AMD CPU speculation... and expert conjecture

Page 737 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.
Gamerk you totally missed the point about starcraft 2... That graph shows the fx performing much much better than usual with an Intel flag, its the same cpu in both colors, the difference is the cpu flag. Fx chips normally fall flat in sc2 which is always attributed to their 'weak ipc' yet with an Intel flag It magically performs better

Irrespective of which compiler was used, this constitutes a serious problem as it's artificially skewing the playing field against amd...
 
If software numbers arent real ipc numbers, don't call it IPC. Call it what it is, bemchmark performance of software.

You cant just look at a program and assume it treats amd and intel equally and laugh that the AMD cpu is doing more work for each task because someone at intel wants people to believe that AMD's sse extensions are "broken and dont work at all".

If its truely believed that sse 4.2 is broken on AMD hardware, why doesnt anyone else know about it other than praising everything Intel? If AMD's sse is broken, lets see the proof.

P.s. remember my previous example of someone saying the 5960x has a higher ipc than the 5960x? You just said that about the fx 8350, thats the only cpu used in that sc ii test.
 


Ah, that's why I really need to stop doing analysis at 1 AM. But still, MAJOR problems here with the test setup. Here are the major ones:

1: Why exactly is CPUID picking up the FX as a single core chip capable of running four threads? So FX nativity isn't even being detected right. so god knows what that's doing to performance.

2: Know any Core 2 CPUs with one core, four threads?

Worse, some of the CPUID results are still picking up the CPU as an AMD FX-8350, just running on LGA 775. I'm also seeing the register layout for a FX chip, not a Core 2, and instruction set support of a FX, not a Core 2.

So frankly, I have no idea what the software is going to do here, since your spoofing something that doesn't exist in reality. You could very well have code blocks that look like:

If Core 2 [This wouldn't trigger]
If Intel [This would]
If HTT [This will]
If AVX [Yep]

And so on. Since you're spoofing something that doesn't exist, I can't predict what the heck the software is going to do. I could easily see a combination of code paths being executed that wouldn't normally be executed together, which could lead to all sort of strange performance.

3: VMWare? Opens up a lot of weird performance questions, mainly how VMWare is going to react to a spoofed Intel CPU. So you have a MAJOR ambiguity in the results before you actually measure anything.

4: The guest OS is running with 3GB of RAM, which is fairly low.

5: I SPECIFICALLY noted MSVC 2008. It's simply possible that SCII doesn't have a native BD codepath, since BD wasn't a thing at the time. And trying to run the old Thurbian codepath, or even the generic SSE2 code path, is going to cost some performance. Without knowing the EXACT build of the compiler, I can't speak for what the code-gen is doing for specific CPU architectures.

So I call the entire test suspect before it even begins.
 


And that's what I always call it.

You cant just look at a program and assume it treats amd and intel equally and laugh that the AMD cpu is doing more work for each task because someone at intel wants people to believe that AMD's sse extensions are "broken and dont work at all".

Intel and AMD code paths are going to be fairly different. Just like Core 2 and Nahalem codepaths are going to be different. And just like Turbian and Bulldozer code paths are going to be different.

But for the most part, the relative difference in instructions executed is going to be trivial, so you can get a decent relative performance difference in IPC. Something like 5% is more or less noise, but something like 40% is significant to the point where you can say there's a real difference in how fast the hardware is executing.

If its truely believed that sse 4.2 is broken on AMD hardware, why doesnt anyone else know about it other than praising everything Intel? If AMD's sse is broken, lets see the proof.

Who is claiming SEE is broken on AMD? Sure, AMD may implement it differently, but if it was broken, every SEE program would crash the PC in various ways.

Now ironically, you typically see performance decline after SSE2 when you set the relevant compiler switches. But the performance difference is <5%, so it falls into the "noise" category as far as I'm concerned. You really can't gain much via compiler switches beyond enabling SSE2.

P.s. remember my previous example of someone saying the 5960x has a higher ipc than the 5960x? You just said that about the fx 8350, thats the only cpu used in that sc ii test.

Nonsensical statement?

PS:

What you're doing, is rather then admit AMD has a problem, and pressing them to do something about it, is trying to deny they have a problem at all, and trying to blame everyone else for AMDs architectural failings. Some of us called the IPC problem almost TWO YEARS before BD hit for crying out loud, we can measure a consistent deficit of more then 20%, and you're continuing to claim the entire world is bias against AMD, rather then the simple explanation of their hardware design being suboptimal for the tasks it's being called out to do.
 
Gamerk, he spoofed the cpuid flags to mimmick the intel parts enough for scii to see "intel inside"

Kinda funny any time intel's cripple amd funcions are proven, its "suspect". Thats the problem with Intel. Apparently they never cheat when dealing with AMD. Its always AMD's fault for not knowing Intel stabbed them in the back.
 


1: MSVC, not ICC. Or are you accusing of MSFT intentionally hurting AMDs performance?

2: There's more that compilers put in then AMD or Intel; the generated code also looks for extensions (SSE3/4/AVX), the HTT flag, processor layout (to some extent, though the OS cares more), and so on. And NONE OF THAT was spoofed.

<mod edit>

<What did I say about keeping it civil?>
<next one gets a vacation>
 
Point is AMD is fighting 2 battles, architecture and Intel software with cripple AMD functions built in. As a programmer are you supposed to treat intel and AMD separately or go by the cpu extensions?

90% of the time when dealing with reviews, focus is put on software with the greatest disparity between amd and intel without questioning the software itself and putting 100% of the blame on "AMD sucks".
 


You're supposed to go by specific CPU architecture for the most part, although that should all be handled compiler side. In theory, where applicable, the compiler should be plopping in codepaths for every applicable CPU extension (probably down to SSE2, which is kind of the lowest supported now), with specific code blocks for specific CPU architectures where applicable to performance. You never do AMD/Intel code branches, because what works for Core 2 may not work for Nahalem; you need to go to the architecture level.

Why do you think GCC in particular has so many flags to fine tune by architecture? MSVC goes by SSE levels and Optimization flags, though I'd have to dig into the documentation to see what exactly they enable (especially -O3), I'd imagine it's more or less the same as GCC though.

What ICC now does, is it defaults to SSE for AMD chips unless explicitly told otherwise by any number of compiler switches, which overrides the default behavior (though Intel notes in it's documentation this isn't technically supported by them and they make no promises for what this does for performance, or even if the code will work properly). And I do again note: Intel is under no obligation to optimize for AMD architectures, just like AMDs compiler is under no obligation to optimize for Intel architectures.
 


Really? Got a source?

It is possible that they remove FMA4 in favor of de-composing it into another ISA extension. Even though I take that info with salt (as usual), your rebuttal is pretty clear, so you must have a source for that, right?

Cheers!
 


You are right. The distinction between maximum IPC allowed by the architecture and the IPC extracted by a concrete piece of software was stated two or three pages ago, and multiple times.

I am not very given to car analogies, but I will post one:

Maximum IPC attributed to a given architecture is analogous to maximum speed achieved by a given car.

Maximum IPC extracted from a piece of software programmed by Peter and compiled using compiler Z is analogous to maximum speed achieved by a car driven by John on circuit R.

Average IPC extracted from a piece of software programmed by Peter and compiled using compiler Z is analogous to average speed achieved by a car driven by John on circuit R.

Car-maximum-speed >= car-circuit-speed-record > car-average-speed-on-circuit

Hardware-maximum-IPC >= hardware-software-maximum-IPC > hardware-software-average-IPC

We can compare the average speed of different cars in Yas Marina Circuit, the same we can compare the average IPC of two different CPUs in Cinebench. E.g. Haswell has about 60% higher IPC (ST) than Piledriver. Steamroller has ~15% higher IPC (MT) than Piledriver. Broadwell brings 5% higher IPC than Haswell. Excavator brings 5% higher IPC than Steamroller...

We can compare the maximum speed of different cars, the same we can compare the maximum IPC of two different CPUs. E.g. Haswell is 8-way issue, whereas Piledriver is 4-way (8-way per module).

The analogy is rather complete. Thus we can say that some car perform better on some circuits than in others, just as some CPU architectures work better for some workloads than others, depending of which was the design goal and the ability of the engineers.

Everyone knows that AMD has a serious IPC deficit. AMD also knows this. I have word from an AMD engineer that the goal for Zen is: "100% higher IPC than Piledriver". I doubt they will be able to increase the IPC by that quantity, but at least they know where is the problem and that is a first step for the solution.



 


Of course, Intel is under no obligation to optimize for AMD architectures, but the issue with the Cripple_AMD function was (i) that ICC checked the CPUID, instead architectural abilities, and disabled optimizations for AMD/VIA chips even when the CPUs were compatible with the fastest code and (ii) Intel hided this odd behavior of its compiler, lied about it, and did spread false information to developers, OEMs, and users in general.



The own patch, which describes support for FMA4
 


https://sourceware.org/ml/binutils/2015-03/msg00078.html

Attached patch adds the following.
* New AMD znver1 processor. The architecture has the below features
* TBM, FMA4, XOP, LWP: ISAs are not supported.
* SMAP, RDSEED, SHA, XSAVEC, XSAVES, CLFLUSHOPT, ADCX: ISAs are supported.
* New CLZERO instruction support.
* clzero has opcode "0F 01 FC".
* clzero gets enabled with CPUID, 8000_0008, EBX[0] =1.
* clzero instruction zero's out the 64 byte cache line specified in rax. Bits 5:0 of rAX are ignored

I have added two new arch test files.
1. arch-13.s: Lists out the new ISAs that are supported and checks it against the march option.
2. x86-64-arch-3.s: The 64-bit version of the test.

It passes make check on x86-64. Okay to commit?

What patch? The actual push of the code?

Cheers!
 
It's deeper in the phoronix comments...

+ { "CPU_ZNVER1_FLAGS", + "Cpu186|Cpu286|Cpu386|Cpu486|Cpu586|Cpu686|CpuSYSC ALL|CpuRdtscp|Cpu387|Cpu687|CpuFISTTP|CpuNop|CpuMM X|CpuSSE|CpuSSE2|CpuSSE3|CpuSSE4a|CpuABM|CpuLM|Cpu FMA|CpuFMA4|CpuBMI|CpuF16C|CpuCX16|CpuClflush|CpuS SSE3|CpuSVME|CpuSSE4_1|CpuSSE4_2|CpuAES|CpuAVX|Cpu PCLMUL|CpuLZCNT|CpuPRFCHW|CpuXsave|CpuXsaveopt|Cpu FSGSBase|CpuAVX2|CpuMovbe|CpuBMI2|CpuRdRnd|CpuADX| CpuRdSeed|CpuSMAP|CpuSHA|CpuXSAVEC|CpuXSAVES|CpuCl flushOpt|CpuCLZERO" },

 


IC isn't the same between an i5-4670k and i7-4770k either.
 


Yeah, I started to look at the actual code push and it does support FMA4, lol.

Could it be a bug or typo in the letter?

Cheers!
 


None of this IPC stuff would be an issue if people just called it performance per clock in a given application with specific compiler settings. PPCGASCS. So easy to remember!

In regards to Titan X, it makes me feel good about 390x. That awkward launch at GDC followed by a rushed launch with availability only on the nvidia site is not a good sign for Nvidia, even if Titan X is a decent step up from older cards on the same process. It looks rushed.
 


Now that people know what to look for (znver1) things should get interesting.
 


Yes, but compilers are aware of HTT because there are impacts to how you want to schedule instructions, because of shared resources.
 


The basic performance equation of computer science is

Performance = IPC * Clock

Thus

IPC = Performance / Clock

Why would people have problems with the left hand side whereas the right hand side is fine?

Moreover calling to IPC "performance per clock in a given application with specific compiler settings. PPCGASCS" as you suggest is a mistake because IPC is also defined for non-compiled software, for a collection of applications, and is also defined for hardware.
 


AMD should be totally clear of that, they started the craze for this API with Mantle, and I think everyone here will concede that DX12 became so openly discussed and Vulkan pushed forward at an accelerated rate because of their work and efforts in unlocking gaming hardware for better performance. The work benefits most companies in that discussion as much or more than it benefits AMD, though it benefits them as well. So...the discussion really comes down to M$ dragging their feet on doing something more dynamic, which has been stalled, like the windows OS itself has been, since the days of DX9/XP
 
Status
Not open for further replies.