Discussion: AMD Ryzen

Page 15 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.


Windows 10 *never* uses AVX2, and Blender does not support it by default in any circumstance. You would literally have to recompile blender, and run the benchmark on linux, to get AVX2.

Which is why so much of intel's theoretical advantage in floating point is blue sky...in the real world it does not matter because nobody uses it.
 
Speaking from my world for a minute here, SSE2 is pretty much the default compilation option. Anything more then that is just gravy. We don't like compiling any higher then that since you start dropping CPU support, and software devs like me almost NEVER plop in CPU opcodes manually anymore. To us, things like AVX/AVX2 exist only for benchmarks and very specialized processor loads.
 


It was proved just a pair of posts ago that last versions of Blender use AVX2. The developers' release notes were quoted.
 


I am not anti-AMD, see my signature, but I am anti-hype and misinformation. And my comments aren't mean to be interpreted as some AMD vs Intel war. I am replying to demo of Zen vs Broadwell made by AMD. If they had made a Zen vs Piledriver demo and claimed that Zen is ~3x faster clock-for-clock I would be reacting in exactly the same way. To put things in perspective

BROADWELL-E-41.jpg


According to AMD Zen @3GHz would be somewhat in the 100 second mark, whereas the FX-8350 is 320 seconds. This implies that Zen would have ~4.26x higher IPC than Piledriver (recall AMD own slides stating that Zen is ~2x faster than FX-8350 clock-for-clock).

The Stilt have made a similar analysis. He downloaded last version of Blender and benchmarked his Piledriver and Haswell chips. He tested single thread and found that Zen core would be 140% faster than PD core to match his Haswell. According to AMD Zen is ~40% faster than XV; therefore XV would be about 100% faster than PD. Numbers don't match by a huge amount. He claims to be puzzled by the result and he got similar conclusions than me.

We are not talking about a small "spin", about a 10% here and a 5% there to put products in a better shape. We are talking about huge gaps . What are we supposed to do? Say "yes" to anything published/advertised by companies and don't use our brain? Then why the forums? The news section would be enough.

And my facts aren't a video demo, but the details of the microarchitectures, and the good record of predictions that I have made about Zen. A good amount of the information in those slides you mention was posted by me here before those slides were even made by the marketing goods.
 
Juan, I think you might have to remember that the 40% IPC they claim over Excavator is not meant for every piece of software in existence, even something above 100% is fairly possible in certain circumstances.
Even if you just take the amount of FPU pipelines, there is 2 FMACs for 2 bulldozer cores vs 2 add + 2 mult pipelines for Zen, so at least a 100% improvement in theoretical maximum output, up to 300% with utopical perfect utilization.
Of course I mean clock for clock, but I don't think I'm the only one around here to recall that bulldozer had massive problems utilizing it's theoretical floating point capability.

For the 2x perf figure of Summit Ridge vs Orochi, remember we are talking 4 modules vs 8 SMT2-cores.
 

I also agree with Juan and basically everything he predicted or somehow knew has came out to be basically true and probably more accurate then other sites that make predictions. I will say that like others said including Reynod that all companies do this Nvidia for sure does it and did it this year and Amd for sure does it to. It needs to be pointed out by the community in full outrage. Not sure i guess Intel does it to with their iGPU claims not so much with the CPU performance.

As gamer said just previously most software dev's stick with SSE i think its even so bad that Intel/others made AVX basically support SSE4 more easier but not to sure on the manner since i'm not a software developer.

In my opinion Zen is shaping up to be quite good and i defended it since Amd is basically going back to a wide core design and they did improve their cache and thus probably increasing the memory controller capability, they learned that single core performance matters a lot and that the modeler design isn't good for X86/X64. They also learned that some magic software wasn't going to make single core performance matter less even if their die hard fans still don't think so.

I still think we will see sandy-ivy level of performance for integer(90+% of everything consumers use) tasks and i personally think that's OK from a new design made by a company that is worth 25 times less money as their main competitor.

Heck skylake is what 25% better for IPC compared to sandy? Zen+ is already being made and as others posted moore's law dying is actually helping Amd.
 


If the four cores were distributed over the whole die, then the effective thermal density would be cut in half and this would improve the thermal headroom to increase clocks. But the four-cores continue occupying half the die. The local density (where heat is generated) is the same and the rest of the die is only providing a small improvement to the heat dissipation. I maintain my computations:

95W 8-core <----> 3.0 GHz base and 3.5GHz turbo.

65W 4-core <----> 3.4 GHz base and 3.9GHz turbo.
 


You are entirely right on that AMD's 40% claim is a short of average. Precisely my point was that the Blender demo they shared with us is a cherry-picked benchmark with some special settings. No one would wait Zen to be 4.3x faster than FX-8350 clock-for-clock across benchmarks.

About execution units, Zen has exactly twice the peak throughput of Piledriver. Two 128-bit FMAC units give a max throughput of 16 FLOP per core, whereas Piledriver peak is 8 FLOP per core. This is a peak. Rest of microarchitecture is not 2x better than Piledriver: caches aren't 2x better, issue is not 2x wider, OoO window is not 2x bigger, and so on. This means that the sustained throughput will not be 2x, but more in the ~1.7x range. Consider SMT and CMT (module penalty for 2T) and Zen could be 2.5x faster than Piledriver under special conditions. This is still very far from the 4.26x that they suggested with the Blender demo. Which reinforces again the thesis that was a cherry-picked benchmark.

The Summit Ridge vs Orochi slide can be interpreted as follow

1.5 x 1.2 x 1.2 = 2.16 ~ 2

from left to right (IPC gain, SMT2, and lack of module penalty). That slide did make sense, because 2x agrees with we know about the microarchitectures of Zen and Piledriver. That is the reason why I am reacting to their Blender demo, but didn't to their Zen vs Orochi slide.
 
The trouble with marketing material is the nature. It's marketing, it's meant to entice people, make bold claims, stir up whichever community it's targeting. It's not the same as white papers and internal documents regarding scientific architecture. Marketing may even use simplified charts and graphs to appear 'technical', similar to an actor in a lab coat pimping health and beauty products on commercials.

It's not amd specific, it's inherent to advertising to fudge a bit and of course in favor of the product being marketed. Just as many claims such as batteries being a certain % more powerful than 'the leading national brand'. Well what brand and model is that exactly? Good luck digging out those specifics, it's likely not what you think or in a scenario you think it was tested in. It's just enough to not quite be totally fudging to duck false advertising claims.

Debates and speculation can be fun but eventually the real test will be a retail chip available for the public to purchase and actually build a machine with pitted up against existing available competition. Benchmarks in real world scenario uses whether encoding, office productivity, multitasking, gaming and so on across a number of various tests.

Lab conditions are always pristine best case environments that may not represent the real world. The same is true of pc fan testing for specs listed on the side of the package in terms of airflow and other measurements. I would expect no different of cpu tests especially done in-house by the company selling them. Expecting to take some of the marketing claims with a grain of salt. I've said the same of intel since both companies have been guilty of it in the past.

Maybe I'm cynical but the last place or source I turn to for realistic appraisals of a product's performance is the manufacturer. They have the most vested interest in putting their product in a good light regardless.
 


Here's the issue with compiling: Unless you start having multiple builds, the minute you start plopping SSE3 or higher opcodes in your source, you start limiting the CPUs your code can run on. Windows pretty much has SSE2 as the lowest baseline for CPU support, so from my perspective, why would I limit my apps potential uses to SSE3 capable CPUs or better, when the host OS can run with SSE2?

As a result, the default option usually compiles against SSE2, and in some very specialized use cases we might manually write other code paths, but that is not normal for consumer-grade software. Nevermind that for 90% of workloads, there's limited to no performance gains beyond the SSE2 codepath.
 


Let me ask you two things, Juan, that are bothering me with those affirmations:

1. are you considering that Zen is an 8-core, 16-thread cpu, when comparing it against the 8-thread Piledriver? Because all numbers on Zen should be halved (or PD's doubled) for a fair comparison.

2. Did Stilt matched clocks between cpus? If not, just by using an FX-8350 or an FX-9590, results would be a lot different. Also against what Haswell? i5 or i7? We are talking IPC here, so obviously it should be all directly comparable.
 


If you think about it those results in Blender *are* believable because:
Piledriver FX 8350 only has 4 fpu, and I believe the FPU in PD is smaller than that in a single zen core. So you are comparing 8 larger fpu units vs 4 smaller ones. In that instance the uptick is going to be significantly more than the 40% IPC gain just by virtue of having more than double the execution resources as well as much higher efficiency.

Juan you did hit on a point though- a FP benchmark is likely showing best case gains for Zen (as that was where AMD's earlier designs were weakest).

I think the take home here though is that we really need to look at *single thread* applications to measure the IPC gap between different processors (especially when comparing to PD). When looking at multi thread there are many more factors at play, such as double the core count in terms of FP units and so on.

 
In regards to IPC, to REALLY do it correctly you need to stress a core, a module [usually a core + SMT unit], and the entire CPU, each at full load, to look for any bottlenecks.

This is never done. Heck, even core loading usually isn't maxed (or given) which leads to incorrect conclusions. See the AoS benchmark, where the raw numbers say Zen < BD IPC, due to BD's core usage almost certainly being lower.

So yeah, it's important to understand what you're trying to benchmark.
 


There's a lot of interesting information in there, even keeping things in perspective since it's WTFBBQTech.

Unfortunately, I don't have any smart remarks to say about the high level overview of Zen. The only thing that caught my eye, was the logic around the 2 AGUs. I think the uops cache is paying of in the design and the re-organization of the L2 and L3 cache. Other than that, nothing else caught my eye.

EDIT: http://www.anandtech.com/show/10591/amd-zen-microarchiture-part-2-extracting-instructionlevel-parallelism

More details and theory.

Cheers!
 


1. Yes I am considering SMT. SMT brings usually between 0% and 40% gains depending of the code. I am using 20% as average in my posts. I did it just yesterday again when I explained why Zen would be ~2x faster (average) than PD clock-for-clock on multithreaded applications

http://www.tomshardware.co.uk/forum/id-2986517/discussion-amd-zen/page-8.html#18480265

SMT doesn't double performance (that is impossible because execution units are shared). In any case check the i5 and i7 in the Blender benchmark given above.

2. Yes, he tested at same clocks and did other changes such as disabling two memory channels on the Haswell side... He used a Xeon model.
 


Zen doesn't have "more than double the execution resources". I have explained it before, when I stated that Zen is a 16 FLOP/core microarchitecture whereas PD is 8FLOP/core. PD has two 128bit FMAC unit per module, therefore FX-8350 has 8x128bit FMAC execution resources. Zen has two 128bit FMAC unit per core, therefore octo-core Zen has 16x128bit FMAC execution resources or just the double.

If we want to talk about multithread performance we have to consider the effect of SMT (I take 20% as average gain) on Zen and the effect of CMT module penalty on PD (I take another 20%), If we want to eliminate those then Single thread is the route as you say. Precisely The Stilt started testing Blender in single thread, as I wrote above:

The Stilt have made a similar analysis. He downloaded last version of Blender and benchmarked his Piledriver and Haswell chips. He tested single thread and found that Zen core would be 140% faster than PD core to match his Haswell. According to AMD Zen is ~40% faster than XV; therefore XV would be about 100% faster than PD. Numbers don't match by a huge amount. He claims to be puzzled by the result and he got similar conclusions than me.

140% faster with 100% more execution units means that special settings or optimized code was used for Zen or that PD had some bug/bottleneck in that benchmark or something else.

Indeed FP benchmarks will show higher performance gains. My 40% over Piledriver IPC was an average. For FP I said I expect something as 80% over Piledriver. I still maintain both figures.
 


The WCCFTECH article was written by someone that doesn't know what is writing. Stuff like this nonsense "The two floating point units on the new core consist of 4 pipes with 128 FMACs per FPU." Also they didn't get still that Zen was officially delayed to 2017.

The anandtech article is serious. The next one, critical of AMD Blender demo claims, is also relevant

http://www.anandtech.com/show/10585/unpacking-amds-zen-benchmark-is-zen-actually-2-faster-than-broadwell

I don't get what you mean by the logic around the 2 AGUs. I wrote about this, when the leak about 2 AGUs was published and it disproved my prediction of 3ALU+3AGU for Zen. Let me quote David Kanter:

 


The ALUs are not 100% symmetrical.

Cheers!
 
Pretty much positive Zen's granularity is four cores now:

http://www.pcper.com/reviews/Processors/AMD-Exposes-Zen-CPU-Architecture-Hot-Chips-28/Cache-Structures-Complexes-and-More

The L1/L2/L3 cache structure has changed from the previous architectures as well. This is not surprising as these caches are absolutely key for overall good performance and throughput for any architecture. An effective cache system also can improve upon energy efficiency as there are fewer wasted cycles going to main memory as well as the power required to make those accesses. Each core features 96KB of L1 divided into 64K 4-way instruction and 32K 8-way data. There is then 512 KB 8-way of L2 that is private to each core. This then is connected to a large and fast 8 MB L3 cache that is shared between four cores. The caches can all transmit up to 32 Bytes per clock. In a CPU with 8 cores, the two L3 caches look to feature a fast interconnect so that data accesses between cores in the different modules do not impose a significant bottleneck (cache accesses, writes, scrubbing, etc.).

Farther evidence Zen is going to be released as groups of four cores.

HC28.AMD.Mike%20Clark.final-page-014.jpg
 


And?
 


From what I remember from my University course about CPU design, when the ALUs are 100% symmetrical you need to have them aligned with the number of AGUs so the number of dispatched operands don't fall into a queue for long waiting for the AGUs to do the addressing in all INT and memory ops. Since the ALUs are no longer symmetrical in current designs, the differences are usually balanced out through bigger op queues or a lower number of them; or at least, that is what I can think of about them.

It's been 10 years since I took that course though, so I might as well be outdated on this.

Cheers!
 
utilizing SIMD is not magic and compiler settings don't do a thing. You have to actually manually use assembly or intrinsics to make it happen. And most programs don't. So it is not cherry picked even it is just a 'real world' result.

Artificial benchmarks don't mean a damned thing when the real software doesn't take advantage of all the features. And that is why they support the older extensions more and so does ARM.
 


That has been the general complaint of x86 CPUs for the last several years. Minor IPC gains unless you use the special new instructions. AMD has compromised by sticking with 128 bit FP units. If the application doesn't take significant advantage of AVX2 then AMD is looking good.
 
Status
Not open for further replies.