Discussion: AMD Ryzen

Page 16 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.


The micro-op cache is an interesting addition. Intel introduced a 1.5K uOp cache to Sandy Bridge, though it's hard to say how much of Sandy's IPC improvement directly resulted from it. What's ironic is that the AMD-K6 was the first x86 to have a micro-op cache, and it was a whopping 20K at that.
 


The argument made by David Kanter and myself already consider asymmetric ALUs (that is what he means when mentions branches).
 




The problem is not on using 128bit units. AMD could have used 4x128bit units to get twice the max throughput, but then they would have to improve everything from the front end to the commit, including doubling the caches' BW, which would generate very complex engineering problems.

I see same mistake when people claims that Bulldozer performed badly because has a FP shared between two cores in a module. This is not rigth. Bulldozer performed badly because the FPU was only 256bit wide. If Bulldozer module had incorporated a 1024bit wide unit then it would beat the best designs Intel has. But of course a 4x bigger FPU would require lots of extra transistors and logic and power and...

And the excuse that AMD Zen uses 128bit units because 256bit is not very popular doesn't hold up on close inspection, because (i) AMD is supporting 256bit on Zen via fussing out the pair of 128bit units and (ii) AMD has been supporting ISAs with much more less popularity, including HSA.
 


Minimal because instructions already stream from the instruction cache at the pipeline’s full rate, the new design saves cycles only when execution can be restarted from the L0 cache after a mispredicted branch. The mayor improvement introduced by the uop cache is on the side of efficiency, because the x86 decoders are power hungry and can be shut down when fetching from the uop cache. In Haswell the total power is reduced by about 12%.



Link? I have just looked to the K6 diagram and there is no mention of that.
 


Zen is just the architecture, so yes there will be a vast range of CPUs for different price points.
 


That is actually not so hard to believe...

Consider this...

8C zen has twice as many FPUs. So, assuming they are equal, we are already at +100% performance...now...we can assume +40% per AMDs estimates over piledriver, and finally, we can also include the fact that zen can run AVX, but piledriver cannot. Which would put another theoretical 60% blue sky for benchmarks in there.

So, we end up +200% total, which is not far off your calculations (20% difference...).

 


SMT changes very little, basically nothing in FPU ops.

However, having double the physical number of FPUs on one chip is a huge difference.

Zen has 8 FPUs on 8 cores, PD has 4 FPUs.
 


Pretty much...new x86 extensions make for some nice blue sky numbers.

However, unless you personally hand write your own linux distro specifically for the maximum hardware capability you personally use, default compiler settings will show the gap is not nearly as large as many people want you to believe.
 


The initial launch of Zen will be aimed strictly at the enthusiast market.

There will be zen offerings later that will be lower core counts and lesser costs than the 8 core flagship, though.

The most mainstream part looks to be a 4 core CPU or APU.
 


It sure would be something if we had a repeat of the Summer of 99. When the Athlon came out, I bought it the very first day. Out went my Pentium III-500 and in went an Athlon-600, and some things ran almost TWICE as fast. Not only that, but that piece of launch-day silicon happily overclocked to 750 with the "gold fingers" device.

Maybe I'm being cynical but I don't think that's going to happen this time around. The fact that they are only showing a 3GHz chip is because they're either bluffing, or they're actually struggling to get it to go faster. Time will tell.
 


I expected readers to understand the same number of 128bit FP units. No need for this erroneous tangent.
 


lol

AMD is faster on blender, so how is it a failure?

The whole point is that artificial benchmarks showing intel is 50x better are largely crap. It doesn't matter what extensions you support if no one uses them. This is ultimately why risc is way better than cisc, and why real cores are better than hyperthreads.

Though I blame microsoft's crappy compiler as well. There is no reason not to use them automatically in many cases but it never will, and gcc is only a little better.
 


The pre-decode cache on K6 is not a uop cache. One of the requirements of a uop cache is that it has to be placed after the decode stage of the pipeline, not before, because the uop cache store the RISC-like uops obtained from decoding the x86 instructions (CISC).

Zen is first AMD design with uop cache.
 


Where do you read the word "failure"? Sure it is not found in my posts, which you are quoting.

Also it is not proven that "AMD is faster on Blender". What has been demonstrated is much less. What has been demonstrated is that overclocked ES of Zen was ~2% faster than an underclocked Broadwell chip under unknown settings (compiler? Flags? platfform?) using a custom image on an unknown version of Blender. Moreover, that "2% faster" is statistically insignificant because it is smaller than the margin of error, which implies that measured "faster" could be just a random effect.
 


Nope. As everyone now knows current Zen silicon runs at 2.8GHz base. They had to overclock to 3GHz. AMD also refused to ask questions from audience about which was the TDP of that Zen sample.
 


Just above, and quoted in your post, you can see a Blender benchmark showing the large performance gap between i5 and i7. In his internal testings, The Stilt found that Blender has abnormally large gains from SMT. He tested using the same Haswell chip with SMT enabled and disabled

One more observation regarding Blender. The SMT yield in Blender appears to be unusually high. In similar applications, such as Cinebench the yield is around 27% on Haswell-E. In Blender the yield is > 59%. Blender BMW benchmark (at default resolution, 20x20 tiles) was completed in 127.98 seconds with 18C/18T while with SMT enabled the time was reduced to 90.07 seconds.
 


Why can't AMD have a 3Ghz ES in their hands? Why couldn't it be final silicon they had to under-clock to match Intel?

Also, I think they said giving the TDP was an strategic thing for them, so they wouldn't inform it until final silicon is ready. I have a vague memory of this last part, so I could be horribly wrong.

Cheers!
 


Doesn't work that way.

First, 200% more is not enough because the performance gap is higher than 4x.

Second, you are counting the same performance gains twice. Zen has 4 ALU per core, Piledriver has 2 ALU. This means that Zen has the ability to execute up to four integer instructions per cycle, which implies twice more throughput than Piledriver. But this is a peak performance, because Zen only has 2 AGUs (the same than Piledriver) and cannot provide data to sustain four ALUs each cycle. This simply implies that sustained performance gain will be inferior to 2x. We can do a cheap estimation.

Zen: 4ALU + 2AGU = 6 execution units
PD: 2ALU + 2AGU = 4 execution units

6/4 = 1.5%, which implies Zen could be about 50% faster, in sustained workloads, than Piledriver. The actual computation is more complex and has to account for other details including the frequency of use of each unit (ALU, AGU) in real code, but ~50% more than Piledriver is close to the expected performance for Zen.

Similar remarks for floating point. Zen is 16FLOP/core and has 2x peak throughput than Piledriver (8FLOP/core), but on sustained floating-point workloads Zen will be ~70% faster than Piledriver. The same happen on Intel side. Haswell core has 2x the number of FP resources than Ivy Bridge core, but it is only 70% faster (sustained performance) clock for clock.

In code that mixes integer and float we can obtain an average, ~60% over Piledriver, which roughly corresponds to the 40% over Excavator officially claimed by AMD.

You are taking the peak throughput gain (2x) and adding average throughput gain (1.4x) on top of that. You are counting twice.

Finally, I don't know what you mean by Piledriver not supporting AVX. AMD has supported AVX since Bulldozer

http://developer.amd.com/community/blog/2009/05/06/striking-a-balance/
 


Because the only known silicon is 2.8GHz.

Because they set the frequency to 3GHz without turbo when running Blender.

Because they had to underclock the Broadwell sample.

Because if they had final silicon they wouldn't announce a six month delay of the chip.

When pressed about the TDP of final silicon, AMD said "comparable to Broadwell".
 
None of this is terribly promising, IMO. If we look at what people are buying, it's the high-clocked, high-IPC quad-core chips that Intel sells the most of. Haswell-E, and Broadwell-E absolutely crush consumer Skylake parts in multi-threaded performance, but that doesn't stop people from choosing the 6700K, a single SKU, more than 5 times as often as all Broadwell-E parts combined and more than 3 times as often as all Haswell-E parts combined. (6700K market share 5.2%, all BW-E 0.9%, all HW-E 1.4% - source: Userbenchmark). They are going to have do better than matching the performance profile of a CPU that no one is buying. If they could release a quad-core Zen clocked at 5GHz, they'd have a slam-dunk, but it looks like we are nowhere near that.
 


I agree ... there is no evidence of this.

Hmm ... J once again you are making statements without the facts to substantiate claims.




 
Status
Not open for further replies.

TRENDING THREADS