Discussion: AMD Ryzen

scuzzycard · Aug 26, 2016

-Fran- :

The micro-op cache is an interesting addition. Intel introduced a 1.5K uOp cache to Sandy Bridge, though it's hard to say how much of Sandy's IPC improvement directly resulted from it. What's ironic is that the AMD-K6 was the first x86 to have a micro-op cache, and it was a whopping 20K at that.

juanrga · Aug 26, 2016

-Fran- :

The argument made by David Kanter and myself already consider asymmetric ALUs (that is what he means when mentions branches).

juanrga · Aug 26, 2016

-Fran- :

Cazalan :

The problem is not on using 128bit units. AMD could have used 4x128bit units to get twice the max throughput, but then they would have to improve everything from the front end to the commit, including doubling the caches' BW, which would generate very complex engineering problems.

I see same mistake when people claims that Bulldozer performed badly because has a FP shared between two cores in a module. This is not rigth. Bulldozer performed badly because the FPU was only 256bit wide. If Bulldozer module had incorporated a 1024bit wide unit then it would beat the best designs Intel has. But of course a 4x bigger FPU would require lots of extra transistors and logic and power and...

And the excuse that AMD Zen uses 128bit units because 256bit is not very popular doesn't hold up on close inspection, because (i) AMD is supporting 256bit on Zen via fussing out the pair of 128bit units and (ii) AMD has been supporting ISAs with much more less popularity, including HSA.

juanrga · Aug 26, 2016

scuzzycard :

Minimal because instructions already stream from the instruction cache at the pipeline’s full rate, the new design saves cycles only when execution can be restarted from the L0 cache after a mispredicted branch. The mayor improvement introduced by the uop cache is on the side of efficiency, because the x86 decoders are power hungry and can be shut down when fetching from the uop cache. In Haswell the total power is reduced by about 12%.

scuzzycard :

Link? I have just looked to the K6 diagram and there is no mention of that.

scuzzycard · Aug 26, 2016

http://www.cpu-world.com/CPUs/K6-III/AMD-K6-IIIE%2B%20550%20-%20K6-III%2B-550ACR.html

I just looked it up and this site calls it a pre-decode cache. Now maybe I'm getting my terminology mixed up, because that sounds more like the trace cache the P4 had.

aciddemon123 · Aug 26, 2016

Hi, does someone know if the AMD zen cpu's will have version for different prices? or are they all going to be 150euro++++ ?

TechyInAZ · Aug 26, 2016

aciddemon123 :

Zen is just the architecture, so yes there will be a vast range of CPUs for different price points.

8350rocks · Aug 26, 2016

juanrga :

Reynod :

I am not anti-AMD, see my signature, but I am anti-hype and misinformation. And my comments aren't mean to be interpreted as some AMD vs Intel war. I am replying to demo of Zen vs Broadwell made by AMD. If they had made a Zen vs Piledriver demo and claimed that Zen is ~3x faster clock-for-clock I would be reacting in exactly the same way. To put things in perspective

According to AMD Zen @3GHz would be somewhat in the 100 second mark, whereas the FX-8350 is 320 seconds. This implies that Zen would have ~4.26x higher IPC than Piledriver (recall AMD own slides stating that Zen is ~2x faster than FX-8350 clock-for-clock).

The Stilt have made a similar analysis. He downloaded last version of Blender and benchmarked his Piledriver and Haswell chips. He tested single thread and found that Zen core would be 140% faster than PD core to match his Haswell. According to AMD Zen is ~40% faster than XV; therefore XV would be about 100% faster than PD. Numbers don't match by a huge amount. He claims to be puzzled by the result and he got similar conclusions than me.

We are not talking about a small "spin", about a 10% here and a 5% there to put products in a better shape. We are talking about huge gaps . What are we supposed to do? Say "yes" to anything published/advertised by companies and don't use our brain? Then why the forums? The news section would be enough.

And my facts aren't a video demo, but the details of the microarchitectures, and the good record of predictions that I have made about Zen. A good amount of the information in those slides you mention was posted by me here before those slides were even made by the marketing goods.

That is actually not so hard to believe...

Consider this...

8C zen has twice as many FPUs. So, assuming they are equal, we are already at +100% performance...now...we can assume +40% per AMDs estimates over piledriver, and finally, we can also include the fact that zen can run AVX, but piledriver cannot. Which would put another theoretical 60% blue sky for benchmarks in there.

So, we end up +200% total, which is not far off your calculations (20% difference...).

8350rocks · Aug 26, 2016

juanrga :

salgado18 :

juanrga :

Let me ask you two things, Juan, that are bothering me with those affirmations:

1. are you considering that Zen is an 8-core, 16-thread cpu, when comparing it against the 8-thread Piledriver? Because all numbers on Zen should be halved (or PD's doubled) for a fair comparison.

2. Did Stilt matched clocks between cpus? If not, just by using an FX-8350 or an FX-9590, results would be a lot different. Also against what Haswell? i5 or i7? We are talking IPC here, so obviously it should be all directly comparable.

1. Yes I am considering SMT. SMT brings usually between 0% and 40% gains depending of the code. I am using 20% as average in my posts. I did it just yesterday again when I explained why Zen would be ~2x faster (average) than PD clock-for-clock on multithreaded applications

http://www.tomshardware.co.uk/forum/id-2986517/discussion-amd-zen/page-8.html#18480265

SMT doesn't double performance (that is impossible because execution units are shared). In any case check the i5 and i7 in the Blender benchmark given above.

2. Yes, he tested at same clocks and did other changes such as disabling two memory channels on the Haswell side... He used a Xeon model.

SMT changes very little, basically nothing in FPU ops.

However, having double the physical number of FPUs on one chip is a huge difference.

Zen has 8 FPUs on 8 cores, PD has 4 FPUs.

8350rocks · Aug 26, 2016

Cazalan :

Pretty much...new x86 extensions make for some nice blue sky numbers.

However, unless you personally hand write your own linux distro specifically for the maximum hardware capability you personally use, default compiler settings will show the gap is not nearly as large as many people want you to believe.

8350rocks · Aug 26, 2016

aciddemon123 :

The initial launch of Zen will be aimed strictly at the enthusiast market.

There will be zen offerings later that will be lower core counts and lesser costs than the 8 core flagship, though.

The most mainstream part looks to be a 4 core CPU or APU.

scuzzycard · Aug 26, 2016

8350rocks :

It sure would be something if we had a repeat of the Summer of 99. When the Athlon came out, I bought it the very first day. Out went my Pentium III-500 and in went an Athlon-600, and some things ran almost TWICE as fast. Not only that, but that piece of launch-day silicon happily overclocked to 750 with the "gold fingers" device.

Maybe I'm being cynical but I don't think that's going to happen this time around. The fact that they are only showing a 3GHz chip is because they're either bluffing, or they're actually struggling to get it to go faster. Time will tell.

Cazalan · Aug 26, 2016

juanrga :

Cazalan :

The problem is not on using 128bit units. AMD could have used 4x128bit units to get twice the max throughput, but then they would have to improve everything from the front end to the commit, including doubling the caches' BW, which would generate very complex engineering problems.

I see same mistake when people claims that Bulldozer performed badly because has a FP shared between two cores in a module. This is not rigth. Bulldozer performed badly because the FPU was only 256bit wide. If Bulldozer module had incorporated a 1024bit wide unit then it would beat the best designs Intel has. But of course a 4x bigger FPU would require lots of extra transistors and logic and power and...

And the excuse that AMD Zen uses 128bit units because 256bit is not very popular doesn't hold up on close inspection, because (i) AMD is supporting 256bit on Zen via fussing out the pair of 128bit units and (ii) AMD has been supporting ISAs with much more less popularity, including HSA.

I expected readers to understand the same number of 128bit FP units. No need for this erroneous tangent.

Melonious · Aug 26, 2016

juanrga :

-Fran- :

Cazalan :

The problem is not on using 128bit units. AMD could have used 4x128bit units to get twice the max throughput, but then they would have to improve everything from the front end to the commit, including doubling the caches' BW, which would generate very complex engineering problems.

I see same mistake when people claims that Bulldozer performed badly because has a FP shared between two cores in a module. This is not rigth. Bulldozer performed badly because the FPU was only 256bit wide. If Bulldozer module had incorporated a 1024bit wide unit then it would beat the best designs Intel has. But of course a 4x bigger FPU would require lots of extra transistors and logic and power and...

And the excuse that AMD Zen uses 128bit units because 256bit is not very popular doesn't hold up on close inspection, because (i) AMD is supporting 256bit on Zen via fussing out the pair of 128bit units and (ii) AMD has been supporting ISAs with much more less popularity, including HSA.

lol

AMD is faster on blender, so how is it a failure?

The whole point is that artificial benchmarks showing intel is 50x better are largely crap. It doesn't matter what extensions you support if no one uses them. This is ultimately why risc is way better than cisc, and why real cores are better than hyperthreads.

Though I blame microsoft's crappy compiler as well. There is no reason not to use them automatically in many cases but it never will, and gcc is only a little better.

juanrga · Aug 29, 2016

scuzzycard :

The pre-decode cache on K6 is not a uop cache. One of the requirements of a uop cache is that it has to be placed after the decode stage of the pipeline, not before, because the uop cache store the RISC-like uops obtained from decoding the x86 instructions (CISC).

Zen is first AMD design with uop cache.

juanrga · Aug 29, 2016

Melonious :

juanrga :

-Fran- :

Cazalan :

The problem is not on using 128bit units. AMD could have used 4x128bit units to get twice the max throughput, but then they would have to improve everything from the front end to the commit, including doubling the caches' BW, which would generate very complex engineering problems.

I see same mistake when people claims that Bulldozer performed badly because has a FP shared between two cores in a module. This is not rigth. Bulldozer performed badly because the FPU was only 256bit wide. If Bulldozer module had incorporated a 1024bit wide unit then it would beat the best designs Intel has. But of course a 4x bigger FPU would require lots of extra transistors and logic and power and...

And the excuse that AMD Zen uses 128bit units because 256bit is not very popular doesn't hold up on close inspection, because (i) AMD is supporting 256bit on Zen via fussing out the pair of 128bit units and (ii) AMD has been supporting ISAs with much more less popularity, including HSA.

lol

AMD is faster on blender, so how is it a failure?

Where do you read the word "failure"? Sure it is not found in my posts, which you are quoting.

Also it is not proven that "AMD is faster on Blender". What has been demonstrated is much less. What has been demonstrated is that overclocked ES of Zen was ~2% faster than an underclocked Broadwell chip under unknown settings (compiler? Flags? platfform?) using a custom image on an unknown version of Blender. Moreover, that "2% faster" is statistically insignificant because it is smaller than the margin of error, which implies that measured "faster" could be just a random effect.

Cazalan · Aug 29, 2016

There is zero evidence an "overclocked ES" was used. Less misinformation please!

juanrga · Aug 30, 2016

Cazalan :

Nope. As everyone now knows current Zen silicon runs at 2.8GHz base. They had to overclock to 3GHz. AMD also refused to ask questions from audience about which was the TDP of that Zen sample.

juanrga · Aug 30, 2016

8350rocks :

juanrga :

salgado18 :

1. Yes I am considering SMT. SMT brings usually between 0% and 40% gains depending of the code. I am using 20% as average in my posts. I did it just yesterday again when I explained why Zen would be ~2x faster (average) than PD clock-for-clock on multithreaded applications

http://www.tomshardware.co.uk/forum/id-2986517/discussion-amd-zen/page-8.html#18480265

SMT doesn't double performance (that is impossible because execution units are shared). In any case check the i5 and i7 in the Blender benchmark given above.

2. Yes, he tested at same clocks and did other changes such as disabling two memory channels on the Haswell side... He used a Xeon model.

SMT changes very little, basically nothing in FPU ops.

Just above, and quoted in your post, you can see a Blender benchmark showing the large performance gap between i5 and i7. In his internal testings, The Stilt found that Blender has abnormally large gains from SMT. He tested using the same Haswell chip with SMT enabled and disabled

One more observation regarding Blender. The SMT yield in Blender appears to be unusually high. In similar applications, such as Cinebench the yield is around 27% on Haswell-E. In Blender the yield is > 59%. Blender BMW benchmark (at default resolution, 20x20 tiles) was completed in 127.98 seconds with 18C/18T while with SMT enabled the time was reduced to 90.07 seconds.

-Fran- · Aug 30, 2016

juanrga :

Why can't AMD have a 3Ghz ES in their hands? Why couldn't it be final silicon they had to under-clock to match Intel?

Also, I think they said giving the TDP was an strategic thing for them, so they wouldn't inform it until final silicon is ready. I have a vague memory of this last part, so I could be horribly wrong.

Cheers!

juanrga · Aug 30, 2016

8350rocks :

Doesn't work that way.

First, 200% more is not enough because the performance gap is higher than 4x.

Second, you are counting the same performance gains twice. Zen has 4 ALU per core, Piledriver has 2 ALU. This means that Zen has the ability to execute up to four integer instructions per cycle, which implies twice more throughput than Piledriver. But this is a peak performance, because Zen only has 2 AGUs (the same than Piledriver) and cannot provide data to sustain four ALUs each cycle. This simply implies that sustained performance gain will be inferior to 2x. We can do a cheap estimation.

Zen: 4ALU + 2AGU = 6 execution units
PD: 2ALU + 2AGU = 4 execution units

6/4 = 1.5%, which implies Zen could be about 50% faster, in sustained workloads, than Piledriver. The actual computation is more complex and has to account for other details including the frequency of use of each unit (ALU, AGU) in real code, but ~50% more than Piledriver is close to the expected performance for Zen.

Similar remarks for floating point. Zen is 16FLOP/core and has 2x peak throughput than Piledriver (8FLOP/core), but on sustained floating-point workloads Zen will be ~70% faster than Piledriver. The same happen on Intel side. Haswell core has 2x the number of FP resources than Ivy Bridge core, but it is only 70% faster (sustained performance) clock for clock.

In code that mixes integer and float we can obtain an average, ~60% over Piledriver, which roughly corresponds to the 40% over Excavator officially claimed by AMD.

You are taking the peak throughput gain (2x) and adding average throughput gain (1.4x) on top of that. You are counting twice.

Finally, I don't know what you mean by Piledriver not supporting AVX. AMD has supported AVX since Bulldozer

http://developer.amd.com/community/blog/2009/05/06/striking-a-balance/

juanrga · Aug 30, 2016

-Fran- :

Because the only known silicon is 2.8GHz.

Because they set the frequency to 3GHz without turbo when running Blender.

Because they had to underclock the Broadwell sample.

Because if they had final silicon they wouldn't announce a six month delay of the chip.

When pressed about the TDP of final silicon, AMD said "comparable to Broadwell".

scuzzycard · Aug 30, 2016

None of this is terribly promising, IMO. If we look at what people are buying, it's the high-clocked, high-IPC quad-core chips that Intel sells the most of. Haswell-E, and Broadwell-E absolutely crush consumer Skylake parts in multi-threaded performance, but that doesn't stop people from choosing the 6700K, a single SKU, more than 5 times as often as all Broadwell-E parts combined and more than 3 times as often as all Haswell-E parts combined. (6700K market share 5.2%, all BW-E 0.9%, all HW-E 1.4% - source: Userbenchmark). They are going to have do better than matching the performance profile of a CPU that no one is buying. If they could release a quad-core Zen clocked at 5GHz, they'd have a slam-dunk, but it looks like we are nowhere near that.

Reynod · Aug 30, 2016

Cazalan :

I agree ... there is no evidence of this.

Hmm ... J once again you are making statements without the facts to substantiate claims.

ComputerSecurityGuy · Aug 30, 2016

I have to say that price is the primary deterrent of Broadwell-E and Haswell-E. With the pricing of the i7-5820K, only the $100-200 increase in board price keeps it out of the mainstream market.

Discussion: AMD Ryzen

Honorable

Distinguished

Distinguished

Distinguished

Honorable

Distinguished

Titan

Distinguished

Distinguished

Distinguished

Distinguished

Honorable

Distinguished

Honorable

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Glorious

Distinguished

Distinguished

Honorable

Administrator

Admirable

Share this page