News Ryzen 9000 CPUs drop 10% frequency executing AVX-512 instructions — Intel CPUs typically suffer from more substantial clock speed drops

King_V

Illustrious
Ambassador
Ok, maybe this is nitpicking, but, if we're being honest, shouldn't the headline say "drops less than 10%" or "drops 7%"

A 10% drop from 5700 MHz would be 5130 MHz. But, the article states:
it reduces clock speed from 5,700 MHz to 5,300 MHz
That's a 400 MHz drop. 400 is 7% of 5700. Or, 7.018%, depending on how precise you need to be.


And, given that 10 is about 42.9% higher than 7....
I mean, at that point, I might say "Headline overstates AMD clock frequency drop by 43%"
 

Jame5

Great
Sep 5, 2024
36
31
60
Two statements that feel very contradictory in back to back paragraphs.


1. "AMD's Zen 5-based desktop processors have four full-width 512-bit execution units for AVX-512, which makes execution of such instructions very efficient..."

2. "...it should be noted that by avoiding the implementation of full-blown 512-bit data paths, AMD makes its cores slightly more compact."

So are we praising AMD for having 512-bit execution, or for not having it?
 
  • Like
Reactions: usertests

bit_user

Titan
Ambassador
The article said:
it remains to be seen how much power AMD Ryzen 9000-series processors do (which are made on TSMC's N4P, 4nm-class process technology) consume when executing AVX-512 instructions.
Phoronix tested this, of course. The test suite-wide average went from 152.26 W with AVX-512 off to 148.64 W with it on. Sounds like I mixed those up, doesn't it? No, that's what the data says!

I think it's because their power limit-based clock throttling might have a tendency to overshoot, which would hit the AVX-512 case more, since that should be power-limited more frequently & severely. Just a guess.

The article said:
AMD's mobile parts, such as the codenamed Strix Point processors, use double-pumped AVX-256 to execute AVX-512 instructions.

While such an approach will probably confuse software developers and, to some degree, end users,
Huh? Why would it confuse software developers? Just like Zen 4, software sees them as full-blown AVX-512 implementations, which they are!

With AVX10, Intel is introducing the new requirement where you have to do a runtime check to see how wide the CPU's implementation is. Then, for software written to extract the most performance, precompiled* software would have to dispatch down either a 256-bit or 512-bit code path, because the operand width of the instructions is baked into the instruction stream.

* As opposed to JIT-compiled languages, like for Javascript or Web assembly, where the JIT compiler can presumably generate autovectorized code suited to the native implementation.
 
Last edited:

bit_user

Titan
Ambassador
Two statements that feel very contradictory in back to back paragraphs.
There's no contradiction. He's talking about the desktop & server CPUs implementing at 512-bit, then later explaining why the mobile cores are still limited to a 256-bit "double-pumped" implementation.

It actually aligns with Intel's stated plan of using AVX10/256 for client processors and AVX10/512 for server processors. The chiplet-based Ryzens use the same CCD chiplets as their server CPUs. So, that can be seen as the reason why the current crop of desktop processors have a 512-bit implementation - riding the coat tails of the servers. However, once AMD brings Strix Point to the desktop, now the picture will get more nuanced, because we'll be back to having some current-model AM5 CPUs with the same approach that Zen 4 used.
 
  • Like
Reactions: King_V and Makaveli

bit_user

Titan
Ambassador
I read ChipsandCheese and now I’m gonna show people how smart I am. Parts of this literally read like they’re ripped straight from ChipsandCheese.
FWIW, I don't care if Toms talks and sounds like ChipsAndCheese, as long as the get the details right (and, of course, cite the other publication where & when actually quoting it or using their data).
 

mac_angel

Distinguished
Mar 12, 2008
663
140
19,160
Is Intel having troubles getting AVX-512 to work? They've dropped it from the past couple of generations, and from what I've seen, even the new ones aren't going to support it either. Meanwhile there are more and more games and programs that do support it.

edit: been trying to find a list of games on Google, but most of the posts are a couple of years old.
 

bit_user

Titan
Ambassador
Is Intel having troubles getting AVX-512 to work?
No, the problem is that it takes up too much area in their E-cores, so it undermines their hybrid strategy.

AMD's demonstration that you can implement it at (mosty) 256-bit isn't quite the rebuke that it seems, since I'm pretty sure the E-cores actually have a 128-bit implementation of AVX2 (which is natively 256-bit). That would mean having to implement AVX-512 via 4x 128-bit, which might become rather complex, like in the case of "horizontal" operations.
 
Phoronix tested this, of course. The test suite-wide average went from 152.26 W with AVX-512 off to 148.64 W with it on. Sounds like I mixed those up, doesn't it? No, that's what the data says!

I think it's because their power limit-based clock throttling might have a tendency to overshoot, which would hit the AVX-512 case more, since that should be power-limited more frequently & severely. Just a guess.
That, or when AVX512 instructions are running the rest of the core's units are powered down - the AVX512 execution unit becomes the hot spot the CPU needs to throttle down for, but it still uses a bit less power than the whole core does.
 

Pierce2623

Prominent
Dec 3, 2023
485
368
560
No, the problem is that it takes up too much area in their E-cores, so it undermines their hybrid strategy.

AMD's demonstration that you can implement it at (mosty) 256-bit isn't quite the rebuke that it seems, since I'm pretty sure the E-cores actually have a 128-bit implementation of AVX2 (which is natively 256-bit). That would mean having to implement AVX-512 via 4x 128-bit, which might become rather complex, like in the case of "horizontal" operations.
That’s pure BS. Skymont is a wider core than Zen 5. It’s 9 wide decode and 8 ALUs wide in the backend. It also has four 128 bit floating point registers which is technically 100% capable of running avx512. Go get some more marketing BS to regurgitate because that one is a total lie.
 
Last edited:
How can you compare it to Intel chips when Intel chips have it disabled for the past 3 generations? :cool:
Early 12th gen had AVX512 disabled by firmware only - disabling the E-cores re-enabled AVX512 on those, allowing for testing.
Once that workaround was known, Intel fused off the AVX512 units, but on Xeon (using only P-cores) those units didn't change much between 12th and 13th/14th gens.
And, yes, you can simply benchmark a 14th-gen Core Xeon and call it a day.
 

edzieba

Distinguished
Jul 13, 2016
578
583
19,760
That’s pure BS. Skymont is a wider core than Zen 5. It’s 9 wide decode and 8 ALUs wide in the backend. It also has four 128 bit floating point registers which is technically 100% capable of running avx512. Go get some more marketing BS to regurgitate because that one is a total lie.
In what way is it a lie? Intel's E-cores take up less die area (i.e. are smaller, as claimed) than the Zen-C cores. An E-core takes up ~25% of the die area of a P-core, whilst a Zen-C core takes up 65% the idea area of a Zen non-C core. You can pack 4x E-cores in the same die area as a single P-core, whilst you can pack 1.5x Zen-C cores in the die area of single Zen non-C cores. In direct die area (remember, different processes) a Zen 4C core is 2.48mm^2, whilst an alder Lake E-core is 1.63mm^2, or about half the area.

AMD chose to maintain 1:1 ISA compatibility for their compact cores at the expense of those cores not being much more compact, whilst Intel went of extreme compactness at the expense of ISA extension parity.
 
Last edited:

TheHerald

Respectable
BANNED
Feb 15, 2024
1,633
502
2,060
Early 12th gen had AVX512 disabled by firmware only - disabling the E-cores re-enabled AVX512 on those, allowing for testing.
Once that workaround was known, Intel fused off the AVX512 units, but on Xeon (using only P-cores) those units didn't change much between 12th and 13th/14th gens.
And, yes, you can simply benchmark a 14th-gen Core Xeon and call it a day.
I know, I have one of those early 12900ks. I don't see any drops in clockspeeds with it enabled.
 

Pierce2623

Prominent
Dec 3, 2023
485
368
560
In what way is it a lie? Intel's E-cores take up less die area (i.e. are smaller, as claimed) than the Zen-C cores. An E-core takes up ~25% of the die area of a P-core, whilst a Zen-C core takes up 65% the idea area of a Zen non-C core. You can pack 4x E-cores in the same die area as a single P-core, whilst you can pack 1.5x Zen-C cores in the die area of single Zen non-C cores. In direct die area (remember, different processes) a Zen 4C core is 2.48mm^2, whilst an alder Lake E-core is 1.63mm^2, or less than half the area.

AMD chose to maintain 1:1 ISA compatibility for their compact cores at the expense of those cores not being much more compact, whilst Intel went of extreme compactness at the expensive of ISA extension parity.
Skymont changed everything. It was literally a 180 degree turn on e core strategy. The old e cores were only 4 ALUs wide. E cores are no longer 25% of p core area but more like 70% or roughly similar to Zen5.
 

edzieba

Distinguished
Jul 13, 2016
578
583
19,760
Skymont changed everything. It was literally a 180 degree turn on e core strategy. The old e cores were only 4 ALUs wide. E cores are no longer 25% of p core area but more like 70% or roughly similar to Zen5.
Instruction width does not tell you much about die area. e.g. a Zen 5 (non--C) core is 8 wide, whilst Golden Cove is 6-wide but a physically much larger core.
Also, Crestmont was 6-wide, not 4-wide.
 

bit_user

Titan
Ambassador
That, or when AVX512 instructions are running the rest of the core's units are powered down - the AVX512 execution unit becomes the hot spot the CPU needs to throttle down for, but it still uses a bit less power than the whole core does.
This is not grounded in reality.

The entire core runs in the same clock domain and there's no evidence they're dynamically powering down parts of the core. They've never said anything like that and we don't see any evidence of it in the benchmarks. It takes a decent amount of time to flush out your pipelines and adds a lot of complexity, if you want to do something like that.

I'm not even sure why you decided to go there, when we know they have a perfectly viable way to modulate power consumption by their cores, which is simply to dial back the clock frequency by up to a few hundred MHz. That has a far less drastic impact on performance and it's something we have evidence of them actually doing.
 

bit_user

Titan
Ambassador
That’s pure BS. Skymont ...
Why are you talking about Skymont, as if Intel is only cancelling AVX-512 in client CPUs now?

... is a wider core than Zen 5. It’s 9 wide decode and 8 ALUs wide in the backend.
The suggestion is that the 3-wide decode clusters in Skymont can each only work on one branch target. This is materially different than a 9-wide decoder and it's not like Zen 5's ability to team its 2x 4-wide decoders on a single instruction stream.

As for the number of ALUs, you can't just boil down an entire core to that number. There are tons of other details that affect how much throughput it actually gets, and Intel even said of the 8-wide design: "It was cheap to add in terms of area and it helps with the peaks." This suggests that utilization of all 8 is not high. I would also point out that Zen 5 has 3 multiply-capable ALUs, among its 6, whereas Skymont has only 2.

It also has four 128 bit floating point registers which is technically 100% capable of running avx512.
That's not technically true. The first issue is that Zen 4 did indeed have to incorporate some specialized machinery to handle "horizontal" AVX-512 instructions that you can't do by simply breaking up one 512-bit operation into 4x 128-bit operations.

Secondly, if you would implement AVX-512 via 128-bit ops, the way to do it is just break down each 512-bit operation into a multiple of 4 micro-ops and then let the FP scheduler decide how to schedule them. You don't need the execution ports to be some multiple of 512.

Finally, when all is said and done, whether it's actually a win to implement 512-bit ops this way vs. only supporting up to 256-bit operands is an open question. The main point of even going to 512-bit vectors is to process more data per op. If you're having to dispatch so many more ops per instruction, then maybe we're well beyond the point of being frontend-bottlenecked. There are plenty of examples of CPUs that break vector ops in half, but I'm not aware of any examples which break them up into 4 parts. That should probably tell us something.

Go get some more marketing BS to regurgitate because that one is a total lie.
U mad, bro? Sounds to me like this might be too emotive a subject for you to have a serious technical discussion about.
 
  • Like
Reactions: TesseractOrion

bit_user

Titan
Ambassador
Early 12th gen had AVX512 disabled by firmware only - disabling the E-cores re-enabled AVX512 on those, allowing for testing.
From what I heard, it took more than disabling the E-cores. There was something else the BIOS had to do, involving some kind of undocumented diagnostic mode that Intel left exposed.

Once that workaround was known, Intel fused off the AVX512 units, but on Xeon (using only P-cores) those units didn't change much between 12th and 13th/14th gens.
And, yes, you can simply benchmark a 14th-gen Core Xeon and call it a day.
Ugh, where to start unpicking this ball of confusion?

First, we have to define what's a "Xeon". The Xeon E-series are nothing more than mainstream desktop CPU dies, in a mainstream socket, but requiring a special motherboard chipset (W680, in this case), in order to utilize their extra features. Intel never made Xeon E versions of the Gen 12 CPUs. Instead, Intel enabled some Xeon-class features on upper-end mainstream Gen 12 models, provided you used them on a mobo with the W680 chipset.

As for "Gen 13", Intel did finally release a Xeon E-series based on Raptor Lake dies, but they had the E-cores and AVX-512 both disabled. Also, the iGPU was fused off, whereas previous generations of Xeon E-series still offered iGPU on select models. There was no "Gen 14" refresh of the Xeon E-series.

Next, Intel has the Xeon W-series, which is a completely different die. As a matter of fact, they use two different die configurations, with the low-end going up to 24 P-cores and 4 memory channels, on a monolithic die, and the larger one using (I think) the same die configuration as the Xeon Scalable server CPUs and 8 memory channels. The P-cores in all of these CPUs are fundamentally different than the ones in the desktop product line, in that they have an additional AVX-512 FMA port and an AMX unit in each core. They also feature a larger L2 cache, equal to that found in Raptor Cove, yet these cores aren't Raptor Cove. It's common for Intel to increase cache sizes in their server core variants and add a second AVX-512 FMA port.

In summary, the only way to do an apples-to-apples comparison of AVX-512 between Intel and AMD, on desktop platforms, is with those early Gen 12 dies. You cannot benchmark AVX-512 on any client cores newer than that.
 
Last edited:
This is not grounded in reality.

The entire core runs in the same clock domain and there's no evidence they're dynamically powering down parts of the core. They've never said anything like that and we don't see any evidence of it in the benchmarks. It takes a decent amount of time to flush out your pipelines and adds a lot of complexity, if you want to do something like that.

I'm not even sure why you decided to go there, when we know they have a perfectly viable way to modulate power consumption by their cores, which is simply to dial back the clock frequency by up to a few hundred MHz. That has a far less drastic impact on performance and it's something we have evidence of them actually doing.
On AMD chips, it is - as they are able to shut down unused units in a core. If an integer unit, or the FPU, or the SIMD units aren't working, they're powered down. Since the AVX512 units heat up a lot when in use, they can create a localized hot spot, which will cause downclocking, but the total power draw will be smaller than if the whole CPU is under compute load. In normal use with heterogeneous loads it probably won't occur, but on such a benchmark, that's AFAIK quite likely.
 

bit_user

Titan
Ambassador
On AMD chips, it is - as they are able to shut down unused units in a core. If an integer unit, or the FPU, or the SIMD units aren't working, they're powered down.
You're thinking of clock gating. That's different. The unit is still powered and maintains state, but it's just not burning as much power.

Since the AVX512 units heat up a lot when in use, they can create a localized hot spot, which will cause downclocking, but the total power draw will be smaller than if the whole CPU is under compute load. In normal use with heterogeneous loads it probably won't occur, but on such a benchmark, that's AFAIK quite likely.
I think you're still telling fairy stories. Give us a source or it's not a "thing".
 
  • Like
Reactions: TesseractOrion