News Intel's Granite Rapids listed with huge L3 cache upgrade to tackle AMD EPYC - software development emulator spills the details

Admin · Jan 21, 2024

Intel's SDE tool receives an update alluding to a boosted L3 cache spec on Intel's next-gen Granite Rapids Xeon CPUs. This uplift in L3 cache should net a boost in AI workloads, as well as certain productivity workloads reliant on fast memory access.

Intel's Granite Rapids listed with huge L3 cache upgrade to tackle AMD EPYC - software development emulator spills the details : Read more

richardvday · Jan 21, 2024

AMD innovates and Intel copies them. Nothing new same thing different day.

-Fran- · Jan 21, 2024

Ugh... So we're doing this dance, uh?

What matters is how much L3/L4/eRAM/etc amounts each core has access to (I don't remember ever seeing L2 shared and even less L1) and the associated latency. Saying it has more of it, while positive, it's hardly something to be excited about, sorry.

Still, an improvement is an improvement, so good on Intel for trying to bring competition.

Regards.

bit_user · Jan 21, 2024

-Fran- said:
What matters is how much L3/L4/eRAM/etc amounts each core has access to

Pretty much what I was going to say. If you're increasing core counts, then you obviously have to increase L3 commensurately!

In Toms own reporting, they quoted Granite Rapids at 84-90 cores.

https://www.tomshardware.com/news/intel-displays-granite-rapids-cpus-as-specs-leak-five-chiplets

That's 1.3x to 1.4x as many cores as Emerald Rapids. Hence, we would logically expect 420 to 450 MiB of L3 cache, if they merely wanted to keep the same per-core cache levels.

480 MiB is only 14.3% to 6.7% above those estimates. So, it wouldn't represent a very significant increase in per-core L3, if the core count estimates are correct.

bit_user · Jan 21, 2024

Keep in mind that a shared L3 cache on a CPU corresponds to the level of cache that's shared by and accessible to all CPU cores on the die

It's not such a strict definition, actually. For instance, AMD limits the scope of L3 to a single chiplet. That makes it a little tricky to compare raw L3 quantities between EPYC and Xeon.

Emerald Rapids' L3 cache improvement still managed to push Intel's Xeon CPUs into competition with EPYC Bergamo chips while technically being a refresh cycle, with solid wins in most AI workloads we tested, and at least comparable performance in most other benchmarks.

Eh, not really. The AI benchmarks were largely due to the presence of AMX, in the Xeons. With that, even Sapphire Rapids could outperform Zen 4 EPYC (Genoa) on such tasks.

As for the other benchmarks you list, some exhibit poor multi-core scaling, which is why Phoronix excluded them from his test suite. If you check the geomeans, on the last page of this article, it's a bloodbath (and the blood is blue):

https://www.phoronix.com/review/intel-xeon-platinum-8592

It's telling that AMD didn't even need raw core-count to win. Look at the EPYC 9554 vs. Xeon 8592+. Both are 64-core CPUs, rated at 360 (EPYC) and 350 (Xeon) Watts. The EPYC beats the Xeon by 3.6% in 2P configuration and loses by just 2.2% in a 1P setup. That speaks volumes to how well AMD executed on this generation. Also, the EPYC averaged just 227.12 and 377.42 W in 1P and 2P configurations, respectively, while the Xeon burned 289.52 and 556.83 W. So, that seeming 10 W TDP advantage for the EPYC isn't decisive, in actual practice.

BTW, since this article is about cache, consider the EPYC 9554 has just 256 MiB of L3 cache, while the Xeon 8592+ has 320 MiB. So, even a decisive cache advantage wasn't enough to put Emerald Rapids solidly ahead of Genoa, on a per-core basis.

rluker5 · Jan 21, 2024

richardvday said:
AMD innovates and Intel copies them. Nothing new same thing different day.

My 4980hq says hi. (134MB L3+L4 cache in 2014) Pretty sure that wasn't the first time cache was relatively increased by a lot, but it is a good enough example.

thestryker · Jan 21, 2024

bit_user said:
Also, the EPYC averaged just 227.12 and 377.42 W in 1P and 2P configurations, respectively, while the Xeon burned 289.52 and 556.83 W. So, that seeming 10 W TDP advantage for the EPYC isn't decisive, in actual practice.

Whatever monitoring they're using isn't picking up the IO die you can tell by the minimum power consumption numbers on the AMD side.

bit_user · Jan 21, 2024

thestryker said:
Whatever monitoring they're using isn't picking up the IO die you can tell by the minimum power consumption numbers on the AMD side.

The 9554 has a minimum power of 11.1 W and 25.6 W in 1P and 2P configurations, respectively. Why do you find that too far fetched, for CPU package idle power figures?

Yes, the minimum spec EPYC uses just 6.7 W, but that's also the Siena platform, with fewer max chiplets, memory channels, and I/O. Also, I think its socket might only support 1P configurations.

Then again, there are some pretty absurd max figures. So, maybe the power-sampling is a little wonky, when it comes to min/max. Perhaps it's derived from energy metrics, in which case noisy timestamps could result in artificially low/high figures, without invalidating the averages.

George³ · Jan 21, 2024

rluker5 said:
My 4980hq says hi. (134MB L3+L4 cache in 2014) Pretty sure that wasn't the first time cache was relatively increased by a lot, but it is a good enough example.

Intel say 7.25MB L1+L2+L3 and nothing for e-dram? Maybe you have on mind another CPU?

thestryker · Jan 21, 2024

bit_user said:
The 9554 has a minimum power of 11.1 W and 25.6 W in 1P and 2P configurations, respectively. Why do you find that too far fetched, for CPU package idle power figures?

That would make it lower than Ryzen 7000 (desktop) not to mention the best the IO die managed alone last gen was ~50W and there's also no world where AMD's idle power consumption is lower than Intel's (this was a focus after SPR as they had unusually high idle due to the way they did tiles). The cores themselves absolutely can get that low as everything about the Zen 4 cores is fantastic for power consumption, but the IO die just isn't that efficient.

thestryker · Jan 21, 2024

George³ said:
Intel say 7.25MB L1+L2+L3 and nothing for e-dram? Maybe you have on mind another CPU?

Everything with Intel Iris Pro Graphics 5200 has 128MB eDRAM which every Crystal Well part has as far as I'm aware.

bit_user · Jan 21, 2024

thestryker said:
That would make it lower than Ryzen 7000 (desktop)

Package power or system power?

thestryker said:
not to mention the best the IO die managed alone last gen was ~50W

It was made on an old process node (12 nm, IIRC).

Anyway, I also mentioned a theory about the min/avg/max numbers being derived from energy figures, which you didn't quote. Even if the min/max values are bad (and some of those max figures are really wacky!), it wouldn't invalidate the avg if they were computed by sampling the energy consumption.

Furthermore, even if we completely ignore the power figures, it doesn't take away from the overall per-core performance-parity between Genoa and Emerald Rapids. That's the real story, IMO.

thestryker · Jan 21, 2024

bit_user said:
Package power or system power?

Package, Zen 4 idles a fair bit higher than Intel due to the IO die. You can see it reflected in any single threaded power consumption tests as AMD will use more power than Intel despite Zen 4 being more efficient.

bit_user said:
It was made on an old process node (12 nm, IIRC).

Point being if the IO die alone used 50W theres no way measuring a Zen 4 Epyc at 11.1W includes the IO die.

bit_user said:
Anyway, I also mentioned a theory about the min/avg/max numbers being derived from energy figures, which you didn't quote. Even if the min/max values are bad (and some of those max figures are really wacky!), it wouldn't invalidate the avg if they were computed by sampling the energy consumption.

If their power consumption numbers don't include the IO die then none of them are comparative to Intel.

bit_user said:
Furthermore, even if we completely ignore the power figures, it doesn't take away from the overall per-core performance-parity between Genoa and Emerald Rapids. That's the real story, IMO.

Oh absolutely I think the Zen 4 architecture is the most impressive x86 arch since Conroe/Yonah. The performance efficiency is absolutely ridiculous in the clock ranges Epyc has and I love seeing the all core clocks so high in real world. I just wanted to point out something wasn't right with their power consumption numbers.

rluker5 · Jan 21, 2024

George³ said:
Intel say 7.25MB L1+L2+L3 and nothing for e-dram? Maybe you have on mind another CPU?

OK you convinced me to go into my attic and fire up this old relic:

Plenty of L4 on 22nm. It also had similar gaming benefits over Haswell plain as the X3D has over it's respective plain versions. When the 5800X3D came out they even went to a lot of games already shown to favor edram in their reviews. Shame that big cache went away, but I bet it would be a security vulnerability nowadays with all of the side channel stuff.

That and it really is about time I took down the artificial Christmas tree so I had to get those boxes too.

bit_user · Jan 21, 2024

thestryker said:
Package, Zen 4 idles a fair bit higher than Intel due to the IO die. You can see it reflected in any single threaded power consumption tests as AMD will use more power than Intel despite Zen 4 being more efficient.

Point being if the IO die alone used 50W theres no way measuring a Zen 4 Epyc at 11.1W includes the IO die.

You keep tossing about this 50 W figure, but it's actually from Milan (Zen 3), AFAICT. So, please either give us an idle power figure for Zen 4 or drop the point.

From a little poking around, I didn't find much, but I did find this:

Look at the Web XPRT metric. I assume that's I/O-bound (either disk or network), and therefore nearly idle. The bar for the 7950X is 7 pixels tall, in a plot area 304 pixels high. Since the Y axis is scaled to 300 W, that gives us a figure of about 7 W for a 7950X, which is doing something above idle.

thestryker said:
Oh absolutely I think the Zen 4 architecture is the most impressive x86 arch since Conroe/Yonah. The performance efficiency is absolutely ridiculous in the clock ranges Epyc has

It has demonstrably lower IPC than Golden Cove. Where it wins in a 64-core config is by its efficiency.

thestryker · Jan 21, 2024

bit_user said:
You keep tossing about this 50 W figure, but it's actually from Milan (Zen 3), AFAICT. So, please either give us an idle power figure for Zen 4 or drop the point.

Okay compare the idle from AnandTech to the testing numbers from Phoronix on the 7763 listed for these:
https://www.anandtech.com/show/16778/amd-epyc-milan-review-part-2/3
https://www.phoronix.com/review/intel-xeon-platinum-8490h/14
The only logical way to explain how the idle package power from AnandTech is so much higher than the Phoronix graphs is if they aren't including the IO die.

bit_user said:
It has demonstrably lower IPC than Golden Cove. Where it wins in a 64-core config is by its efficiency.

Yeah that's why it's impressive to me the efficiency is just so very good.

bit_user · Jan 21, 2024

thestryker said:
Okay compare the idle from AnandTech to the testing numbers from Phoronix on the 7763 listed for these:
https://www.anandtech.com/show/16778/amd-epyc-milan-review-part-2/3

Well, Phoronix says they're using RAPL power reporting. Anandtech doesn't say how they measure power. Between Zen3 and Zen4, it could be that the same mechanism reports different things. So, it's not conclusive to show a difference in power for Zen 3.

Whatever the case, Phoronix reported EPYC 9554 averaging 62.4 and 219.4 W less power in 1P and 2P configurations than Xeon 8592+. Even if the EPYC numbers omit I/O power, that would still put it as comparable in 1P configurations and probably well ahead of Xeon in 2P configurations.

BTW, Genoa's I/O Die is made on TSMC N6. So, it should be a good deal more efficient than Milan's. On the flip side, it supports 12 chiplets vs. 8, 12 DDR5 channels vs. 8 DDR4 channels and PCIe 5.0 vs. PCIe 4.0. Making it a bit less obvious how their power would compare. The PCIe revision is probably the least important, for those tests, since very few of the lanes would have been in use.

JayNor · Jan 21, 2024

"...While AMD EPYC generally remains the player to beat in the enterprise CPU space,"

Any discussion of "player to beat" in the current environment would also have to take into account the per core AMX tiled matrix acceleration in the Granite Rapids chips. It isn't even mentioned in the article.

Similarly, Intel and SK hynix reportedly developed a MCR DIMM 8000MT/sec solution that will be supported in Granite Rapids. That seems potentially just as important as the caches, given the rapidly increasing parameter sizes of the LLMs.

thestryker · Jan 21, 2024

bit_user said:
BTW, Genoa's I/O Die is made on TSMC N6. So, it should be a good deal more efficient than Milan's. On the flip side, it supports 12 chiplets vs. 8, 12 DDR5 channels vs. 8 DDR4 channels and PCIe 5.0 vs. PCIe 4.0. Making it a bit less obvious how their power would compare. The PCIe revision is probably the least important, for those tests, since very few of the lanes would have been in use.

Yeah it's a lot more efficient, but also as you say has more IO and interconnect to deal with. I'm certain it's still lower than Milan, but without more comparative tests it's hard to say how much due to the efficiency differences between Zen 3 and 4. It's possible that it can deactivate (even just partially) which would be a big change.

jthill · Jan 22, 2024

So, 430MB L3 on next-gen chips Intel can't actually make for sale yet, "it's gonna be great!".

Meanwhile, AMD's been selling the 9684 since last year, with 1152MB L3, and their next-gen chips are expected this year too.

bit_user · Jan 22, 2024

jthill said:
So, 430MB L3 on next-gen chips Intel can't actually make for sale yet, "it's gonna be great!".

Meanwhile, AMD's been selling the 9684 since last year, with 1152MB L3, and their next-gen chips are expected this year too.

It's like I said: Intel's L3 cache is global, while AMD's is only shared by sets of 8 cores. It still works in AMD's favor, since it works out to 12 MiB per core, while Emerald Rapids has only 5 MiB per core. With NUMA-aware thread scheduling, the fact that AMD's L3 isn't truly global shouldn't be much of a liability.

In AMD's approach, a single core can used up to 96 MiB of L3, while Intel's allows a single core to use all 320 MiB. However, that's much more of a corner case for server workloads.

If you want to look at it on a technical level, AMD is distributing that 1152 MiB of cache across 24 dies (32 MiB per CCD + 64 MiB per 3D cache die), while Emerald Rapids has 160 MiB on each of only 2 dies. So, in some ways, Intel is the one being more aggressive.

-Fran- · Jan 22, 2024

Don't forget latency based on adjecency~~

Regards.

News Intel's Granite Rapids listed with huge L3 cache upgrade to tackle AMD EPYC - software development emulator spills the details

Administrator

Honorable

Illustrious

Polypheme

Polypheme

Distinguished

Splendid

Polypheme

Prominent

Splendid

Splendid

Polypheme

Splendid

Distinguished

Polypheme

Splendid

Polypheme

Reputable

Splendid

Reputable

Polypheme

Illustrious

Share this page