News Nvidia Claims Grace CPU is 2X Faster Than AMD Genoa, Intel Sapphire Rapids at Same Power

Status
Not open for further replies.
correct me if i'm wrong but doesn't the Graph Analytics GapBFS bench already take into account power consumption, so if this new datacenter chip is more efficient then the AMD part but performs the same I believe the Graph Analytics bench would show the intel with a significant lead.

don't get me wrong, the intel part is impressive, i mean it has a little less then 4/5 the core count of the AMD chip and it's on par with the same performance, so this is an impressive part
 
  • Like
Reactions: gg83
The next-generation Arm "Neoverse"

Enough with this Verse madness!

Everybody is running around with this crazy name and they think it's cool or supposed to impress simple minds: The Metaverse, Mutiverse, and now Neoverse......Gah!
 
  • Like
Reactions: gg83
The claim is almost moot, unless/until Nvidia would sell Grace-based machines for general purpose computing, which I haven't heard them announce. Grace seems designed primarily (if not exclusively) to feed & harness their H100-class GPUs.

At best, the claim portends good things for Graviton 4, or whatever cloud processor comes along that uses Neoverse V2 cores and can be used by the general public for standard computing tasks.
 
  • Like
Reactions: gg83
I'll wait till Phoronix, Level1Techs, ServeTheHome, or Chips and Cheese gets ahold of these and benchmarks them.

Until then, I'm not holding my breath for any of their claims.
I expect someone will manage to run independent benchmarks on them, but I doubt Nvidia will be sending around review samples - or even giving reviewers free access to their cloud platform. These superchip modules are probably only usable in a DGX system (or whatever the current incarnation is called) and I'd guess the entry price on one is a lot more than the hardware budget of anyone out there doing half-decent server reviews.

Most likely, someone will rent some time on a cloud instance of one, and that's where they'll run the benchmarks. Because the machine will probably also house highly sought-after H100 chips, time will be scarce and dearly expensive.

The obvious drawback, with a cloud instance, is limited (if any) visibility into power dissipation - especially at the system level or anything that would let us estimate this "5 MW Datacenter" metric.
 
Last edited:
For what it's worth, the Arm Neoverse V2 is based on the Cortex-X3, which is actually the newest Arm P-core uArch.

This acceleration means that the V2 provides a two-generation microarchitecture jump over the V1, resulting in a big boost in single-thread performance that could put it neck-and-neck with the leading x86 server CPUs.

Perf / clock, the Cortex-X3 exceeds Golden Cove & Zen4, so it's probably clocked a good lower than SPR & Genoa → saving power. Notably, the Cortex-X3 below is 1MB L2, but Neoverse V2 is upgraded to 2MB L2.

GB61 1T pts per GHz
Arm Cortex-X3: ~559 (S23U)
Intel Golden Cove: ~485 (i9-12900KS)
AMD Zen4: ~510 (7950X)

As NUVIA shared, per-core, datacenter CPUs are at basically smartphone power usage. At 2.8 GHz, one whole Arm Neoverse V2 core + L2 cache uses 1.4W:
Arm-Neoverse-V2-HC35_Page_13-scaled.jpg


Arm discussed more on V2 at Hotchips last week.

Will wait for actual benchmarks, though, before passing judgement for / against the V2.

EDIT: V2, not N2, eghad. Thank you, @bit_user
 
Last edited:
  • Like
Reactions: bit_user
For what it's worth, the Arm Neoverse N2 is based on the Cortex-X3, which is actually the newest Arm P-core uArch.
Typo: your link says V2, but you wrote N2. The N-series are mid-sized cores and based on the Cortex-A7xx series mobile cores. The V-series are the big cores, and based on the Cortex-X series mobile cores.

As for the statement, it's correct in the sense that X3 is the latest you can find in the wild, but ARM already announced the X4:

Some notable things about the X4 are that:
  • ARM claims it's the widest core, period. It features 11-wide decode!
  • It's exclusively 64-bit.
  • They eliminated the op cache, which probably made sense after they simplified its decoder by removing ARMv7 backwards compatibility.

Perf / clock, the Cortex-X3 exceeds Golden Cove & Zen4, so it's probably clocked a good lower than SPR & Genoa → saving power. Notably, the Cortex-X3 below is 1MB L2, but Neoverse V2 is upgraded to 2MB L2.

GB61 1T pts per GHz
Arm Cortex-X3: ~559 (S23U)
Intel Golden Cove: ~485 (i9-12900KS)
AMD Zen4: ~510 (7950X)
Of course, it matters what clockspeed you use to measure this. Ideally, you would run all processors at the same clockspeed, since IPC suffers as you increase frequency.
 
  • Like
Reactions: ikjadoon
Typo: your link says V2, but you wrote N2. The N-series are mid-sized cores and based on the Cortex-A7xx series mobile cores. The V-series are the big cores, and based on the Cortex-X series mobile cores.

As for the statement, it's correct in the sense that X3 is the latest you can find in the wild, but ARM already announced the X4:

Some notable things about the X4 are that:
  • ARM claims it's the widest core, period. It features 11-wide decode!
  • It's exclusively 64-bit.
  • They eliminated the op cache, which probably made sense after they simplified its decoder by removing ARMv7 backwards compatibility.


Of course, it matters what clockspeed you use to measure this. Ideally, you would run all processors at the same clockspeed, since IPC suffers as you increase frequency.

Ah, thank you. I wrote N2 more than once, but did not catch all the typos.

//

That is true about X4! I was initially thinking about already launched uArches. Arm has revealed the upgraded X4 to ship the end of this year / early 2024.

But, yes: the X4 seems quite promising.

11-wide decode: I had not seen that, actually. From Hwcooling.net's notes, I believe X4 is now at 8x decoders (a notable +33% increase over X3). Relatedly, dispatch was widened to 10 instructions.

But ARM is also betting on core widening in other pipeline stages, especially in the frontend. The processor has eight parallel instruction decoders, which is also a record number for ARM’s cores (Apple also has eight decoders). These decoders can deliver eight instructions per cycle to the following processing stages. Dispatch then supports up to 10 micro-ops per cycle (not all instructions are decoded to one micro-op).

A78: 4-wide decode
X1: 5-wide decode
X2: 5-wide decode
X3: 6-wide decode
X4: 8-wide decode

Yes on 64-bit-only. Now, even the X2 was 64-bit-only execution, but to fully exploit that, it seems like it took a few more generations. Removing the uOp cache entirely may well be one of those benefits two gens later.

That is wild; five generations of uOp caches and Arm now concludes "Don't need it; we're just going to go super-wide on decoding." They seem to have the simulations to back it up, though, so we'll see.

//

Now, I should correct: V2 was designed at L2$ at 2MB, but actually NVIDIA's Grace CPU ships with just 1MB L2$. The same applies to next-generation X4: Arm says X4 now supports up to 2MB: so the ceiling was raised, but we've need to see who, if anyone, actually adds in the extra SRAM.

Technically, Arm allows the X4 to ship with as little as 512KB L2 → 2MB L2.

//

Of course, it matters what clockspeed you use to measure this. Ideally, you would run all processors at the same clockspeed, since IPC suffers as you increase frequency.

Yes: ideally, the same clocks. But, I'm curious: I've not seen much real degradation in IPC from higher frequency (except in perhaps extreme overclocking). Are there modern examples of say 3.5 GHz vs 5.5 GHz where IPC falls off notably?

My quick check at AnandTech's Bench & GB5 1T:

i7-12700K: 1897 @ 5 GHz
i3-12300: 1681 @ 4.4 GHz

So 13.6% increase in frequency, 12.8% increase in performance (so IPC seems to have been roughly maintained).
 
  • Like
Reactions: bit_user
11-wide decode: I had not seen that, actually. From Hwcooling.net's notes, I believe X4 is now at 8x decoders (a notable +33% increase over X3). Relatedly, dispatch was widened to 10 instructions.
I don't know where I saw the "11-wide" figure. I can't find it, now. Maybe I confused it with something else.

10-wide dispatch is interesting. Given that it no longer has a uop cache, I wonder how they would manage 10-wide dispatch with an 8-wide decoder. Actually, neither Anandtech nor Wikichip actually say how wide the decoder is.

"Arm says that the front-end of the Cortex-X4 has seen some significant changes. The instruction fetch delivery has been completely redesigned. As with the Cortex-A715, it seems that the Cortex-X followed suit and also dropped the macro-operations cache entirely. Instead, the Cortex-X4 widened the pipeline to support up to 10 instructions."


Are there modern examples of say 3.5 GHz vs 5.5 GHz where IPC falls off notably?
Good question. I was sort of wondering, myself, just what the fall-off curve looked like for IPC vs. GHz. I don't recall seeing anything on this, recently. I'll update this thread if I find anything.

My quick check at AnandTech's Bench & GB5 1T:

i7-12700K: 1897 @ 5 GHz
i3-12300: 1681 @ 4.4 GHz

So 13.6% increase in frequency, 12.8% increase in performance (so IPC seems to have been roughly maintained).
Nice!
 
  • Like
Reactions: ikjadoon
i7-12700K: 1897 @ 5 GHz
i3-12300: 1681 @ 4.4 GHz

So 13.6% increase in frequency, 12.8% increase in performance (so IPC seems to have been roughly maintained).
BTW, that's not quite an apples-to-apples comparison, because i3 uses a different die with a smaller ring bus. Furthermore, when the big (hybrid) Alder Lake die has the E-cores enabled, the ring bus operates at a lower frequency. So, you'd want these measurements to be taken with the actually disabled, rather than simply unused.

However, both of those factors should hamper frequency scaling. If controlled, you'd expect it to look even better than your quoted numbers. So, I guess yours is a good lower-bound (on single-threaded workloads, at least).

Ideally, the benchmarks would be run on the same CPU, changing only the frequency limits. So, I've done exactly that, using the data from this article: https://chipsandcheese.com/2022/01/28/alder-lakes-power-efficiency-a-complicated-picture/

kMltPg1.png


SUdXdDM.png


Two different microarchitectures; two different workloads. Both tell a pretty good story about IPC scaling. Note that the data is collected from 4x cores of each type, rather than single core. I presume it's 1 thread per core, but I don't know that for certain.

If you compare the worst / best, Golden Cove has only 77.9% as high IPC on 7zip and 87.5% as high on x264. The same comparison puts Gracemont's worst at 84.0% of its best, on 7zip, and 92.0% on x264. What surprises me most is that we don't see a distinct drop-off, at the top of the scale. It basically follows the same gradual slope for most of the plot. I wish they'd used an i9-12900KS, so we could see if that started to become pronounced at the extremes.
 
Last edited:
  • Like
Reactions: ikjadoon
BTW, that's not quite an apples-to-apples comparison, because i3 uses a different die with a smaller ring bus. Furthermore, when the big (hybrid) Alder Lake die has the E-cores enabled, the ring bus operates at a lower frequency. So, you'd want these measurements to be taken with the actually disabled, rather than simply unused.

However, both of those factors should hamper frequency scaling. If controlled, you'd expect it to look even better than your quoted numbers. So, I guess yours is a good lower-bound (on single-threaded workloads, at least).

Ideally, the benchmarks would be run on the same CPU, changing only the frequency limits. So, I've done exactly that, using the data from this article: https://chipsandcheese.com/2022/01/28/alder-lakes-power-efficiency-a-complicated-picture/
kMltPg1.png
SUdXdDM.png

Two different microarchitectures; two different workloads. Both tell a pretty good story about IPC scaling. Note that the data is collected from 4x cores of each type, rather than single core. I presume it's 1 thread per core, but I don't know that for certain.

If you compare the worst / best, Golden Cove has only 77.9% as high IPC on 7zip and 87.5% as high on x264. The same comparison puts Gracemont's worst at 84.0% of its best, on 7zip, and 92.0% on x264. What surprises me most is that we don't see a distinct drop-off, at the top of the scale. It basically follows the same gradual slope for most of the plot. I wish they'd used an i9-12900KS, so we could see if that started to become pronounced at the extremes.

RE the i3 vs i7 comparison: that's a very valid critique on my admittedly quick check.

RE: ring clock In my mind, if both tests had the lower 3.6 GHz ring clock, seemingly it'd be valid, no, particularly as we'd be testing 1T performance? Or, apologies, this is nT IPC we're examining.

//

On the great charts you've shared: that is still a notable drop and that makes these ultra-high frequencies even less appealing. Thank you greatly for sharing these. Maybe the rumored 14th Gen Raptor Lake refresh will really show whether that 6+ GHz freq is worth the gradual loss of IPC.

I'm actually on the opposite spectrum: I did not expect the drop to be that large! If the trend is right (this is at 5 GHz), we may see even larger drops, as 14th Gen Intel is seemingly boosting well past 5 GHz.

This is very interesting; I appreciate your serious search. Is that data available on that URL? I didn't spot clock frequency vs performance data by Chips & Cheese. I thought I'd kept a close eye on their work, but maybe I missed this, haha.

I believe it's testing 4x cores, so there may be some cross-core loading that may also be bottlenecked as frequency increases? I typically think of "IPC" as a 1T metric, but as the nT metric, that also tracks: as we significantly increase clocks, something in the uArch design creates a bottleneck where the frequency goes up, but not the performance.
 
  • Like
Reactions: bit_user
On the great charts you've shared: that is still a notable drop and that makes these ultra-high frequencies even less appealing. Thank you greatly for sharing these. Maybe the rumored 14th Gen Raptor Lake refresh will really show whether that 6+ GHz freq is worth the gradual loss of IPC.
Most of the credit goes to the Chips & Cheese article for actually collecting the data. Too bad they didn't publish their raw measurements - I had to laboriously estimate each datapoint from their plot bitmaps, then compute the transforms from the image space of the plots to their numerical equivalents. A lot of the remaining credit goes to Excel.

This is very interesting; I appreciate your serious search. Is that data available on that URL? I didn't spot clock frequency vs performance data by Chips & Cheese. I thought I'd kept a close eye on their work, but maybe I missed this, haha.
Well... I took a bit of a leap. I noticed they seemed to collect 4 data points from each test run they did, on each cluster of cores. I made this leap simply because the number of data points from the different plots matched per-core type (always 14 points for E-cores, 15 for P-cores). So, what I did was extract the data from their plots and combine it into a joint table, for each class of cores.

Once I had recovered the raw values (which I validated by reproducing their plots and visually inspecting them to ensure they looked almost identical), I could then plot the aspects which most interested me.

IMO, the most interesting thing I did with it was to import the data from a Python script I wrote to compute the optimal combination of clock speeds for each power level. I then (naively) combined the throughput from these clockspeed combinations to demonstrate how Alder Lake's E-cores enabled it to deliver more performance at any power level! For more on that, look here:


Yes, I had better things to spend my time on, but I like a good challenge and was genuinely curious about the outcome.

I believe it's testing 4x cores, so there may be some cross-core loading that may also be bottlenecked as frequency increases? I typically think of "IPC" as a 1T metric
Yes! I used the data I had. Sadly, the only Alder Lake I have access to is a 65W non-K variant. So, I would be even further restricted on clockspeeds and boost durations, if I tried to reproduce their experiment.

My hope was simply that neither 4 E-cores nor 4 P-cores would be bottlenecked too badly, in that 16-core chip. However, I'm well aware that the 4 E-cores were sharing a cluster (I believe Alder Lake's E-cores can only be disabled on a per-cluster basis), and all E-cores in a cluster share the same slice of L2 cache. So, that actually makes the E-cores look worse than if we had the same data for 1T on each class of core.

Thanks for reminding me of that. That means, if anything, the data is more pessimistic about E-core performance than P-core performance.

as we significantly increase clocks, something in the uArch design creates a bottleneck where the frequency goes up, but not the performance.
I can think of many plausible reasons for that. One being: as you increase clock speeds, cache fetches & memory latencies take more cycles, because the number of nanoseconds for those things don't change but you've got more cycles per nanosecond. So, that stresses the out-of-order buffers' ability to find useful work to do, while the thread is waiting for its data.

I'm sure there are other reasons, but that's one where I think we can probably say these buffer sizes are targeted at a given clockspeed. Running above that target should naturally result in some degree of drop-off. Of course, the effect is going to be workload-dependent - a memory-heavy workload will fill up the request queues, pushing absolute latencies higher.

Anyway, thank you so much for taking the time to read what I posted and think about it. That's not something I take for granted, around here!
; )

I appreciate your keen insight.
 
Last edited:
  • Like
Reactions: ikjadoon
Most of the credit goes to the Chips & Cheese article for actually collecting the data. Too bad they didn't publish their raw measurements - I had to laboriously estimate each datapoint from their plot bitmaps, then compute the transforms from the image space of the plots to their numerical equivalents. A lot of the remaining credit goes to Excel.


Well... I took a bit of a leap. I noticed they seemed to collect 4 data points from each test run they did, on each cluster of cores. I made this leap simply because the number of data points from the different plots matched per-core type (always 14 points for E-cores, 15 for P-cores). So, what I did was extract the data from their plots and combine it into a joint table, for each class of cores.

Once I had recovered the raw values (which I validated by reproducing their plots and visually inspecting them to ensure they looked almost identical), I could then plot the aspects which most interested me.

IMO, the most interesting thing I did with it was to import the data from a Python script I wrote to compute the optimal combination of clock speeds for each power level. I then (naively) combined the throughput from these clockspeed combinations to demonstrate how Alder Lake's E-cores enabled it to deliver more performance at any power level! For more on that, look here:

Yes, I had better things to spend my time on, but I like a good challenge and was genuinely curious about the outcome.


Yes! I used the data I had. Sadly, the only Alder Lake I have access to is a 65W non-K variant. So, I would be even further restricted on clockspeeds and boost durations, if I tried to reproduce their experiment.

My hope was simply that neither 4 E-cores nor 4 P-cores would be bottlenecked too badly, in that 16-core chip. However, I'm well aware that the 4 E-cores were sharing a cluster (I believe Alder Lake's E-cores can only be disabled on a per-cluster basis), and all E-cores in a cluster share the same slice of L2 cache. So, that actually makes the E-cores look worse than if we had the same data for 1T on each class of core.

Thanks for reminding me of that. That means, if anything, the data is more pessimistic about E-core performance than P-core performance.


I can think of many plausible reasons for that. One being: as you increase clock speeds, cache fetches & memory latencies take more cycles, because the number of nanoseconds for those things don't change but you've got more cycles per nanosecond. So, that stresses the out-of-order buffers' ability to find useful work to do, while the thread is waiting for its data.

I'm sure there are other reasons, but that's one where I think we can probably say these buffer sizes are targeted at a given clockspeed. Running above that target should naturally result in some degree of drop-off. Of course, the effect is going to be workload-dependent - a memory-heavy workload will fill up the request queues, pushing absolute latencies higher.

Anyway, thank you so much for taking the time to read what I posted and think about it. That's not something I take for granted, around here!
; )

I appreciate your keen insight.

My apologies for my delayed reply. I wanted to have the full time to dedicate a response to this great work.

You have serious dedication to the data. I've sometimes drawn a few lines in Paint.NET to estimate bitmap charts, but this is another level.
  1. The 10P vs 12P vs 8P+8E chart is fantastic to see. As the core count increases, the perf / W increases, but it matters which type of cores you're adding: only P-cores leaves a lot performance at every power level. That, I did not expect; even at just 35W, there is a +20% perf boost with E-cores.
  2. It's good to see the 7-zip vs x264 comparison; 7-zip can be more memory intensive, so seeing the pattern across workloads makes it a clearer choice.
  3. This commitment to the data is what makes discussions actually interesting. Otherwise, it's just old ideas getting throw around without any meaningful movement. I wish Chips & Cheese had made similar data. That claim is so popular, "Just use P-cores, argh."
  4. That's quite fair. I do have a 125W Alder Lake, but I've not had a chance to tinker yet and am limited to software power measurements, but this would be quite interesting to see where IPC begins to drop off. I'm thinking how I can test just one Gracemont core: I would assume by turning off the P-cores. This is already piquing my interest.
  5. On the theory: I think I see the idea. The RAM & cache are too slow to keep up with the enhanced fetch + execute performance of the CPU itself. I do remember seeing some data (on older Intel gens) on CPU OC'ing where, at higher overclocks, you'd be leaving a lot of perf on the table if you didn't also increase the uncore frequency.
Completely understand. I mean, this is the type of data & discussion that is actually interesting. We always get day 1 reviews and then maybe a few niche tests, but as microarchitectures are lasting longer & longer, there is so much left untold about these uArches.

This is one of the most fascinating and interesting discussions I've had on CPUs in a very long time. Thank you, @bit_user, for your insight, testing, and actual data to to understand how these cores perform.
 
  • Like
Reactions: bit_user
My apologies for my delayed reply. I wanted to have the full time to dedicate a response to this great work.
Oh, don't apologize. This is just-for-fun stuff. Lower priority than nearly anything else.

You have serious dedication to the data. I've sometimes drawn a few lines in Paint.NET to estimate bitmap charts, but this is another level.
I had already been combing over those charts, squinting and trying to make out the data points, in my debates with Terry. So, I'd reached a point where I'm like "okay, let's scrape this stuff properly".

  1. The 10P vs 12P vs 8P+8E chart is fantastic to see. As the core count increases, the perf / W increases, but it matters which type of cores you're adding: only P-cores leaves a lot performance at every power level. That, I did not expect; even at just 35W, there is a +20% perf boost with E-cores.
Keep in mind that this is all extrapolations. What would be better is if I had a way to repeat their measurements, while changing additional parameters. We all know these things don't scale linearly, and that could play into certain spec decisions, as well. The "all P-core" extrapolations should be better quality, if I started with 8P+0E, instead of extrapolating from 4P+0E. Also, the 8P+8E should ideally be just a direct measurement.

2. It's good to see the 7-zip vs x264 comparison; 7-zip can be more memory intensive, so seeing the pattern across workloads makes it a clearer choice.
Don't forget AVX2, which I believe x264 heavily utilizes. I think that's a large part of why the E-cores perform less well. Ideally, I'd validate this by doing a set of comparisons using rebuilt x264 sources using their portable C implementation, instead of the optimized asm path containing the ISA-specific vector instructions.

Your earlier point about using data from quad-core clusters is also relevant, since the E-cores could be hurting from L2 pressure. It'd be interesting to do a scaling analysis on x264, in order to get a sense of such details.

3. This commitment to the data is what makes discussions actually interesting. Otherwise, it's just old ideas getting throw around without any meaningful movement. I wish Chips & Cheese had made similar data. That claim is so popular, "Just use P-cores, argh."
I try not to spend more time typing words than it would take to gather the relevant data. Before, it was uncommon to have access to this level of data and analysis, but outlets like C&C and SemiAnalysis are changing the game. I couldn't do what little I have, without them taking the first step.

I do wish C&C had spent more time with their own data, in that article. Thankfully, they gave me enough material to play with and arrive at my own conclusions.

I'm thinking how I can test just one Gracemont core: I would assume by turning off the P-cores. This is already piquing my interest.
According to Anandtech's original Alder Lake review, you cannot disable the P-cores (or not all of them, at least). The way they performed their core vs. power scaling was to use affinity. On Linux, I think you can do this with cpuset and numactl.

I know Windows has similar tools - I've seen people mention them - but I don't pay attention to it, since all of my real focus and technical activities are on Linux.

This is one of the most fascinating and interesting discussions I've had on CPUs in a very long time. Thank you, @bit_user, for your insight, testing, and actual data to to understand how these cores perform.
Thanks, again, for asking some good questions and your own contributions.

I don't even consider myself a real geek on this stuff. I'd just been debating Intel's E-cores for so long that I really wanted to make a purely numerical argument for them.

I've visited the RealWorldTech forums, several times. Never posted there, but I think you can find some good discussions there. It's weird that the main site is basically on life support, yet the forums appear to remain quite vibrant.

Also, I know Chips & Cheese has a Discord channel, access to which I think requires a base-level subscription (a couple $/month ought to do it).

I've never plumbed Reddit, for this stuff. It's entirely possible you get some good discussions, on there. There are lots of CPU architecture geeks, students, professionals, and ex out there. You've got to figure they congregate and "talk shop" somewhere.
 
  • Like
Reactions: ikjadoon
Status
Not open for further replies.