• Happy holidays, folks! Thanks to each and every one of you for being part of the Tom's Hardware community!

News Nvidia's Grace server CPU trades blows with AMD and Intel in detailed review -- Bergamo, Genoa, and Emerald Rapids outperformed in over half of the...

Status
Not open for further replies.
The Geomean shows a single, 72-core Grace achieving 2175, while a single, 96-core AMD EPYC 9654 achieves 2499. So, that's 14.9% faster with 33.3% more cores (and 166.7% more threads).

What's even more impressive is how well it does against the 128-core/256-thread Zen 4C-based Bergamo (EPYC 9754), which is only 13.1% faster with 77.8% more cores (255.6% more threads).

Not bad, Grace!
 
Last edited:
The Geomean shows a single, 72-core Grace achieving 2175, while a single, 96-core AMD EPYC 9654 achieves 2499. So, that's 14.9% faster with 33.3% more cores (and 166.7% more threads).

What's even more impressive is how well it does against the 128-core/256-thread Zen 4C-based Bergamo (EPYC 9754), which is only 13.1% faster with 77.8% more cores (255.6% more threads).

Not bad, Grace!
Is there a particular test that informs best a linear relationship between score and number of cores? I wonder if there is some kind of diminishing return.
 
  • Like
Reactions: bit_user
Is there a particular test that informs best a linear relationship between score and number of cores? I wonder if there is some kind of diminishing return.
This is a good question and I'm not aware of anyone having done that analysis. I think Michael (the author at Phoronix) knows which tests tend to scale well.

An interesting distinction might also be whether they scale well to multiple, physical CPUs. That subset is easy to eyeball by just looking for tests where 2P configurations score about double (or half) of a 1P test setup. Among these, I see:
  • Algebraic Multi-Grid Benchmark 1.2
  • Xcompact3d Incompact3d 2021-03-11 - Input: input.i3d 193 Cells Per Direction
  • LULESH 2.0.3
  • Xmrig 6.18.1 - Variant: Monero- Hash Count: 1M
  • John The Ripper 2023.03.14 - Test: bcrypt
  • Primesieve 8.0 - Length: 1e13
  • Helsing 1.0-beta - Digit Range: 14 digit
  • Stress-NG 0.16.04 - Test: Matrix Math

I think categorizing the benchmarks in Phoronix Test Suite (which allegedly takes a couple months to run in its entirety!), based on things like multi-processor scalability, sensitivity to memory bottlenecks, intensity of disk I/O, etc. would be fertile ground for someone to tackle. This would also let you easily run just such a sub-category, based on what aspect you want to stress.

It would also be interesting to do some clustering analysis of the benchmarks in PTS, so that you could skip those which tend to be highly-correlated and run just the minimal subset needed to fully-characterize a system.

Something else that tends to come up is how heavily a benchmark uses vector instructions and whether it contains hand-optimized codepaths for certain architectures.

As PTS is an open source project, it might be possible to add some of these things (or, at least enough logging that some of these metrics can be computed), but I have no idea how open Michael might be for such contributions to be upstreamed.
 
Last edited:
The Geomean shows a single, 72-core Grace achieving 2175, while a single, 96-core AMD EPYC 9654 achieves 2499. So, that's 14.9% faster with 33.3% more cores (and 166.7% more threads).

What's even more impressive is how well it does against the 128-core/256-thread Zen 4C-based Bergamo (EPYC 9754), which is only 13.1% faster with 77.8% more cores (255.6% more threads).

Not bad, Grace!
It's still slower than the 9554 which has 64c.
 
It's still slower than the 9554 which has 64c.
Yes, the GeoMean is indeed lower (which is what I was talking about).

There are some outliers, though. In each direction, to be fair, but things like Xmrig (Monero) seem like they'd probably have some x86 optimizations and maybe not ones for ARM. John The Ripper (bcrypt) and ACES DGEMM seem like other cases where a few, well-placed optimizations could tip the balance more in ARM's favor.

It would be nice if Phoronix would be able to exclude benchmarks with x86-specific optimizations that don't also have ARM-optimized paths. However, given some comments in that article, it's clear this was not done.

That said, I think the V2 still does better on energy-efficiency than Zen 4, according to the prior article. Ultimately, Grace is just there as a support chip for Hopper. So, I think it doesn't need the best per-core perfomance, in order to serve that role - efficiency is more important.

Also, Grace can scale up to at least 32 CPUs per server. As far as I know the most x86 will scale is in the top of the Xeon range, where you can build only 8P systems. EPYC only scales to 2P.
 
This is a good question and I'm not aware of anyone having done that analysis. I think Michael (the author at Phoronix) knows which tests tend to scale well.

An interesting distinction might also be whether they scale well to multiple, physical CPUs. That subset is easy to eyeball by just looking for tests where 2P configurations score about double (or half) of a 1P test setup. Among these, I see:
  • Algebraic Multi-Grid Benchmark 1.2
  • Xcompact3d Incompact3d 2021-03-11 - Input: input.i3d 193 Cells Per Direction
  • LULESH 2.0.3
  • Xmrig 6.18.1 - Variant: Monero- Hash Count: 1M
  • John The Ripper 2023.03.14 - Test: bcrypt
  • Primesieve 8.0 - Length: 1e13
  • Helsing 1.0-beta - Digit Range: 14 digit
  • Stress-NG 0.16.04 - Test: Matrix Math

I think categorizing the benchmarks in Phoronix Test Suite (which allegedly takes a couple months to run in its entirety!), based on things like multi-processor scalability, sensitivity to memory bottlenecks, intensity of disk I/O, etc. would be fertile ground for someone to tackle. This would also let you easily run just such a sub-category, based on what aspect you want to stress.

It would also be interesting to do some clustering analysis of the benchmarks in PTS, so that you could skip those which tend to be highly-correlated and run just the minimal subset needed to fully-characterize a system.

Something else that tends to come up is how heavily a benchmark uses vector instructions and whether it contains hand-optimized codepaths for certain architectures.

As PTS is an open source project, it might be possible to add some of these things (or, at least enough logging that some of these metrics can be computed), but I have no idea how open Michael might be for such contributions to be upstreamed.
He definitely knows which tests tend to scale well to many cores or even multiple sockets because he often makes those annotations in his reviews. He also knows which tests are more bandwidth dependent etc. He literally benchmarks hardware for a living. He’s good at it, too.
 
He definitely knows which tests tend to scale well to many cores or even multiple sockets because he often makes those annotations in his reviews. He also knows which tests are more bandwidth dependent etc.
I know he has a sense of these things, but that doesn't mean his knowledge is complete.

He literally benchmarks hardware for a living. He’s good at it, too.
He lacks transparency in which benchmarks he uses for which hardware reviews - certainly a way the results could be biased. Having classifications and categories of benchmarks would be a way to provide greater transparency, more alignment between different benchmark runs (as well as with & among user submissions), and reduce suspicions of influencing the results.
 
Yes, the GeoMean is indeed lower (which is what I was talking about).

There are some outliers, though. In each direction, to be fair, but things like Xmrig (Monero) seem like they'd probably have some x86 optimizations and maybe not ones for ARM. John The Ripper (bcrypt) and ACES DGEMM seem like other cases where a few, well-placed optimizations could tip the balance more in ARM's favor.

It would be nice if Phoronix would be able to exclude benchmarks with x86-specific optimizations that don't also have ARM-optimized paths. However, given some comments in that article, it's clear this was not done.

That said, I think the V2 still does better on energy-efficiency than Zen 4, according to the prior article. Ultimately, Grace is just there as a support chip for Hopper. So, I think it doesn't need the best per-core perfomance, in order to serve that role - efficiency is more important.

Also, Grace can scale up to at least 32 CPUs per server. As far as I know the most x86 will scale is in the top of the Xeon range, where you can build only 8P systems. EPYC only scales to 2P.
Most software is optimized and designed for x86, so the Benchmark is realistic.
 
Most software is optimized and designed for x86, so the Benchmark is realistic.
This claim is starting to get pretty tired.

Ever since the smartphone revolution got going, lots of common software libraries & tools have been getting optimized for ARM. Then, we started to see companies like Ampere, Applied Micro, Cavium, and even Qualcomm (i.e. their aborted Centriq product) start developing ARM-based server CPUs, with optimizations of server software for ARM beginning to trickle in. That trickle turned into a stream, when ARM launched it Neoverse initiative, back in late 2018.

The next big milestone is probably back in March 2020, when Amazon rolled out Graviton 2 - and with very competitive pricing. This fed in even more ARM optimizations.

In the embedded & self-driving space, Nvidia (along with most others, I think) has used ARM in its self-driving SoCs.

Even HPC felt ARM's presence, when Fugaku topped the charts, using Fujitsu's custom A64FX ARM core and no GPUs!

Finally, Windows/ARM has been brewing for probably the better part of the past decade, which should be focusing more attention on ARM optimizations of client-specific packages.

By now, we're many years into the ARM revolution. There are still some packages yet to receive ARM optimizations, but they're probably in the minority of the ones commonly in use.

I think your claim is past its sell-by date. x86 stalwarts really need to come up with a new rallying cry, or risk getting laughed off the field.
 
This claim is starting to get pretty tired.
It would be nice if Phoronix would be able to exclude benchmarks with x86-specific optimizations that don't also have ARM-optimized paths. However, given some comments in that article, it's clear this was not done.
Good thing you don't bring it up then...
You can't have it both ways, sure everything that can be optimized for ARM has been, but there are still plenty of things that can't be optimized which is why you wanted phoronix to exclude them.
 
Good thing you don't bring it up then...
You can't have it both ways,
I mentioned a few cases, whereas the post I replied to said "Most software is optimized and designed for x86". That's a big difference.

Furthermore, Phoronix sometimes tests fairly obscure apps. The ones most likely to have ARM optimizations are the ones most commonly used on ARM. It therefore isn't a contradiction to say that the majority of packages in common use are reasonably-well ARM-optimized, while also raising the spectre that some of the dozen or so packages included in that GeoMean might have x86-optimized paths but not ARM-optimized ones.

Finally, the post I replied to concluded that because most software is x86-optimized, that Phoronix' inclusion of x86-optimized apps made it "realistic". However, if someone is only using Grace to run deep learning or cloud hosting software, all of those packages will have been well-optimized for it and therefore it's not realistic to include some x86-optimized ones in the suite.
 
Last edited:
I mentioned a few cases, whereas the post I replied to said "Most software is optimized and designed for x86". That's a big difference.
For the end user most software is, you are not a normal end user, you are a coder or whatever you need server oriented apps and those are optimized for ARM to a high degree, consumer apps aren't.
I agree with you, this is a server CPU and phoronix is server oriented website, I'm just saying that you have to also understand normal users.
 
For the end user most software is, you are not a normal end user,
It sounds like you agree that Nvidia's Grace isn't designed for running normal client workloads. It's not even intended to be an ordinary server CPU. Hence, I maintain that the state of desktop software on ARM is irrelevant to this particular platform. This is a machine for AI and HPC, period.

you have to also understand normal users.
For a different kind of system? Yes. If this were an ARM-based laptop review, then it would be a fair point.
 
Status
Not open for further replies.

TRENDING THREADS