News Nvidia Grace Superchip loses to Intel Sapphire Rapids in HPC performance benchmarks, but promises greater efficiency

Status
Not open for further replies.
Nvidia, already well in front on AI with CUDA and best LLM performance has at the same time created a competitive custom datacenter solution... Maybe their marketshare valuation is justified.

Will be interesting to see AMD genoa comparison as well as if it can continue to compete. Custom ARM cores are making inroads and these benches prove for certain configs they are competent replacements.
 
Against Sapphire Rapids in HBM mode, Grace only won in three of the eight tests — though it was able to outperform in five tests when in DDR5 mode. It's a surprisingly mixed bag for Nvidia considering that Grace has 50% more cores and uses TSMC's more advanced 4nm node instead of Intel's aging Intel 7 (formerly 10nm) process.
Some key details this statement overlooks.
  • Clock speed: the Xeon Max 9468 runs at 3.5 Ghz vs. 3.2 GHz for the Grace CPU tested - a 9% advantage for Intel. Allegedly, Grace is designed to clock higher, but the one tested was running at reduced clocks for some reason.
  • Sapphire Rapids' Golden Cove cores feature much wider AVX-512. I'm not sure of the number of ports, but possibly a total of 1536 bits or wider. Grace uses ARM Neoverse V2 cores, which have 4x 128-bit SVE 2 support = 512 bits of vector throughput, per cycle.
  • Xeon Max features 1 TB/s of HBM bandwidth, while Grace only manages about half as much bandwidth from its LPDDR5X. Furthermore, the NextPlatform article indicates that the researchers' system had Grace's memory running at reduced clocks, but doesn't specify by how much.

Basically, the main thing Grace has going for it is its 50% higher core count. Taken together, I find the result probably putting Grace in a more positive light than expected. I mean, if you really run the numbers, the Grace setup just isn't designed for throughput like a Xeon Max is. That's because Nvidia never intended Grace to do the heavy lifting. They expect you to use their H100 (and now H200) to shoulder the main compute burden.

BTW, NextPlatform incorrectly describes Grace's 512 GB/s memory bandwidth as being aggregate for the superchip. It's actually 512 GB/s per CPU.

grace-CPU-superchip-graphic.png

Source: https://developer.nvidia.com/blog/nvidia-grace-cpu-superchip-architecture-in-depth/
 
Well now we see a test that shows the chosen equipment winning that justifies the spend and choice which seems to run contrary to all the expectations, lots of stuff done in the background to ensure the intel product is on top, when frankly it should not be, anybody else smell a blue fish ?
 
anybody else smell a blue fish ?
Usually, you'd want to dig into the details of the setups and how the testing was performed, in order to spot anything which could've biased the results one way or another. Hopefully, all of those details are in the papers, themselves.

What strikes me as weird, about the whole proposition, is the use of Xeon Max in HBM mode. This limits you to just 64 GiB of memory, which seems woefully insufficient. I'm glad they tested this configuration, because I'm certainly curious about the potential of HBM, but seems unsuitable for actual usage.

So, if they forego HBM mode, that cuts down Xeon Max's advantage from winning 5/8 benchmarks to 3/8. There's a 3rd option, which is to use HBM as a cache. I wonder if the article just omitted those results or if the researchers hadn't tested them. Anyway, if the HBM cache mode adds little benefit over straight DDR5 mode, then I'd say they made a mistake in selecting Xeon Max and should've just gone with a standard Xeon model that supports higher clock speeds.

Fun stuff, though.
: )
 
Just gonna leave this here:

Sadly, no power consumption figures. However, the Geomean shows a single, 72-core Grace achieving 2175, while a single, 96-core AMD EPYC 9654 achieves 2499. So, that's 14.9% faster with 33.3% more cores (and 166.7% more threads).

What's even more impressive is how well it does against the 128-core/256-thread Zen 4C-based Bergamo (EPYC 9754), which is only 13.1% faster!

Not bad, Grace!
 
Last edited:
Status
Not open for further replies.