News Nvidia Unveils DGX GH200 Supercomputer and MGX Systems, Grace Hopper Superchips in Production

bit_user

Titan
Ambassador
Thanks for the writeup!

up to 1 TB/s of throughput between the CPU and GPU offering a tremendous advantage for certain memory-bound workloads.
The way they have a dual-CPU configuration option can lead to some confusion about whether the stats are referring to a single CPU or dual-CPU config. As you can see from their developer blog post, the 1 TB/s is the memory bandwidth across 2 Grace CPUs. The Chip-to-Chip link is only 900 GB/s.

When a GPU is paired with a CPU, the GPU would be bottlenecked by the CPU's 500 GB/s memory interface. Then again, I'd bet Nvidia is arriving at 900 GB/s by summing throughput in each direction, in which case it could only read or write CPU memory at a peak of @ 450 GB/s.
grace-CPU-superchip-graphic.png

The system has 150 miles of optical fiber and weighs 40,000 lbs, but presents itself as one single GPU. Nvidia says the 256 Grace Hopper Superchips ...
Really? That's 156 pounds per "superchip". That sounds like pretty bad scaling, actually.


@PaulAlcorn , in the 7-slide album under "Nvidia MGX Systems Reference Architectures", the last 2 slides appear to be duplicates of slides 2 & 3. Not that I really care that much about the details of MGX, but I did notice.
; )

The MGX reference designs could be the sleeper announcement of Nvidia’s Computex press blast
Yeah, but let's call it out for what it is: an end-run around OCP (Open Compute Project). The industry has a nice, open multi-vendor ecosystem that already supports modular accelerators.

Nvidia is pivoting to Ethernet as the interconnect for high-performance AI platforms
"Never bet against Ethernet."


@PaulAlcorn, in the album under "Nvidia Spectrum-X Networking Platform", the second image seems to be missing?

Nvidia's first Arm CPUs (Grace) ...
Server CPUs, that is. They've been making ARM-based SoCs for embedded applications since about 1.5 decades ago.
 
  • Like
Reactions: msroadkill612

zecoeco

Prominent
BANNED
Sep 24, 2022
83
113
710
Nvidia throwing the new buzzword "AI" in every new product they announce.
They just won't stop overhyping AI..
 

ngilbert

Prominent
Mar 30, 2023
6
7
515
Nvidia throwing the new buzzword "AI" in every new product they announce.
They just won't stop overhyping AI..
Given that it is the buzzword of the month across the industry, and they have a massive lead in that type of workload, they would be foolish not to push it as far as they can.
 
  • Like
Reactions: bit_user

elforeign

Distinguished
Oct 11, 2009
100
142
18,770
Awesome write up, can someone illustrate just what the platform enables? I understand whatever is possible with today in terms of language models, etc but when they say 2.2-6.3x more performance, where does that take us in the next year?

How soon can I plug into the matrix?
 

bit_user

Titan
Ambassador
Awesome write up, can someone illustrate just what the platform enables? I understand whatever is possible with today in terms of language models, etc but when they say 2.2-6.3x more performance, where does that take us in the next year?
The details are (mostly) in the slides. And the devil is in the details. It shouldn't surprise you to hear that Nvidia loves to overhype their products, so they're going to highlight the best case scenario, much more than the typical speed up. Don't get these two confused!

They claimed a 2.3x speedup in (training?) a 65B parameter Large Language model (e.g. ChatGPT), a 5x speedup in a 500 GB Deep Learning Recommendation Model, and a 5.5x speedup in a 400 GB Vector DB workload (think lookups in a large-population facial recognition database, among other examples that come to mind).

There's a common theme, here: they're all big. That's because they didn't announce a new GPU! They announced a new system architecture, based around their "superchip" modules and next-gen interconnect technology. So, it's not the computation that got faster, but rather the data movement, locality, and latencies that improved.

How soon can I plug into the matrix?
Depends on how you define it. If you just wanted VR without a headset, then the bottleneck would seem to be the brain-machine interface. Elon Musk's Neuralink just got approval to start human trials. I haven't followed their stuff, but I'm guessing it'd be several generations before they're ready to tap into the optic nerve. And it's anybody's guess if/when we can do that without having to sever the connection to your organic retinas.
 

elforeign

Distinguished
Oct 11, 2009
100
142
18,770
The details are (mostly) in the slides. And the devil is in the details. It shouldn't surprise you to hear that Nvidia loves to overhype their products, so they're going to highlight the best case scenario, much more than the typical speed up. Don't get these two confused!

They claimed a 2.3x speedup in (training?) a 65B parameter Large Language model (e.g. ChatGPT), a 5x speedup in a 500 GB Deep Learning Recommendation Model, and a 5.5x speedup in a 400 GB Vector DB workload (think lookups in a large-population facial recognition database, among other examples that come to mind).

There's a common theme, here: they're all big. That's because they didn't announce a new GPU! They announced a new system architecture, based around their "superchip" modules and next-gen interconnect technology. So, it's not the computation that got faster, but rather the data movement, locality, and latencies that improved.


Depends on how you define it. If you just wanted VR without a headset, then the bottleneck would seem to be the brain-machine interface. Elon Musk's Neuralink just got approval to start human trials. I haven't followed their stuff, but I'm guessing it'd be several generations before they're ready to tap into the optic nerve. And it's anybody's guess if/when we can do that without having to sever the connection to your organic retinas.
Thanks, for some reason I thought the grace hopper chip was a new chip but if I'm understanding correctly, it's just the software stack and the interconnects that got upgraded to enable more to be daisy chained and maintain coherency.
 
  • Like
Reactions: msroadkill612

bit_user

Titan
Ambassador
Thanks, for some reason I thought the grace hopper chip was a new chip
Grace is a 72-core ARM CPU and each one has up to 512 GB of memory. There are 3 options for populating their modules (daughter cards):
  • Grace + Grace (2x CPU)
  • Grace + Hopper (CPU + GPU)
  • Hopper + Hopper (2x GPU)

The big news is that Grace is finally shipping, after probably close to a year of delays.

if I'm understanding correctly, it's just the software stack and the interconnects that got upgraded to enable more to be daisy chained and maintain coherency.
Yeah, I don't follow NVLink quite closely enough to say in detail just how much the latest round of upgrades were worth, but the game-changer is that you now have these CPU nodes in the NVLink mesh, each of which has 512 GB of memory (compared with the GPU nodes, which have only up to 96 GB of memory). So, that's why they're highlighting all of these big memory workloads that were previously bottlenecked on accessing large amounts of data in pools of "host" memory.

Beyond that, they're highlighting that you can bypass Infinband for scaling to multi-chassis. Using NVLink scales much better, at the rack-level. At some point, you still need to use Infiniband.
 
  • Like
Reactions: elforeign

msroadkill612

Distinguished
Jan 31, 2009
203
29
18,710
The details are (mostly) in the slides. And the devil is in the details. It shouldn't surprise you to hear that Nvidia loves to overhype their products, so they're going to highlight the best case scenario, much more than the typical speed up. Don't get these two confused!

They claimed a 2.3x speedup in (training?) a 65B parameter Large Language model (e.g. ChatGPT), a 5x speedup in a 500 GB Deep Learning Recommendation Model, and a 5.5x speedup in a 400 GB Vector DB workload (think lookups in a large-population facial recognition database, among other examples that come to mind).

There's a common theme, here: they're all big. That's because they didn't announce a new GPU! They announced a new system architecture, based around their "superchip" modules and next-gen interconnect technology. So, it's not the computation that got faster, but rather the data movement, locality, and latencies that improved.


Depends on how you define it. If you just wanted VR without a headset, then the bottleneck would seem to be the brain-machine interface. Elon Musk's Neuralink just got approval to start human trials. I haven't followed their stuff, but I'm guessing it'd be several generations before they're ready to tap into the optic nerve. And it's anybody's guess if/when we can do that without having to sever the connection to your organic retinas.
 

msroadkill612

Distinguished
Jan 31, 2009
203
29
18,710
Indeed. To this inexpert, there seems a lot of hype about grace/hopper & not much fundamentally new about processors.

The stuff from AMD's stable tho, is revolutionary.

4 gpus on a single mi300 substrate (32 such gpuS per blade). NV's lan bandwidth boasts seem irrelevant to AI - its all about latency & efficiency via proximity of processing resources (simple laws of physics).

The most gloomy prospect for NV IMO, is that AMD is on the launch pad of gpu chiplets - not only chiplets which together comprise a gpu (e.g. 9700 xtx) (as with a single core complex (ccx) Zen, but multiple gpuS on a single substrate (as w/ a multi ccx zen)

Nvidia is constrained by the scale limitations of monolithic processors & thats far worse a problem for GPU, where scale clearly matters far more. Intel's problems competing in DC against epyc will pale by comparison in the gigantic AI rigs.

To boast of 150 miles of lan fiber seems odd... not to mention the huge cost of all those $10k NICs

Shared gpu/cpu memory on mi300, 192 core epycs, 3d cache epyc - a huge advance on grace hopper discrete cpu/gpu cache,

if grace hopper is just now releasing as u say, more revolutionary mi300 seems not so far behind, with its release to el capitan.
 

bit_user

Titan
Ambassador
Indeed. To this inexpert, there seems a lot of hype about grace/hopper & not much fundamentally new about processors.
Here's what I see that's new:
  • First ARMv9-A server CPU (also the first to use ARM Neoverse V2 cores - Amazon Graviton 3 uses V1)
  • First server CPU to use co-packaged LPDDR memory, with the only other example I know being Apple.
  • Second and only current CPU to feature NVLink (the first was IBM's POWER, but that CPU is at least 2 generations old)
  • Made on TSMC N4 - gives a slight advantage over EPYC at N5.

NV's lan bandwidth boasts seem irrelevant to AI
Of course it's relevant! Big models involve thousands of training GPUs, and that means you need to scale well beyond the number that can be packed in a single chassis or rack!

I'm incredibly interested in seeing some real-world benchmarks, purely for the sake of looking at how the latest & greatest ARM server cores compare to Intel and AMD. As for its role in building scalable AI systems, that's not terribly interesting to me, personally.

Compared with the MI300, I can't really say too much, technically. I'd just point out that the success of that product depends on a whole lot of factors, with technical parity/superiority being only one. However, I do hope it enables AMD to finally gain some traction in the AI market. I'm not going to place any bets, either way.
 
Last edited: