News Nvidia Is Bringing Back the Dual GPU... for Data Centers

bit_user

Polypheme
Ambassador
@JarredWaltonGPU , I'm not really sure what's new here (besides the additional memory and power). Their existing H100 PCIe product already had 3x NVLink bridge that could be used to install the cards in pairs. This PDF is dated Nov. 30, 2022 and includes installation diagrams clearly showing that.



It also states:

"The NVIDIA H100 PCIe operates unconstrained up to its maximum thermal design power (TDP) level of 350 W"​


So, you're right that they did increase the power limits.

Over at Anandtech, Ryan Smith is claiming the additional capacity is from enabling the 6th stack, which is also now HBM3. That increases the memory bandwidth of a single card to 3.9 TB/s, according to him.

The reference to "GPT3-175B" makes me wonder if GPT3 was just too big to fit a pair of their existing H100 PCIe cards, hence the need for this upgrade.
 
Last edited:
@JarredWaltonGPU , I'm not really sure what's new here (besides the additional memory and power). Their existing H100 PCIe product already had 3x NVLink bridge that could be used to install the cards in pairs. This PDF is dated Nov. 30, 2022 and includes installation diagrams clearly showing that.



It also states:

"The NVIDIA H100 PCIe operates unconstrained up to its maximum thermal design power (TDP) level of 350 W"​


So, you're right that they did increase the power limits.

Over at Anandtech, Ryan Smith is claiming the additional capacity is from enabling the 6th stack, which is also now HBM3. That increases the memory bandwidth of a single card to 3.9 TB/s, according to him.

The reference to "GPT3-175B" makes me wonder if GPT3 was just too big to fit a pair of their existing H100 PCIe cards, hence the need for this upgrade.
Oh... I somehow got it into my head that H100 was always HBM3. Seems like it was HBM2e on the PCIe model. I guess HBM3 and HBM2e must not be all that different at a base level. It's still a bit odd on the memory capacity going to 94GB per card. Like, enabling the sixth stack makes sense. But why not the full 96GB per card? Were yields really that much better with disabling 2GB per card? Or maybe it's something with ECC, but I don't know. I'll ask Nvidia, see if it has a response.

As for the additional memory, I'm sure there's something about the extra VRAM enabling larger models. From what I can tell, 4-bit mode with OPT-13b needs at least ~10GB. So if you go up to 130b, it would be ~100GB, and 165b would be ~127GB. That's assuming truly linearly scaling but that's probably not accurate. Whatever the limits are, 188GB vs. 160GB means the model can be 17.5% larger. Bigger is better? 🙃
 
  • Like
Reactions: bit_user
RIP SLI. Long time SLI builder here since the 3Dfx Voodoo2 days of the late 1990s. It used to be a common theme that SLI was popular to allow one to game on one mid-range GPU at lower settings until buying a second GPU and maybe bigger PSU when budget allowed to fully unlock a game's eye candy potential at single top end GPU performance (at least that was me). However, at some point in the mid-2010s game developers started not supporting dual SLI/Crossfire right around the time the 4th-gen consoles started becoming their development focus.

What I don't remember is which triggered which first in SLI's death: did the game developers not developing dual GPU support drive the hardware manufacturers to stop supporting it or did the manufacturers drop support first and then the game developers stopped supporting it? Either way since we are now forced to buy one single much more expensive GPU these days (REALLY much more expensive), upgrade paths are fewer and far between for many.
 
  • Like
Reactions: bit_user

bit_user

Polypheme
Ambassador
Either way since we are now forced to buy one single much more expensive GPU these days (REALLY much more expensive), upgrade paths are fewer and far between for many.
Excellent summary. Good questions.

As the latest nodes become increasingly expensive and GPU designers wrestle with chiplets and partitioning, I actually wonder if we could see a revival of multi-GPU. With PCIe 5 / CXL, we might not even need over-the-top connectors, although a dual GPU setup would typically mean 2 cards running at just x8.

Another key question is whether increasing use of ray tracing might be an enabler, here. Around 2006 or so, Intel started talking about it and making the case that it scales better than rasterization. That's when they put together a demo of raytraced Doom or Quake, running on a dual quad-core Core 2 Xeon workstation, with all the rendering done on the CPU cores. Maintaining the BVH tree could throw a wrench into that idea - I'm not sure how well you could distribute it.

Anyway, I think whatever happens with multi-die GPUs, the software which is designed to accommodate the partitioning will utilize the hardware more efficiently. And once modern game engines have to take that on board, multi-GPU is a couple more steps down that logical progression. So, I wouldn't rule out the possibility of seeing it return within the decade.
 
Last edited:
  • Like
Reactions: 10tacle