News Nvidia's data center Blackwell GPUs reportedly overheat, require rack redesigns and cause delays for customers

Admin · Nov 17, 2024

Nvidia's NVL72 machines with Blackwell GPUs reportedly overheat.

Nvidia's data center Blackwell GPUs reportedly overheat, require rack redesigns and cause delays for customers : Read more

bit_user · Nov 17, 2024

It seems like Nvidia tried to launch with clock speeds that were too high.

Perhaps they felt compelled to do that, in order to justify the exorbitant pricing. This would also explain why they can't simply tell customers to reduce clock speeds a little, since that would significantly affect the perf/$ and might result in customers asking for a rebate to offset the lost performance.

While Nvidia provides industry leading AI performance, I'm sure they're probably less competitive on the perf/$ front. Intel basically said as much, during recent announcements regarding their repositioning of Gaudi. This is something Nvidia must be keeping an eye on.

TBH, I honestly expected Nvidia to have transitioned their AI-centric products away from a GPU-style architecture, by now, and towards a more purpose-built, dataflow-oriented architecture like most of their competitors have adopted. I probably just underestimated how much effort it would take them to overhaul their software infrastructure, which I guess supports Jim Keller's claim that CUDA is more of a swamp than a moat.

raminpro · Nov 17, 2024

ithese days performance not very important,( all hardwares performance near toghather) , temprature is more important

cant understand why they dont use AMD
AMD maybe little be week but have very low heat and low watt

high watt , high temprature = degrade

wingfinger · Nov 17, 2024

lack of comprehensive design/engineering when it is required more than ever, and when facilities are/should be available to do that?

watzupken · Nov 17, 2024

Well, even if Nvidia made a subpar product, companies and individuals will still throw money at them. So with little consequences, they can basically do what they want isn’t it? All you need is some fluffy and creative marketing on some edge cases from internal testing.

usertests · Nov 17, 2024

bit_user said:
It seems like Nvidia tried to launch with clock speeds that were too high.

I thought the heat was from top Blackwell products essentially gluing two giant dies together, increasing the thermal density.

https://www.tomshardware.com/pc-com...ute-and-massive-improvements-over-hopper-h100

Clocks are going to play a role but I think the B200's design is a big factor. If I'm reading it correctly it can use 1200W instead of 700W for the H100.

thestryker · Nov 17, 2024

bit_user said:
It seems like Nvidia tried to launch with clock speeds that were too high.

Perhaps they felt compelled to do that, in order to justify the exorbitant pricing. This would also explain why they can't simply tell customers to reduce clock speeds a little, since that would significantly affect the perf/$ and might result in customers asking for a rebate to offset the lost performance.

While Nvidia provides industry leading AI performance, I'm sure they're probably less competitive on the perf/$ front. Intel basically said as much, during recent announcements regarding their repositioning of Gaudi. This is something Nvidia must be keeping an eye on.

Just to add onto this their biggest customers are diversifying hardware sources already so cutting that performance lead could be especially dangerous there.

bit_user · Nov 17, 2024

usertests said:
I thought the heat was from top Blackwell products essentially gluing two giant dies together, increasing the thermal density.

I'm not saying otherwise, just that I'm sure it's clocked well beyond the point of diminishing returns. In which case, they could reduce power consumption (and thereby heat), if they'd dial back the clocks a bit. It's such a simple and obvious solution that I thought it was worth exploring the implications of doing it.

HansSchulze · Nov 17, 2024

How did we get from October mass production to January shipping?

Heat density is a thing. Even with liquid cooling it is difficult to continuously move thousands of watts from a to b. Temporary fix is to maybe install 2 of 3 boards? Not earth shattering. Problem is that if they change the case itself, there is a crazy amount of networking connectivity inside as well as heavy power cabling. Swapping cases will cost time and money, but i doubt customers will have to wait to start testing their compute modules, like we run Furmark on new systems. Weed out infant m0rtality. Lowering clocks is good after that, since the boards work ok at higher clocks, while they wait for mechanical issues. People still get work done, time doesn't stop for anyone.

bit_user · Nov 17, 2024

HansSchulze said:
we run Furmark on new systems. Weed out infant m0rtality.

Ooo! +1 for furmark. That's one of my favorite stress tests!

It's even available for Linux, now!

FurMark 2.1.0 Released | Geeks3D

FurMark 2 has been released. The new major version of FurMark is available for Windows and Linux with OpenGL and Vulkan stress tests and benchmarks.

geeks3d.com

spongiemaster · Nov 18, 2024

bit_user said:
I'm not saying otherwise, just that I'm sure it's clocked well beyond the point of diminishing returns.

Enterprise hardware is never clocked like that. Efficiency and reliability are significantly more important to enterprise customers than all out performance. Hopper's boost clock is only 1980Mhz, compared to a 4090 that boosts up to 2850MHz. Adding a 2nd chip, Blackwell increases transistor count compared to Hopper from 80 billion to 208 billion, plus an addition 50GB of HBM. That's where the power increase is coming from. TDP increased from 700W to 1000W in the same form factor. If data centers did nothing to improve cooling with that increase in power/heat density, it's no surprise the systems are overheating.

bit_user · Nov 18, 2024

spongiemaster said:
Enterprise hardware is never clocked like that. Efficiency and reliability are significantly more important to enterprise customers than all out performance.

This isn't true of AI hardware, due to the demand for AI performance vs. its expense and scarcity. That incentivizes customers to milk the most performance out of whatever they can get their hands on. I previously ran through the numbers in this post:

https://forums.tomshardware.com/thr...rmance-than-a-core-i5-1.3854574/post-23334920

As for reliability, AI training is inherently iterative. That gives you a natural checkpoint that you can resume from, in the event of a hardware failure. It's also more error-tolerant than most computation people do, using low-precision arithmetic and even relying on random seeds. Heck, it wouldn't surprise me too much to hear that some customers are even running the memory on these accelerators in non-ECC mode!

spongiemaster said:
Hopper's boost clock is only 1980Mhz, compared to a 4090 that boosts up to 2850MHz.

It's much bigger, though. That makes them hard to compare.

spongiemaster said:
Adding a 2nd chip, Blackwell increases transistor count compared to Hopper from 80 billion to 208 billion, plus an addition 50GB of HBM. That's where the power increase is coming from.

Like I said, it doesn't really matter why it's using more power, the point is they can still reduce it by trimming back the clock speed. So, I was trying to explore the matter.

spongiemaster said:
If data centers did nothing to improve cooling with that increase in power/heat density, it's no surprise the systems are overheating.

The article says: "These problems have caused Nvidia to reevaluate the design of its server racks multiple times". So, I think it's probably not a simple matter of everyone just acting like the same solutions will work that they used before.

eichwana · Nov 18, 2024

Is this down to nvidia’s designs and clock speeds, or the rack mount company’s designs?

(ignorance on my part as to who the rack mounting is produced/supplied by)

HansSchulze · Nov 18, 2024

bit_user said:
furmark. That's one of my favorite stress tests. It's even available for Linux

I'm looking for an android cpu str3ss t3st, prim3s would have been nic3

bit_user · Nov 18, 2024

HansSchulze said:
I'm looking for an android cpu str3ss t3st, prim3s would have been nic3

Well, stress-ng has a range of different CPU stress tests (the best of its cpu methods I've found on x86 CPUs is fft) and apparently can be built for Android:

https://github.com/ColinIanKing/stress-ng/blob/master/README.Android

Not sure if you can get it on Google Play, but you might look for it. The ordinary Linux build works great on ARM computers, like Raspberry Pi.

spongiemaster · Nov 18, 2024

bit_user said:
As for reliability, AI training is inherently iterative. That gives you a natural checkpoint that you can resume from, in the event of a hardware failure. It's also more error-tolerant than most computation people do, using low-precision arithmetic and even relying on random seeds. Heck, it wouldn't surprise me too much to hear that some customers are even running the memory on these accelerators in non-ECC mode!

Musk just purchased 300,000 Blackwell GPU's. He's certainly not the only one with 10's of thousands of Nvidia GPU's in a single supercomputer. What you are talking about has nothing to do with the work involved keeping massive super computers like these running with constant hardware failures and software issues. Reliability absolutely matters.

https://www.tomshardware.com/tech-i...ee-hours-for-metas-16384-gpu-training-cluster

bit_user said:
It's much bigger, though. That makes them hard to compare.

No. TSMC's reticle limit is around 800mm2. All of Nvidia largest GPU's in recent years have been around this size. Hopper has 80 billion transistors. AD102 has 76billion transistors. Almost full die (98.6%) ADA A6000 boosts to 2505Mhz. More than 25% higher than Hopper.

bit_user said:
The article says: "These problems have caused Nvidia to reevaluate the design of its server racks multiple times". So, I think it's probably not a simple matter of everyone just acting like the same solutions will work that they used before.

A 10U Hopper DGX system that has 8 GPU's in it uses about 10kW of power, the same system with Blackwell consumes over 14kW. An additional 4kW of heat to dissipate out of each 10U rack is not a trivial problem to solve. Not everyone is as smart as you and gets it right on the first try.

bit_user · Nov 18, 2024

spongiemaster said:
Musk just purchased 300,000 Blackwell GPU's. He's certainly not the only one with 10's of thousands of Nvidia GPU's in a single supercomputer. What you are talking about has nothing to do with the work involved keeping massive super computers like these running with constant hardware failures and software issues. Reliability absolutely matters.

https://www.tomshardware.com/tech-i...ee-hours-for-metas-16384-gpu-training-cluster

I think that article actually highlights what I'm talking about:

"the main trick for developers is to ensure that the system remains operational regardless of such local breakdowns.
...
the Llama 3 team maintained over a 90% effective training time."

You can't have perfect hardware reliability at these scales, so you need to have fault-tolerance at the software level. Once your system is fault tolerant, then you can afford to clock the hardware a bit higher before the failure rate increases so much that it ceases to be a net win. I'm not saying they actually do, but the presence of those failures clearly shows how effective their fault-tolerance strategies were.

spongiemaster said:
No. TSMC's reticle limit is around 800mm2. All of Nvidia largest GPU's in recent years have been around this size. Hopper has 80 billion transistors. AD102 has 76billion transistors. Almost full die (98.6%) ADA A6000 boosts to 2505Mhz. More than 25% higher than Hopper.

Transistor counts can be misleading, because different types of transistors need to be different sizes. In particular I/O transistors are much larger than what's used in most of the computational logic. The H100 features a 3072-bit HBM2e interface and 48 NVLink lanes. There are probably other reasons behind the transistor density discrepancy, but that could be first among them.

The die sizes I found are 609 mm^2 for RTX 4090 (AD102) and 814 mm^2 for Hopper (H100). The dies have different architectures and different utilization rates of their functional units, with the compute kernels being tuned to achieve as close to 100% utilization as possible.

Next, consider the difference in workloads. In a realtime rendering workload, once you finish rendering a frame, the GPU gets some idle time. Not only that, but because not everything in rendering is parallelizable, you won't even have full occupancy while the frame is still rendering. For compute workloads, as long as the system isn't bottlenecked on communication or I/O, the GPU is running flat out. Hence, it totally makes sense that a realtime-oriented GPU would have higher boost clocks, because there will be more times when it's under comparitively low utilization.

spongiemaster said:
A 10U Hopper DGX system that has 8 GPU's in it uses about 10kW of power, the same system with Blackwell consumes over 14kW. An additional 4kW of heat to dissipate out of each 10U rack is not a trivial problem to solve. Not everyone is as smart as you and gets it right on the first try.

Heh, flattery will get you everywhere!
: )

I trust Nvidia and datacenter operators to appreciate the ever increasing challenges posed by cooling. I also agree with you that these aren't easy problems to solve. I'm no expert in these subject areas, so I find it interesting to hear the struggles of even those who are.

DS426 · Nov 18, 2024

Tisk, tisk.

nVidia will say it's normal to keep stock prices from dropping. How about instead, let's be honest and say we rushed the product out the door. It's normal for customers to develop a timeline and then have to push it back? I don't think all these big players happened to fail on their timelines at the same time.

There's no reason that nVidia couldn't stress test these racks with more intensity than what customers' workloads would actually put on them. Haha, love the furmark mention -- good call!

Anyways, instead, they calculated and simulated what they believed would be a "real world workload" and got it wrong. It's ok to be wrong and incredibly refreshing when CEO's and mega corps admit to mistakes.

I think nVidia simply underestimated how big of a challenge it would be to provide sufficient cooling in a rack with this level of power density.

lejeczek · Nov 18, 2024

Because NVIDIA's motto - also those bit tech thieves' - has always been: burn that oil like there is no tomorrow!

JRStern · Nov 18, 2024

spongiemaster said:
A 10U Hopper DGX system that has 8 GPU's in it uses about 10kW of power, the same system with Blackwell consumes over 14kW. An additional 4kW of heat to dissipate out of each 10U rack is not a trivial problem to solve.

Do you really think the problem is rack-level?
I would think it's really chip/module level, that is, need more water flow at the chip/module level.
Or - colder water? Didn't NVidia call for room-temperature ("warm") water cooling?
Guess you can do it either way - increased warm flow, or same flow at 40f instead of 65f. But there's only so much surface area on the chip/module so temperature may dominate.

News Nvidia's data center Blackwell GPUs reportedly overheat, require rack redesigns and cause delays for customers

Administrator

Titan

Distinguished

Reputable

Distinguished

Judicious

Titan

Prominent

Titan

Dignified

Titan

Distinguished

Prominent

Titan

Dignified

Titan

Prominent

Reputable

Distinguished

Share this page