News Nvidia's data center Blackwell GPUs reportedly overheat, require rack redesigns and cause delays for customers

It seems like Nvidia tried to launch with clock speeds that were too high.

Perhaps they felt compelled to do that, in order to justify the exorbitant pricing. This would also explain why they can't simply tell customers to reduce clock speeds a little, since that would significantly affect the perf/$ and might result in customers asking for a rebate to offset the lost performance.

While Nvidia provides industry leading AI performance, I'm sure they're probably less competitive on the perf/$ front. Intel basically said as much, during recent announcements regarding their repositioning of Gaudi. This is something Nvidia must be keeping an eye on.

TBH, I honestly expected Nvidia to have transitioned their AI-centric products away from a GPU-style architecture, by now, and towards a more purpose-built, dataflow-oriented architecture like most of their competitors have adopted. I probably just underestimated how much effort it would take them to overhaul their software infrastructure, which I guess supports Jim Keller's claim that CUDA is more of a swamp than a moat.
 
ithese days performance not very important,( all hardwares performance near toghather) , temprature is more important

cant understand why they dont use AMD
AMD maybe little be week but have very low heat and low watt

high watt , high temprature = degrade
 
lack of comprehensive design/engineering when it is required more than ever, and when facilities are/should be available to do that?
 
Well, even if Nvidia made a subpar product, companies and individuals will still throw money at them. So with little consequences, they can basically do what they want isn’t it? All you need is some fluffy and creative marketing on some edge cases from internal testing.
 
  • Like
Reactions: jlake3 and KyaraM
It seems like Nvidia tried to launch with clock speeds that were too high.

Perhaps they felt compelled to do that, in order to justify the exorbitant pricing. This would also explain why they can't simply tell customers to reduce clock speeds a little, since that would significantly affect the perf/$ and might result in customers asking for a rebate to offset the lost performance.

While Nvidia provides industry leading AI performance, I'm sure they're probably less competitive on the perf/$ front. Intel basically said as much, during recent announcements regarding their repositioning of Gaudi. This is something Nvidia must be keeping an eye on.
Just to add onto this their biggest customers are diversifying hardware sources already so cutting that performance lead could be especially dangerous there.
 
I thought the heat was from top Blackwell products essentially gluing two giant dies together, increasing the thermal density.
I'm not saying otherwise, just that I'm sure it's clocked well beyond the point of diminishing returns. In which case, they could reduce power consumption (and thereby heat), if they'd dial back the clocks a bit. It's such a simple and obvious solution that I thought it was worth exploring the implications of doing it.
 
How did we get from October mass production to January shipping?

Heat density is a thing. Even with liquid cooling it is difficult to continuously move thousands of watts from a to b. Temporary fix is to maybe install 2 of 3 boards? Not earth shattering. Problem is that if they change the case itself, there is a crazy amount of networking connectivity inside as well as heavy power cabling. Swapping cases will cost time and money, but i doubt customers will have to wait to start testing their compute modules, like we run Furmark on new systems. Weed out infant m0rtality. Lowering clocks is good after that, since the boards work ok at higher clocks, while they wait for mechanical issues. People still get work done, time doesn't stop for anyone.
 
we run Furmark on new systems. Weed out infant m0rtality.
Ooo! +1 for furmark. That's one of my favorite stress tests!

It's even available for Linux, now!

 
Last edited:
  • Like
Reactions: HansSchulze
I'm not saying otherwise, just that I'm sure it's clocked well beyond the point of diminishing returns.
Enterprise hardware is never clocked like that. Efficiency and reliability are significantly more important to enterprise customers than all out performance. Hopper's boost clock is only 1980Mhz, compared to a 4090 that boosts up to 2850MHz. Adding a 2nd chip, Blackwell increases transistor count compared to Hopper from 80 billion to 208 billion, plus an addition 50GB of HBM. That's where the power increase is coming from. TDP increased from 700W to 1000W in the same form factor. If data centers did nothing to improve cooling with that increase in power/heat density, it's no surprise the systems are overheating.
 
  • Like
Reactions: KyaraM
Enterprise hardware is never clocked like that. Efficiency and reliability are significantly more important to enterprise customers than all out performance.
This isn't true of AI hardware, due to the demand for AI performance vs. its expense and scarcity. That incentivizes customers to milk the most performance out of whatever they can get their hands on. I previously ran through the numbers in this post:

As for reliability, AI training is inherently iterative. That gives you a natural checkpoint that you can resume from, in the event of a hardware failure. It's also more error-tolerant than most computation people do, using low-precision arithmetic and even relying on random seeds. Heck, it wouldn't surprise me too much to hear that some customers are even running the memory on these accelerators in non-ECC mode!

Hopper's boost clock is only 1980Mhz, compared to a 4090 that boosts up to 2850MHz.
It's much bigger, though. That makes them hard to compare.

Adding a 2nd chip, Blackwell increases transistor count compared to Hopper from 80 billion to 208 billion, plus an addition 50GB of HBM. That's where the power increase is coming from.
Like I said, it doesn't really matter why it's using more power, the point is they can still reduce it by trimming back the clock speed. So, I was trying to explore the matter.

If data centers did nothing to improve cooling with that increase in power/heat density, it's no surprise the systems are overheating.
The article says: "These problems have caused Nvidia to reevaluate the design of its server racks multiple times". So, I think it's probably not a simple matter of everyone just acting like the same solutions will work that they used before.
 
Last edited:
  • Like
Reactions: P.Amini
Is this down to nvidia’s designs and clock speeds, or the rack mount company’s designs?

(ignorance on my part as to who the rack mounting is produced/supplied by)
 
As for reliability, AI training is inherently iterative. That gives you a natural checkpoint that you can resume from, in the event of a hardware failure. It's also more error-tolerant than most computation people do, using low-precision arithmetic and even relying on random seeds. Heck, it wouldn't surprise me too much to hear that some customers are even running the memory on these accelerators in non-ECC mode!

Musk just purchased 300,000 Blackwell GPU's. He's certainly not the only one with 10's of thousands of Nvidia GPU's in a single supercomputer. What you are talking about has nothing to do with the work involved keeping massive super computers like these running with constant hardware failures and software issues. Reliability absolutely matters.

https://www.tomshardware.com/tech-i...ee-hours-for-metas-16384-gpu-training-cluster

It's much bigger, though. That makes them hard to compare.

No. TSMC's reticle limit is around 800mm2. All of Nvidia largest GPU's in recent years have been around this size. Hopper has 80 billion transistors. AD102 has 76billion transistors. Almost full die (98.6%) ADA A6000 boosts to 2505Mhz. More than 25% higher than Hopper.

The article says: "These problems have caused Nvidia to reevaluate the design of its server racks multiple times". So, I think it's probably not a simple matter of everyone just acting like the same solutions will work that they used before.

A 10U Hopper DGX system that has 8 GPU's in it uses about 10kW of power, the same system with Blackwell consumes over 14kW. An additional 4kW of heat to dissipate out of each 10U rack is not a trivial problem to solve. Not everyone is as smart as you and gets it right on the first try.
 
Musk just purchased 300,000 Blackwell GPU's. He's certainly not the only one with 10's of thousands of Nvidia GPU's in a single supercomputer. What you are talking about has nothing to do with the work involved keeping massive super computers like these running with constant hardware failures and software issues. Reliability absolutely matters.

https://www.tomshardware.com/tech-i...ee-hours-for-metas-16384-gpu-training-cluster
I think that article actually highlights what I'm talking about:

"the main trick for developers is to ensure that the system remains operational regardless of such local breakdowns.
...
the Llama 3 team maintained over a 90% effective training time."

You can't have perfect hardware reliability at these scales, so you need to have fault-tolerance at the software level. Once your system is fault tolerant, then you can afford to clock the hardware a bit higher before the failure rate increases so much that it ceases to be a net win. I'm not saying they actually do, but the presence of those failures clearly shows how effective their fault-tolerance strategies were.

No. TSMC's reticle limit is around 800mm2. All of Nvidia largest GPU's in recent years have been around this size. Hopper has 80 billion transistors. AD102 has 76billion transistors. Almost full die (98.6%) ADA A6000 boosts to 2505Mhz. More than 25% higher than Hopper.
Transistor counts can be misleading, because different types of transistors need to be different sizes. In particular I/O transistors are much larger than what's used in most of the computational logic. The H100 features a 3072-bit HBM2e interface and 48 NVLink lanes. There are probably other reasons behind the transistor density discrepancy, but that could be first among them.

The die sizes I found are 609 mm^2 for RTX 4090 (AD102) and 814 mm^2 for Hopper (H100). The dies have different architectures and different utilization rates of their functional units, with the compute kernels being tuned to achieve as close to 100% utilization as possible.

Next, consider the difference in workloads. In a realtime rendering workload, once you finish rendering a frame, the GPU gets some idle time. Not only that, but because not everything in rendering is parallelizable, you won't even have full occupancy while the frame is still rendering. For compute workloads, as long as the system isn't bottlenecked on communication or I/O, the GPU is running flat out. Hence, it totally makes sense that a realtime-oriented GPU would have higher boost clocks, because there will be more times when it's under comparitively low utilization.

A 10U Hopper DGX system that has 8 GPU's in it uses about 10kW of power, the same system with Blackwell consumes over 14kW. An additional 4kW of heat to dissipate out of each 10U rack is not a trivial problem to solve. Not everyone is as smart as you and gets it right on the first try.
Heh, flattery will get you everywhere!
: )

I trust Nvidia and datacenter operators to appreciate the ever increasing challenges posed by cooling. I also agree with you that these aren't easy problems to solve. I'm no expert in these subject areas, so I find it interesting to hear the struggles of even those who are.
 
  • Like
Reactions: P.Amini
Tisk, tisk.

nVidia will say it's normal to keep stock prices from dropping. How about instead, let's be honest and say we rushed the product out the door. It's normal for customers to develop a timeline and then have to push it back? I don't think all these big players happened to fail on their timelines at the same time.

There's no reason that nVidia couldn't stress test these racks with more intensity than what customers' workloads would actually put on them. Haha, love the furmark mention -- good call! :) Anyways, instead, they calculated and simulated what they believed would be a "real world workload" and got it wrong. It's ok to be wrong and incredibly refreshing when CEO's and mega corps admit to mistakes.

I think nVidia simply underestimated how big of a challenge it would be to provide sufficient cooling in a rack with this level of power density.
 
Because NVIDIA's motto - also those bit tech thieves' - has always been: burn that oil like there is no tomorrow!
 
A 10U Hopper DGX system that has 8 GPU's in it uses about 10kW of power, the same system with Blackwell consumes over 14kW. An additional 4kW of heat to dissipate out of each 10U rack is not a trivial problem to solve.
Do you really think the problem is rack-level?
I would think it's really chip/module level, that is, need more water flow at the chip/module level.
Or - colder water? Didn't NVidia call for room-temperature ("warm") water cooling?
Guess you can do it either way - increased warm flow, or same flow at 40f instead of 65f. But there's only so much surface area on the chip/module so temperature may dominate.