Musk just purchased 300,000 Blackwell GPU's. He's certainly not the only one with 10's of thousands of Nvidia GPU's in a single supercomputer. What you are talking about has nothing to do with the work involved keeping massive super computers like these running with constant hardware failures and software issues. Reliability absolutely matters.
https://www.tomshardware.com/tech-i...ee-hours-for-metas-16384-gpu-training-cluster
I think that article actually highlights what I'm talking about:
"the main trick for developers is to ensure that the system remains operational regardless of such local breakdowns.
...
the Llama 3 team maintained over a 90% effective training time."
You can't have perfect hardware reliability at these scales, so you need to have fault-tolerance at the software level. Once your system is fault tolerant, then you can afford to clock the hardware a bit higher before the failure rate increases so much that it ceases to be a net win. I'm not saying they actually do, but the presence of those failures clearly shows how effective their fault-tolerance strategies were.
No. TSMC's reticle limit is around 800mm2. All of Nvidia largest GPU's in recent years have been around this size. Hopper has 80 billion transistors. AD102 has 76billion transistors. Almost full die (98.6%) ADA A6000 boosts to 2505Mhz. More than 25% higher than Hopper.
Transistor counts can be misleading, because different types of transistors need to be different sizes. In particular I/O transistors are much larger than what's used in most of the computational logic. The H100 features a 3072-bit HBM2e interface and 48 NVLink lanes. There are probably other reasons behind the transistor density discrepancy, but that could be first among them.
The die sizes I found are
609 mm^2 for RTX 4090 (AD102) and
814 mm^2 for Hopper (H100). The dies have different architectures and different utilization rates of their functional units, with the compute kernels being tuned to achieve as close to 100% utilization as possible.
Next, consider the difference in workloads. In a realtime rendering workload, once you finish rendering a frame, the GPU gets some idle time. Not only that, but because not everything in rendering is parallelizable, you won't even have full occupancy while the frame is still rendering. For compute workloads, as long as the system isn't bottlenecked on communication or I/O, the GPU is running flat out. Hence, it totally makes sense that a realtime-oriented GPU would have higher boost clocks, because there will be more times when it's under comparitively low utilization.
A 10U Hopper DGX system that has 8 GPU's in it uses about 10kW of power, the same system with Blackwell consumes over 14kW. An additional 4kW of heat to dissipate out of each 10U rack is not a trivial problem to solve. Not everyone is as smart as you and gets it right on the first try.
Heh, flattery will get you everywhere!
: )
I trust Nvidia and datacenter operators to appreciate the ever increasing challenges posed by cooling. I also agree with you that these aren't easy problems to solve. I'm no expert in these subject areas, so I find it interesting to hear the struggles of even those who are.