News Faulty Nvidia H100 GPUs and HBM3 memory contributed to failures every three hours during Meta's LLama 3 training — 16,384 GPU cluster detailed in w...

Status
Not open for further replies.
High rel humblebrags and chassis adequacy are good to have. Sooo...looking at soft failures? Ah, 96 page ops report linked at top of article, better than being there.
 
Considering the fact that a 16,384 GPU cluster experienced 419 failures in 54 days (7.76 times per 24 hours, or a failure every three hours), we can only wonder how often xAI's cluster containing 100,000 H100 GPUs, a six-fold increase in the number of components that could fail, will experience a failure.

Well if a 16,384 GPU cluster had a 2.56% failure rate with 419 GPU failures, then a 100,000 GPU cluster experiencing the same rate would see 2,560 failures.

The question is if they're covered by warranty.
 
  • Like
Reactions: KyaraM
Well if a 16,384 GPU cluster had a 2.56% failure rate with 419 GPU failures, then a 100,000 GPU cluster experiencing the same rate would see 2,560 failures.

The question is if they're covered by warranty.
It'll be really interesting to see the full bathtub curve over the system lifetime.

These clusters have to run about three years, let's call that 20x 54 days. If the failure rate stayed flat over that time (was this the cluster's first run after commissioning or has it been around a while already?), that would mean 50% of gpus fail... Ouch! Does not seem very economical, or confidence-inspiring.
 
Well if a 16,384 GPU cluster had a 2.56% failure rate with 419 GPU failures, then a 100,000 GPU cluster experiencing the same rate would see 2,560 failures.

The question is if they're covered by warranty.
Your numbers are wrong because the title is misleading, bordering on clickbait, and because you didn't bother to read any of the article. 419 is the total number of unexpected interruptions of any kind, not just the GPU. There were 148 GPU failures (0.90% failure rate) and 72 HBM3 failures (.44%) for a total H100 failure rate of 1.34%.
 
Your numbers are wrong because the title is misleading, bordering on clickbait, and because you didn't bother to read any of the article. 419 is the total number of unexpected interruptions of any kind, not just the GPU. There were 148 GPU failures (0.90% failure rate) and 72 HBM3 failures (.44%) for a total H100 failure rate of 1.34%.

Actually we are both wrong, it's 270, or 1.65%, going by the category numbers.

Faulty GPU - 148
Faulty HBM3 - 72
Faulty SRAM - 19
Faulty GPU Processor - 17
Silent data corruption - 6
Thermal Interface Sensor - 6

3tSEBGshxZPSFJjS25coKZ-1200-80.png.webp
 
Actually we are both wrong, it's 270, or 1.65%, going by the category numbers.

Faulty GPU - 148
Faulty HBM3 - 72
Faulty SRAM - 19
Faulty GPU Processor - 17
Silent data corruption - 6
Thermal Interface Sensor - 6

3tSEBGshxZPSFJjS25coKZ-1200-80.png.webp
Not all of those issues would require a GPU replacement. Regardless, the point stands that the article title is misleading and the number is much lower than 419 failures.
 
Even at a failure rate of "only 1.65%," this is on a rather short 54 day computing workload. Normalizing to an annualized failure rate and assuming the rate wouldn't climb (silicon degradation would accelerate the FR but let's ignore), Meta would be looking at about an 11.15% AFR.

I guess the upside is that the GPUs would still be under warranty, so then it just becomes an additional labor cost and a tiny slab of productivity lost on training.
 
The link to the study is a temporary CDN URL which has expired, so it doesn't work any more. Can you please update it with the original link?
 
we can only wonder how often xAI's cluster containing 100,000 H100 GPUs ... will experience a failure.
Well, about 5x as often.
I suppose as it's expected you just write everything with checkpoint/restart logic and it's cool.
But let's see if xAI ever gets to 100k GPUs.
I think some of the humongous plans we've heard for LLM training systems actually went obsolete two years ago.
 
Status
Not open for further replies.