News Faulty Nvidia H100 GPUs and HBM3 memory contributed to failures every three hours during Meta's LLama 3 training — 16,384 GPU cluster detailed in w...

Admin · Jul 27, 2024

In a 16,384 H100 GPU cluster, something breaks down every few hours or so. In most cases H100 GPUs are to blame, according to Meta.

Faulty Nvidia H100 GPUs and HBM3 memory contributed to failures every three hours during Meta's LLama 3 training — 16,384 GPU cluster detailed in w... : Read more

jkhoward · Jul 27, 2024

I’d be pissed if two of my CPUs failed in 54 days..

Flayed · Jul 27, 2024

Assuming they had somewhere in the region of 2,500 CPUs 2 failures doesn't seem that bad.

Steve Nord_ · Jul 27, 2024

High rel humblebrags and chassis adequacy are good to have. Sooo...looking at soft failures? Ah, 96 page ops report linked at top of article, better than being there.

Alvar "Miles" Udell · Jul 27, 2024

Considering the fact that a 16,384 GPU cluster experienced 419 failures in 54 days (7.76 times per 24 hours, or a failure every three hours), we can only wonder how often xAI's cluster containing 100,000 H100 GPUs, a six-fold increase in the number of components that could fail, will experience a failure.

Well if a 16,384 GPU cluster had a 2.56% failure rate with 419 GPU failures, then a 100,000 GPU cluster experiencing the same rate would see 2,560 failures.

The question is if they're covered by warranty.

slimsea · Jul 28, 2024

Alvar Miles Udell said:
Well if a 16,384 GPU cluster had a 2.56% failure rate with 419 GPU failures, then a 100,000 GPU cluster experiencing the same rate would see 2,560 failures.

The question is if they're covered by warranty.

It'll be really interesting to see the full bathtub curve over the system lifetime.

These clusters have to run about three years, let's call that 20x 54 days. If the failure rate stayed flat over that time (was this the cluster's first run after commissioning or has it been around a while already?), that would mean 50% of gpus fail... Ouch! Does not seem very economical, or confidence-inspiring.

spongiemaster · Jul 28, 2024

Alvar Miles Udell said:
Well if a 16,384 GPU cluster had a 2.56% failure rate with 419 GPU failures, then a 100,000 GPU cluster experiencing the same rate would see 2,560 failures.

The question is if they're covered by warranty.

Your numbers are wrong because the title is misleading, bordering on clickbait, and because you didn't bother to read any of the article. 419 is the total number of unexpected interruptions of any kind, not just the GPU. There were 148 GPU failures (0.90% failure rate) and 72 HBM3 failures (.44%) for a total H100 failure rate of 1.34%.

Alvar "Miles" Udell · Jul 28, 2024

spongiemaster said:
Your numbers are wrong because the title is misleading, bordering on clickbait, and because you didn't bother to read any of the article. 419 is the total number of unexpected interruptions of any kind, not just the GPU. There were 148 GPU failures (0.90% failure rate) and 72 HBM3 failures (.44%) for a total H100 failure rate of 1.34%.

Actually we are both wrong, it's 270, or 1.65%, going by the category numbers.

Faulty GPU - 148
Faulty HBM3 - 72
Faulty SRAM - 19
Faulty GPU Processor - 17
Silent data corruption - 6
Thermal Interface Sensor - 6

spongiemaster · Jul 28, 2024

Alvar Miles Udell said:
Actually we are both wrong, it's 270, or 1.65%, going by the category numbers.

Faulty GPU - 148
Faulty HBM3 - 72
Faulty SRAM - 19
Faulty GPU Processor - 17
Silent data corruption - 6
Thermal Interface Sensor - 6

Not all of those issues would require a GPU replacement. Regardless, the point stands that the article title is misleading and the number is much lower than 419 failures.

DS426 · Jul 29, 2024

Even at a failure rate of "only 1.65%," this is on a rather short 54 day computing workload. Normalizing to an annualized failure rate and assuming the rate wouldn't climb (silicon degradation would accelerate the FR but let's ignore), Meta would be looking at about an 11.15% AFR.

I guess the upside is that the GPUs would still be under warranty, so then it just becomes an additional labor cost and a tiny slab of productivity lost on training.

Daniel15 · Jul 31, 2024

The link to the study is a temporary CDN URL which has expired, so it doesn't work any more. Can you please update it with the original link?

JRStern · Aug 31, 2024

we can only wonder how often xAI's cluster containing 100,000 H100 GPUs ... will experience a failure.

Well, about 5x as often.
I suppose as it's expected you just write everything with checkpoint/restart logic and it's cool.
But let's see if xAI ever gets to 100k GPUs.
I think some of the humongous plans we've heard for LLM training systems actually went obsolete two years ago.

Search

News Faulty Nvidia H100 GPUs and HBM3 memory contributed to failures every three hours during Meta's LLama 3 training — 16,384 GPU cluster detailed in w...

Admin

Administrator

jkhoward

Distinguished

Flayed

Splendid

Steve Nord_

Commendable

Alvar "Miles" Udell

Dignified

slimsea

spongiemaster

Dignified

Alvar "Miles" Udell

Dignified

spongiemaster

Dignified

DS426

Commendable

Daniel15

Distinguished

JRStern

Distinguished

TRENDING THREADS

Latest posts

Moderators online

Share this page