News World's Fastest Supercomputer Can't Run a Day Without Failure

Co BIY

Splendid
At this scale with the huge number of parts there will always be a problem somewhere in the system but the design should be able to work around them.

60 million parts !

If mean time between failure is going to be hours rather than days are you reaching the point where bigger is not better ?

Would a system half the size but could solve the same problem in double the time be a better solution?
 

kanewolf

Titan
Moderator
At this scale with the huge number of parts there will always be a problem somewhere in the system but the design should be able to work around them.

60 million parts !

If mean time between failure is going to be hours rather than days are you reaching the point where bigger is not better ?

Would a system half the size but could solve the same problem in double the time be a better solution?
YES. This is definitely the case. AND the software is written to tolerate failures. A thread doesn't complete because the CPU died, the scheduler times out the thread, attempts to kill it, then resubmits the work on another CPU.
Writing software for 1,000 or 10,000 threads is a VERY different problem than typical desktop software. Checkpoints, abort logic, and restart logic are baked into the software.
CRCs are created as part of data sets and checked before trusting them. Reliability is a fundamental of high performance computing.
 

bit_user

Titan
Ambassador
Wait I thought AMD can't do anything wrong....

Should have used Intel.
At the risk of piling on, I'll point out two relevant claims from the article:
  1. "others indicated that AMD’s Instinct MI250X compute GPUs were not as reliable as expected this year"
  2. “A lot of challenges are focused around those [GPUs], but that’s not the majority of the challenges that we are seeing,” the head of OLCF said. “It is a pretty good spread among common culprits of parts failures that have been a big part of it. I don’t think that at this point that we have a lot of concern over the AMD products.”
Taken together with the part about "60 million parts", it does seem the reliability target would need to be awfully high to achieve a decent MTBF. Although, exactly what do they consider a "part"? Perhaps each independently replaceable unit, like a memory DIMM or a SSD? Do cables count?

And #1 doesn't characterize the sources or whether the problems are hardware vs. software. AMD has been pouring resources into their HiP software stack, recently. It's a CUDA work-alike, and much younger than CUDA. There's a decent chance the sources are seeing issues with it, which could be shaken out as it matures. Or not. We simply don't know, but we should beware of over-interpreting the small amount of information presented here.

At this scale with the huge number of parts there will always be a problem somewhere in the system but the design should be able to work around them.
Yes, I would expect that fault tolerance would be an essential aspect of the architecture, at this scale. In the past, I think the standard approach for HPC was to take periodic snapshots of the system that you could resume from, if a failure happened subsequently. Think of it like "saved games".

That works up to a certain point, but if the MTBF drops low enough, the amount of lost work could start to become significant. Also, snapshots aren't free - it would seem like you'd need a global barrier to essentially freeze the entire system. So, maybe you could use a more granular form of snapshots that would enable the computation to be resumed from an individual node? Or a transaction model which always saves the inputs from each parcel of work until the output has been collected and consolidated?


A big question I have is whether most jobs run on all machines. Not all computations will scale terribly well, so it seems like you could partition up the machines to concurrently run smaller jobs. Perhaps some jobs even have enough slack that you could run multiple jobs on the same machines, concurrently. If I had so much expensive hardware sitting around and depreciating, I know I'd sure be looking for ways to maximize utilization!
 

kanewolf

Titan
Moderator
At the risk of piling on, I'll point out two relevant claims from the article:
  1. "others indicated that AMD’s Instinct MI250X compute GPUs were not as reliable as expected this year"
  2. “A lot of challenges are focused around those [GPUs], but that’s not the majority of the challenges that we are seeing,” the head of OLCF said. “It is a pretty good spread among common culprits of parts failures that have been a big part of it. I don’t think that at this point that we have a lot of concern over the AMD products.”
Taken together with the part about "60 million parts", it does seem the reliability target would need to be awfully high to achieve a decent MTBF. Although, exactly what do they consider a "part"? Perhaps each independently replaceable unit, like a memory DIMM or a SSD? Do cables count?

And #1 doesn't characterize the sources or whether the problems are hardware vs. software. AMD has been pouring resources into their HiP software stack, recently. It's a CUDA work-alike, and much younger than CUDA. There's a decent chance the sources are seeing issues with it, which could be shaken out as it matures. Or not. We simply don't know, but we should beware of over-interpreting the small amount of information presented here.


Yes, I would expect that fault tolerance would be an essential aspect of the architecture, at this scale. In the past, I think the standard approach for HPC was to take periodic snapshots of the system that you could resume from, if a failure happened subsequently. Think of it like "saved games".

That works up to a certain point, but if the MTBF drops low enough, the amount of lost work could start to become significant. Also, snapshots aren't free - it would seem like you'd need a global barrier to essentially freeze the entire system. So, maybe you could use a more granular form of snapshots that would enable the computation to be resumed from an individual node? Or a transaction model which always saves the inputs from each parcel of work until the output has been collected and consolidated?


A big question I have is whether most jobs run on all machines. Not all computations will scale terribly well, so it seems like you could partition up the machines to concurrently run smaller jobs. Perhaps some jobs even have enough slack that you could run multiple jobs on the same machines, concurrently. If I had so much expensive hardware sitting around and depreciating, I know I'd sure be looking for ways to maximize utilization!
My experience with large systems like this is that they are partitioned in a standard configuration (8 x 1000 nodes plus 4 x 2500 nodes plus 1 x 10000 node for example). IF there is sufficient justification made to the administration, that configuration is temporarily changed. The job requiring an unusual configuration is run, then the hardware is normalized. If your software can "only" use 500 nodes, you are scheduled on a 1000 node partition, but maybe at a lower priority so that jobs that will fully utilize the partition are run first.
 

bit_user

Titan
Ambassador

bit_user

Titan
Ambassador
If your software can "only" use 500 nodes, you are scheduled on a 1000 node partition, but maybe at a lower priority so that jobs that will fully utilize the partition are run first.
You got me excited when you said "lower priority", but then you say the jobs are still run sequentially.

I think it'd be cool to have a way to see how much idle capacity exists for different job types, because I'll bet a lot of jobs could be doubled or tripled-up, with little impact on the individual job.

I wonder if these supercomputers will ever be made fully obsolete by cloud computing. It just seems like the cloud computing model (which does optimize resource usage) is so much more efficient. I could imagine a 10x cost advantage, with only something like a 2-3x performance hit, from running a typical supercomputer job on the cloud.
 
  • Like
Reactions: gg83

kanewolf

Titan
Moderator
I wonder if these supercomputers will ever be made fully obsolete by cloud computing. It just seems like the cloud computing model (which does optimize resource usage) is so much more efficient. I could imagine a 10x cost advantage, with only something like a 2-3x performance hit, from running a typical supercomputer job on the cloud.
IMO, not any time soon. Jobs that require an HPC implementation, have a much greater performance penalty. Cloud implementations have lots of cores, but low I/O capacity compared to a supercomputer. Processing HUGE datasets, like created by the Hadron collider can't be processed in commercial clouds. The two types of computing are not competitors.
 

Co BIY

Splendid
If I had so much expensive hardware sitting around and depreciating, I know I'd sure be looking for ways to maximize utilization!

When doing some reading at the LUMI Finnish Supercomputer site it looked like getting enough researchers to bring them worthy workloads was a challenge (or maybe a focus). They have the additional complication of needing to allocate resources to financial partner countries in an acceptable manner. I had earlier read that a real concern for the project is whether the continent had enough computer researchers with the expertise to make use of the machine. OTOH who wants to spend time learning how to use something not available to you. Bit of a catch 22.

Luckily for the US Department of Energy the calculation: (idle time rate x rate of depreciation of a #1 supercomputer = ) is likely a state secret. {Insert your own Trump declassification joke}
 

bit_user

Titan
Ambassador
Cloud implementations have lots of cores, but low I/O capacity compared to a supercomputer. Processing HUGE datasets, like created by the Hadron collider can't be processed in commercial clouds. The two types of computing are not competitors.
Cloud computing offers GPUs and certainly not insubstantial network and I/O speeds. It seems to me that the main places where you lose out are on single-thread performance and node-to-node latency.

I think that cloud computing is nibbling away at HPC, year by year. I'm sure that, for 5-10 years, I've been seeing a drip-feed of news about HPC cloud, where people are teaming up supercomputing resources across sites. More recently, I think I've seen announcements from like AWS and Google for specifically HPC-oriented contracts.

Anyway, I found a recent article which examines this specific question, and comes to a similar conclusion to what I'm thinking:

Some other relevant articles:
 
  • Like
Reactions: gg83

bit_user

Titan
Ambassador
I had earlier read that a real concern for the project is whether the continent had enough computer researchers with the expertise to make use of the machine. OTOH who wants to spend time learning how to use something not available to you. Bit of a catch 22.
Yeah, and this is where Nvidia's CUDA strategy was brilliant. By supporting CUDA up and down their product stack, they created a ready workforce because students could start using it on their gaming GPUs.

By contrast, AMD's GPU compute strategy has been disastrous, for them. So many were turned off by lack of compute support for consumer hardware and buggy/late driver support.

Add to that, AMD's strategic flip-flop between OpenCL and HiP... it leads to market confusion. Also, if HiP is supposed to be CUDA-like, it looks like a cheap imitation without a compelling value of its own. That creates a perception where you're better off just going for "the real deal" and using CUDA, if you can.

Luckily for the US Department of Energy the calculation: (idle time rate x rate of depreciation of a #1 supercomputer = ) is likely a state secret.
That's exactly right. Government is going to be the only one with pockets deep enough and a legitimate need (i.e. dedicated computing resources that are segregated from the public cloud).
 

kanewolf

Titan
Moderator
I think that cloud computing is nibbling away at HPC, year by year. I'm sure that, for 5-10 years, I've been seeing a drip-feed of news about HPC cloud, where people are teaming up supercomputing resources across sites. More recently, I think I've seen announcements from like AWS and Google for specifically HPC-oriented contracts.
I don't disagree with this. AWS has improved their offerings. 25Gb ethernet, rather than 10Gb. And you can get exclusive access to hardware. But that costs significantly more than traditional cloud (shared) access. Shared access had few guarantees on latency, so anything that is real-time can have issues. You may not be able to capture real-time 5Gb data, for example, even if you have a 25Gb link, because you might be time sliced out in a shared environment for too long and miss data.
Again, IMO supercomputers and cloud are complementary technologies. There are problems that work better on one but not the other.
 
  • Like
Reactions: TesseractOrion

bit_user

Titan
Ambassador
Again, IMO supercomputers and cloud are complementary technologies. There are problems that work better on one but not the other.
Sounds like we mostly agree. Where there could be a "tipping point", of sorts, is if the HPC market gets too small to sustain itself. HPC vendors will react by offering hardware solutions more resembling that of hyperscale/cloud hardware, to the extent that dedicated HPC no longer has a compelling value proposition, except for niche applications.

The whole thing is a feedback loop, because the smaller the market for exotic hardware gets, the more expensive it becomes, and that pushes even more users onto cloud offerings. Meanwhile, capitalizing on this trend, cloud providers are all to willing to cater to some specific needs of HPC users, like in the Google Cloud HPC link I posted above.
 

neojack

Honorable
Apr 4, 2019
619
184
11,140
m surprised that the MTBF of those systems is noted in "hours"

I mean, one can reasonably expect some redondancy on that scale ?
a failling GPU or CPU should not put the whole computer to an halt
 
  • Like
Reactions: bit_user

cc2onouui

Prominent
Oct 3, 2021
14
7
515
Head of Oak Ridge Leadership Computing Facility: You will have failures at this scale.
“You are going to have failures at this scale. Mean time between failure on a system this size is hours, it’s not days.”

You have to love it when a journalists plays like he didn't get the meaning of the statement, it is obviously means with more complexity there is more needed maintenance, "the machine that uses 60 million parts in total" and he didn't blame AMD hardware, he exactly said: "I don’t think that at this point that we have a lot of concern over the AMD products.” so go ahead and complete your "sorry I'm killing this business reputation by mistake- I didn't mean to" It is so obvious what he said and what he meant but, and it's obvious what you are doing.. you are not just a kid.. what a corrupted world we live in.