Head of Oak Ridge Leadership Computing Facility: You will have failures at this scale.
World's Fastest Supercomputer Can't Run a Day Without Failure : Read more
World's Fastest Supercomputer Can't Run a Day Without Failure : Read more
Are we trolling? Because, unless you're willing, and able, to back that statement, and show us that there's any serious claim being made that "AMD can't do anything wrong," then you're trolling.Wait I thought AMD can't do anything wrong....
Wait I thought AMD can't do anything wrong....
YES. This is definitely the case. AND the software is written to tolerate failures. A thread doesn't complete because the CPU died, the scheduler times out the thread, attempts to kill it, then resubmits the work on another CPU.At this scale with the huge number of parts there will always be a problem somewhere in the system but the design should be able to work around them.
60 million parts !
If mean time between failure is going to be hours rather than days are you reaching the point where bigger is not better ?
Would a system half the size but could solve the same problem in double the time be a better solution?
Wait I thought AMD can't do anything wrong....
At the risk of piling on, I'll point out two relevant claims from the article:Should have used Intel.
Yes, I would expect that fault tolerance would be an essential aspect of the architecture, at this scale. In the past, I think the standard approach for HPC was to take periodic snapshots of the system that you could resume from, if a failure happened subsequently. Think of it like "saved games".At this scale with the huge number of parts there will always be a problem somewhere in the system but the design should be able to work around them.
My experience with large systems like this is that they are partitioned in a standard configuration (8 x 1000 nodes plus 4 x 2500 nodes plus 1 x 10000 node for example). IF there is sufficient justification made to the administration, that configuration is temporarily changed. The job requiring an unusual configuration is run, then the hardware is normalized. If your software can "only" use 500 nodes, you are scheduled on a 1000 node partition, but maybe at a lower priority so that jobs that will fully utilize the partition are run first.At the risk of piling on, I'll point out two relevant claims from the article:
Taken together with the part about "60 million parts", it does seem the reliability target would need to be awfully high to achieve a decent MTBF. Although, exactly what do they consider a "part"? Perhaps each independently replaceable unit, like a memory DIMM or a SSD? Do cables count?
- "others indicated that AMD’s Instinct MI250X compute GPUs were not as reliable as expected this year"
- “A lot of challenges are focused around those [GPUs], but that’s not the majority of the challenges that we are seeing,” the head of OLCF said. “It is a pretty good spread among common culprits of parts failures that have been a big part of it. I don’t think that at this point that we have a lot of concern over the AMD products.”
And #1 doesn't characterize the sources or whether the problems are hardware vs. software. AMD has been pouring resources into their HiP software stack, recently. It's a CUDA work-alike, and much younger than CUDA. There's a decent chance the sources are seeing issues with it, which could be shaken out as it matures. Or not. We simply don't know, but we should beware of over-interpreting the small amount of information presented here.
Yes, I would expect that fault tolerance would be an essential aspect of the architecture, at this scale. In the past, I think the standard approach for HPC was to take periodic snapshots of the system that you could resume from, if a failure happened subsequently. Think of it like "saved games".
That works up to a certain point, but if the MTBF drops low enough, the amount of lost work could start to become significant. Also, snapshots aren't free - it would seem like you'd need a global barrier to essentially freeze the entire system. So, maybe you could use a more granular form of snapshots that would enable the computation to be resumed from an individual node? Or a transaction model which always saves the inputs from each parcel of work until the output has been collected and consolidated?
A big question I have is whether most jobs run on all machines. Not all computations will scale terribly well, so it seems like you could partition up the machines to concurrently run smaller jobs. Perhaps some jobs even have enough slack that you could run multiple jobs on the same machines, concurrently. If I had so much expensive hardware sitting around and depreciating, I know I'd sure be looking for ways to maximize utilization!
You don't have to imagine:Just imagine what the comment section would have been if it was Intel instead of AMD.
You got me excited when you said "lower priority", but then you say the jobs are still run sequentially.If your software can "only" use 500 nodes, you are scheduled on a 1000 node partition, but maybe at a lower priority so that jobs that will fully utilize the partition are run first.
IMO, not any time soon. Jobs that require an HPC implementation, have a much greater performance penalty. Cloud implementations have lots of cores, but low I/O capacity compared to a supercomputer. Processing HUGE datasets, like created by the Hadron collider can't be processed in commercial clouds. The two types of computing are not competitors.I wonder if these supercomputers will ever be made fully obsolete by cloud computing. It just seems like the cloud computing model (which does optimize resource usage) is so much more efficient. I could imagine a 10x cost advantage, with only something like a 2-3x performance hit, from running a typical supercomputer job on the cloud.
If I had so much expensive hardware sitting around and depreciating, I know I'd sure be looking for ways to maximize utilization!
Cloud computing offers GPUs and certainly not insubstantial network and I/O speeds. It seems to me that the main places where you lose out are on single-thread performance and node-to-node latency.Cloud implementations have lots of cores, but low I/O capacity compared to a supercomputer. Processing HUGE datasets, like created by the Hadron collider can't be processed in commercial clouds. The two types of computing are not competitors.
Yeah, and this is where Nvidia's CUDA strategy was brilliant. By supporting CUDA up and down their product stack, they created a ready workforce because students could start using it on their gaming GPUs.I had earlier read that a real concern for the project is whether the continent had enough computer researchers with the expertise to make use of the machine. OTOH who wants to spend time learning how to use something not available to you. Bit of a catch 22.
That's exactly right. Government is going to be the only one with pockets deep enough and a legitimate need (i.e. dedicated computing resources that are segregated from the public cloud).Luckily for the US Department of Energy the calculation: (idle time rate x rate of depreciation of a #1 supercomputer = ) is likely a state secret.
I don't disagree with this. AWS has improved their offerings. 25Gb ethernet, rather than 10Gb. And you can get exclusive access to hardware. But that costs significantly more than traditional cloud (shared) access. Shared access had few guarantees on latency, so anything that is real-time can have issues. You may not be able to capture real-time 5Gb data, for example, even if you have a 25Gb link, because you might be time sliced out in a shared environment for too long and miss data.I think that cloud computing is nibbling away at HPC, year by year. I'm sure that, for 5-10 years, I've been seeing a drip-feed of news about HPC cloud, where people are teaming up supercomputing resources across sites. More recently, I think I've seen announcements from like AWS and Google for specifically HPC-oriented contracts.
Sounds like we mostly agree. Where there could be a "tipping point", of sorts, is if the HPC market gets too small to sustain itself. HPC vendors will react by offering hardware solutions more resembling that of hyperscale/cloud hardware, to the extent that dedicated HPC no longer has a compelling value proposition, except for niche applications.Again, IMO supercomputers and cloud are complementary technologies. There are problems that work better on one but not the other.
“You are going to have failures at this scale. Mean time between failure on a system this size is hours, it’s not days.”Head of Oak Ridge Leadership Computing Facility: You will have failures at this scale.