El captiaine? Wasn't that the super computer cluster that had abnormally high failure rates for component racks?
That was the clickbait, but the article actually said the failure rate was expected, for a system of its size. They were also still in the build-out phase and learning more about the failures that were occurring. So, it's likely to become more stable with a bit of maturity.
HPC uses some fault-tolerance techniques, like checkpointing. That way, you don't lose all progress on a simulation, if a failure does occur. Also, it's customary to partition these big supercomputers and a single job very rarely spans all partitions. So, a failure in one part of "the machine" only affects the job using that partition.