News AMD Instinct MI300 Data Center APU Pictured Up Close: 15 Chiplets, 146 Billion Transistors

bit_user

Polypheme
Ambassador
El captiaine? Wasn't that the super computer cluster that had abnormally high failure rates for component racks?
That was the clickbait, but the article actually said the failure rate was expected, for a system of its size. They were also still in the build-out phase and learning more about the failures that were occurring. So, it's likely to become more stable with a bit of maturity.

HPC uses some fault-tolerance techniques, like checkpointing. That way, you don't lose all progress on a simulation, if a failure does occur. Also, it's customary to partition these big supercomputers and a single job very rarely spans all partitions. So, a failure in one part of "the machine" only affects the job using that partition.
 
  • Like
Reactions: digitalgriffin

bit_user

Polypheme
Ambassador
It will be interesting to see if this device can be used without standard DRAM
Unlikely, given the sizes of the datasets they're talking about using it on. Given when it's launching, I expect it'll use standard DDR5 RDIMMs. Future generations will probably use CXL memory.