News Tesla details how it finds punishing defective cores on its million-core Dojo supercomputers — a single error can ruin a weeks-long AI training run

Developing and building a wafer-scale processor is an extremely complex task, and only two companies in the industry — Cerebras and Tesla — have accomplished it. Like other processors, these devices are prone to defects and degradation; however, Tesla has developed its own method of identifying faulty processing cores without taking them offline, which highlights significant progress.

TSMC, which builds these gargantuan processors for Cerebras and Tesla, states that more companies will adopt wafer-scale designs using its SoIC-SoW technology in the coming years. Apparently, the industry is preparing and gaining experience for this. Little by little.
Every hyperscaler is going to have wafer scale chips soon. Facebook, Apple, Google, Microsoft, Amazon, and Netflix next.
 
Rented a Cyberbeast for a 4000 mile road trip to the redwoods. FSD is mind blowing. In that 4000 miles, I only needed to take control 3 times where I didn't like how close to the inner line it was on super curvy mountain roads. Absolutely incredible.
 
Every hyperscaler is going to have wafer scale chips soon. Facebook, Apple, Google, Microsoft, Amazon, and Netflix next.
It just might become the point where some roads will split and one or another current hyperscaler may no longer be able to keep up and stay in the race.

Doing custom ARM cores and fabric ASICs may be one thing (and a lot of that seems to be semi-custom with Broadcom), but doing your own wafer-scale AI custom design at the order of a Cerebras is quite a different matter.

And I wouldn't bet on Tesla/Dojo actually having succeeded and earning its money just yet or ever: Tesla is buying a surprisingly high number of Nvidia GPUs and Mr. Musk's track record on truth may be one of the ever fewer things he still shares with his ex boss.
 
The article said:
Tesla says a single silent data error can destroy an entire training run that takes weeks to complete.
First, I thought AI models were generally more resilient to errors than that.

Second, this is not a new problem. In the field of HPC, they rely on check-pointing and revert to an earlier snapshot of the state, when an error occurs or is detected.

The article said:
Given the extreme complexity of the Dojo Training Tile (the large wafer-size chip), it isn't easy to detect defective dies even during the manufacturing process
Why not? I really don't understand this claim, even if it were like Cerebras' design (which it isn't). IIRC, Tesla attaches their compute dies to a carrier, which means they should be able to test each of the compute dies first, and only mount the good ones. Of course, you can have errors introduced during the mounting phase, or which occur later, but test-before-mounting should significantly cut down on the rate of hardware defects that make it into the finished assembly.

The article said:
the Stress tool has also uncovered a rare design-level flaw
...
In addition, the company intends to extend the method to pre-silicon testing phases and early validation workflows to catch the aforementioned faults even before production, although it is challenging to envision exactly how this might be achieved, as SDCs can occur due to aging.
They previously said it could be used to find design flaws. That's what you'd use it for, in a pre-silicon validation scenario.
 
  • Like
Reactions: abufrejoval