News Tachyum Submits Bid to Build 20 Exaflops Supercomputer

What about the software support ? Surely the DoE does not to rebuild their (already had) software stack to fully utilize the maximum capability of new hardware ... 🤔
 
  • Like
Reactions: bit_user
What about the software support ? Surely the DoE does not to rebuild their (already had) software stack to fully utilize the maximum capability of new hardware ... 🤔

If their software stack is not linux and standard HPC libraries they're next level stupid. So it should just be a recompile and tuning (some of which will already have been done by the hardware vendor) to move to a new architecture.
 
That was all before Tachyum sued Cadence, its intellectual property provider, for lower-than-expected performance of its Prodigy processor.
Uh, that's not what the prior article said. I went back to check, and the only complaints it mentioned were basically lack of functional and timely delivery of the promised IP. And Tachyum simply said this forced them to source the IP from other suppliers, causing them schedule delays.

If that episode had any impact on performance, it sounds like it was just by delaying their product launch so it had to face newer products from its competitors.
 
One of the interesting things about the DoE's supercomputing plans is that from now on it wants to upgrade its high-performance compute capabilities every 12–24 months, not every 4–5 years.
I definitely like the idea of having an upgrade path, and maybe you don't build out the entire machine at once, but you either add nodes or replace older nodes on a periodic basis. If that's what they mean, then cool. Otherwise, an upgrade cycle of 12-24 months sounds incredibly wasteful.
 
If their software stack is not linux and standard HPC libraries they're next level stupid.
You might think so, but I know AMD put a tremendous amount of effort into their HiP stack for porting CUDA applications to run on ROCm, and that seemed driven by certain HPC contracts they had.

So it should just be a recompile and tuning (some of which will already have been done by the hardware vendor) to move to a new architecture.
When you're talking about such large machines, I think porting is a slightly more involved endeavor. Yeah, you can just use something like OpenMP and get a quick, easy speedup. However, if you really want your application to get a good speedup, you typically have to invest a lot more time & effort.
 
I was very surprised to see the world VLIW in there.
Oh, yeah ...no. VLIW is quite dominant in DSPs and therefore a lot of deep learning ASICs.

To get good scaling, you need to keep the cores small and efficient. That eliminates out-of-order execution from consideration. So, that limits us to in-order cores. There are two basic options for squeezing more performance out of in-order cores: VLIW and SIMD.

GPUs combine SIMD with SMT, in order to hide memory latency. You can certainly combine VLIW and SIMD. You can even combine VLIW with SMT, though it won't be as efficient as SMT is in other contexts.

What I really wonder whether they use a conventional cache hierarchy, or if their SRAM is directly addressable and software managed. GPUs started mostly with the latter, but have gradually been embracing a more traditional cache hierarchy, over time.
 
  • Like
Reactions: SunMaster