News Exacluster with 144 Nvidia H200 AI GPUs detailed by its designer: Hydra Host enters the scene

The article said:
The setup uses 3.2Tbps InfiniBand for east-west traffic and 400Gbps Ethernet for north-south communication
In this context, what do "east-west" and "north-south" refer to? I'd guess east-west means communication among peer nodes and north-south is referring to communication with clients and storage.
 
  • Like
Reactions: Stomx
The article said:
On the one hand, Hydra Host is a close Nvidia partner and only offers Nvidia GPUs as a service. In addition, its Brokkr software is optimized primarily for CUDA. On the other hand, ExaAI is a company backed by Nvidia, so it can potentially get preferential pricing.
LOL, it's the same hand!

The article said:
Hydra also specializes in building custom solutions for startups and even monetizes their machines when not in use.
This is a nice idea, in theory. However, the concern I'd have is that AI training tends to be so data-intensive that it would take a long time to upload all of your training data to their servers and then you get to actually use the cluster for how long???

Plus, once you get bumped, because they want to use it, what do you do? I guess you have to transfer your partially-trained model + training data to some other cluster and continue there? Sounds inefficient, to me. I guess if the cost savings are substantial vs. one of the big cloud operators, then it might be worth the downsides.
 
This setup is nothing special, most typical university supercomputers are like that just instead of H200, B200 they still have older A100 which are around 3x slower than H200.

By the way since DeepSeek trained their model on 2048 H800 GPUs for two months, on such cluster, it would train it in 9-12 months

(the H200 is 2-3x faster than H800 which seems severely restricted on FP64 which is not used for AI and restricted on communication speed).

The cost of electricity for 100kW *10,000 hours * 10 cents per kWh is just $100,000 per year or negligible.

So this just $5M cluster is pretty capable to do the same job like DeepSeek have done for the same money
 
Last edited:
  • Like
Reactions: bit_user
In this context, what do "east-west" and "north-south" refer to? I'd guess east-west means communication among peer nodes and north-south is referring to communication with clients and storage.
Right naming is vertical and horizontal networks communications in the cause in supercomputers. Just is article is used terms that is between workers which install physically the network in supercomputers.
 
More interesting numbers.
If Elon Mask current AI cluster consisting of 100,000 H100 GPUs (which are not substantially slower than H200) would start training DeepSeek model at 9am in the morning it would finish it at approximately the end of the day at 7pm.

And if the DeepSeek decided to submit the application requesting computer time on the most powerful supercomputer Frontier they most probably would got rejected for asking too little time, below their cut-off threshold which is around 250,000 node-hours
 
Last edited:
  • Like
Reactions: bit_user