News Alibaba Cloud ditches Nvidia's interconnect in favor of Ethernet — tech giant uses own High Performance Network to connect 15,000 GPUs inside data...

bit_user

Titan
Ambassador
FYI, Tenstorrent, Habana, and Cerebras are all using 100+ Gigabit Ethernet. In the case of the first two, they're even using it for intra-chassis communication, which was referenced in some choice remarks by Jim Keller:

I'm not saying Alibaba paper is unoriginal, but just pointing out that its novelty is probably in the approach of mitigating switch failures and reducing their likelihood.
 
Jul 1, 2024
1
0
10
This article is a bit confusing, as NVLink is still used inside the GPU hosts (to interconnect the 8x GPU memory and communication). NVLink was never an option for connecting hosts together, that is the domain of Ethernet, Infiniband, or a custom optical crossconnect (Google et al). Perhaps the author is referring to favoring Ethernet vs. Infiniband? In that case I agree with @bit_user that this is known territory. Meta have been very vocal publicly about using Ethernet and lightweight IP traffic engineering on the TOR to favor paths for elephant flows. This would seem to be the same or similar scheme.
 

bit_user

Titan
Ambassador
This article is a bit confusing, as NVLink is still used inside the GPU hosts (to interconnect the 8x GPU memory and communication).
Correct.

NVLink was never an option for connecting hosts together,
In the early revs, it was just for intra-machine communication. In the last couple generations, it started to expand to rack-scale and maybe a little beyond.

From Nvidia's GTC 2024 Keynote:

04:29PM EDT - And NVDIA is building a rack-scale offering using GB200 and the new NVLink opertions, GB200 NVL72"

04:29PM EDT - NVLink 5 scales up to 576 GPUs"

115290406.jpg

04:48PM EDT - 5000 NVLink cables. 2 miles of cables

04:48PM EDT - And those are all copper cables. No optical transceivers needed

04:48PM EDT - That saved 20kW to be spent on computation

At some point, you're right that Nvidia wants you to switch to Infiniband. I think their acquisition of Mellanox, ~5 years ago, had something to do with that.