News AMD enhances multi-GPU support in latest ROCm update: up to four RX or Pro GPUs supported, official support added for Pro W7900 Dual Slot

oofdragon

Distinguished
Oct 14, 2017
327
292
19,060
Where is Cross Fire??? They surely can make it in 2024 since they did it 2014, 2004? 8800XT will be around a 7900XT for $500 they say, well, make it CF and it will beat the 4090 no problm
 

abufrejoval

Reputable
Jun 19, 2020
584
424
5,260
Where is Cross Fire??? They surely can make it in 2024 since they did it 2014, 2004? 8800XT will be around a 7900XT for $500 they say, well, make it CF and it will beat the 4090 no problm
In both cases the big issue is the extra headaches and diminishing returns for the extra GPUs.

It's like baking a cake or building a house: adding another person doesn't always double the output or half the time. When you add 5, 500 or 5000 there is a good chance no cake nor house will ever happen, unless your problem doesn't suffer too much from Amdahl's law and your solution process is redesigned to exploit that.

It's very hard to actually gain from the extra GPUs, because cross-communications at PCIe speeds vs. local VRAM is like sending an e-mail in-house and having to hand-write and hand deliver it as soon as it needs to reach someone in the next building.

In some cases like mixture of expert models there are natural borders you can exploit. I've also experimented with an RTX4070 and RTX4090 because they were the only ones I could fit into a single workstation for the likes of Llama-2. Some frameworks give you fine controll on which layers of the network to load on which card so you can to exploit where layers are less connected.

But in most cases it just meant that token rates went down to the 5 token/s you also get with pure CPU inference, because that's just what an LLM on normal DRAM and PCIe v4 x16 will give you, no matter how much compute you put into the pile.

ML models or workloads need to be designed to very specific splits to suffer the least from a memory space that may be logically joined but is effectively partitions via tight bottlenecks. And so far that's a very manual job that doesn't even port to a slightly different setup elsewhere.
 
Last edited:
  • Like
Reactions: oofdragon

systemBuilder_49

Distinguished
Dec 9, 2010
101
35
18,620
Except AMD is so stingey with pcie slots (only 24 on the 1000x - 9000x cpus) that this feature is USELESS to all but threadripper customers. Nice one, AMD, democratizing AI for nothing but their richest customers!
 
  • Like
Reactions: oofdragon

LabRat 891

Honorable
Apr 18, 2019
108
76
10,660
Where is Cross Fire??? They surely can make it in 2024 since they did it 2014, 2004? 8800XT will be around a 7900XT for $500 they say, well, make it CF and it will beat the 4090 no problm
I reccommend you take a look @

'Crossfire' is gone, long-gone now. M-GPU only works on VK/DX12, where supported.
However, AMD has already figured out how to 'bond' GPUs together over InfinitiFabric. The feature merely has not been offered to the consumer space.

I may be incorrect, but I believe InfinitiFabric inter-GPU communication is involved with ROCm Multi-GPU, too.
 
I've been always AMD but next gen I want 4090 perf for 1440p and/or at least 1.3x4090 perf for 4K, hope they make it somehow or I'll have to go Nvidia 😭
I gave up from amd... I take the worse card from nvidia the infamous rtx 4060ti 16gb :)

Don't wait to go green team.

Get one on cheap before the new cards come out.
 

DS426

Upstanding
May 15, 2024
254
190
360
Except AMD is so stingey with pcie slots (only 24 on the 1000x - 9000x cpus) that this feature is USELESS to all but threadripper customers. Nice one, AMD, democratizing AI for nothing but their richest customers!
Entry-level Threadripper (7960X) is not terribly expensive (~$1,400) and gives 88 usable PCI-E lanes on the TRX50 platform. A quad 7900 XTX system will probably need that much CPU, depending on the AI workload. We're still talking about performant, relatively lower cost AI systems here with no full-blown EPYC or Xeon server systems.

BTW, AM4 had 24 lanes but AM5 has 28 lanes, and they are PCIe 5.0 capable. I don't know that 7900 XTX needs more than PCIe 4.0 x8 bandwidth (or maybe a small bottleneck??), so at least a dual GPU setup seems more than feasible to me as 16 lanes (minimum) are dedicated to PCIe slots.
 

abufrejoval

Reputable
Jun 19, 2020
584
424
5,260
Except AMD is so stingey with pcie slots (only 24 on the 1000x - 9000x cpus) that this feature is USELESS to all but threadripper customers. Nice one, AMD, democratizing AI for nothing but their richest customers!
Lanes come at a cost, actually a huge cost in terms of die area and power consumption.

AMD gives you options: Make do with 16 lanes on APUs, 24-28 lanes on "desktop" SoCs and plenty more with Threadripper and EPYC.

Not everyone wants to pay extra for extra lanes on lower tier SoCs.

And some may be able to make do without having the full complement of lanes for every GPU: in GPU mining a single lane was good enough while with LLMs even 64 PCIe v5 lanes may still be too slow to be useful.

In theory you could even employ switches, which is what these ASmedia chips are, too.

Whether you're stingy with your money or AMD is stingy on the lanes is a difference in perspective that complaining cannot bridge.