News AMD RDNA 3 professional GPUs with 48GB can beat Nvidia 24GB cards in AI — putting the 'Large' in LLM

Don't forget you can use multiple GPUs with LLMs.

Radeon PRO W7900 has 48 gigabytes of vram and 864 gigabytes a second of bandwidth.

Nvidia 5090 has 32 gigabytes of vram and 1792 gigabytes a second of bandwidth.

For $500 more, assuming MSRP, you could have dual Nvidia 5090s totaling 64 gigabytes of vram and 3584 gigabytes a second of bandwidth.
 
Don't forget you can use multiple GPUs with LLMs.

Radeon PRO W7900 has 48 gigabytes of vram and 864 gigabytes a second of bandwidth.

Nvidia 5090 has 32 gigabytes of vram and 1792 gigabytes a second of bandwidth.

For $500 more, assuming MSRP, you could have dual Nvidia 5090s totaling 64 gigabytes of vram and 3584 gigabytes a second of bandwidth.
rather bold of you to assume anything AI related made by any company will be MSRP.... or even available for purchase at all.
 
Ok, a Pro W9700 or any 'Pro' graphics cards are out of the question for someone like me who just uses (or wants to use) a GPU for generating subtitles using Whisper, but...
My question is, what LLM applications require or takes advantage of such large pools is RAM?
To me, it seems like these benchmarks only apply to a very small percentage of the market and for that percentage (I'm sure I'm be corrected/schooled) is looking for budget GPU performance vs. faster, premium GPUs with equivalent memory available? Are there no Nvidia vendors with 48GB memory? - or like another poster noted, buying 2 cards (even with elevated pricing) will outperform the Pro W series.
I understand all about prices and cost of entry - this is why I still use Whisper on my Ryzen CPU and not GPU, but if your use cases requires vast amounts of RAM, aren't you more likely to have the $$ for more costly and effective options? It's not like you have one task to process, they likely have multiple dozens, hundred or thousands of tasks and speed would be off the utmost importance?
For me, I can setup Whisper to transcribe a season of TV, let it run for a day or 4 and then move to the next season/series.
SURE, I'd love for it to be done much now quickly and without hugging my CPU cycles, but it can be done.
People/business that require tasks to be done quickly and efficiently are going to drop $$$ on the best tools/GPUs for speed and efficiency - AMD going on about how their last gen GPU is still relevant or beating the competition seems like a PR/Shareholder battle and not truly directed to the people buying or needing truly large LLM tasks.
What am I missing?
Ps. Anyone have a cheap Nvidia GPU that might be busted, but otherwise had working Media Engine and NPU they want to sell cheap? 😋
 
Don't forget you can use multiple GPUs with LLMs.
Please forget about using multiple GPUs with LLMs [unless you can afford to taylor make your own proprietary models and software stack to actually make that work well enough for training, like DeepSeek did].

Of course, if you can afford NVlink switches and proper DC GPUs, there is a bit of scaling to be had, if you got the engineering teams to go with that.
Radeon PRO W7900 has 48 gigabytes of vram and 864 gigabytes a second of bandwidth.

Nvidia 5090 has 32 gigabytes of vram and 1792 gigabytes a second of bandwidth.

For $500 more, assuming MSRP, you could have dual Nvidia 5090s totaling 64 gigabytes of vram and 3584 gigabytes a second of bandwidth.
And with all that money you'll have the performance of a GTX 1030 with an LLM.

Because as soon as you run out of memory in one of your GPUs and have to go across the PCIe bus for weights on the other GPU or CPU RAM, you might as well just stick with CPUs, only: some of them have a memory bus that's faster than what GPUs need to share on PCIe.

The reason why Nvidia can charge so much for their 4TB/s 96GB HBM GPUs with NVlink is because they understand the limitatios of LLMs.

If all you had to do was to coble GPUs together in a chassis like they used to do for Ethereum, Nvidia would still be flogging gaming GPUs.

Of course, thinking myself smart, I actually had to try that.

I did put an RTX 4090 and a 4070 (5 slots total width) into my Ryzen 9 desktop and then observed what happened when I split models between them...

Pretty near the same that happens, once layers are loaded to CPU RAM: performance goes down the drain and very close to what CPUs can do, too. At least, when you have 16 good cores, like I do.

And if you look closely via HWinfo, you'll see that the memory bus is at 100% utilization, while everbody else, CPUs or GPUs are just twiddling their thumbs, waiting for data.
 
Last edited:
AMD published DeepSeek R1 benchmarks of its W7900 and W7800 Pro series 48GB GPUs, massively outperforming the 24GB RTX 4090.

AMD RDNA 3 professional GPUs with 48GB can beat Nvidia 24GB cards in AI — putting the 'Large' in LLM : Read more
They had the same story at CES, how a single Strix Halo could outperform an RTX 4090 by a factor or 2.2...

The reason: they loaded a 70B model at 4Q, which eats around 42GB.

With only 24GB of 1TB/s VRAM on board of an RTX 4090, it has to go across the 64GB/s PCIe bus for any weight in [CPU] RAM.

While the Strix Halo can use its 256 bit wide the LP-DDR5 RAM at 256GB/s and thus will obviously be faster. Should be 4x really.

But that's like driving a Ferrari into a corn field and having it run against a tractor: not a fair race.

Please, AMD marketing has pulled this cheap trick before. But it's totally misleading. So don't go mindlessly echoing their bull, use your brain to filter disinformation.

Or you'll just put pople off TH.
 
Last edited:
Wouldn't the new AMD Strix Halo-APUs provide much, much better AI-related performance per Dollar compared to any of these dedicated GPUs discussed in this article, considering they would have up to 128 GB of RAM which could dynamically be allocated and shared between GPU and CPU? At least with regards to large LLMs?
 
I think most people are doing inference, not training? And there are things like this.
Short answer: start looking for performance results...

Longer attempt:

People (YouTube influencers, probably) also gofund cluster mainboards for Raspberry PI CMs, evidently so even people on a budget can play around with clusters... when an Atom system will allow them to do the very same at a lower budget with virtual machines instead, ...and better results to boot.

And that's what you could do as well, if you believe that the aggregate compute power of a lot of very small machines can achieve similar levels of compute power of an equivalent big machine: partition a big Xeon into tiny VMs and then have both variants execute some HPC workload.

There are some cloud workloads that will do fairly similar (apart from the RAM overhead of all those OS copies), but any single application that uses a lot of shared data, will likely plummet in performance.

Inferencing in LLMs is actually more demanding in the sense that it is done in real-time: people ask the AI oracle and exepect and answer within their attention span (machine vision may require much shorter latencies).

Training is obviously fare more compuationally expensive, but can be batched or strung along a time axis: it may take days, weeks if not months (depending on what type and class of model you are doing) for the final training run after model architecture and data sets are developed.

And if it weren't for data dependencies, in theory each weight could take just as long to compute, only communicating the final result.

Unfortunately it's those dependencies or connections that ML is all about and just picture connectivity as the synapses between neurons and then picture those sharing a memory bus or an Ethernet network.

That's where NVlink delivers bandwidth and latencies which are as close as you can possibly get to shared memory at multiple Terabytes/second. If you could cheap out of that one, Nvidia wouldn't be where they are today.
 
Guess i should have been more clear. Using LM Studio to run a model using multiple gpus.
I understood you well enough and that's what I did as well, for my last round of testing.

(I've been in HPC for four decades and scale-out problems have been worked on before computers were even properly invented)

LMstudio is a really nice UI, but it only sits on top of a very tall stack of software layers.

The layer splitting is done by llama.cpp, which is also used by a lot of other tools or via command line, if that's how you roll.

And llama.cpp also sits on many layers and supports different runtimes underneath, which you can also configure and switch on LMstudio, e.g. allowing me to go between CPU-only, Vulkan and CUDA on my Nvidia machines, or just between Vulkan and CPU on my systems with Intel and AMD GPUs.

How that switches between the potentially different physical pools of RAM that hold the weights and the compute units that work on them, isn't that clear to me, but generally a preference for locality can be implied: if a layer's data sits on xPU A, there is a good likelyhood that it will also be executed by that xPU and not on xPU B. However, technically there are quite a few options, because several GPUs and CPUs can share a virtual memory space and APUs might even have a physically unified pool... which might be NUMA...

So while CUDA as a framework isn't likely to execute on x86 CPUs or AMD GPUs, it could use CPU RAM and even (from a pure PCIe perspective) AMD/Intel GPU RAM. That becomes a little more interesting when you have unified memory and iGPUs.

More likely llama.cpp will switch the run times with where the layers are stored, so even if you select a CUDA runtime, as soon as layers exceed the RAM capacity of one device, they are likely moved entirely to whatever runtime the other device uses.

But that's all software, you could do your own if you can afford the time.

All that doesn't really matter if PCIe is the best you have in the middle and the inner loop of an LLM and since that is 64GB/s on PCIe v4 x16 (or less with bifurcation), that's what limits their speed. The individual GPUs might have bandwidth adding to multiple TB/s, even the CPU might do 256GB/s on quad channel RAM, but because of high data dependency of LLMs that just changes how fast they can twiddle their thumbs. If they have to pass the bridge often enough for any token produced, it's going to be slow.

That's why scale out hardware architectures for machine learning use NVlink fabrics to come as close as possible to local RAM speeds.
 
And that's what you could do as well, if you believe that the aggregate compute power of a lot of very small machines can achieve similar levels of compute power of an equivalent big machine
Running on Raspberry Pis is not the point of what I linked, much faster hardware can be used. But it shows that multiple devices can in theory be used without "[making] your own proprietary models and software stack" and it might be cheaper for somebody.

What we really want to see is lots and lots of RAM. Preferably user-upgradeable to take advantage of commodity pricing. Strix Halo 128 GB is good enough for 70B models for around $2,000 for a complete system. If you needed ~1.3 TB of RAM, consumer-grade 48 GB modules might cost less than $2,500 (Newegg has 2x48 for $160), but using 26x of those to run LLMs isn't so easy.

It's interesting that the consumer is not so far from being able to clear the memory hurdle for running giant LLMs at home. You can get 192 GB in 4 DIMMs, with 256 GB (64 GB modules) likely being available this year. DRAM capacity progress has slowed considerably, but it might pick up again after Samsung, Micron, and friends introduce 3D DRAM in the 2030s.

And yeah, the memory bandwidth is a problem.
 
Running on Raspberry Pis is not the point of what I linked, much faster hardware can be used. But it shows that multiple devices can in theory be used without "[making] your own proprietary models and software stack" and it might be cheaper for somebody.

What we really want to see is lots and lots of RAM. Preferably user-upgradeable to take advantage of commodity pricing. Strix Halo 128 GB is good enough for 70B models for around $2,000 for a complete system. If you needed ~1.3 TB of RAM, consumer-grade 48 GB modules might cost less than $2,500 (Newegg has 2x48 for $160), but using 26x of those to run LLMs isn't so easy.

It's interesting that the consumer is not so far from being able to clear the memory hurdle for running giant LLMs at home. You can get 192 GB in 4 DIMMs, with 256 GB (64 GB modules) likely being available this year. DRAM capacity progress has slowed considerably, but it might pick up again after Samsung, Micron, and friends introduce 3D DRAM in the 2030s.

And yeah, the memory bandwidth is a problem.
ServeTheHome has hinted at preparing a few LLM benchmarks on high-end EPYC systems with 12 memory channels, I'm not sure he'll pull through, because results that disappoint too many, have a tendendy of getting lost.

It's hard to estimate at which bandwidth CPUs might run out of steam doing LLM inferencing, CPU vendors generally have tried to enable inline inferencing for AI via ISA extensions especially in custom hyperscaler designs, but those aren't sold to the public.

I'm not curious enough myself to go out and buy 4/8/12 channel hardware, wo while I'd like to know as well, I'll just wait until someone with hands on hardware is willing to talk about it.

Perhaps Wendel could be tempted as well.

And yeah, there clearly is a bit of pressure on dual channel RAM designs, especially since dual DIMMs per channel becomes a blind alley. But the beach front and traces extra channels require is a hefty cost, too, plenty and fast memory just cannot really be made cheap, they only offer you more and more variants of compromises.

For me the bigger issue is that I have yet to meet an LLM model that was actually useful. Perhaps I'm just asking the wrong questions but most of what I get from them is shockingly bad hallucinations.

So I'm glad I can at least use those GPUs for gaming... that's where complete fabrications are fun!
 
  • Like
Reactions: usertests