News AMD's software stack remains a weak spot — ROCm won't support RDNA 4 at launch

And this is exactly why AMD can't gain any market share.

There's legions of CS students and pros who would ditch Nvidia the second ROCm becomes a bit less of a bad-taste parody of CUDA. Every single generation, AMD spectacularly fails to deliver. I'm amazed they sell any HPC accelerators at all, considering how pathetic their SW stack is.

SInce they have no interest in offering any versatility in their GPUs, let's at least hope that their "strictly for gaming" stock makes gamers happy so that there's less pressure on (sigh...) Nvidia.
 
And this is exactly why AMD can't gain any market share.
I think there's definitely more to it than just lacking and inconsistent compute support.

There's legions of CS students and pros who would ditch Nvidia the second ROCm becomes a bit less of a bad-taste parody of CUDA. Every single generation, AMD spectacularly fails to deliver. I'm amazed they sell any HPC accelerators at all, considering how pathetic their SW stack is.
Agreed. A couple years ago, I recall reading an impassioned plea by a prof (maybe adjunct) who had to make Nvidia GPUs a requirement for his class, after countless hours of headaches and failed attempts trying to help students get various AMD GPUs to work.

SInce they have no interest in offering any versatility in their GPUs,
It's not for lack of interest. At least, not any time recently.

It turns out there's a purely technical explanation for why AMD's track record of ROCm support is so spotty. Here's an excerpt of a post by a former AMD-GPU employee who I think retired only about 6 months ago. I've only quoted the key bit, but I'd encourage you to follow the link and read the entire post.

"... rather than compiling compute code to an IR such as PTX and distributing that, the ROCm stack currently compiles direct to ISA and uses a fat binary mechanism to support multiple GPUs in a single binary. This was a decent approach back in the early days of ROCm when we needed a way to support compute efficiently without exposing the IR and shader compiler source used for Windows gaming, but it needed to be followed up with an IR-based solution before the number of chips we wanted to support became too large.

At the time we were thinking about needing one ISA version per chip generation but that changed quickly as hardware teams started introducing new features when ready rather than saving them up for a new generation so we could have 3 or 4 different ISA versions per GPU generation. This meant that binaries for libraries and application code grew much more quickly than initially expected, to the point where they were blowing past 2GB and causing other problems. I believe we fixed the 2GB issue quickly but that does not help to prevent stupidly large binaries if we support too many chips at once."

https://www.phoronix.com/forums/for...-like-supported-by-rocm?p=1520611#post1520611
Now, you can argue that AMD should've fixed this by now, and I wouldn't disagree. However, I think they've probably been consumed by trying to keep up with Nvidia in making HIP fully CUDA-compatible, porting various software packages to HIP, and doing all the bring-up work on their ambitious MI300. Not to mention having to add support for various RDNA GPUs.

AMD's GPU team is still far smaller than the equivalent parts of Nvidia and I can understand it's hard to prioritize activities that primarily don't benefit new hardware or revenue. Yet, failing to do so is strategically bad. So, it's a classic "rock and hard place" scenario - or, perhaps "ROCm and hard place", as the case may be.

Also, one fact that's uncomfortable for some of us Linux fans is that AMD has long been biased towards Windows. To a significant degree, their mainstream GPU Compute efforts have been consumed by supporting Microsoft initiatives, like C++AMP and DirectCompute. Within the organization, the Linux efforts were long seen as something just needed to support the relatively small workstation and server GPU markets and funded from those revenue streams.
 
It's not for lack of interest. At least, not any time recently.
I think the lack of interest is shown by the fact the they were *asked*. Intel published a guide on how to use their new GPUs with PyTorch and their ML stack when they released Alchemist. They gave Alchemist and Battlemage samples to AI YouTubers, not just gaming reviewers. If AMD cared, they would mention any progress with ROCm and make some effort at promoting it.

You make great points. I'm still not sold to the idea that investing in a good ML stack isn't the best move an Nvidia competitor can make right now:
- Their HPC hardware benefits from engineers who know the APIs and the framework already. That's the strength of CUDA, the same code that you debugged at home runs on the cluster.
- I'm still sceptical that the "homebrew ML" market is as small as people seem to think. Why does Tom's Hardware bother testing all new GPUs with ML workloads? (The tests aren't great, but they're there!). And homebrew ML engineers make new networks, which could run on new clusters... And see point 1.
- Ultimately, consumer GPU support *is* what made CUDA successful in the first place. If it worked once, it could work again?
 
  • Like
Reactions: jp7189
CUDA support across all cards is the single biggest reason nvidia has dominated the market. People choose what the are comfortable with. Even if nvidia was a slower card they would be chosen because people know how to use it.

In every ML buying decision I've been a part of, it's always which nvidia card is the right choice. Never, ever has AMD entered the conversation. We just need one person at the table to say something like.. "hey I've been playing around with my AMD card at home and it's working great." For the same money ML researchers will pick the slower nvidia card for their personal use because of uncertainty around ROCm.
 
  • Like
Reactions: NinoPino
I think the lack of interest is shown by the fact the they were *asked*. Intel published a guide on how to use their new GPUs with PyTorch and their ML stack when they released Alchemist. They gave Alchemist and Battlemage samples to AI YouTubers, not just gaming reviewers. If AMD cared, they would mention any progress with ROCm and make some effort at promoting it.
I'm not the best to comment on what outreach AMD has or hasn't done in this area. I expect they'll do more, as their hardware becomes more competitive. They do have a youtube channel, however, with much AI-specific content.

They also have a new blogging platform, with some post specific to AI usage:


You make great points. I'm still not sold to the idea that investing in a good ML stack isn't the best move an Nvidia competitor can make right now:
Of course they are. Why do you think they announced ROCm support for the RX 7900 GPUs and on Windows? They do finally understand that people want to use their GPUs for compute and AI.

This is also why they're working on support for RDNA4, even though it won't be ready at launch. Maybe the reason for that delay is because they're finally addressing that IR (Intermediate Representation) problem described in the post I quoted!
 
For the same money ML researchers will pick the slower nvidia card for their personal use because of uncertainty around ROCm.
They definitely earned themselves a bad reputation. About 10 years ago, they rewrote their Linux driver stack and their OpenGL (and later Vulkan) support became pretty rock-solid and properly competitive with Nvidia. Many people assumed ROCm would rapidly improve and stabilize in a similar fashion, only to be met with repeated disappointments. This has created a lot of hurt feelings and bad will, in the community. As the post I quoted outlined, I think this wasn't planned, but it was managed horribly. I hope the same people are no longer in charge.

Meanwhile, there's been another userspace taking shape. Mesa developed the "rusticle" frontend for easily supporting OpenCL on GPUs that Mesa supports. ROCm supports OpenCL, but at this point Rusticle might've already surpassed its implementation. I prefer OpenCL to CUDA/HIP anyhow, but support for AI on OpenCL isn't as good.

Lastly, there are also people using Vulkan for compute, which is an area I haven't followed as closely. However, I can say that Vulkan is much lower level than OpenCL and I'm not convinced it really addresses deficiencies of OpenCL the same way it does for OpenGL.

I guess I should also mention WebGPU, which is an API & framework usable from WebAssembly and I think Javascript. This is an area I know even less about than Vulkan. I believe browers are implementing WebGPU atop Vulkan, but I'm not entirely sure.
 
Last edited:
In fact, just yesterday there was some news on this subject. I think AMD's pursuit of MLIR (Multi-Level Intermediate Representation) might indeed be aimed at addressing the problem I cited in post #5.

Some background:

IREE (Intermediate Representation Execution Environment) is an MLIR-based end-to-end compiler and runtime that lowers Machine Learning (ML) models to a unified IR that scales up to meet the needs of the datacenter and down to satisfy the constraints and special considerations of mobile and edge deployments.

Key features:
  • Ahead-of-time compilation
  • Support for advanced model features
  • Designed for CPUs, GPUs, and other accelerators
  • Low overhead, pipelined execution
  • Binary size as low as 30KB on embedded systems
  • Debugging and profiling support

Supported ML frameworks:
  • JAX
  • ONNX
  • PyTorch
  • TensorFlow and TensorFlow Lite

Support for hardware accelerators and APIs:
  • Vulkan
  • ROCm/HIP
  • CUDA
  • Metal (for Apple silicon devices)
  • AMD AIE (experimental)
  • WebGPU (experimental)

They also claim OS support includes Linux, Windows, MacOS, Android, iOS, and WebAssembly (experimental). Supported host ISAs include Arm, x86, and RISC-V.

MLIR, itself:

Multi-Level Intermediate Representation Overview

The MLIR project is a novel approach to building reusable and extensible compiler infrastructure. MLIR aims to address software fragmentation, improve compilation for heterogeneous hardware, significantly reduce the cost of building domain specific compilers, and aid in connecting existing compilers together.


Perhaps this is what was referred to by part of the post I quoted (again, see #5 for link):

"... it seems to me that we should be able to package up our extended LLVM IR as the "compiler output" then run the back half of LLVM on the IR at runtime."

I think we'll know if AMD has truly solved this problem, based on whether & the degree to which they can properly support all of their recent hardware. Certainly, when you add XDNA to the mix, the matrix of hardware they need to support is reaching the level of a first-order problem.[/list]
 
  • Like
Reactions: abufrejoval
This doesn't bode well for Strix Halo and is especially troublesome when that system has been advertised as an ML secret weapon with its unified high bandwidth memory.

You can finally pre-order a Framework desktop with 128GB of RAM with ca. 200GB/s bandwidth, but without software it's rather useless or overpriced for what it can do.

Not that I think 200GB/s bandwidth will be that much fun with LLMs filling 128GB capacity, but that's another issue and perhaps a sign that AMD marketing doesn't talk to their engineers.

About the only good GPU news I've been able to observe lately is that Limited Edition B580 have finally appeared near MSRP in Europe, so I grabbed one.

Unfortunately enabling resizable BAR on my Broadwell Xeon failed a BIOS checksum check so for now it's basically a spare in case one of my Nvidia GPUs should fail: can't even get spare parts any more, should the need arise...
 
You can finally pre-order a Framework desktop with 128GB of RAM with ca. 200GB/s bandwidth, but without software it's rather useless or overpriced for what it can do.
They describe the memory as:

Capacity: 128GB
Speed: LPDDR5x-8000
Maximum dedicated VRAM: 96GB

https://frame.work/products/desktop-mainboard-amd-ai-max300?v=FRAMBM0002

At a 256-bit width, that works out to a nominal bandwidth of 256 GB/s. So, it's enough to read the entire VRAM contents 2.67 times per second. If you're entirely memory-bound, then that seems about how fast you can inference the largest model it can hold.

Not that I think 200GB/s bandwidth will be that much fun with LLMs filling 128GB capacity, but that's another issue and perhaps a sign that AMD marketing doesn't talk to their engineers.
I really think they didn't conceive of it primarily for AI. The quote I saw around its origins really seemed more oriented towards graphics.

You might think AMD was taking a bit of inspiration from Apple Silicon, with its powerful CPU cores, graphics and unified memory. But according to VP Joe Macri, AMD was building towards this long before Apple. “We were building APUs [chips combining CPUs and Radeon graphics] while Apple was using discrete GPUs. They were using our discrete GPUs. So I don’t credit Apple with coming up with the idea.”

Macri gives Apple credit for proving that you don’t need discrete graphics to sell people on powerful computers. “Many people in the PC industry said, well, if you want graphics, it’s gotta be discrete graphics because otherwise people will think it’s bad graphics,” he said.

Source: https://www.engadget.com/computing/...ly-wouldnt-exist-without-apple-220034111.html
 
  • Like
Reactions: DS426 and Makaveli
Let's not forget that AMD's Datacenter revenue was up 69% YoY. A lot of this was EPYC but some was Instinct.

Large enterprises have the resources needed to make ROCm work for them, while a CS Professor or class might not, at least not without noticeably more effort compared to going the CUDA route. Honestly, Radeon support is still relatively new, and with the RDNA vs. CDNA rift and ROCm aligning to CDNA, I'm not surprised that they weren't ready for RDNA4's launch.

Intel weren't the only ones that weren't as prepared as nVidia to ride the AI wave.
 
At a 256-bit width, that works out to a nominal bandwidth of 256 GB/s. So, it's enough to read the entire VRAM contents 2.67 times per second. If you're entirely memory-bound, then that seems about how fast you can inference the largest model it can hold.
AMD's literature backs the 256GB/s figure as maximum, Framework says 8000MT/s at 256 bit which would result in the same figure. That's about the same as the RTX 4060 in one of my laptops.
I really think they didn't conceive of it primarily for AI. The quote I saw around its origins really seemed more oriented towards graphics.
AMD bragged about it on CES, argued that it could run Llama 3.1 70B-Q4 and was twice as fast as an RTX 4090 with 24GB.

Now that was obviously disingenious, because the only way to fit a Llama 70B model into an RTX 4090 is with 2 bits of quantisation, which basically produces pure gibberish. At 4 bits some layers will need to go to CPU RAM and then it's the PCIe bandwidth which determines performance, there was no difference in token speed between my 16 core Ryzen 9 and the RTX 4090, single digit token per second, around 4 if I remember correctly.

But that is also not technically false, as my CPU RAM bandwith was near 100GB/s so you might get 8-10ts with 256GB/s, not a good experience I think, but perhaps for some better than not being able to do anything at all.

As long as the models fit into the 24GB on my RTX 4090, token speeds came to around 40 per second and that would be tolerable in terms of speed, if the results were usable. Perhaps I ask the wrong questions, but I usually get catastrophic hallucinations, another topic that.
You might think AMD was taking a bit of inspiration from Apple Silicon, with its powerful CPU cores, graphics and unified memory. But according to VP Joe Macri, AMD was building towards this long before Apple. “We were building APUs [chips combining CPUs and Radeon graphics] while Apple was using discrete GPUs. They were using our discrete GPUs. So I don’t credit Apple with coming up with the idea.”​
Macri gives Apple credit for proving that you don’t need discrete graphics to sell people on powerful computers. “Many people in the PC industry said, well, if you want graphics, it’s gotta be discrete graphics because otherwise people will think it’s bad graphics,” he said.​
I've used AMD APUs pretty much from day one and I distinctly remember how with Kaveri AMD pushed the notion of being able to mix CPU and GPU code at the granularity (and the overhead) of a procedure call... which sounded so great I actually bought a Kaveri A10-7850k system only for testing that. But it never became a practical reality for lack of software support and as a normal PC it was a disappointment, both for CPU and for gaming performance, even with the best DDR3-2400 to feed the 512 iGPU cores.

So there is potential and then there is actual benefit. And with Strix Halo I see a bit of a repeat where it's hard to actually obtain value from what sounds awesome at first glance.

For pure graphics performance you'll get an RTX 4060 mobile, not a bad experience at 1080p gaming but much cheaper at €750 with the 8-core Ryzen laptop included, €145 extra to swap the 16GB included with 64GB of RAM, €321 for 128GB, but only with around 100GB/s bandwidth.

For AI it would seem that getting a quad or even octa-channel EPYC might offer more capacity or speed, I don't know where CPUs would be too weak for LLM inference, which for machine learning is rather light on compute: in my tests, once more than a few layers were in CPU RAM, it made no difference if the rest of the model was running on the RTX 4090 or everything on the CPUs. But both, the newer dual channel Zens and the older quad channel Xeons at my disposal, didn't pass the 100GB/s mark, so beyond that it's terra incognita for me.

Framework quotes €1270 for the 32GB model, €2329 for the 128GB model, the latter a bit more than what I paid for my RTX 4090 (but currently selling at more than twice that price) with 4x the bandwidth until you need more than 24GB.

A few months ago, that bought you quite a bit more gaming performance but also much better ML, as long as the models were small enough.

If model sizes tips you towards unified memory, 256GB/s may just not be good enough to make it worthwhile, data center GPUs with 96GB of RAM offer 4TB/s of bandwidth for acceptable token speeds.

So any which way I look at it, Strix Halo is serving a tight niche. But selling it as a Llama 70B machine yet without ROCm support seems to kick it into a Kaveri corner. And Kaveri's main advantage was price, not a Strix Halo forte so far.

Because of that niche, Strix Halo as a stand-alone product seems insane for lack of scale, I can only imagine it being worthwhile if it can serve the console market with little if any change. Yet there again the equivalent of an RTX 4060 may not be good enough to reach 4k.

Well, we'll see, AMD usually isn't completely stupid, so more likely it's me who's wrong.

For me without ROCm I can't justify buying it professionally, and it's too costly as a toy, so I can't check for myself, which is my biggest complaint :)
 
Last edited:
  • Like
Reactions: Peksha
AMD bragged about it on CES, argued that it could run Llama 3.1 70B-Q4 and was twice as fast as an RTX 4090 with 24GB.
At this point in time, I agree that they'd want to play up the AI aspect. I'm just saying that, back when they started this project, they probably weren't thinking about people doing edge-based inferencing of such huge AI models. I'd guess this project got underway in 2022 and was maybe greenlit even before that.

I've used AMD APUs pretty much from day one and I distinctly remember how with Kaveri AMD pushed the notion of being able to mix CPU and GPU code at the granularity (and the overhead) of a procedure call... which sounded so great I actually bought a Kaveri A10-7850k system only for testing that.
Yes, I remember their HSA push. I spec'd out a micro-ATX system for dabbling with it, and got as far as buying a case for it, but luckily that's as far as I got.

If model sizes tips you towards unified memory, 256GB/s may just not be good enough to make it worthwhile, data center GPUs with 96GB of RAM offer 4TB/s of bandwidth for acceptable token speeds.
It's rumored that a workstation variant of the RTX 5090 might feature 96 GB. For that to happen 24 Gb GDDR7 dies need to become available and I don't know if they are. But, it should be like < 1/5th the price of a H200, I think. And if you need more memory, just get a second one and divide the model between them.

I can only imagine it being worthwhile if it can serve the console market with little if any change.
While Ryzen AI Max 395's iGPU is comparable to that of a PS5, its memory bandwidth still comes up short of the console's 448 GB/s. Both use a 256-bit datapath, but the console uses GDDR6 and even the Pro model (which has yet better specs) costs only 1/3rd as much as the Framework PC.

For me without ROCm I can't justify buying it professionally, and it's too costly as a toy, so I can't check for myself, which is my biggest complaint :)
It will probably have ROCm support, by the time you could actually get your hands on a framework PC with it.
 
It's rumored that a workstation variant of the RTX 5090 might feature 96 GB. For that to happen 24 Gb GDDR7 dies need to become available and I don't know if they are. But, it should be like < 1/5th the price of a H200, I think. And if you need more memory, just get a second one and divide the model between them.
Professional double capacity variants based on top consumer chips have existed for several generations, albeit with a hefty markup for the relatively cheap extra RAM chips e.g. the L40 and RTX 6000 which are basically an RTX 4090 with 48GB of GDDR6, passive (L40) or active (RTX 6000) cooling and some driver differentiations/extortions.

Interestingly, when RTX 4090 prices exploded towards the end of its life cycle, the formerly "insane" prices of those double capacity variants spiked much less, seemed to almost become reasonable.

Never enough for me to go and buy one for testing, but I did a bit of research to find out why they seemd to hold so little attraction. From what I could gather their effective token rates halfed with double sized models and dropped below acceptable "human patience thresholds" for production inference, especially when trying to run inference for multiple clients because bandwidth can't keep up with capacity.

Their lack of modern NVlink support made them also unattractive for training and thus they become very niche and the main reason I doubt the value of Strix Halo for AI. Their HBM cousins (with TB/s of NVlink), however, just remained unobtainium, mostly because at 4x the bandwidth token rates remained high enough, while max concurrent user populations might drop as models grow.

With the RTX 5090 not being produced in volume, a "B40" isn't in the cards anytime soon.

And no, if adding cards and dividing the model were a thing, Nvidia couldn't charge what they do.

Unless you use the modern variants of NVlink only available on the DC variants with HBM and the switching ASICs required to connect them, the cross connect would be PCIe, which effectively puts you at CPU level inference performance.

Of course I had to try that for myself, putting my RTX 4090 and my RTX 4070 into a single system (only possible because I used narrow PNY GPUs at 3 and 2 slot width) and tried distributing layers of Llama between them. Just moving 5-10% of the layers removed all GPU speedups, it was brutal, pretty much like running 4GB of software on a 256MB machine and using a hard disk for paging.

Your models and the fabric need to be specifically designed for the interconnect to work around the bottleneck and that can kind of work for final training when it is optimally decoupled and deeply batched. DeepSeek is supposed to have used that for training and mixtures of experts might have natural breaks between experts.

But inference is real-time, limited by bandwidth and consumer patience.
 
And no, if adding cards and dividing the model were a thing, Nvidia couldn't charge what they do.
For inferencing, you can use multiple L40's to implement either Tensor Parallelism or Pipeline Parallelism:

model-parallelism.webp


model-parallelism-layer-wise.webp



Where PCIe becomes more of a bottleneck is with training. That's what NVLink is for.

Of course I had to try that for myself, putting my RTX 4090 and my RTX 4070 into a single system (only possible because I used narrow PNY GPUs at 3 and 2 slot width) and tried distributing layers of Llama between them. Just moving 5-10% of the layers removed all GPU speedups, it was brutal, pretty much like running 4GB of software on a 256MB machine and using a hard disk for paging.
How did you decide where to split the network? What was the connectivity at that point? Was the CPU orchestrating data movement between the two network partitions?

What I'd imagine would help is to use PCIe -> PCIe transfers between the cards. Perhaps you can even arrange for one GPU to inject a command in the other's command queue, once the transfer has completed, so that the host CPU doesn't need to get involved. PCIe 5.0 would help even further.
 
Where PCIe becomes more of a bottleneck is with training. That's what NVLink is for.
Well, current generation models require hundreds of thousands of GPUs, so even NVLink doesn't link them all, but only the closest cluster levels, while they then use Mellanox fabrics or pretty cool optical switches in the case of Google's TPUs.

But in all of these cases topology is no longer transparent and automated optimizations are done by something they call "compilers", even if that doesn't fit my more traditional definition of a compiler as a language translator. Those compiliation techniques still require that you provide information about the potential separation points and you quickly face diminishing returns.
How did you decide where to split the network? What was the connectivity at that point? Was the CPU orchestrating data movement between the two network partitions?
I basically can only tell llama.cpp how many layers to use on which GPU and I have no insight as to where there might be a better split, which ones have most weights or tend to have the largest amount of cross-talk. That might be better if this was TensorFlow, where topology information is used to make automated placement decisions.

CUDA uses virtual memory, just like CPUs (perhaps without paging) and thus allows to map GPU and CPU memory into shared address spaces without the need for explicit data movement (copying). Data access (from either GPU or CPUs) will simply go over the PCIe (or NVlink) bus if a memory region isn't local. With all layers on GPU VRAM, the CPU doesn't do much beyond setup and initial data load and even if layers spill over into CPU RAM, that doesn't require CPU intervention, GPUs will have it mapped into their virtual address space and just use it, with proper caching and a shared coherence protocol.

But since the PCIe bus is the bottleneck, e.g. access from the RTX 4090 on 4070 VRAM or CPU RAM would be limited by PCIe v4 x8 in all cases and shifting the number of layers between the GPUs or to the CPU had little impact.
What I'd imagine would help is to use PCIe -> PCIe transfers between the cards. Perhaps you can even arrange for one GPU to inject a command in the other's command queue, once the transfer has completed, so that the host CPU doesn't need to get involved. PCIe 5.0 would help even further.
As I said, that's all taken care of by a big shared virtual memory space between all participants, also for NVlink. Once you go beyond NVlink I have no idea what the giants use, but there is plenty of URDMA and MPI frameworks in HPC which have similar issues and solutions.

In my case PCIe 5 wasn't an option because the RTX 4* series won't support it, but wouldn't do much. NVlink is still 10x faster at 1.8TB/s vs. 125GB/s for PCIe 5 x16, but still much slower than local HBM bandwidth.

In GPU compute even VRAM is considered a bottleneck and pretty much like a disk or database access compared to traditional CPU code, because it's shared by thousands of cores. The real work gets done soley in vast register files, which are truly local to each.

Having to go across the PCIe bus is like having to fetch a tape, mount it and have it seek to where you then read/write a bit of data.
 
CUDA uses virtual memory, just like CPUs (perhaps without paging) and thus allows to map GPU and CPU memory into shared address spaces without the need for explicit data movement (copying). Data access (from either GPU or CPUs) will simply go over the PCIe (or NVlink) bus if a memory region isn't local. With all layers on GPU VRAM, the CPU doesn't do much beyond setup and initial data load and even if layers spill over into CPU RAM, that doesn't require CPU intervention, GPUs will have it mapped into their virtual address space and just use it, with proper caching and a shared coherence protocol.

But since the PCIe bus is the bottleneck, e.g. access from the RTX 4090 on 4070 VRAM or CPU RAM would be limited by PCIe v4 x8 in all cases and shifting the number of layers between the GPUs or to the CPU had little impact.
This is exactly the problem. Even if the GPU can page the model into its local memory, caching is pretty much useless. Your hit-rate will be near zero. You're just going to stream in the entire portion of the model you're using, on every inference. That's why NPUs don't have SMT and instead just rely on DMA engines and direct-mapped on-die SRAM, instead of a cache.

The point I was making is that you need to partition a model so that you can load portions of it on each GPU and let it reside there. Otherwise, like you said, it'll just be PCIe-bottlenecked to the point where you might as well just inference on your CPU.

In GPU compute even VRAM is considered a bottleneck and pretty much like a disk or database access compared to traditional CPU code, because it's shared by thousands of cores. The real work gets done soley in vast register files, which are truly local to each.
Last I checked (maybe back in the Volta era), you could configure up to half of the on-die memory as direct-mapped SRAM. So, not just registers. However, once you've finished with a set of weights, you have to fetch more from off-chip. The only way you avoid having to read in the entire model for every inference is by batching. However, the larger your batch size, the more of an issue the intermediates become.
 
This is exactly the problem. Even if the GPU can page the model into its local memory, caching is pretty much useless. Your hit-rate will be near zero. You're just going to stream in the entire portion of the model you're using, on every inference. That's why NPUs don't have SMT and instead just rely on DMA engines and direct-mapped on-die SRAM, instead of a cache.
I wouldn't mix NPUs and GPUs here, they share little in design or actual workloads AFAIK.

NPUs are typically around very dense vision and sound models, much more like DSPs or even data flow architectures and their DMA engines ensure real-time access to the input and output data, while reference data is mostly on SRAM. At least that's how I've been understanding both Qualcom Hexagons and Movidius chips. And they are all about doing what a CPU or a GPU could also do, but at much lower energy consumption.

Big GPUs doing LLMs need to basically traverse all those billions of weights of a model for every input token, so caching those weights would be wasted energy, because there is next to no data locality (while code is both very small and replicated everywhere).

Of course, there is still lots of inner loops on those register files and caches with intermediate data, some of it for reference and results but bandwidth * size the major limitation for token/s.

On NVlink, especially with the newest generation ASIC switch chips, distinct GPUs share a virtual memory space and the bandwidth cliff between them may be only half of the local HBM can deliver, even if latencies might be bigger so the scale-out penalty is less pronounced than on a measly PCIe bus. But if you ignore topology, you still loose 50% performance on NVlink and at Blackwell prices, that's something they'd want to avoid.
The point I was making is that you need to partition a model so that you can load portions of it on each GPU and let it reside there. Otherwise, like you said, it'll just be PCIe-bottlenecked to the point where you might as well just inference on your CPU.
Exactly the same point I'd make, too. But splitting the model isn't trivial or even easy. Tensorflow was invented to make that automatable somewhat, but when the cliff is 50-125GB/s of PCIe vs. 1.8TB/s of NVlink, I'm pretty sure it can't help.

Mixture of expert models might be easier to split and of course training in huge batches can ease some of the overhead, but for inference you need bandwidth to get speed. Unless you have a Cerebras, you'd better be able to fit your inference into what fits into an NVlink cluster.
Last I checked (maybe back in the Volta era), you could configure up to half of the on-die memory as direct-mapped SRAM. So, not just registers. However, once you've finished with a set of weights, you have to fetch more from off-chip. The only way you avoid having to read in the entire model for every inference is by batching. However, the larger your batch size, the more of an issue the intermediates become.
Sounds right, and probably newer cards still support stuff like that: they still try to sell GPUs into HPC and I am quite sure they'd need and want that. Even the inner loops of LLMs are likely to use any such tricks, because if all LLM data had to reside in RAM, not even HBM would be nearly fast enough.
 
NPUs are typically around very dense vision and sound models, much more like DSPs or even data flow architectures and their DMA engines ensure real-time access to the input and output data, while reference data is mostly on SRAM.
They work out of SRAM, but the DMA engines can be used to stream weights in, meaning they're not limited only to models that fit in SRAM. That would be excessively limiting. I defy you to show me a model that fits in like 4 MB and yet needs 45 TOPS of compute power.

they are all about doing what a CPU or a GPU could also do, but at much lower energy consumption.
The CPU cores don't have the 45 TOPS of compute power required by CoPilot+.

Big GPUs doing LLMs need to basically traverse all those billions of weights of a model for every input token, so caching those weights would be wasted energy, because there is next to no data locality (while code is both very small and replicated everywhere).
That was my point.

On NVlink, especially with the newest generation ASIC switch chips, distinct GPUs share a virtual memory space and the bandwidth cliff between them may be only half of the local HBM can deliver, even if latencies might be bigger so the scale-out penalty is less pronounced than on a measly PCIe bus. But if you ignore topology, you still loose 50% performance on NVlink and at Blackwell prices, that's something they'd want to avoid.
The idea is still to stream data over NVLink, not weights.

Exactly the same point I'd make, too. But splitting the model isn't trivial or even easy. Tensorflow was invented to make that automatable somewhat, but when the cliff is 50-125GB/s of PCIe vs. 1.8TB/s of NVlink, I'm pretty sure it can't help.
You're comparing apples & oranges. PCIe is point-to-point, whereas the NVLink lanes aren't connected like that. And anyway, you wouldn't usually have just two GPUs in a full SXM setup.

Sounds right, and probably newer cards still support stuff like that: they still try to sell GPUs into HPC and I am quite sure they'd need and want that. Even the inner loops of LLMs are likely to use any such tricks, because if all LLM data had to reside in RAM, not even HBM would be nearly fast enough.
Batching has been around at least since I first started messing with Caffe, about a decade ago.