At a 256-bit width, that works out to a nominal bandwidth of 256 GB/s. So, it's enough to read the entire VRAM contents 2.67 times per second. If you're entirely memory-bound, then that seems about how fast you can inference the largest model it can hold.
AMD's literature backs the 256GB/s figure as maximum, Framework says 8000MT/s at 256 bit which would result in the same figure. That's about the same as the RTX 4060 in one of my laptops.
I really think they didn't conceive of it primarily for AI. The quote I saw around its origins really seemed more oriented towards graphics.
AMD bragged about it on CES, argued that it could run Llama 3.1 70B-Q4 and was twice as fast as an RTX 4090 with 24GB.
Now that was obviously disingenious, because the only way to fit a Llama 70B model into an RTX 4090 is with 2 bits of quantisation, which basically produces pure gibberish. At 4 bits some layers will need to go to CPU RAM and then it's the PCIe bandwidth which determines performance, there was no difference in token speed between my 16 core Ryzen 9 and the RTX 4090, single digit token per second, around 4 if I remember correctly.
But that is also not technically false, as my CPU RAM bandwith was near 100GB/s so you might get 8-10ts with 256GB/s, not a good experience I think, but perhaps for some better than not being able to do anything at all.
As long as the models fit into the 24GB on my RTX 4090, token speeds came to around 40 per second and that would be tolerable in terms of speed, if the results were usable. Perhaps I ask the wrong questions, but I usually get catastrophic hallucinations, another topic that.
You might think AMD was taking a bit of inspiration from Apple Silicon, with its powerful CPU cores, graphics and unified memory. But according to VP Joe Macri, AMD was building towards this long before Apple. “We were building APUs [chips combining CPUs and Radeon graphics] while Apple was using discrete GPUs. They were using our discrete GPUs. So I don’t credit Apple with coming up with the idea.”
Macri gives Apple credit for proving that you don’t need discrete graphics to sell people on powerful computers. “Many people in the PC industry said, well, if you want graphics, it’s gotta be discrete graphics because otherwise people will think it’s bad graphics,” he said.
I've used AMD APUs pretty much from day one and I distinctly remember how with Kaveri AMD pushed the notion of being able to mix CPU and GPU code at the granularity (and the overhead) of a procedure call... which sounded so great I actually bought a Kaveri A10-7850k system only for testing that. But it never became a practical reality for lack of software support and as a normal PC it was a disappointment, both for CPU and for gaming performance, even with the best DDR3-2400 to feed the 512 iGPU cores.
So there is potential and then there is actual benefit. And with Strix Halo I see a bit of a repeat where it's hard to actually obtain value from what sounds awesome at first glance.
For pure graphics performance you'll get an RTX 4060 mobile, not a bad experience at 1080p gaming but much cheaper at €750 with the 8-core Ryzen laptop included, €145 extra to swap the 16GB included with 64GB of RAM, €321 for 128GB, but only with around 100GB/s bandwidth.
For AI it would seem that getting a quad or even octa-channel EPYC might offer more capacity or speed, I don't know where CPUs would be too weak for LLM inference, which for machine learning is rather light on compute: in my tests, once more than a few layers were in CPU RAM, it made no difference if the rest of the model was running on the RTX 4090 or everything on the CPUs. But both, the newer dual channel Zens and the older quad channel Xeons at my disposal, didn't pass the 100GB/s mark, so beyond that it's terra incognita for me.
Framework quotes €1270 for the 32GB model, €2329 for the 128GB model, the latter a bit more than what I paid for my RTX 4090 (but currently selling at more than twice that price) with 4x the bandwidth until you need more than 24GB.
A few months ago, that bought you quite a bit more gaming performance but also much better ML, as long as the models were small enough.
If model sizes tips you towards unified memory, 256GB/s may just not be good enough to make it worthwhile, data center GPUs with 96GB of RAM offer 4TB/s of bandwidth for acceptable token speeds.
So any which way I look at it, Strix Halo is serving a tight niche. But selling it as a Llama 70B machine yet without ROCm support seems to kick it into a Kaveri corner. And Kaveri's main advantage was price, not a Strix Halo forte so far.
Because of that niche, Strix Halo as a stand-alone product seems insane for lack of scale, I can only imagine it being worthwhile if it can serve the console market with little if any change. Yet there again the equivalent of an RTX 4060 may not be good enough to reach 4k.
Well, we'll see, AMD usually isn't completely stupid, so more likely it's me who's wrong.
For me without ROCm I can't justify buying it professionally, and it's too costly as a toy, so I can't check for myself, which is my biggest complaint
