>That is actually the entire point of Strix Halo: to provide a solution for specific workloads that easily tops the usual standard PC/X86 builds at that pricepoints, but without going deep into proper high-end workstation-territory.
I agree, with caveats. Framework Desktop (FWDT hereafter) is a good first step, but it has substantial limitations, and software support (Windows and Linux) is still very much WIP. For LLM use, it's very bleeding edge. $2K is well within the enthusiast budget, but I suspect its useful lifespan will be short, before more functional and finished options become available. Probably a year.
Strix Halo isn't as much a design dictated by a well defined use case, but purely where an intersection of generally diverging technological trends happen to meet, once.
It really just trades traces used for PCIe lanes to double the data lanes for RAM. It seems a workable compromise at exactly this point, because it means you don't have to change any of the other commodity components from which you build systems.
But you can't just trade further traces, up or down to adjust to models demands growing gradually in different directions, e.g. going from the current 256 bit wide RAM bus, bit by bit or even byte by byte to 523, sacrificing what remains of PCIe or some USB signals etc.
Where a truly flexible architecture requires a switch, which unfortunately costs as many transistors and electrons as the APU itself, it hard-wires that single other workable permutation to save that, at the cost of a future, really.
EPYCs give you that flexibility via their giant IOD switch, but at (mostly) higher cost while the analog to the fruity cult "Ultra" variant, which might be another attractive performance point, doesn't have a Strix Halo equivalent.
That's why I judge it a solution looking for a problem to match its capabilites and nothing that has a workable evolutionary path ahead, as genius as AMD might be.
[Aside: I would love to see Intel jump into this edge AI space, as it said it would. Hoping to see Nova Lake AX becoming a reality.]
With Nova Lake Intel is already betting its entire innovative technology potential on duplicating a feature that AMD has well established in the market. And while it's been great to establish and hold a flagpole position, it's actually not a volume niche, not even for AMD. AMD did V-Cache for servers, the consumer parts are a great collateral, but if they didn't sell them in servers, perhaps AMD couldn't afford to make them just for gamers.
I don't see Intel selling those big cache CPU CCDs in servers, nor do I see mixed P/E designs succeeding in servers, so I believe who ever is left at intel is still chasing after the ghosts of ancient glory.
LLM speed (tokens/sec) has two components, prompt processing (commonly measured as pp512) and token generation (tg128). Strix Halo has decent memory bandwidth & memory size for token generation, but prompt processing requires high compute that comes with dGPU. That's why you see Wendell of Level1Tech using an eGPU via USB4 for KVcache. PP time is typically much smaller than TG time, so it's a small bottleneck objectively, but the latency, or time to first token (TTFT) is important in terms of perceived speed.
Similarly, Strix Halo isn't suited for imagegen like Stable Diffusion, which requires high compute that comes with a fast dGPU. Because SD is targeted toward consumer dGPUs, 16-32GB VRAM is sufficient for most operations including training.
Exactly: unless your workload matches ideally to that single optimum Strix Halo has to offer, its value burns very quickly.
This segues into FWDT's lack of expansion. It can't accommodate a dGPU. L1Tech's use of eGPU via USB4 (20Gpbs max) would be expensive, slow, and kludgy workaround. You can also use an M.2->OcuLink adapter which is faster, but you'd have a Franken-setup which defeats the purpose of a SFF box.
Since they traded those PCIe lanes and their v5 power budgets for the iGPU and the wider RAM bus, they are locked in to that single spot, neither EPYCs nor desktops can duplicate exactly, while they sweep a much broader range of use cases.
FWDT lacks high-speed interconnect, for those who want to try multinode cluster. Its 5GbE pales in comparison with GB10's ConnectX-7's 400Gbps throughput (dual node only). 10GbE would be a practical minimum.
Even with the likes of NV-link, any significant scale-out typically requires a bespoke re-design of the base model, unless we find a way to do quantum chip-to-chip links. The smaller you start in terms of GPUs, the more effort you have to invest there. It's not an accident Nvida makes so much money on NV-link ASICs, anyone can do GPUs these days.
And no, 10GbE for GPU scale-out is much like tryin to run a RAID through modems on a serial line (actually a bit better than a SATA line). And bandwidth alone isn't the issue, it's the latency overhead you need to eliminate on Ethernet, which is what Mellanox is good at, by actually talking Infiniband.
The 112GB max memory (with latest Linux drivers) is enough to run dense 70B LLMs, and barely enough to run Llama 4 Scout at Q4, the smallest Llama 4 model. It's not enough for Qwen3-235B-A22B which requires 143GB at Q4. The trend of open LLMs is toward larger sizes and using MoE (mixture of experts). The 112GB limit is at an awkward "almost, but not quite" level to run the most interesting open LLMs. I understand the necessity of using soldered RAM, and 128GB is a cost trade-off made for the small edge AI market (GB10 also has 128GB limit). But 256GB would've been a more "comfortable" limit. I hope to see that for the next iteration.
I am less optimistic and it's not just about the horrific hallucinations. The RTX Pro GDDRx professional series cards, which offer 4x the VRAM capacity of their gamer counterparts, typically match those gaming GPU bandwidths since the hardware is very much the same. But with "only" 1.5TB/s of bandwidth available on RTX 5090 and RTX Pro vs. the 8TB/s of their HBM brethren, token rates already fall below 'human speed', or what most people would tolerate for inference responses, once you actually use the full 96GB capacity.
And at that point Strix Halo would deliver only 1/8th of 1/4th, or 1/32th of what is the accepted performance base of HBM datacenter GPUs. At that point the sweat-shop conversions from China that double the VRAM of gamer GPUs seem more attractive, because the relationship of capacity vs bandwidth is a little better vs the Pro cards.
WIP software: Strix Halo's NPU isn't currently usable except within AMD's Lemonade Server (Windows only). From an AMD dev on /r/locallama, it can't be run in conjunction with iGPU, but only on a separate task. Many other aspects are WIP.
IMHO it's disinformation bordering on fraud: AMD is shooting themselves in the foot there.
I'm waiting for the release and review of the GB10 lineup, viz Asus Ascent GX10. At $3K, it's still within the enthusiast/small dev reach. It has same RAM size and similar memory bandwidth. Compute should be higher. Software support is better, and CUDA is in play. But it doesn't do Windows, and can't serve as a general-purpose system. N1X next year (similar to GB10) will do WoA, but may not be targeted at the same market as GB10.
I'm very glad I got paid rather well researching what you can do with on more pedestrian architectures. So the horrible quality of the LLM results were no issue, and I could recycle all the hardware I bought for other work and gaming.
Otherwise I can only recommend you spend your money elsewhere, as the baselines are going up faster than your ability to run even entry level public models.
Even with the support and budgets Wendel gets, his actual results are more optimism than anything one could sell at profit.
I'd rather have Wendel and others in a similar position dig in a little further, than people who need a new job spend their last pennies on a future that won't run on these machines.