News AMD slides claim Strix Halo can beat the RTX 4070 laptop GPU by up to 68% in modern games

Claiming it can beat a 4070 is a tall claim when the 4070 in question is only allowed to draw 65W... at that wattage, the card loses around 20 FPS compared to one who is allowed to draw 100W - 140W; and the 4070 can usually still be OC'd. That's a significant difference in my book. Strix Halo does look impressive, especially for an iGPU, but I wouldn't call this an apples to apples comparison if the competitor is power limited. Especially with just the first party benchmarks that will make it look as good as possible.
 
It's a shame that the best 4090/5090 GPUs for AI only deliver 2.5 t/s on simple models like the Llama 3.x-70b.
It's sad that they still get 4.5 stars and best choice for doing local LLMs at such an ugly speed.
 
Last edited:
The 2025 model Z13 flow is supposed to use the ryzen 395, with the same power and thermal constraints.
But seeing as they don't bother mentioning what the 395 is in, I'm going to assume it's on an open air test bench.
 
I'm not surprised it can beat a 4090 on llama 3_70b, that model is way too big to fit into 24GB of memory and is thus extremely limited on the 4090. You need at least a pair of 3090/4090 to have a chance of fitting that model in memory, and even then you'd need quantization. Not exactly fair without context but it is true that a Strix halo with 128GB of ram would run 70b much better than a single 4090. 24GB cards are limited to about 34b with quantization for optimal performance.
 
But the Nvidia rival tested was constrained by the Asus ROG Flow Z13's thermal design power.

Were the thermals necessary due to being super thin and light?

Any GPU is going to struggle if it has far, far inadequate cooling. This whole story says probably more about the thin and light laptop than it does the superior performance of the AMD chip. It is a nice chip though. Wish I could get one in a socket on a desktop.
 
Last edited:
AMD would sell massive amounts of this iGPU if it was a desktop product, what are they thinking??
1. It almost certainly will come to mini-PCs in time. It’s not even out yet.
2. It has quad-channel unified memory, and thus cannot be a drop-in option for AM5 based desktops.
3. It’s equal to a 4070 laptop with a 65w TGP, but that configuration is easily surpassed in gaming by midrange desktop dGPUs.
4. The complexity and large iGPU and extremely fast unified RAM and the fact it can’t drop into AM5 means it would likely be very expensive for a desktop APU, and desktops needing good graphics performance but not willing to use a dGPU and being willing to pay a premium for that is a slim niche, so it’s not AMD’s launch priority. Someone will fill it eventually and people will probably complain it’s a poor value.
 
RTX 4070 mobile << RTX 4060 desktop, 128b gddr6, 1.7GHz boost. It doesn't take much to beat this weak GPU. And no amount of power limits will help it, especially when compared in the same 13" chassis 😉
They are about equal from what I have seen. Meaning that the 4070 mobile (and above, honestly) are a really bad deal. Meanwhile, the laptop 4060 is only about 9% worse than the desktop part and at least mine can be overclocked to roughly the same level as the desktop 4060, meaning it's the far better deal imho. Again, for an iGPU, the performance is pretty decent. It's just a bit disingenuously presented in my eyes.
 
1. It almost certainly will come to mini-PCs in time. It’s not even out yet.
2. It has quad-channel unified memory, and thus cannot be a drop-in option for AM5 based desktops.
3. It’s equal to a 4070 laptop with a 65w TGP, but that configuration is easily surpassed in gaming by midrange desktop dGPUs.
4. The complexity and large iGPU and extremely fast unified RAM and the fact it can’t drop into AM5 means it would likely be very expensive for a desktop APU, and desktops needing good graphics performance but not willing to use a dGPU and being willing to pay a premium for that is a slim niche, so it’s not AMD’s launch priority. Someone will fill it eventually and people will probably complain it’s a poor value.
This would make a perfect htpc for gaming right?
 
1. I think the point of the test, despite using a 65w mobile 4070, is to show how much more efficient this chip is by testing both in a thermally constrained environment, so you might actually get more than 30 min of battery life if you're gaming on it.

2. Comparing an RTX 4090 running a 70b param model is very disingenuous , since the card can't really run something that large. HOWEVER, if I'm reading the results correctly, the AMD iGPU is hitting around 6 tokens per second on that same 70b model, which is actually very impressive.

Does anyone know when the review embargo drops for the new z13? Or maybe that HP laptop version with the same chip?
 
Claiming it can beat a 4070 is a tall claim when the 4070 in question is only allowed to draw 65W... at that wattage, the card loses around 20 FPS compared to one who is allowed to draw 100W - 140W; and the 4070 can usually still be OC'd. That's a significant difference in my book. Strix Halo does look impressive, especially for an iGPU, but I wouldn't call this an apples to apples comparison if the competitor is power limited. Especially with just the first party benchmarks that will make it look as good as possible.
Depends on what power the Strix used, plust lets not forget we should compare CPU with iGPU to CPU+GPU when it comes to performance/watt.

For me the BS started way before with Nvidia using the same names for their desktop and laptop parts, by hiding the details in the fine print that is a blatant attempt at misleading their customers. Simply put many computer buyers don't read the fine print or don't have the foundation to really understand the details.
 
I'm not surprised it can beat a 4090 on llama 3_70b, that model is way too big to fit into 24GB of memory and is thus extremely limited on the 4090. You need at least a pair of 3090/4090 to have a chance of fitting that model in memory, and even then you'd need quantization. Not exactly fair without context but it is true that a Strix halo with 128GB of ram would run 70b much better than a single 4090. 24GB cards are limited to about 34b with quantization for optimal performance.
I was going to write that 🙂

Except that using multiple GPUs in order to fit bigger models doesn't work either: there is a reason why the professional cards with double or more VRAM capacity cost far more than 2-3 4090 cards.

With LLMs any time you have to go across the PCIe bus, you might as well just do inferencing on the CPU, because it's the PCIe bus bandwidth limitation which limits the token rate... well at least if your CPU isn't a Pentium.

I've tested this some years ago with dual V100 and perhaps two years ago with an RTX 4090 (PNY 3 slot) and an RTX 4070 (PNY two slot) in the same workstation (only PNYs would fit) vs the Ryzen 9 5950X host and Llama-2 70B models: llama2.c allows distributing layers between GPUs and CPU RAM, but unless your model happens to be optimised for requiring extra little bandwidth between those memory pools, the tightest bottleneck determines overall speed.

I believe 2-bit quantizations actually did fit on the RTX 4090 alone, but generally Llama2 below 8-bit quantizations became unreadable garbage, while at 8 bit it was at least gramatically correct hallucinations.

So yes, a Llama 70b model would probably fit on Strix Halo with 128GB but token rates would still likely be below 5 token/s and I find anything below 40ts too annoying to bother with.
 
I was going to write that 🙂

Except that using multiple GPUs in order to fit bigger models doesn't work either: there is a reason why the professional cards with double or more VRAM capacity cost far more than 2-3 4090 cards.

With LLMs any time you have to go across the PCIe bus, you might as well just do inferencing on the CPU, because it's the PCIe bus bandwidth limitation which limits the token rate... well at least if your CPU isn't a Pentium.

I've tested this some years ago with dual V100 and perhaps two years ago with an RTX 4090 (PNY 3 slot) and an RTX 4070 (PNY two slot) in the same workstation (only PNYs would fit) vs the Ryzen 9 5950X host and Llama-2 70B models: llama2.c allows distributing layers between GPUs and CPU RAM, but unless your model happens to be optimised for requiring extra little bandwidth between those memory pools, the tightest bottleneck determines overall speed.

I believe 2-bit quantizations actually did fit on the RTX 4090 alone, but generally Llama2 below 8-bit quantizations became unreadable garbage, while at 8 bit it was at least gramatically correct hallucinations.

So yes, a Llama 70b model would probably fit on Strix Halo with 128GB but token rates would still likely be below 5 token/s and I find anything below 40ts too annoying to bother with.
According to the AMD propaganda, it gets about 6 tokes per second, running at 55 watts (why didn't they run it at 120 watts, per the chip limit? Who knows), which is pretty good!

Also, the 3090 still supported NVLink, so running multiple 3090s in parallel is possible, as they can pool their VRAM with an overhead cost in the single digit percentages.
 
It's a shame that the best 4090/5090 GPUs for AI only deliver 2.5 t/s on simple models like the Llama 3.x-70b.
It's sad that they still get 4.5 stars and best choice for doing local LLMs at such an ugly speed.
And they are extremely bored doing so.

It's the nature of the LLMs and their mostly sequential operation through all weights, which has them near completely VRAM bandwidth bound. And that's the reason why GPUs used for AI use HBM for roughly 4x GDDR6 or 2x GDDR7 bandwidth against HBM3.

And if you tested a Llama 70b model on your 4090 system, just compare it to the CPU-only variant: at least with 8-bit quantisations they are pretty identical to what you'd get with the GPU, because everybody is just waiting on CPU RAM (odd 2-3, 5 and 7 bit still tend to profit a tiny bit from operand/data type conversion inside the GPU).

No idea how things change with FP4 on the 5xxx, but I doubt that a simple type conversion will do.

At least in theory Strix has an ace here, because they can have blocks of weights sharing exponents and then only use 4-bit to represent the mantissa while retaining the precision of FP8 representation: you could label that "variation scarcity", I guess.

Don't know if that's something Nvidia does as well, but in any case I doubt these come for free but require a model redesign and rather selective precision managment for the different layers to maintain what little quality there is in LLMs.
 
  • Like
Reactions: Peksha
According to the AMD propaganda, it gets about 6 tokes per second, running at 55 watts (why didn't they run it at 120 watts, per the chip limit? Who knows), which is pretty good!
55 Watts probably because they aren't compute bound but limited by DRAM bandwidth. My RTX 4090 stops its fans runing any model that won't fit inside 24GB, because there is so little to do but wait.

I try to simplify things by converting a token to a syllable and making words 2-4 syllables (I'm German, my words are longer :)).

A little more than one word per second? Try that on your wife and see what that does to her patience!

As a benchmark it's perhaps not bad, but in practical terms I'd consider it unusable.
Also, the 3090 still supported NVLink, so running multiple 3090s in parallel is possible, as they can pool their VRAM with an overhead cost in the single digit percentages.

Have a closer look at NVlink, not all versions and variants are created equal. For the 3090 it's a little over 100GByte/s, pretty much DRAM speed these days and only two cards max for anything PC.

CUDA code is designed to exploit the Terabyte/s aggregate bandwidth of massive register files. VRAM access is already falling off a cliff, so much so that common subexpression elimination, a typical staple of compiler optimization on CPUs, is essentially reversed, because that would often be slower than recomputing inside registers.

And even with the greatest and latest NVlink switches (7200 Gbyte/s on Hopper), that's not counting latency.

HPC and AI hardware is a little more complex than just putting Lego bricks together. And yeah, I hoped it was much simpler, too. But then I had the opportunity to test and researched a bit deeper.

And now I understand better why prices are the way they are.
 
Last edited:
  • Like
Reactions: Peksha
When are these products actually being released? That seems to be the extremely important information nobody mentions in any of the coverage.

*Edit: Also, looking at the layout in that HP workstation, it could have easily been using LP-CAMM2 and been upgradable rather than having soldered-on RAM.
 
  • Like
Reactions: kealii123
When are these products actually being released? That seems to be the extremely important information nobody mentions in any of the coverage.

*Edit: Also, looking at the layout in that HP workstation, it could have easily been using LP-CAMM2 and been upgradable rather than having soldered-on RAM.
I'm afraid they can only cover what they are being told by AMD. And even AMD only has some contractual control over the earliest release date from OEMs.

LP-CAMM2 sure sounds nice, but there is no way that I'd buy any of this stuff when it's new: affordance driven patience, I'm afraid.
 
  • Like
Reactions: kealii123
I'm afraid they can only cover what they are being told by AMD. And even AMD only has some contractual control over the earliest release date from OEMs.

LP-CAMM2 sure sounds nice, but there is no way that I'd buy any of this stuff when it's new: affordance driven patience, I'm afraid.
Except that this very much looks like soldered RAM to me...
 
With jlake3's nice list here and the picture from that HP workstation I'm ready to make a first set of predictions about Strix Halo:
1. It almost certainly will come to mini-PCs in time. It’s not even out yet.
Yup, the HP Z2 proves that point
2. It has quad-channel unified memory, and thus cannot be a drop-in option for AM5 based desktops.
Actually I believe it's 8 channel but at 32-bit instead of 4 channels at 64-bit, but that's LP-DDR5 vs. DDR5...
The basic truth is that a socket like AM5 can't just duplicate the number of RAM channels, so any Strix product will most likely be soldered BGA, which you can still put into a desktop or tower chassis and give a few slots.
3. It’s equal to a 4070 laptop with a 65w TGP, but that configuration is easily surpassed in gaming by midrange desktop dGPUs.
And there the main question is: how much will you be changed for the "premium" of having an APU instead?
In a desktop: who'd want to pay extra?
In a laptop: the more portable the more you'll have to pay.

So AMD's current strategy is to recycle Apple's Mx argument and charge a premium for an "AI workstation", that is essentially cheaper to make than a CPU/GPU combo with the same performance... while those exact combinations can currently be bought for $600-1000

For gaming, Strix Halo it won't get better than a Lenovo LOQ with an RTX 4070 and a Phoenix APU, for portability you might get twice the mobile endurance for iso performance. But that could be 1 hour instead of 30 minutes: nothing strix halo will enable mobile gaming for eight hours at full APU power.
4. The complexity and large iGPU and extremely fast unified RAM and the fact it can’t drop into AM5 means it would likely be very expensive for a desktop APU, and desktops needing good graphics performance but not willing to use a dGPU and being willing to pay a premium for that is a slim niche, so it’s not AMD’s launch priority. Someone will fill it eventually and people will probably complain it’s a poor value.
Actually, mass production seems quite capable of eating the complexity overhead, much of which is just shifted from mainboard and dGPU into the APU.

And AMD's genius is in making sure that not every part is completely new and bespoke: IP is largely recycled and silicon dies might see lots of reuse. Mostly I guess AMD plans to sell quite a lot of these, since Intel has nothing to counter.

So after accounting for the higher APU price vs. Strix Point, the mainboards wouldn't be much more expensive than an AOZ+dGPU combo, the laptop's incremental power dissipation potential doesn't have a linear cost if the form factor and weight isn't fixed to crazy limits.

RAM production cost should be far more reasonable than what vendors will want to charge. And here those first pictures hint at soldered RAM, which will vastly reduce production cost while it vastly increments the initial sales prices.

Unfortunately, adding LPCAMM2 module support would significantly increase mainboard and memory module production cost, only to then have to compete on a cut-throat (consumer friendly) RAM market.

That can only happen if consumers fail to bite and fall for soldered RAM.

Are these systems really in any way "Workstations for transformative AI performance"?

For training anything this puny has zero mass appeal.

In LLM inference their speed advantage would really just be twice the [CPU] RAM bandwidth over any model that oversteps dGPU VRAM boundries. That would be a sweet spot only if there was anything sweet or valuable to have there. After spending a couple of hours with the biggest DeepSeek R1 I could fit on my RTX 4090, I am less convinced than ever. It told me Marie Antoinette had no biological mother... few things are as certain as everyone needing a biologic mother to be.

And even then I wonder if it's worth paying twice the price of a current gamer laptop or getting half the gaming performance of a dGPU system: 6 vs 3 token/sec, that's a lot like 6 vs 3 FPS for gaming.

Especially if you're not into gaming or AI that much or you current hardware will do nicely.

And then there is Nvidia's new AI NUC...

Well, let's just say: the more you guys buy this stuff this year, the cheaper it will be for me next year.
 
Last edited:
UP TO is doing heavy lifting here, but it is a good showing.

I haven't seen any indication of LPCAMM models, just soldered. I hope we see LPCAMM in the future.

As for the high TDP, you have to remember that there are 16 CPU cores in the top part. This is meant to replace a i9-14900HK/Ryzen 9 9955HX and a 4070/5060 at the same time. What is the CPU going to use during gaming tests, surely at least 15-20 Watts?

I don't think gamers are going to find these relevant because the prices will probably be way too high. Even the 8-core, 32 CU model. Maybe it doesn't have to be, idk. But this is the start of something that AMD can improve each generation.
 
  • Like
Reactions: Jame5