News AMD unveils industry-first Stable Diffusion 3.0 Medium AI model generator tailored for XDNA 2 NPUs — designed to run locally on Ryzen AI laptops

With a 9GB model, you'd be far better off using a 12GB GPU, way faster and cheaper even in a laptop.

Strix Halo is great technology... at a terrible price

It's designed to be cheaper than a dGPU, built on using a wider bus on commodity DRAM, but they charge HBM prices for it.
 
With a 9GB model, you'd be far better off using a 12GB GPU, way faster and cheaper even in a laptop.

Strix Halo is great technology... at a terrible price
This isn't about Strix Halo, this is about XDNA2. That's included in Strix Point, Krackan, and even the newly announced Ryzen AI 5 330. So basic image generation capabilities using a 50 TOPS NPU will come to even sub-$400 laptops.

If it really needs 24 GB instead of 16 GB to run fast (or at all), that could be a problem in that segment given there are many systems with soldered RAM. Hopefully all laptops with "AI" in the processor name will come with over 12 GB, matching the Windows AI/Copilot requirement of 16 GB, but 24-32 GB could be rare. So you need 1-2 SODIMM slots to add more memory yourself.
 
Last edited:
  • Like
Reactions: MosephV and DS426
...
Strix Halo is great technology... at a terrible price
...
Given the specs, it's look like it's targeting professionals, who are willing to pay premium for a premium product. Besides, you can charge a premium when there is no competition in this space. Intel is asleep, and Nvidia doesn't have an x86 license.

For the rest of us, a Strix Halo with same GPU, half CPU cores, and 32-64 MB of ram would make more sense. Supposedly those SKU are in the pipeline. I read today that some Chinese outfit is bringing those variants to Desktop.
 
Given the specs, it's look like it's targeting professionals, who are willing to pay premium for a premium product. Besides, you can charge a premium when there is no competition in this space. Intel is asleep, and Nvidia doesn't have an x86 license.
The story is not about Strix Halo, it's about the XDNA2 NPU, which is also available in millions of Strix Point and Krackan laptops and mini PCs. All with around the same 50-55 TOPS of performance.

Maybe it can run faster on Strix Halo though. I don't know if the NPU itself benefits from the additional memory channels and bandwidth, or if's only the iGPU that needs it. Running Stable Diffusion on Strix Halo's iGPU may be better.
 
  • Like
Reactions: MosephV and DS426
With a 9GB model, you'd be far better off using a 12GB GPU, way faster and cheaper even in a laptop.

Strix Halo is great technology... at a terrible price

It's designed to be cheaper than a dGPU, built on using a wider bus on commodity DRAM, but they charge HBM prices for it.
Folks, easy to beat up anything with "AI" slapped on it, but let's step back a bit here.

Sure, Strix Halo is expensive, but I don't know that faster and cheaper is always the case in the comparison being used here. As others already pointed out, regular Strix comes in a lot lower, so the trade-off could be cheaper and slower. Anyone doing serious AI inferencing should absolutely rely on a PC with a dGPU and ideally 16 GB of VRAM or more as that only adds a lot more models than can be ran entirely in VRAM rather shared with system RAM, or otherwise just a greater portion in VRAM. That said, the whole point of NPU's on laptops was to strike a balance of having some level of decent AI inferencing performance while maintaining decent battery life. You fire up a dGPU to about 100% utilization on a laptop and battery life falls off a cliff. If it's more of a workstation laptop that typically stays on a charger, different scenario.

I'm seeing laptops for about $600 on Amazon (ASUS, etc.) and elsewhere with Ryzen AI 5 340, 16 GB of RAM. 32 GB RAM models are significantly higher, which I agree is silly, but I think this is an OEM problem and not so much AMD pricing. In any case, those prices should eventually come down as they saturate marketplaces.

Also realize there are smaller models; the purview of this article is this specific AI model which recommends 32 GB of RAM. $600 and even lower laptops can run AI models for inferencing locally and at much faster speeds than other traditional laptops with faster CPU's but no NPU's. No, not everyone cares about this today, but the presumption is that it's only growing as it comes of age, and then the regret later on down the road would be not having an NPU.
 
This isn't about Strix Halo, this is about XDNA2. That's included in Strix Point, Krackan, and even the newly announced Ryzen AI 5 330. So basic image generation capabilities using a 50 TOPS NPU will come to even sub-$400 laptops.

If it really needs 24 GB instead of 16 GB to run fast (or at all), that could be a problem in that segment given there are many systems with soldered RAM. Hopefully all laptops with "AI" in the processor name will come with over 12 GB, matching the Windows AI/Copilot requirement of 16 GB, but 24-32 GB could be rare. So you need 1-2 SODIMM slots to add more memory yourself.
Thanks for the clarification!

But that has me much more sceptical about the speed and quality of the results. I've spent far too many hours trying to get diffusion models produce something I had in mind using an RTX 4090, to believe I could be productive with something so puny (yes, the real issue may be in front of the machine...): there is no secret weapon in those NPUs beyond tiling the workload, low precision, and all weights in SRAM caches. Their design reminds me of the Movidius chips, which is very much a DSP design aimed at tiny dense models for real-time audio and image enhancements, not generative AI models.

At a much larger scale it's how Cerebras works as well, but there we aren't talking €300 laptops.

I just cannot imagine the main model really running on the NPU, they must be using the GPU for the image generation and the NPU for upscaling, a bespoke solution that may be little more than a one-off tech demo, and with little chance of working with any other hardware, even their own a generation older or younger, because you'd have to redesign the workload split between so heterogeneous bits of hardware: there is no software stack to support that generically.

To me it has the whiff of marketing people desperately searching a market for their product and I can only advise you to verify the functionality you expect or need before committing money you care about.
 
Last edited:
there is no secret weapon in those NPUs beyond tiling the workload, low precision, and all weights in SRAM caches. Their design reminds me of the Movidius chips, which is very much a DSP design aimed at tiny dense models for real-time audio and image enhancements, not generative AI models.
A quick search revealed this document, a work sponsored by AMD, but comparing CPU vs. NPU efficiency and speeds.

Unfortunately it doesn't detail data types, because I'd be quite surprised if the NPU was actually used or efficient at BF16: that's already a very big format for holding significant numbers of weights in SRAM caches, four or eight bit representations seem much more adequate, but perhaps AMD is using shared exponents here.

But once the NPU would have to use those 9GB of weights in DRAM, there is just no way it would be significantly more efficient than a modern GPU, if it could be done, GPUs would do it, too.
 
Given the specs, it's look like it's targeting professionals, who are willing to pay premium for a premium product. Besides, you can charge a premium when there is no competition in this space. Intel is asleep, and Nvidia doesn't have an x86 license.
You don't need x86 to run diffusion models. Nvidia's Digits is all about proving that.
Unfortunately it seems they aren't willing to also forgo Windows in a consumer environment.
For the rest of us, a Strix Halo with same GPU, half CPU cores, and 32-64 MB of ram would make more sense. Supposedly those SKU are in the pipeline. I read today that some Chinese outfit is bringing those variants to Desktop.
Yeah, for the vast majtority of consumers Strix Halo is too much CPU for the GPU that comes with it, so a single CCD makes for a more rounded consumer device, even if I'd obviously spend the extra €50 a second CCD would cost AMD to make.

I'd also spend €100 extra to go from 64GB to 128GB of RAM, but that's where the trouble is: AMD is adding zeros to those prices at the end where it hurts.
 
I just cannot imagine the main model really running on the NPU, they must be using the GPU for the image generation and the NPU for upscaling
The small print on the Amuse-ai website seems confirms the assumption, that it's only the upscaling that runs on the NPU. A single slide in the blog post on AMD's website says that both stages run on the NPU, while the rest again claims "XDNA2 supersampling" [only].

AMD might claim a 'clerical [AI?] error' on that later, but if anyone happens to have a Strix Point device, perhaps they could validate that both stages run on the NPU: HWinfo and the task manager will tell.

I don't mind being wrong as much as I dislike being misled by 'creative' advertising.

In my theory I should be able to even test this on one of my Hawk Point/Phoenix APUs, since to my understanding the difference between these generations is only the number of tiles. But actually I understand next to nothing, because I can't find documentation. Perhaps I'll try that later, now that I have a desktop Hawk Point. On the Phoenix laptop I ran out of patience with the older, non-upscaling variant.

In any case, while diffusion models are widely used to implement upscaling, do not expect similar results to what NPUs can do: their ability will be more akin to a digital zoom on your smartphome camera or what a SmartTV would do to fill those 4k pixels with something.

And of course AMD isn't unveiling a "model generator", but a model that generates images...

Either Anton had a bad day or somebody else did the headline (I've heard that before).
 
  • Like
Reactions: usertests
AMD should ditch Amuse and focus on improving whatever open-source project(s) instead.

It's a steaming pile of garbage; locked-down wrt. model selection, and censorship that's even more silly than what OpenAI performs.
 
I just cannot imagine the main model really running on the NPU, they must be using the GPU for the image generation and the NPU for upscaling, a bespoke solution that may be little more than a one-off tech demo, and with little chance of working with any other hardware, even their own a generation older or younger, because you'd have to redesign the workload split between so heterogeneous bits of hardware: there is no software stack to support that generically.
I've heard the RTX 3060 has only 100 TOPS. Nvidia doesn't list that but they have RTX 4060 at 242 TOPS: https://www.nvidia.com/en-us/geforce/graphics-cards/compare/

I guess it still has to be verified to what extent the NPU is being used here, but someone could do that and test the speed of generation too.

It should be noted that the Ryzen AI 5 330 only has 2 compute units, but the same XDNA2 as other models. So it's unlikely that the iGPU would be much help there.

I don't think AMD is ditching the NPU soon, although some would like them to. I expect we'll see an XDNA3 at 100 TOPS within the next two years. XDNA1 early adopters have 10-16 TOPS and no BF16 support so they are hosed. XDNA2+ will catch on as long Microsoft is pushing Copilot+.
 
Last edited:
I don't think AMD is ditching the NPU soon,
Ditching Amuse and the NPU isn't quite the same thing.

NPUs may be new to PCs, but have been around on mobile for many generations and proven their use there e.g. the famous Hexagon DSPs from Qualcomm, which is also their technological root as real-time data transformation enablers.

But just because NPUs and GPUs both emulate neural networks doesn't mean they are mutually replaceable or run with similar software: ultimately their use cases grow to the point of some overlap, which is similar to what happend with CPUs and GPU compute.

And TOPS figures may be as meaningful as comparing GPU and CPU core counts or clocks.

The much more advanced funtional differentiation of accelerator types on mobile was driven by the much harder energy constraints and their use as media player and cameras. Hexagons added all kinds of abilities far beyond their original DSP functionality and have become a SoC (on Soc) on their own with a multitude of diverse funcional blocks.

In PCs, especially those with desktop power budgets, both energy and media acceleration are less pressing demands and it's quite frankly harder to sell, especially with the huge hardware fragmentation and legacy APIs.

Microsoft's main interest was to push NPUs for Co-Pilot and local user data insight mining, just as the latest generation NPUs started to see similar use by Meta, Google and their Asian equivalents. They mainly didn't want users to see their fat big paws always in and even blocking the user data cookie jar: CPUs and GPUs are too visible and people investigate when they slow down.
 
Ok, I'm ready to call this an AMD cheat show.

I've installed Amuse 3.1 on a Hawk Point Mini-ITX with a Ryzen 7 8845HS configured for at 55Watt TDP with 64GB of DDR5-5600 DRAM. GPU-Z reports 238.9 GB/s of bandwidth, but that's most likely the iGPUs last level cache. Realistically it's more like 55GB/s, at least measured from the CPU and being just dual channel.

The 8845HS only has a 10 TOPS XDNA2 NPU, but it's discovered and enabled by Amuse automatically for use in XDNA2 upscaling. Also its 780m iGPU is roughly 30% less powerful than the Strix Point 890m.

But in terms of AI friendly data type support I cannot find any differences between the two, BF16 seems supported equally on both, while the fact that the 55 TOPS Strix NPU also supports BF16 is both an oddity and of zero relevance, as we'll see below.

I've downloaded all AMD models, but used the SD3.5 Medium (AMDGPU) for the test, duplicating the prompt, steps (22), and seed (1544622413) from the blog post:
wide and low angle, cinematic, fashion photography for the brand NPU. Woman wearing a NPU jersey with big letters "NPU" logo text, and brown chinos. The background is a gradient, red, pink and orange, studio setting
I'd love to show the results, but evidently I cannot paste directly and I never figured out truly anonymous image hosting.

Long story short, the actual pictures do not resemble at all what's shown in the blog post, those were obviously redacted, even if the eyes on those three first identical triplets, obviously weren't fixed entirely.

There are the typical anatomical deficiencies in the Tucan, like a three part beak (or extra eyes all over), rendering[!] the unfortunate beast unable to eat and the "NPU" lettering on the jersey suffer the well known issues of text rendering in Stability AI diffusion models, you're lucky if the result is three letters or readable, my first render with the above parameters reads more like "NPG" instead.

So that's misleading on the achievable quality.

Performance should logically be next, but here we are comparing 780m RDNA 3.0 vs 890m RDNA 3.5 might be quite a bit of an uplift, given the bigger raw resource allocations. But at 270 seconds for a single image, even a 4x improvement, this is far from the "Rapid Iterations, No Re-Shoots" AMD promises

I'd label that misleading on productivity.

But let's focus on the NPU: The Windows task manager fails to pick up NPU usage, at least with Hawk Point. Mine is version 32.0.203.258 from April 1st this year (funny, that) and came with the latest and greatest AMD driver suite 25.6.1, which is still reported as current.

Likewise HWinfo doesn't pick up all the counters from the NPU, which it was prepared for, specifically Watt numbers and utilization remain at zero. The only indication of usage is an NPU clock, the first sign of life and usage from the NPU I've seen ever. However, it reports up at 1.6GHz, which I find implausible. But it shows when and how long it's used.

And surprise, surprise, that happens at the very end, just before the GPU generated image is NPU upscaled and displayed.

And that happens in a heartbeat, even on the lowly 10 TOPS NPU, so that's misleading on the NPU benefits.

As I wrote before, BF16 weights are very atypical in the embedded AI models of NPUs, which are designed for real-time flow-through, not the iterative batch operations of diffusion or larage langauge models. Their stream architecture holds weights in local SRAM while data flows through, preferably via DMA engines fetching data from CPU RAM or memory mapped sensors.

AMD may be very proud of teaching its NPUs new tricks with block encoding of exponents to store BF16 data at INT8 storage density, but a) that trick isn't guaranteed to work and b) that's still twice INT4 and certainly no good with billions of weights.

Clearly someone was given the unthankful task of having to weave a tale of value around an NPU that is plain useless, but perhaps at least usable with some tricks now. And the result is a bag of misleading texts and images, that seem to imply NPU benefits simply because they share a room with the GPU doing the work.

To summarize:
The image generation passes do not use the NPU at all, the brunt of the work is done on the GPU.
The NPU is only used for image upscaling, which is using a fraction of its potential but is little more than a digital zoom, designed to handle even video and not comparable to diffusion model AI upscaling at all. More specifically this doesn't demonstrate the benefits of a 55 TOPS NPU over an older 10 TOPS variant, because it's used for less than 1% of the total work.

Laptops won't be able to run AI based image or video generation with reasonable speed and quality currently, nor for a long time, extrapolating any currently known technology. NPUs don't change that--by design.

IMHO desktops and dGPUs aren't much better, all of these are designed as appetizers for cloud services.

As I said leading in, a bad cheat show that reminds of of Intel's worst.

AMD: you don't need to bow that low, your stuff will sell well enough without any of this garbage.

I fear these days doing your worst only increases reputation, but somehow I still believe I am in a bad dream and hope to wake up soon.
 
For the kicks of it (sory for all the blurb!), I've also tried using one of the bigger models, in this case SD3.5 on the Hawk Point, to see if unified memory management is working as expected.

It does, the 780m iGPU runs with 24 of the system's 64 GB of RAM used for model weights, but trust me, performance isn't at the "rapid iterations" level, where AI allows you to avoid re-shoots of photos or videos.

At least "NPU" actually read as "NPU" on the first render, but again the 10 TOPS NPU work was done in a heartbeat, no need (or use) for the 5x variant.
 
On a roll, now...

Since the SD3.5 Turbo seemed so much better and faster I tried doing a sequence of renders, which pushes up the GPU memory usage. As it neared 34GB the system (or at least the screen) locked up just when it tried its XDNA2 upscaling and had to be rebooted.

So I had a look at the logs and among the first things that are being recorded are those limits:

The NPU says it can handle up to 30 GB of RAM (out of 64GB available overall) while the GPU says it can handle up to 34 GB, 4GB of which are statically allocated for exclusive GPU use (as per BIOS).

I guess someone isn't checking their error codes nor configuration limits...

Still, that's a little further out than I can normally achieve via my RTX 4090 with only 24GB of VRAM...
 
This is sorely needed, because windows fails to get any tangible impact on daily AI use, which left the NPU useless for most. Now at least normal people can have some fun generate images with SD 3.

Noted that SD 3 is a very old model (released on june 2024) and have been superseded by newer models like SD 3.5 and flux.1
 
Since Amuse runs just as well on Nvidia and Intel GPUs (albeit without the XRDNA2 upscaling), I gave that a twirl.

No functional issues on an RTX 5070 or B580, just the usual VRAM capacity issue: if you run out of VRAM, performance drops off a cliff. That's the theoretical benefit of a unified memory architecture.

On the RTX 4090 using the SD3.5 Turbo model, I was finally able to recreate the quality from their blog post.

But that's a very different model and not exactly battery powered or using an NPU...

Shame on you AMD marketing guys!
 

@abufrejoval Apple chip can do both generation and upscaling solely on the NPU. Is it because of your 10 TOPS that Amuse put the generation on GPU? I can't think of a reason why AMD's NPU can't do it.​

I've been wrong before, which is why I hope someone with the proper hardware will be able to clarify and report performance and energy consumption data.

Yet I find it very hard to believe that AMD should duplicate the very same functionality within a single chip at similar level of performance and very distinct levels of energy consumption.

Now you might argue that this is already the case with CPUs and GPUs doing AI, but between those two inventions there are decades.

Modern GPGPU designs should incorporate within them anything that would make diffusion models orders of magnitude more efficient at similar performance, not put that into an extra NPU for fun. If NPUs would generally deliver what GPUs do in AI only much more efficiently, Nvidia would be selling NPUs not GPUs.

I cannot discount that AMD might have put a layer or two of the neural networks involved on the NPU for a bit of accelleration. Coming to think if that, it even makes a little sense, because diffusion models are generative adverserial models, with two different types of model essentially working against each other, one to generate, the other to recognize.

And the recognition model might well be rather suitable for an NPU.

Still the question remains if it significantly improves the speed or the energy efficiency of the whole process and if that makes it worthwhile for AI vendors to accomodate such splits in the most popular works.

And there I have more doubts than time to write them all down: The full generative model running on an NPU just seems so unlikely I won't just accept their marketing fluff.

Certainly not when they obviously have inconsistencies in the text and pictures with what you see on screen.
 
Last edited:
  • Like
Reactions: Bikki
Dang, this got intense, lol.

Thank you for that deep-dive and elaboration, @abufrejoval . I don't mean to hype up NPU's... indeed Windows uses Phi Silica, a Small Language Model, to run things locally on Copilot PC's, as benefits are limited today. The two letters "AI" have been the biggest technology marketing term in many years, with things like "Copilot+" just being specific branding of this.

Anyways, I haven't tried Amuse on my Ryzen AI 375 HX 32 GB RAM work laptop, only on my Radeon RX 7900 rig at home. The difference in SDXL 3 and 3.5 models is noticeable, along with the different optimizations, fine-tuns, etc. I've played around with LM Studio far more, including with this HX 375 laptop. In a lot of cases, I don't see any NPU utilization according to Task Manager, which doesn't really matter anyways as the 890M is surprisingly strong for being an iGPU. Maybe the unified memory benefit as mentioned earlier? I can't say for certain.
 
Dang, this got intense, lol.
I don't typically do these deep dives in public :)
Thank you for that deep-dive and elaboration, @abufrejoval . I don't mean to hype up NPU's... indeed Windows uses Phi Silica, a Small Language Model, to run things locally on Copilot PC's, as benefits are limited today. The two letters "AI" have been the biggest technology marketing term in many years, with things like "Copilot+" just being specific branding of this.
There is quite a few people out there, who do understand what Microsoft is after here. It's never that difficult, because most of the time M$ is just trying to catch up and in this case the fact that Google and Meta were able to [ab]use NPUs on mobiles to generate insights locally instead of having to manage the raw data got them into making that a pre-requisite for 'their' platform, too.

Of course, selling spying on desktop users is hard to sell and M$ isn't having much luck, while the chip vendors just want to sell new hardware.
Anyways, I haven't tried Amuse on my Ryzen AI 375 HX 32 GB RAM work laptop, only on my Radeon RX 7900 rig at home. The difference in SDXL 3 and 3.5 models is noticeable, along with the different optimizations, fine-tuns, etc. I've played around with LM Studio far more, including with this HX 375 laptop. In a lot of cases, I don't see any NPU utilization according to Task Manager, which doesn't really matter anyways as the 890M is surprisingly strong for being an iGPU. Maybe the unified memory benefit as mentioned earlier? I can't say for certain.
Amuse hasn't been my go-to GUI for AI either, I guess that's why the company is partnering with AMD to gain some traction. And admittedly the software has improved a bit beyond the previous release, but I prefer LM Studio for what I do as well, as well as older tools or even the occasional Python in Linux.

Yes, the 3.5 models result in much more beautiful pictures, but that may be just the thing: they are trained so intensively on marketing material, that they only produce more marketing material well enough. But that's the fundamental issues with all these models, that they can never be better than the data they were trained on, only more 'stoned'.

The task manager doesn't seem able to read NPU usage, that's why I used HWinfo's NPU clock as a usage indicator: that was the only sensor that detected NPU usage at all.

I did find a message in the Amuse log files, which indicates that it would use my NPU only for upscaling, not for the diffusion side. So there is a clear indication that 55 TOPS NPUs are treated differently, lack of BF16 support in the older variants would be a good reason. And this becomes more evident when you select the RyzenAI variant that doesn't have an NPU: starting generation results in an application error for the NPU specific layers. The XDNA2 upscale button is also is greyed out on the Nvidia or ARC GPU systems I tried.

So there are two distinct NPU functionalies supported by Amuse, upscaling, which is so fast there is no issue with older NPUs and diffusion support, where only the last generation NPUs are supported.

And from some more deep-diving my current working hypothesis is that the SD3 Medium (AMDGPU) model is the only one, which has some degree of NPU support baked in. In the Expert mode you get to choose between a default and a RyzenAI variant after loading.

And looking at the model files seems to also confirm my last post, that NPU support is limited to some layers in this RyzenAI variant, because you'll find those layers in a separate directory on disk.

Except that I got the generator and the recognizer parts wrong, the NPU supports the decoder/generator while the GPU runs the encoder/recognizer.

Just how that split results in either speedup or lower energy consumption can only be measured if you can find working instrumentation.

But what remains misleading is that the entire AI workload runs on the NPU.

And what remains to be proven is that partial support of the NPU provides significant value for the overhead of doing a bespoke model design either in speedup or energy savings.

Of course, if AMD could demonstrate that moving layers between the GPU and the NPU is pushing a button, that latest point becomes mote. Again, I'd be very sceptical from past experiences, but I'm just a small fish in the AI pond, who doesn't even want to be there.

What also runs on the NPU is the UNET and that looks like a classic NPU/DSP workload to me, that might actually benefit, if the information transfer doesn't kill it. Here the fact that the NPU/GPU/CPU run in a unified memory space, might make that transfer much less costly than on other architectures. But that's just me guessing.
 
Last edited: