Ok, I'm ready to call this an AMD cheat show.
I've installed Amuse 3.1 on a Hawk Point Mini-ITX with a Ryzen 7 8845HS configured for at 55Watt TDP with 64GB of DDR5-5600 DRAM. GPU-Z reports 238.9 GB/s of bandwidth, but that's most likely the iGPUs last level cache. Realistically it's more like 55GB/s, at least measured from the CPU and being just dual channel.
The 8845HS only has a 10 TOPS XDNA2 NPU, but it's discovered and enabled by Amuse automatically for use in XDNA2 upscaling. Also its 780m iGPU is roughly 30% less powerful than the Strix Point 890m.
But in terms of AI friendly data type support I cannot find any differences between the two, BF16 seems supported equally on both, while the fact that the 55 TOPS Strix NPU also supports BF16 is both an oddity and of zero relevance, as we'll see below.
I've downloaded all AMD models, but used the SD3.5 Medium (AMDGPU) for the test, duplicating the prompt, steps (22), and seed (1544622413) from the blog post:
wide and low angle, cinematic, fashion photography for the brand NPU. Woman wearing a NPU jersey with big letters "NPU" logo text, and brown chinos. The background is a gradient, red, pink and orange, studio setting
I'd love to show the results, but evidently I cannot paste directly and I never figured out truly anonymous image hosting.
Long story short, the actual pictures do not resemble at all what's shown in the blog post, those were obviously redacted, even if the eyes on those three first identical triplets, obviously weren't fixed entirely.
There are the typical anatomical deficiencies in the Tucan, like a three part beak (or extra eyes all over), rendering[!] the unfortunate beast unable to eat and the "NPU" lettering on the jersey suffer the well known issues of text rendering in Stability AI diffusion models, you're lucky if the result is three letters or readable, my first render with the above parameters reads more like "NPG" instead.
So that's misleading on the
achievable quality.
Performance should logically be next, but here we are comparing 780m RDNA 3.0 vs 890m RDNA 3.5 might be quite a bit of an uplift, given the bigger raw resource allocations. But at 270 seconds for a single image, even a 4x improvement, this is far from the "Rapid Iterations, No Re-Shoots" AMD promises
I'd label that misleading on
productivity.
But let's focus on the NPU: The Windows task manager fails to pick up NPU usage, at least with Hawk Point. Mine is version 32.0.203.258 from April 1st this year (funny, that) and came with the latest and greatest AMD driver suite 25.6.1, which is still reported as current.
Likewise HWinfo doesn't pick up all the counters from the NPU, which it was prepared for, specifically Watt numbers and utilization remain at zero. The only indication of usage is an NPU clock, the first sign of life and usage from the NPU I've seen ever. However, it reports up at 1.6GHz, which I find implausible. But it shows when and how long it's used.
And surprise, surprise, that happens at the very end, just before the GPU generated image is NPU upscaled and displayed.
And that happens in a heartbeat, even on the lowly 10 TOPS NPU, so that's misleading on the
NPU benefits.
As I wrote before, BF16 weights are very atypical in the embedded AI models of NPUs, which are designed for real-time flow-through, not the iterative batch operations of diffusion or larage langauge models. Their stream architecture holds weights in local SRAM while data flows through, preferably via DMA engines fetching data from CPU RAM or memory mapped sensors.
AMD may be very proud of teaching its NPUs new tricks with block encoding of exponents to store BF16 data at INT8 storage density, but a) that trick isn't guaranteed to work and b) that's still twice INT4 and certainly no good with billions of weights.
Clearly someone was given the unthankful task of having to weave a tale of value around an NPU that is plain useless, but perhaps at least usable with some tricks now. And the result is a bag of misleading texts and images, that seem to imply NPU benefits simply because they share a room with the GPU doing the work.
To summarize:
The image generation passes do not use the NPU at all, the brunt of the work is done on the GPU.
The NPU is only used for image upscaling, which is using a fraction of its potential but is little more than a digital zoom, designed to handle even video and not comparable to diffusion model AI upscaling at all. More specifically this doesn't demonstrate the benefits of a 55 TOPS NPU over an older 10 TOPS variant, because it's used for less than 1% of the total work.
Laptops won't be able to run AI based image or video generation with reasonable speed and quality currently, nor for a long time, extrapolating any currently known technology. NPUs don't change that--by design.
IMHO desktops and dGPUs aren't much better, all of these are designed as appetizers for cloud services.
As I said leading in, a bad cheat show that reminds of of Intel's worst.
AMD: you don't need to bow that low, your stuff will sell well enough without any of this garbage.
I fear these days doing your worst only increases reputation, but somehow I still believe I am in a bad dream and hope to wake up soon.