A major problem it could face is how big the LLMs tend to be. One of the things that makes integrated NPUs so interesting is that they can work directly from main memory. 192 GB is enough to fit GPT-3 class models, I'm pretty sure. Desktops will soon get a bump to 256 GB, while laptops should handle at least 128 GB.
The NPUs get integrated less because they can access main memory: GPUs can do that, too, specially the iGPUs.
The main reason for NPUs is processing efficiency, CPUs or GPUs might actually achieve similar performance (all limited by RAM bandwidth), but spend may more Watts. On a notebook, that kills the AI via an empty battery.
Large language models need large RAM, but also huge bandwidth as they basically comb throught all or at least a large portion of the weights: it's not your typicall HPC random data patterns, but an exhaustive pass on every token. If your LLM is two times bigger than another, the same RAM bandwidth means half the token rate.
At 128/192/256 GB of RAM, even if it's good for 100GB/s (average DDR5) or 200GB/s (quad LPDDR5) LLMs become very, very boring indeed (single digit token rates per second). And it doesn't matter at all if you're processing on an NPU, CPU or GPU, memory bandwidth is your only constraint at that point (not rumors, not just 2nd hand opinions, I tested, extensively).
16GB of VRAM at 500GB/s is pretty good on the A770m in the Serpent Canyon, only 50% less than the near 1000GB/s the 24GB of VRAM will do on my RTX 4090 and way more economical. That RTX has the ability to represent each weight at all the 2-16 INT and the various xFLOATy precision that current CPUs can't yet handle (another bonus of NPUs).
Intel and AMD GPUs might not trail Nvidia here, but it doesn't solve the problems: you're fighting against a quality wall with reduced precision and a memory wall with bigger models; RAM capacity is much cheaper to double than the bandwith demand that grows just as linearly and that's why everyone wants HBM to get to 4TB/s or a little more.
A dGPU with PCIe 5.0 could still potentially stream the weights out of main memory, but PCIe would probably bottleneck it to the point where it's not much faster than an iNPU.
PCIe v5 x16 and DDR5 aren't that far apart, but both are 40x slower than HBM. Large language models aren't compute limited, my 4090 is never uses more than 50% compute even when LLMs fit inside the 1000GB/s VRAM.
When your model gets bigger, even if only some layers go to RAM (I have 768GB on some of my machines), performance falls off a cliff and there is really no difference in CPU vs GPU inference speeds, no matter how many CPU or GPU cores you throw at the same memory bus.
That said, there are other AI models you could use that are small enough to fit in typical dGPU memory. So, if the focus isn't so much on LLMs, then dGPUs could remain an attractive option.
I cannot recommend buying any consumer GPU for AI
work. If you treat it as a hobby or simply a new type of "text adventure" game, that's fine.
For a while consumer GPUs did actually allow you to surf the AI waves and enable quite a bit of experimentation. If you experiment with completely new ways of surfing or other domains of AI, GPUs might help you have some fun.
Consider that the competition
there is ten thousands of wannabe PhDs, with high ambitions and no sane constraints on overtime.
But "the real LLM game" is with big boys who surf with boards the size of ocean liners on waves higher than sky scrapers.