But even on desktops it is clear that bandwidth significantly restrains the performance of even old cores.
I did provide several memory-scaling benchmarks which show that higher-bandwidth memory makes almost no difference to either compute or gaming workloads. The biggest impact is from using memory with lower latency, which is why DDR4-3600 even beat DDR5-4800, in that gaming benchmark.
That is why they build a bunch of cache levels, increasing its size, in an attempt to hide the problem from the public - insufficient bandwidth in x86.
CPU caches aren't only about solving the bandwidth problem. As I said before, latency is the more crucial issue. I cited a very current and relevant example, to illustrate this point. You seemed to have glossed over it, perhaps without appreciating what it tells us.
I linked a slide showing the additional cache tier that Intel added to the Lion Cove P-cores in Lunar Lake and Arrow Lake. Intel calls these Level 0 (data), Level 1 (data), and L2. However, the L0D is the same size as the previous generation's L1D and the L2 is only a little bigger (by 50%) than the previous generation's L2. So, essentially, what they did was to shoehorn a level in between the old L1D and L2.
What's crucial to appreciate, here, is that
the only parameters which differ between the new L1D and L2 are the size and latency. Bandwidth is the same! So, we can clearly see what a top-level concern latency is, for the CPU designers.
If you compare these to DRAM latency, you can see why DRAM latency is such a performance-killer. Again, here's how the cache latencies compare with each other and DRAM.
Intel went to all the trouble of adding another level of cache, just for the sake of lowering the stair step after the lowest-level cache (what they used to call L1D and now call L0) and elongating the L2 step.
All the memory bandwidth in the world won't help with latency! You could have an
infinite amount of memory bandwidth, but a CPU thread is still going to be sitting idle for a large chunk of that ~100 ns (which translates to ~500 clock cycles, in a CPU core running at 5 GHz), every time it needs to read more data from DRAM. It's
latency that can really murder CPU performance!
Now, how you want to characterize cache is a matter of perspective. Is it an optimization or a dirty trick? I only consider something a dirty trick, if it has some significant downside that's worse than whatever benefit I'm getting. In modern CPUs, caches are sufficiently refined that they don't really have such downsides. If you get rid of caches, you throw out the baby with the bathwater. You can't substitute them with 4096-bit HBM, or anything of the sort.
It is obvious that if dgpu has vram capable of operating at 250-750 GB/s with consumption in the region of 80-140 W, it should be the same for igpu, which means the total bandwidth should be much, much larger.
The M-series Max have 512-bit memory interfaces, but the Ultra doubles this - and it's the one which is designed to compete against the fastest desktop dGPUs. The M1 Ultra had 800 GB/s of memory bandwidth, at a time when the fastest dGPU (RTX 3090) had 936 GB/s. However the RTX 3090 also had a TDP of 350 W, which is more than twice as much as a Mac Studio would draw. It's clear that the M1 Ultra couldn't match it, performance wise, but it was definitely the fastest iGPU ever made.
I think Apple will never beat Nvidia with its iGPUs, but (for the most part) it doesn't really need to. It just needs to get into the ballpark, in order to meet the needs of most of its power users.