InvalidError :
bit_user :
If you look back into semi-recent history, many cutting-edge designs (like IBM's Cell processor) have eschewed big cache hierarchies, instead favoring software-managed local memory.
Having local "scratch" memory is basically software-directed caching. Yes, that might be slightly more power-efficient from avoiding cache-coherence concerns
...
Having yet another type of RAM internal to CPU cores that exists outside conventional memory address space would also open up a whole new can of worms security-wise.
Well, there are different levels at which you could place it. If you made it private to a core/thread, then there should be no security concern. IMO, cache is still a good solution for inter-thread communication - which tends to be far lower-bandwidth, as well.
As for additional context switching overhead, the L1/L2 cache routinely gets flushed (and I think some spectre or meltdown fix might now explicitly do that) anyway, so we're really not talking about a lot of additional memory traffic. Plus, have you considered the size of the AVX-512 register file? I think it might be bigger than the Cell processor's entire per-SPE scratch pad area!
Anyway, we can debate the relative merits, but only time will tell, for sure. Really, there's not so much difference between this and simply having a large register file. Just add indexed access and you're basically there. Another way to think of it is like having your stack frame fetched into local memory, which I think is a little like what SPARC tried to do with register windows.