InvalidError :
bit_user :
Again, the best testament to this will be Nvidia's new 64-bit Denver core. They've already announced a SoC, built around it, for automotive applications.
"Old world thinking" or not, from an engineering point of view, it is much simpler in terms of total design effort to work with an architecture that allows local optimizations on its own than having to re-engineer and re-characterize both the software and hardware stack with every architecture change that has any sort of effect on scheduling, dependencies and resource availability. With "old world" thinking, compilers and CPUs can be improved independently.
I don't follow. Compilers
are optimized for specific micro-architectures. They all have options to tell them which set of instruction set extensions to use and how to tune the code. And I think you over-emphasize the degree to which they'd be coupled. Compilers can & do continue to evolve high-level optimizations independent of their architecture-specific backends.
VLIW has already established this precedent. There have been numerous VLIW micro-architectures, for embedded applications, that lacked binary compatibility between generations.
InvalidError :
If you think that lacking instruction decoders and complex scheduling magically enables higher clock frequencies or performance, look at the very GPUs you cited as an example: low clock frequencies compared to Skylake.
Actually, it's a design choice. If your applications have enough concurrency and people are willing to pay for a larger die, then it's more energy efficient to have lots of lower-clocked units. However, someone could easily tune the simple, in-order cores of modern GPUs to achieve blistering clockspeeds. It's all about tradeoffs. In fact, if 14/16 nm die space were cheaper, I bet current gen GPUs would have even higher shader counts and even lower clock speeds.
InvalidError :
In case you may have missed it, the multi-core mobile craze has been dying down lately as people have discovered that more weaker cores deliver generally worse overall experience than fewer stronger cores.
I hate to link another site, but the best analysis of the matter you're likely to find doesn't really support your assertion:
http://www.anandtech.com/show/9518/the-mobile-cpu-corecount-debate/18
That said, I do see mobile core counts leveling off.
InvalidError :
Nvidia decided to give up on mobile computing and chase automotive opportunities which aren't as heavily battery-challenged.
Are you sure? Google's Pixel C uses Tegra X1.
For my final exhibit, and what got me seriously thinking about the issue, consider this piece:
http://www.csm.ornl.gov/SOS20/documents/Sohmers.pptx
He makes a compelling case about how much power is wasted by all of CPUs' efforts to compensate for naive code. I'm not convinced these guys will succeed, but I think their compass is pointed in the right direction.
Think about it: why waste all the power & die area to continually decode the same instructions over-and-over again? The output is
exactly the same. Why waste power managing cache, when compilers are smart enough to do it statically? Why redo all the work of scheduling the same instructions, when the ordering will be almost the same from one run to the next? Compilers can see more, be smarter, and persist optimizations from one run to the next. JIT-like runtimes can dynamically re-tune code, based on profiling, even in the midst of it running.
I don't know if it's true, but I've read that the VLIW philosophy started to take hold when people noticed that more power and die area was being spent on instruction scheduling, branch prediction, etc. than on actual computation. Yet true VLIW was constrained to embedded applications, where cross-generational binary compatibility wasn't a requirement. In the world of mobile computing, the hardware and software platforms are so highly-controlled and interpreted languages are so dominant that I think the yolk of binary compatibility has effectively been shed.
To chip away at the wall that CPUs are up against, more creativity and daring is needed. Make the hardware simpler and the software smarter. Makes sense to me, anyway.
Just to be clear, I don't see x86 CPUs going this route. Not for a while, at least.