photonboy :
I think the reality is likely that there may not be a lot that can be done to improve the architecture without breaking backwards compatibility.
I think the problem is much worse than that: CPU architectures in general are simply reaching the point where any further major IPC improvements are no longer practical due to disproportionate complexity cost: if you make the instruction scheduler more complex to detect more fringe IPC extraction opportunities, it eats die space, it eats power, it makes the critical path longer (more logic involved in the instruction selection) and will reduce achievable clocks on a given process.
IBM's Power architecture, Intel's x86 and Itanium, PA-RISC, Alpha, AMD's x86 and other CPU architectures have taken different paths in their CPU core designs but they all achieved comparable per-core performance. ARM is slowly catching up but will hit the same performance scaling bottlenecks as everyone else once it gets done maturing simply due to typical software instruction mixes and patterns.
Regardless of what instruction set or architecture you use, there is only so much that hardware with infinite resources can do to improve IPC when every 12th or so instruction is a conditional branch in typical software.