I will write here a collective answer to several ARM issues some of you are asking.
First, Anand article is wrong. They are not controlling some relevant variables during power measurement. Moreover, Yuka did mention another fundamental point: our lack of knowledge of several fundamental parameters of the new 20nm node. There is a reason why several companies (including Nvidia) are skipping the 20nm node. Moreover, it is worth mentioning Intel is having similar power issues with the 14nm node.
Second, I don't know why some of you insist on misunderstanding the issue about efficiency again and again and again. Evidently a 100W ARM CPU @3.0GHz will not be so efficient as an single-digit watt ARM CPU @1.5GHz. Efficiency reduces with higher IPC and frequency
Efficiency \propto I^{-1} f^{-2}
What the whole industry is claiming is that a ~100W ARM CPU will be more efficient than a ~100W x86 CPU. Several of us estimate the ARM ISA provides about 20--30% advantage over the x86 ISA at Xeon levels of performance with everything else the same. David Kanter claims that K12 would be about 10% better than Zen. At lowest TDPs the gap between x86 and ARM is bigger, 2x or more, that is why Intel cannot enter the mobile market even with a process node advantage.
I note that 20--30% (or Kanter's 10%) is for the overall ISA advantage. I see too many people on forums still believing that x86 penalty is only on the decoder. I will simply copy and paste from an engineer:
It's a myth that ISA overhead is just in decode. There are many aspects of an ISA that affect the overall microarchitecture. Just to mention one example, x86 requires more load/store units due to having fewer registers and load+op instructions. x86 also uses a more complex memory ordering model.
Several ARM servers have been measured and provide more performance than Xeons but consume less power. Jdwii gave the link to Cray announcing partnership with Cavium to build ARM-based supercomputers. That is because there is an efficiency advantage. The 80W SoC gives more performance than 95W Xeons but consumes less power. Linley Gwennap wrote:
Compared with Xeon, ThunderX could deliver 50% to 100% more performance per watt and per dollar, particularly when considering the additional chips that Intel needs to complete the server design.
Third, x86-64 is an extension to x86-32. This means that all the bloat, legacy, and oddities of the old ISA are present on the new. For low power Intel chips it is estimated that about 20% of the die space is used to support the microcode ROM. Any engineer knows that designing and implementing an x86 decoder is more complex than a A64 decoder. The complexity is not because the x86 ISA is better, but because two reasons (i) decoding instructions of variable length is more complex than decoding instructions of fixed length and (ii) the x86 decode has to support hundred of legacy instructions that nobody uses anymore but are in the ISA otherwise.
I know studies that show that only a 30% of x86 opcodes are used in modern scientific applications. The rest are not used anymore but have to be supported by the hardware for formal complain to the ISA. x86 is the most bloated and poor ISA in use.
ARM is different. A64 is
not an extension to A32, but a new separate ISA. Moreover, ARM engineers not only updated to 64-bit but used the change to clean the ISA eliminating legacy and bloated aspects. The new A64 is a clean, elegant, and efficient ISA. Engineers can support A64 or A32 or A64+A32. First chips will support both ISAs A64+A32 (A57, A53, Denver, Cyclone). Next chips will focus on A64 and will only support part of A32 (e.g., Vulcan support full A64 but only usermode of A32). And more latter ARM chips will support only A64, which will provide another boost on efficiency.
Finally, A64 software already exists: both OS and applications. I gave some links many pages ago.