Well, I spent some time googling to get some info on this...
Pentium Pro x86 decoder occupies 7% of die space. That was in 1995, with total transistor count 8.8mil per die.
Meanwhile, early K8 x86 decoder occupies 2% of die space.
Yup, but Pentium Pro didn't have any on-die L2 cache (it had 256KB on package, in a separate die) and had only 16K of L1...
That does make those 7% look even more interesting, does not it? (It means that with 4MB of on-die cache, it will be much much less than 1%).
I'm not sure i understand your point here.
Anyway, my point is that you wouldn't use that space for more cache, you would use it for execution resources.
It's not even just "die area", it's more about, amount of engineering work necessary to design a very sophisticated decoding scheme, and all the consequences on the control unit of the CPU itself.
Also, apart from die size, the complex decoders waste clock cycles.
G5 - 16 stages
Conroe - 14 stages
Yep, but that does not mean so much.
It's like the enormous pipeline of Prescott, it had 30+ stages, now look at Conroe..
The G4e for example, had only a 7 stage pipeline, with the same instruction set.
And the G4, which was out at the time of the K7, had only 4 (while the K7 had 10-15
link)
Also the longer pipeline has an additional cost, in terms of branch penalty, so even more resources have to be invested in a sophysticated branch predictor.
Here is what John Stokes from Ars Technica says:
To do all of this hefty decoding work, the K7 uses two separate decoding pipelines. It has a hardware decoder that decodes the smaller x86 instructions, and a microcode decoder that decodes the larger and more complicated x86 instructions. Since the larger x86 instructions are rarely used, the hardware decoder sees most of the action. Thus the microcode decoder doesn't really slow the K7's front end down. All it does is eat up transistor real estate, contributing the the K7's extremely high transistor count, large die size, high power dissipation, etc.. Not only do all of these wild decoding contortions jack up the transistor count, but they affect the pipeline depth as well. The K7 spends three pipeline stages (or three cycles) after the instruction fetch finding and positioning the variable-length x86 instructions onto 6 decode positions (3 MacroOps = 6 ops). As a result, the K7's pipeline is about 3 stages longer than it would be if the x86 ISA were not CISC
Now G5's pipeline and OOO core are principally similiar to K8 and Conroe. So the one possible explanation is that x86 ISA has some hidden power that is not quite obvious at first look. Here, code density and not being load-store architecture are my favorite explanations.
But at the end of the day, the pipeline of the G5 is not that different from a P4 or K8 (note, i'm not very familiar with the G5, and i have to read more on the subject), it seems like it has more functional units, but it all depends on how you define a functional unit (for example, the G5 is said to have a "branch unit", while that doesn't count as a functional unit on P4 and K8
link), so it has 2 integer units, 2 load/store units... pretty much, it should be similar, clock for clock.
But that is my point - if the rest of OOO core is basically the same, RISC ISA should show some advantage, if RISC is so much better than "crippled" x86.
But in reality, x86 seems to be faster at crunching integer code. Even x86-32, with stupidly small registers file.
But the G5 trounces its x86 competitors in floating point code, and so does the Itanium.
Having lots of registers (and 3 operands or 4 operands instructions, typical of load store architectures) is especially important in FP code, because a certain amount of data has to be used again and again, especially in matrix intensive calculations.
In integer operations, memory latency and bandwidth and branch performance have a leading role, and a register-memory architecture like x86 proves to be very efficient, if it has good memory latency.
I mean, in the end from an architectural point of view, the K8 is just so close to the K7.. but it outperforms it significantly, mostly for reasons (like the IMC) which have nothing to do with the pipeline.
Also,
here you can see that even in integer application, the G5 is very competitive clock for clock against x86 CPUs; of course this test was somewhat controversial, as the x86 CPU could get better results under windows with a better compiler, but in the end it is not so unfair to test both machines with the same compiler and on the same OS.