Gahhh you didn't read the article now did you...
I'll post again...
Because - (1) PoV-Ray's usage of SSE2 is not SSE (Stream SIMD Execution) at all, but really double-precision FP with random register access;
Here's a nice diagram for you.. it contains AMD's K8 vs. Intel Core (not to be confused with Core 2).
Now look at the segment on the lower part of that image where we see "MAX DP FP / CYCLE".
Even Intel's Core architecture can do a better job then AMD's K8. Core 2 can do 5 per core per clock. K10 is no different then K8 in this respect and can only process a maximum of 3 per core per clock.
Then there's the second part of thise... Random register Access.. Core 2 is 400% faster then K8 at this, K10 is twice as faster as K8 at this thus half the speed of Core 2 per clock.
Not all apps use these optimisations, but MANY professional applications (if not most as I stated before) do. Games don't really use this particular optimisation as well as the Graphics card (GPU) mainly handles most of the FP load while the CPU concentrates on AI and for now physics (Mainly high Integer load).
This exlains why AMD choose to target it's integer performance and ignored this one downfall and also why you should read the article.
I think the table you provided has some errors. According to ARS Technica, the P6 has a 12 stage pipeline and the Pentium-M pipeline depth was never made publicly available. Off the top of my head, I don't know what else, if anything, is inaccurate, but you may want to check the source where that table was obtained. Chances are there are more errors.
Here are the ARS links:
P6 12 stage pipeline
Pentium-M info
More Pentium-M info
Ryan
Wanna know what's funny.. looks like Arstechnica got it wrong. The P6 does in fact have 10 stages.
Virginia.edu
And also here..
Wikipedia P6
Superpipelining, which increased from Pentium's 5-stage pipeline to 14 of the Pentium Pro, and eventually morphed into the 10-stage pipeline of the Pentium III, and the 12- to 14-stage pipeline of the Pentium M.
I wouldn't trust the PowerPoint from UVa either. It doesn't cite any sources or the credentials of the author (or even who the author is for that matter.) Plus it contradicts your table. The PPT says that Netburst had a pipeline depth of 20, yet the table says 21+8.
I don't think Wiki can be trusted here either. (We all know all too well that Wiki is to be taken with a grain of salt.) It states that the Pentium Pro was 14 stages and P3 was 10. Heck, the Wiki article even contradicts itself regarding the length of the PM pipeline. I cannot find any other source that states the pipeline was changed from PP to P3.
I only brought it up because I don't know how relevant the table is to the thread if it's not even accurate.
Ryan
The table is 100% accurate.
Manchester.edu
INTEL.com <------
The idea of out-of-order execution, or executing independent program instructions out of program order to achieve a higher level of hardware utilization, was first implemented in the P6 microarchitecture. Instructions are executed through an out-of-order 10 stage pipeline in program order.
Netburst Northwood has a 21 stage pipeline (varying on how you look at it)
Intel.com Willamette had a 20 stage pipeline.
The +8 represents the much despised and ridiculed mis-predictions. It's not an accurate number but rather an estimate see although the whole 20-21 stages are misprediction pipelines, the Netburst architecture actually mispredicted quite often, more so then any other architecture thus far. Although Intel's Rapid Fire Execution engine (ALU's running at twice the Clock speed) helps alleviate some of the cost in performance from a misprediction it's not enough. The architecture itself is VERY inneficient.