amd quadfather 4X4 VS intel Kentsfield

spud · Sep 24, 2006

But predication is very limited to small number of cases. Do not you see the problem?

That’s a matter of opinion at this point, the fact that predication does work and by all points of reference say its quite effective. The only real reference would be to run heavy branched code, such as SQL.

But SPEC 2006gives you a good idea at least in general performance of the new processors that are available now.

HP HP ProLiant ML370 G5 SAS 3.0 GHz/4MB DC 169,360
HP HP ProLiant DL585 32GB/2.4GHz/4P 115,110
HP HP Integrity rx6600 230,569

TPC-C results showing as similar setups as I could find dual processor, dual core setups.

References to SIMD do not make sense to me. Yes, SIMD is affected as well, but the code does not have to be SIMD to have parallel loop interleaved execution.

Intel insists that the case to do back to back loop execution. I assume otherwise you would be using too much of the processors resources for loop execution, with considerations there is a great deal of other software needing the limited resources that exists it would be a safe bet that only "simulated" software could theoretically do 2 or more loop executions at once but real world it's not so realistic as they stated performance is not ideal otherwise.

SIMD allows for vectorizeation, 3 loops being executed at once would be by all standards be a lot easier for the SIMD engine to deal with than the main core. But I requested additional information from Intel on clarification, but Monday will be when they get a chance to look at it.

i posted a ways back about the autovectorizer and simd and was ridiculed
by either pip or cxl dont remember. spud if you go back and see the post i explained more of it so it could be usefull

Ya it appears I need to do more homework on the subject.

joset · Sep 25, 2006

Excuse me, I beg your pardon, I do apologize, etc, to interrupt you but this is appalling: From this very date (September, the 5th) I do not receive any e-mail notices about posts on this thread... (ok, not that you care but I do!). 😛

I hereby utter my protest! Damn!

Cheers!

cxl · Sep 25, 2006

Hmm yes, but in the end, code density has only an impact on the instruction cache size.. and with prefetching, not even so much.
The complex decoding has a cost both in terms of wasted die space, and in terms of decode stages / pipeline length.
Personally i like the RISC approach more.
investing hardware on it.

Well, I spent some time googling to get some info on this...

Pentium Pro x86 decoder occupies 7% of die space. That was in 1995, with total transistor count 8.8mil per die.

Meanwhile, early K8 x86 decoder occupies 2% of die space.

Now look at cache size Itanium needs and perhaps those 2% are well spend to achieve higher code density (plus of course some compatibility too

).

Considering pipeline stages, G5, which is now the only comparable RISC core to Conroe and K8, has 16(!) integer pipeline stages. Meanwhile, everybody knows what happened with Macs

In fact, it makes me quite curious. Given that most benchmarks comparing G5 and x86 in 32-bits, I would expect G5 to outperform x86 on the same frequency in integer mode because of register pressure. Still this does not seem to be the case.

Now G5's pipeline and OOO core are principally similiar to K8 and Conroe. So the one possible explanation is that x86 ISA has some hidden power that is not quite obvious at first look. Here, code density and not being load-store architecture are my favorite explanations.

Pippero · Sep 25, 2006

Well, I spent some time googling to get some info on this...

Pentium Pro x86 decoder occupies 7% of die space. That was in 1995, with total transistor count 8.8mil per die.

Meanwhile, early K8 x86 decoder occupies 2% of die space.

Yup, but Pentium Pro didn't have any on-die L2 cache (it had 256KB on package, in a separate die) and had only 16K of L1...
However, concerning die space, you shouldn't compare the amount used by the cache with the one used by the logical units of the CPU; i don't know anything about process technology and stuff, but i think i read in several places that the cache has a much lower impact on yield and costs to manufacture a CPU (Jack?).
If you look at this picture of a K8 die, you'll notice that the instuction decode block (excluding the instruction cache) occupies a die area comparable to the whole FPU/SSE execution unit...
Also, apart from die size, the complex decoders waste clock cycles.

Considering pipeline stages, G5, which is now the only comparable RISC core to Conroe and K8, has 16(!) integer pipeline stages. Meanwhile, everybody knows what happened with Macs

In fact, it makes me quite curious. Given that most benchmarks comparing G5 and x86 in 32-bits, I would expect G5 to outperform x86 on the same frequency in integer mode because of register pressure. Still this does not seem to be the case.

Now G5's pipeline and OOO core are principally similiar to K8 and Conroe. So the one possible explanation is that x86 ISA has some hidden power that is not quite obvious at first look. Here, code density and not being load-store architecture are my favorite explanations.

But at the end of the day, the pipeline of the G5 is not that different from a P4 or K8 (note, i'm not very familiar with the G5, and i have to read more on the subject), it seems like it has more functional units, but it all depends on how you define a functional unit (for example, the G5 is said to have a "branch unit", while that doesn't count as a functional unit on P4 and K8 link), so it has 2 integer units, 2 load/store units... pretty much, it should be similar, clock for clock.
But i have to read more on the subject.

cxl · Sep 25, 2006

Well, I spent some time googling to get some info on this...

Pentium Pro x86 decoder occupies 7% of die space. That was in 1995, with total transistor count 8.8mil per die.

Meanwhile, early K8 x86 decoder occupies 2% of die space.

Yup, but Pentium Pro didn't have any on-die L2 cache (it had 256KB on package, in a separate die) and had only 16K of L1...

That does make those 7% look even more interesting, does not it? (It means that with 4MB of on-die cache, it will be much much less than 1%).

picture of a K8 die, you'll notice that the instuction decode block (excluding the instruction cache) occupies a die area comparable to the whole FPU/SSE execution unit...

No, I do not think so. If I exclude the cache, there is still a lot of units common for any high-end CPU regardless ISA (branch predictor, cache control), x86 related parts are 30% of it maximum.

Of course, even RISC ISA would have some decoders there, so the real difference is small. While I was unsuccesful finiding pictures of G5 die with description, it has 9 fetch/decode stages so I do not expect it to be small either.

Also, apart from die size, the complex decoders waste clock cycles.

G5 - 16 stages
Conroe - 14 stages

Now G5's pipeline and OOO core are principally similiar to K8 and Conroe. So the one possible explanation is that x86 ISA has some hidden power that is not quite obvious at first look. Here, code density and not being load-store architecture are my favorite explanations.

But at the end of the day, the pipeline of the G5 is not that different from a P4 or K8 (note, i'm not very familiar with the G5, and i have to read more on the subject), it seems like it has more functional units, but it all depends on how you define a functional unit (for example, the G5 is said to have a "branch unit", while that doesn't count as a functional unit on P4 and K8 link), so it has 2 integer units, 2 load/store units... pretty much, it should be similar, clock for clock.

But that is my point - if the rest of OOO core is basically the same, RISC ISA should show some advantage, if RISC is so much better than "crippled" x86.

But in reality, x86 seems to be faster at crunching integer code. Even x86-32, with stupidly small registers file.

cxl · Sep 26, 2006

instruction cache) occupies a die area comparable to the whole FPU/SSE execution unit...

While I was unsuccesful finiding pictures of G5 die with description

OK, there is one:

http://www.anandtech.com/mac/showdoc.aspx?i=2436&p=2

While it is a little bit hard to identify equivalent parts ("Instruction fetch"?), I would say G5 decoder is not smaller by any metrics.

djkrypplephite · Sep 26, 2006

Well, I spent some time googling to get some info on this...

Pentium Pro x86 decoder occupies 7% of die space. That was in 1995, with total transistor count 8.8mil per die.

Meanwhile, early K8 x86 decoder occupies 2% of die space.

Yup, but Pentium Pro didn't have any on-die L2 cache (it had 256KB on package, in a separate die) and had only 16K of L1...
However, concerning die space, you shouldn't compare the amount used by the cache with the one used by the logical units of the CPU; i don't know anything about process technology and stuff, but i think i read in several places that the cache has a much lower impact on yield and costs to manufacture a CPU (Jack?).
If you look at this picture of a K8 die, you'll notice that the instuction decode block (excluding the instruction cache) occupies a die area comparable to the whole FPU/SSE execution unit...
Also, apart from die size, the complex decoders waste clock cycles.

Considering pipeline stages, G5, which is now the only comparable RISC core to Conroe and K8, has 16(!) integer pipeline stages. Meanwhile, everybody knows what happened with Macs

In fact, it makes me quite curious. Given that most benchmarks comparing G5 and x86 in 32-bits, I would expect G5 to outperform x86 on the same frequency in integer mode because of register pressure. Still this does not seem to be the case.

Now G5's pipeline and OOO core are principally similiar to K8 and Conroe. So the one possible explanation is that x86 ISA has some hidden power that is not quite obvious at first look. Here, code density and not being load-store architecture are my favorite explanations.

But at the end of the day, the pipeline of the G5 is not that different from a P4 or K8 (note, i'm not very familiar with the G5, and i have to read more on the subject), it seems like it has more functional units, but it all depends on how you define a functional unit (for example, the G5 is said to have a "branch unit", while that doesn't count as a functional unit on P4 and K8 link), so it has 2 integer units, 2 load/store units... pretty much, it should be similar, clock for clock.
But i have to read more on the subject.

I recall the Pentium Pro in fact having 1 MB L2.

http://www.intel.com/design/archives/processors/pro/docs/243570.htm

cxl · Sep 26, 2006

I recall the Pentium Pro in fact having 1 MB L2.

http://www.intel.com/design/archives/processors/pro/docs/243570.htm

Yes, but it was on separate die (just like kentsfield has 2 dies in single package, PPro had CPU die and L2 cache die - at that time it was almost impossible to manufacture such big CPU on single die).

Pippero · Sep 26, 2006

Well, it was available with 256KB, 512 and 1MB.
But if i remember correctly, it was first introduced with 256KB. (link)

Pippero · Sep 26, 2006

Well, I spent some time googling to get some info on this...

Pentium Pro x86 decoder occupies 7% of die space. That was in 1995, with total transistor count 8.8mil per die.

Meanwhile, early K8 x86 decoder occupies 2% of die space.

Yup, but Pentium Pro didn't have any on-die L2 cache (it had 256KB on package, in a separate die) and had only 16K of L1...

That does make those 7% look even more interesting, does not it? (It means that with 4MB of on-die cache, it will be much much less than 1%).

I'm not sure i understand your point here.
Anyway, my point is that you wouldn't use that space for more cache, you would use it for execution resources.
It's not even just "die area", it's more about, amount of engineering work necessary to design a very sophisticated decoding scheme, and all the consequences on the control unit of the CPU itself.

Also, apart from die size, the complex decoders waste clock cycles.

G5 - 16 stages
Conroe - 14 stages

Yep, but that does not mean so much.
It's like the enormous pipeline of Prescott, it had 30+ stages, now look at Conroe..
The G4e for example, had only a 7 stage pipeline, with the same instruction set.
And the G4, which was out at the time of the K7, had only 4 (while the K7 had 10-15 link)
Also the longer pipeline has an additional cost, in terms of branch penalty, so even more resources have to be invested in a sophysticated branch predictor.
Here is what John Stokes from Ars Technica says:

To do all of this hefty decoding work, the K7 uses two separate decoding pipelines. It has a hardware decoder that decodes the smaller x86 instructions, and a microcode decoder that decodes the larger and more complicated x86 instructions. Since the larger x86 instructions are rarely used, the hardware decoder sees most of the action. Thus the microcode decoder doesn't really slow the K7's front end down. All it does is eat up transistor real estate, contributing the the K7's extremely high transistor count, large die size, high power dissipation, etc.. Not only do all of these wild decoding contortions jack up the transistor count, but they affect the pipeline depth as well. The K7 spends three pipeline stages (or three cycles) after the instruction fetch finding and positioning the variable-length x86 instructions onto 6 decode positions (3 MacroOps = 6 ops). As a result, the K7's pipeline is about 3 stages longer than it would be if the x86 ISA were not CISC

Now G5's pipeline and OOO core are principally similiar to K8 and Conroe. So the one possible explanation is that x86 ISA has some hidden power that is not quite obvious at first look. Here, code density and not being load-store architecture are my favorite explanations.

But at the end of the day, the pipeline of the G5 is not that different from a P4 or K8 (note, i'm not very familiar with the G5, and i have to read more on the subject), it seems like it has more functional units, but it all depends on how you define a functional unit (for example, the G5 is said to have a "branch unit", while that doesn't count as a functional unit on P4 and K8 link), so it has 2 integer units, 2 load/store units... pretty much, it should be similar, clock for clock.

But that is my point - if the rest of OOO core is basically the same, RISC ISA should show some advantage, if RISC is so much better than "crippled" x86.

But in reality, x86 seems to be faster at crunching integer code. Even x86-32, with stupidly small registers file.
But the G5 trounces its x86 competitors in floating point code, and so does the Itanium.
Having lots of registers (and 3 operands or 4 operands instructions, typical of load store architectures) is especially important in FP code, because a certain amount of data has to be used again and again, especially in matrix intensive calculations.
In integer operations, memory latency and bandwidth and branch performance have a leading role, and a register-memory architecture like x86 proves to be very efficient, if it has good memory latency.
I mean, in the end from an architectural point of view, the K8 is just so close to the K7.. but it outperforms it significantly, mostly for reasons (like the IMC) which have nothing to do with the pipeline.
Also, here you can see that even in integer application, the G5 is very competitive clock for clock against x86 CPUs; of course this test was somewhat controversial, as the x86 CPU could get better results under windows with a better compiler, but in the end it is not so unfair to test both machines with the same compiler and on the same OS.

cxl · Sep 26, 2006

It's not even just "die area", it's more about, amount of engineering work necessary to design a very sophisticated decoding scheme, and all the consequences on the control unit of the CPU itself.

It is certainly much harder to design it. But even startup company could do that... (nexgen).

The G4e for example, had only a 7 stage pipeline, with the same instruction set.
And the G4, which was out at the time of the K7, had only 4 (while the K7 had 10-15

Yes, but G4 is not as wide and deeper pipeline means higher frequency, which was major trouble for Apple at the time....

When they decided to fix the problem, suddenly G5 had 16 stages...

Having lots of registers (and 3 operands or 4 operands instructions, typical of load store architectures) is especially important in FP code, because a certain amount of data has to be used again and again, especially in matrix intensive calculations.

Sure, x87 fp-stack is indeed pretty stupid idea by today's standards, anyway I guess this issue is fixed since SSE2 (and AMD64).

In integer operations, memory latency and bandwidth and branch performance have a leading role, and a register-memory architecture like x86 proves to be very efficient, if it has good memory latency.

Yes, that is my point.

In somewhat bigger picture, if x86 delivers superior integer performance, I can clearly see that much better idea than introducing the new ISA would have been to fix FP. Which to some degree already happened.

Which brings me to my very orginal point: Itanium was a mistake, both from marketing and technical standpoint. Itanium cannot in long run compete with OOO architectures and its market penetration will never be high enough because of SW incompatibility.

Intel should not have wasted large amounts of resources on in-order architecture and rather fix x86 to be a good 64-bit CPU.

cxl · Sep 26, 2006

I'm not sure i understand your point here.
Anyway, my point is that you wouldn't use that space for more cache, you would use it for execution resources.

Is not it the same? You save the cache thanks to dense coding, you can reuse it for execution resources as well 😉

djkrypplephite · Sep 26, 2006

bump. i am loving this thread, keep it up.

joset · Sep 27, 2006

I'm not sure i understand your point here.
Anyway, my point is that you wouldn't use that space for more cache, you would use it for execution resources.

Is not it the same? You save the cache thanks to dense coding, you can reuse it for execution resources as well 😉

Sorry to interrupt your [wonderful] debate.
I'm ignorant on these matters (overall) but my curiosity is as large:

I understand "dense" & "coding" but I fail to understand "dense (or denser) coding", in the x86 ISA context (apart "squeezing" more parallelism or "going" 64-bit); the [ignorant] way I see it, "denser coding" could also demand larger caches, as a linear consequence of more instructions per cycle, hence, more data/instructions to be readily accessed (larger caches or...?).

Would you care to elaborate on this?

Thanks.

Cheers!

cxl · Sep 27, 2006

I'm not sure i understand your point here.
Anyway, my point is that you wouldn't use that space for more cache, you would use it for execution resources.

Is not it the same? You save the cache thanks to dense coding, you can reuse it for execution resources as well 😉

I understand "dense" & "coding" but I fail to understand "dense (or denser) coding", in the x86 ISA context

Dense *en*coding. While usual RISC ISA all opcodes 4 bytes long, x86 has variable width, and more or less accidentally, often used opcodes are shorter. It has even 1 byte opcodes. Therefore the same code is usually shorter in x86.

joset · Sep 27, 2006

I'm not sure i understand your point here.
Anyway, my point is that you wouldn't use that space for more cache, you would use it for execution resources.

Is not it the same? You save the cache thanks to dense coding, you can reuse it for execution resources as well 😉

I understand "dense" & "coding" but I fail to understand "dense (or denser) coding", in the x86 ISA context

Dense *en*coding. While usual RISC ISA all opcodes 4 bytes long, x86 has variable width, and more or less accidentally, often used opcodes are shorter. It has even 1 byte opcodes. Therefore the same code is usually shorter in x86.

(By comparison to RISC ISAs) ok, now I got it.
But, the issue is: RISC uses fixed-lenght instruction sets (32-bits each, if I'm not mistaken), while CISC (x86 ISA) uses variable-lenght ones (up to 128-bit, again, if I'm not mistaken, even if they've got to be broken up into smaller sets); in practical terms, how would it ["dense encoding"] «save cache» if, in the x86 worst case scenario, the ISA would have to be dealt with long instruction sets, therefore, in cases where more cache is needed? (and, I'm considering the K8 & Core as well, although as a 'mingle' of both CISC & RISC ISAs).

Thanks again.

Cheers!

cxl · Sep 27, 2006

But, the issue is: RISC uses fixed-lenght instruction sets (32-bits each, if I'm not mistaken), while CISC (x86 ISA) uses variable-lenght ones (up to 128-bit, again, if I'm not mistaken, even if they've got to be broken up into smaller sets); in practical terms, how would it ["dense encoding"] «save cache» if, in the x86 worst case scenario, the ISA would have to be dealt with long instruction sets, therefore, in cases where more cache is needed?

But the cache does not have to deal with "worst case scenario"... Why it should? It simply stores opcodes just as they are in program. Well, Netburst stores "decoded" opcodes in trace cache, but that is still L1 cache. L2 cache on all architectures stores "raw" data.

Moreover, many of those long opcodes would have to be encoded as two or even three RISC opcodes.

(and, I'm considering the K8 & Core as well, although as a 'mingle' of both CISC & RISC ISAs).

Oh, they are CISC, not "mingle". 32-bit ISA is fixed since 80386.

Pippero · Sep 28, 2006

I'm not sure i understand your point here.
Anyway, my point is that you wouldn't use that space for more cache, you would use it for execution resources.

Is not it the same? You save the cache thanks to dense coding, you can reuse it for execution resources as well 😉
Yes, in terms of die space, of course

But in terms of complexity of design, the cache should be considered separately from the core.
I mean, you can easily design huge caches, but it's not like you could substitute them with a 1000-issue ultramegasupadupa pipelined core just because they'd have the same die area 😀

Pippero · Sep 28, 2006

The G4e for example, had only a 7 stage pipeline, with the same instruction set.
And the G4, which was out at the time of the K7, had only 4 (while the K7 had 10-15

Yes, but G4 is not as wide and deeper pipeline means higher frequency, which was major trouble for Apple at the time....

When they decided to fix the problem, suddenly G5 had 16 stages...

Meh, and Prescott had 30

Having lots of registers (and 3 operands or 4 operands instructions, typical of load store architectures) is especially important in FP code, because a certain amount of data has to be used again and again, especially in matrix intensive calculations.

Sure, x87 fp-stack is indeed pretty stupid idea by today's standards, anyway I guess this issue is fixed since SSE2 (and AMD64).

SSE2 improved the situation a lot, but it still lacks 3 operand instructions and has relatively few register (for non vector code).
As such, the FP (SSE) performance of x86 CPUs is still not so stellar.

In integer operations, memory latency and bandwidth and branch performance have a leading role, and a register-memory architecture like x86 proves to be very efficient, if it has good memory latency.

Yes, that is my point.

In somewhat bigger picture, if x86 delivers superior integer performance, I can clearly see that much better idea than introducing the new ISA would have been to fix FP. Which to some degree already happened.

Which brings me to my very orginal point: Itanium was a mistake, both from marketing and technical standpoint. Itanium cannot in long run compete with OOO architectures and its market penetration will never be high enough because of SW incompatibility.

Intel should not have wasted large amounts of resources on in-order architecture and rather fix x86 to be a good 64-bit CPU.
[/quote]
Well it doesn't really deliver superior integer performance clock for clock (compared to the Power architecture, not Itanium), but yes it is competitive.
Again, i'm not so sure if Itanium is a mistake, but certainly they could have created a "clean" and more traditional new ISA, maybe with 3 operands instructions AND memory/register ones, with a few more architectural registers, a regular (but not fixed) instruction length..
Something like a mixture between "clean" x86 and RISC.. oh well, anyway, i'm sure they had their reasons, a CPU architect has to usually make a lot of compromises, and his choices are dictated by quantitative factors, then an architecture usually turns up as a failure because of shifts in market demand or the industry as a whole.

cxl · Sep 28, 2006

I'm not sure i understand your point here.
Anyway, my point is that you wouldn't use that space for more cache, you would use it for execution resources.

Is not it the same? You save the cache thanks to dense coding, you can reuse it for execution resources as well 😉
Yes, in terms of die space, of course

But in terms of complexity of design, the cache should be considered separately from the core.

Yes, that is true. To make some more points to support your view

- x86 L1 cache usually has 2 more bits to store instruction boundaries, so that decreases code density advantage for L1 cache (but not L2 and main memory).

- AFAIK, caches are (or can be) designed with some error correction possibilities, so that single cache line error can be "repaired" by configurating the CPU (burning bridges with laser).

Anyway, I still maintain my point that x86 ISA costs have diminished near to zero around 1999 and it is not significant problem in increasing high-end CPU performance in future.

Also for integer code, x86 consistently shows superior performance for such a long period of time that very likely there are some hidden advantages (and I think it must be register-memory vs load-store).

IMHO at this moment, IA-64 is the architecture with performance problem, not x86, because of fundamental in-order nature of EPIC.

cxl · Sep 28, 2006

The G4e for example, had only a 7 stage pipeline, with the same instruction set.
And the G4, which was out at the time of the K7, had only 4 (while the K7 had 10-15

Yes, but G4 is not as wide and deeper pipeline means higher frequency, which was major trouble for Apple at the time....

When they decided to fix the problem, suddenly G5 had 16 stages...

Meh, and Prescott had 30

But at the time, PIII and K7 vs G4 was the match. G4 was stuck at 500Mhz, while PIII/K7 were at 1Ghz.

Nobody argues Prescott fiasco here

SSE2 improved the situation a lot, but it still lacks 3 operand instructions and has relatively few register (for non vector code).
As such, the FP (SSE) performance of x86 CPUs is still not so stellar.

Well, I must say I will have to learn a little bit more about importance of 3 operand FP, anyway at the same time I think x86/FP never had FPU execution resources of Itanium/G5. So let us wait and see what happens when they get it

At the same time, I think that FP code makes less than 5% of that valuable x86 legacy (in terms of SW and SW expertise).

Again, i'm not so sure if Itanium is a mistake, but certainly they could have created a "clean" and more traditional new ISA, maybe with 3 operands instructions AND memory/register ones, with a few more architectural registers, a regular (but not fixed) instruction length..

Yes, super-x86 would be nice

factors, then an architecture usually turns up as a failure because of shifts in market demand or the industry as a whole.

Yes. But I think that around 1998 the EPIC problems were already clearly visible, they should have guts to cancel the project. They will do it anyway soon, with much more money lost.

amd quadfather 4X4 VS intel Kentsfield

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Share this page