amd quadfather 4X4 VS intel Kentsfield

Pippero · Sep 17, 2006

But most modern CPUs have a branch misprediction rate of less than 5%! So the speedup is only marginal.. also as i said, what happens if you encounter another branch, before you have resolved the condition of the predicated one?
You can't use predication.
On a long pipeline like Prescott, basically you could never use it.
But this doesn't meant that predication is bad, but certainly is no holy grail.
Here i have a link which shows an example of HW loop unrolling thanks to Tomasulo's algorithm.

spud · Sep 17, 2006

With speculative OOO, overlapping loop execution is standard.

You'll have to show me when that changed from theory to silicon, as well you'll have to show me a x86 machine that can do overlapping loop execution, without hardlocking.
Well, a K7 can do that (and K8 and Core 2).
Perhaps even a humble Pentium 2.
With register renaming, thanks to Tomasulo's algorithm, and speculative execution (the branch will be correctly predicted as taken except the first and last time).
Each loop iteration will use a different "shadow" register, until the instruction retire.

As for speculative execution, or as some of you are coining it speculative OOO (I have never heard of someone calling it that), or simply brand prediction, I’ve never heard of someone calling speculative execution speculative OOO my bad.

With that in mind yes modern x86 machines are capable of software pipelining, but it is done at the compiler level not at the machine level, which is the case with EPIC. It’s noticeably more efficient to be done on EPIC than x86, based on x86’s register limits as apposed to EPIC’s.

It also should be noted that software pipelining is quite difficult to implement on x86 and is nearly all cases is faster and much easier to do a straight forward loop, or use Duff's technique to do your loop unrolling which is easier and quite a bit simpler.

Why are you bringing the Tomasulo's algorithm into the argument, I quite aware that it gave birth to out-of-order execution processors? You look that up on Wiki to make me look stupid?
Nope, OOO execution existed before Tomasulo's algorithm was invented, and it was based on a simpler algorithm, called Scoreboarding.
Tomasulo's algorithm allows dynamic loop unrolling at HW level, that's why i mentioned it.
And if i find the time i might provide an example of this, but not tonight though (yes it's night here

).
And no, i didn't look it up in a Wiki, since i'm teaching it in a course on computer architecture in a college..

Did you mention me lookin it up in a Wiki to make me look stupid? 😉

Well I have read very little on scoreboarding I took the initiative and found this but really I am not seeing the overall relevance other than OOO did exist back in the day, which you proved but that’s about it.

Pippero · Sep 17, 2006

You'll have to show me when that changed from theory to silicon, as well you'll have to show me a x86 machine that can do overlapping loop execution, without hardlocking.

The relevance, is that x86 machines can do overlapping loop execution in HW, thanks to Tomasulo's algorithm.

spud · Sep 17, 2006

You'll have to show me when that changed from theory to silicon, as well you'll have to show me a x86 machine that can do overlapping loop execution, without hardlocking.

The relevance, is that x86 machines can do overlapping loop execution in HW, thanks to Tomasulo's algorithm.

Well you are going to have to point me to the compiler white sheet that says that. I am looking as well and I keep comming up with Itanium only support.

cxl · Sep 17, 2006

With speculative OOO, overlapping loop execution is standard.

You'll have to show me when that changed from theory to silicon, as well you'll have to show me a x86 machine that can do overlapping loop execution, without hardlocking.

Well, and that might explain something to you, I have seen it in reality when trying to optimize some of my algorithms. E.g. when I have tried to optimize using MMX, fine tuned assembly onroled loop on AMD64 was exactly as fast as primitive C implementation.

When thinking about the issue for week, I have finaly decided that only explanation is that loop is "unrolled" by CPU and executed in parallel. Since then I feel deep respect to combination of OOO and speculative exection.

cxl · Sep 17, 2006

Sure. Directly from Intel's marketing material. How many such cases you meet in real world?

But the example is compareing simple branches, will generally go right on over to complex branches.

E.g. subroutine call cannot be dealt with predication. And that is pretty common.

You'll have to show me when that changed from theory to silicon, as well

In 1995. Years before fist Itanium launched.

you'll have to show me a x86 machine that can do overlapping loop execution, without hardlocking.

Open your x86 compatible computer case and look at that chip under biggest heatsink

cxl · Sep 17, 2006

As for speculative execution, or as some of you are coining it speculative OOO (I have never heard of someone calling it that)

Sorry for confusion. AFAIK OOO and speculative execution are ortoghonal. By "speculative OOO" I mean "OOO with speculative execution".

You obviously need both to have interleaved loop iteration execution.

With that in mind yes modern x86 machines are capable of software pipelining, but it is done at the compiler level not at the machine level, which is the case with EPIC.

What the hell you mean by this?

It’s noticeably more efficient to be done on EPIC than x86, based on x86’s register limits as apposed to EPIC’s.

Number of registers has nothing to do with EPIC concept. E.g. SPARC has as many registers as Itanium, it even has register windows and still is OOO machine. BTW, it does not work well for SPARC either...

It is true that 8 GPR for IA32 is somewhat limiting the performance (but not that much). Anyway, this problem is solved by AMD64. There is little you can complain about in AMD64 ISA performance wise.

cxl · Sep 17, 2006

You'll have to show me when that changed from theory to silicon, as well you'll have to show me a x86 machine that can do overlapping loop execution, without hardlocking.

The relevance, is that x86 machines can do overlapping loop execution in HW, thanks to Tomasulo's algorithm.

Well you are going to have to point me to the compiler white sheet that says that. I am looking as well and I keep comming up with Itanium only support.

The point you do not understand is that overlaping loop iteration is natural property of OOO and speculative execution and is done by hardware.

Maybe the trouble is that while for Itanium, marketing materials are able to explain used techniques to average Joe and it all seems nice, P6-like architecture is kind of magic that is hard to explain to public...

Pippero · Sep 17, 2006

You'll have to show me when that changed from theory to silicon, as well you'll have to show me a x86 machine that can do overlapping loop execution, without hardlocking.

The relevance, is that x86 machines can do overlapping loop execution in HW, thanks to Tomasulo's algorithm.

Well you are going to have to point me to the compiler white sheet that says that. I am looking as well and I keep comming up with Itanium only support.
Finally i see where we misunderstand each other.
Sorry, but you can quit your search, as you won't find compiler white sheet referring to dynamic scheduling in OOO CPUs.

In fact, EPIC's Rotatating Registers for loop unrolling are a "poor man's version" of Register Renaming in Tomasulo's Algorithm.
With Register Renaming, the CPU has a larger set of registers which are used with Tomasulo's algorithm for dynamic scheduling, in a way that is completely invisible to the compiler and the programmer.
For example, the K7 has a 88 physical FP registers, while the instruction set of course supports only 8; however the FP scheduler has 36 entries, which means that up to 36 instructions can be "in-flight" each with its own registers. (the loop unrolling happens inside the reorder buffers of the CPU, with OOO)
Basically, the difference with EPIC is that in this case you have the compiler who is doing the loop unrolling and explicitly tells the CPU to "rotate" the register file, to resolve naming conflict among registers, while an OOO CPU does this completely in HW internally, without the compiler knowing anything about what's going on.

Pippero · Sep 17, 2006

Here there is a simple example of register renaming and dynamic scheduling at work.

Pippero · Sep 17, 2006

It’s noticeably more efficient to be done on EPIC than x86, based on x86’s register limits as apposed to EPIC’s.

Number of registers has nothing to do with EPIC concept. E.g. SPARC has as many registers as Itanium, it even has register windows and still is OOO machine. BTW, it does not work well for SPARC either...

It is true that 8 GPR for IA32 is somewhat limiting the performance (but not that much). Anyway, this problem is solved by AMD64. There is little you can complain about in AMD64 ISA performance wise.
Here i will side a bit with the pro-EPIC party

Well yes the extemely limited number of registers of x86 has nothing to do with OOO architectures in general.
However, the idea behind EPIC is the following.
OOO is cool and powerful, but it has a cost, in terms of HW, an important cost.
So EPIC says, let's do without OOO and use that silicon to improve the execution resouces instead, and leave the scheduling to the compiler.
For example, instead of using 88 registers only for dynamic scheduling, i could expose them to programmer, which then could use them also as storage space, say, for local variables for or to speed up procedure calls.
Sames goes for the deep buffers, schedulers, and extra control logic which becames much more complicated; that silicon could be used for creating extra ALUs, which can give a higher peak performance.
It's a bit like CISC -> RISC: with CISC you had complex instructions which could even make you a coffee, then RISC broke that down in sequences of simpler instructions, simplifying HW design and the control logic of the CPU; this in turn meant that hand writing assembler code became more difficult, and compilers had to become more sophysticated.
But the HW overall became faster and RISC was a big success.
However, EPIC has yet to prove to be the next big thing compared to OOO RISC.
Let's take the predication example: with EPIC you use redundant execution resources to process both ends of a branch, then kill the wrong one; however, it could be more efficient to use those resources to process a second thread, like SMT and hyperthreading.
And there are case where compiler analysis cannot find out dependencies, which arise only at run time.
Let's take the following example:

array[n] = array[j] * a
array[k] = array[j] + c

Now, depending on the values at run time of n, j, k, we can have several possible dependencies which can turn into hazards:
n = j : real (or flow) dependency (read after write)
k = j : blocking (or anti) dependency (write after read)
k = n : write (or output) dependency (write after write)

Since the values of k, i, j are not known at compile time, a compiler cannot do much to optimize that code, while HW OOO can dynamically schedule it at run time.

IMO, what made OOO so popular, is that different CPUs in the same family can share the same compile code and all of them execute it as fast as possible given their own architecture.
Without OOO, AMD probably would have had no chance against Intel, because there was no way that they could convince the developers to really optimize the code for their CPUs, so the only way was for the CPU to optimize the code itself.

spud · Sep 17, 2006

With speculative OOO, overlapping loop execution is standard.

You'll have to show me when that changed from theory to silicon, as well you'll have to show me a x86 machine that can do overlapping loop execution, without hardlocking.

Well, and that might explain something to you, I have seen it in reality when trying to optimize some of my algorithms. E.g. when I have tried to optimize using MMX, fine tuned assembly onroled loop on AMD64 was exactly as fast as primitive C implementation.

When thinking about the issue for week, I have finaly decided that only explanation is that loop is "unrolled" by CPU and executed in parallel. Since then I feel deep respect to combination of OOO and speculative exection.

Show me then, show the code and show me the assembly output. I want to see every line of code, so I can get it independantly be confirmed by Intel.

Untill then I will not budge on my stance that a x86 machine is not capable of overlapping loop execution.

cxl · Sep 17, 2006

And there are case where compiler analysis cannot find out dependencies, which arise only at run time.
Let's take the following example:

array[n] = array[j] * a
array[k] = array[j] + c

Now, depending on the values at run time of n, j, k, we can have several possible dependencies which can turn into hazards:
n = j : real (or flow) dependency (read after write)
k = j : blocking (or anti) dependency (write after read)
k = n : write (or output) dependency (write after write)

Actually, this is my main complaint about EPIC idea as well. See the thread a couple of messages ago where I suggest similiar example. Dynamic analysis is always more productive than static one.

My second complaint about EPIC is that parallel execution cannot pass the branch - simply because at the branch point, paralell group of instructions has to stop.

cxl · Sep 17, 2006

With speculative OOO, overlapping loop execution is standard.

You'll have to show me when that changed from theory to silicon, as well you'll have to show me a x86 machine that can do overlapping loop execution, without hardlocking.

Well, and that might explain something to you, I have seen it in reality when trying to optimize some of my algorithms. E.g. when I have tried to optimize using MMX, fine tuned assembly onroled loop on AMD64 was exactly as fast as primitive C implementation.

When thinking about the issue for week, I have finaly decided that only explanation is that loop is "unrolled" by CPU and executed in parallel. Since then I feel deep respect to combination of OOO and speculative exection.

Show me then, show the code and show me the assembly output. I want to see every line of code, so I can get it independantly be confirmed by Intel.

I can post it here (but I would have to dig deep in version control system, the code is abandoned for 2 years), but I am not very sure what you are going to take from it? Do you want to do benchmark?

Untill then I will not budge on my stance that a x86 machine is not capable of overlapping loop execution.

Then go ask somewhere else. It would be much faster

Pippero · Sep 17, 2006

Show me then, show the code and show me the assembly output. I want to see every line of code, so I can get it independantly be confirmed by Intel.

Untill then I will not budge on my stance that a x86 machine is not capable of overlapping loop execution.

Spud, you cannot see HW loop unrolling from the assembler code.
You can't because the unrolling, in OOO CPUs is not done there!
You just have your loop, in assembler, and inside the schedulers (or reservation stations or reorder queues etc) of the CPU the loop unrolling takes place.
It's only the compiler which doed loop unrolling at assembler level; EPIC provides some special assistance to this; x86 does not, but if you do loop unrolling on x86 at compiler level (or hand written assembler), you won't find any performance improvements.
The reason for this, is that the CPU would unloop the loop internally, and do "register rotation" (to say it in an EPIC-like way) using its shadow registers.
And in one of the links i posted, there is an example how this works.

cxl · Sep 17, 2006

provides some special assistance to this; x86 does not, but if you do loop unrolling on x86 at compiler level (or hand written assembler), you won't find any performance improvements.

Actually, this is not quite true, in very specific cases there still can be some advantage to unrolled loop, especially when body is minimal (2-3 opcodes), at least last time I have benchmarked with AMD64. But the difference is very small.

The reason is that even correctly predicted branch has single clock latency (occupies a slot in pipeline).

Pippero · Sep 17, 2006

Ok, i agree, but this is nitpicking and is not exactly going in the direction that we were pointing out

What i meant is that you cannot show HW loop unrolling from assembler code.
Now you mention that there are a few cases where SW loop unrolling is advantageous.
Well, anyway x86 CPUs with their limited (architectural) register set are not exactly a good candidate for SW loop unrolling; also the increased register pressure, from a compiler point of view, might offset the slight performance increase, by forcing the spilling of registers which could be used otherwise for holding other variables.
So the scenario you're picturing seems more of a finely crafted hand generated assembler code, rather than typical compiler stuff.

Pippero · Sep 17, 2006

Actually, this is my main complaint about EPIC idea as well. See the thread a couple of messages ago where I suggest similiar example. Dynamic analysis is always more productive than static one.

I wouldn't say "always".
The compiler can have a much more general view of what is going on, while the hw scheduler has a fixed window and thus a very limited temporal horizon.
But anyway, the point of EPIC is not, if you can better resolve hazards with a compiler rather than with an online scheduler.
Its point is, if i invest the scheduler die space in execution resources, like ALUs and registers, will i get overall faster?
Until now, the answer is neither a definite "yes", nor a definite "no".

cxl · Sep 17, 2006

But anyway, the point of EPIC is not, if you can better resolve hazards with a compiler rather than with an online scheduler.
Its point is, if i invest the scheduler die space in execution resources, like ALUs and registers, will i get overall faster?
Until now, the answer is neither a definite "yes", nor a definite "no".

Will there ever be a time when this question is finally put to the rest? It is more than 18 years since HP started developing EPIC. Intel has good compiler team around working on the issue for many years.

On the contrary, P6 design was finished in 1994 (I am not sure when they started, but I guess it took 5 years max, maybe less) and it seems to work well since then. It even seems to scale.

Now do not get me wrong. EPIC is indeed very interesting concept, just as Transmeta, SPARC windows, ARM simplicity, Niagara hyperthreading etc, etc. It just seems that right now, it is not fastest platform, which considering efforts and investments (and hype) it should be.

spud · Sep 18, 2006

E.g. subroutine call cannot be dealt with predication. And that is pretty common.

1.
2.
3.

I personally beleive anything further from what is said in these documents is very well beyond my scope and I highly doubt a assembly writer will find the time to post on THG.

In 1995. Years before fist Itanium launched.

Interesting, thats when development of silicon started for the Itanium.

cxl · Sep 18, 2006

E.g. subroutine call cannot be dealt with predication. And that is pretty common.

12/papers/isca-95-partial-pred.pdf#search=%22can%20a%20subroutine%20call%20be%20predicated%22]3[/url].

I personally beleive anything further from what is said in these documents is very well beyond my scope and I highly doubt a assembly writer will find the time to post on THG.

Actually, now looking at my post, I should be more specific. You can in principle predicate soubroutine call itself, but it is equivalent to conditional branch (subroutine is either executed or not).

spud · Sep 18, 2006

What the hell you mean by this?

I mean what I said, a modern x86 machine (processor) can execute assembly code that has been set up to be pipelined, it can not morph, shift, or otherwise manipulate code at the machine level to be pipelined, EPIC on the other hand can.

Number of registers has nothing to do with EPIC concept.

SPARC
General Purpose:32 64bit
Float Point:16 128bit quad precision
32 64bit double precision
32 32bit single precision

Itanium
General purpose:128 64bit
Float Point:128 82bit
Predication:64 1bit
Branch:8 64bit

As well it has everything to do wit the EPIC concept from the get go. The more registers the software can see the more performance that can be squeezed out of the processors resources that’s a very known fact, and just as you said that’s why x86-64 has a performance increase because of the additional 16 registers, not because of the magical 64bit gremlins.

Gelato does a very comprehensive explanation of the targets of EPIC, might you do some good to check them out.

spud · Sep 18, 2006

You'll have to show me when that changed from theory to silicon, as well you'll have to show me a x86 machine that can do overlapping loop execution, without hardlocking.

The relevance, is that x86 machines can do overlapping loop execution in HW, thanks to Tomasulo's algorithm.

Well you are going to have to point me to the compiler white sheet that says that. I am looking as well and I keep comming up with Itanium only support.

The point you do not understand is that overlaping loop iteration is natural property of OOO and speculative execution and is done by hardware.

Maybe the trouble is that while for Itanium, marketing materials are able to explain used techniques to average Joe and it all seems nice, P6-like architecture is kind of magic that is hard to explain to public...

No I am sorry but it is not possible on x86, and unless you can bring some whitepapers to the table to say otherwise I will stand by my convictions.

spud · Sep 18, 2006

You'll have to show me when that changed from theory to silicon, as well you'll have to show me a x86 machine that can do overlapping loop execution, without hardlocking.

The relevance, is that x86 machines can do overlapping loop execution in HW, thanks to Tomasulo's algorithm.

Well you are going to have to point me to the compiler white sheet that says that. I am looking as well and I keep comming up with Itanium only support.
Finally i see where we misunderstand each other.
Sorry, but you can quit your search, as you won't find compiler white sheet referring to dynamic scheduling in OOO CPUs.

In fact, EPIC's Rotatating Registers for loop unrolling are a "poor man's version" of Register Renaming in Tomasulo's Algorithm.
With Register Renaming, the CPU has a larger set of registers which are used with Tomasulo's algorithm for dynamic scheduling, in a way that is completely invisible to the compiler and the programmer.
For example, the K7 has a 88 physical FP registers, while the instruction set of course supports only 8; however the FP scheduler has 36 entries, which means that up to 36 instructions can be "in-flight" each with its own registers. (the loop unrolling happens inside the reorder buffers of the CPU, with OOO)
Basically, the difference with EPIC is that in this case you have the compiler who is doing the loop unrolling and explicitly tells the CPU to "rotate" the register file, to resolve naming conflict among registers, while an OOO CPU does this completely in HW internally, without the compiler knowing anything about what's going on.

Again I can provide whitepapers in the defence of my position, you will have to do the same thing otherwise your just another person trying to discredit me and frankly your not doing a very good job at this point.

spud · Sep 18, 2006

Here there is a simple example of register renaming and dynamic scheduling at work.

Yes I have read Aces explanation on the matter and he proves what I am saying that x86 has too few registers to do overlapping loop execution. Not enough registers for the instructions and dependencies.

amd quadfather 4X4 VS intel Kentsfield

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Share this page