amd quadfather 4X4 VS intel Kentsfield

cxl · Sep 22, 2006

The system interface control logic contains
an in-order queue (IOQ) and an out-of-order
queue (OOQ), which track all transactions
pending completion on the system interface.
The IOQ tracks a request’s in-order phases
and is identical on all processors and the node
controller. The OOQ holds only deferred
processor requests. The IOQ can hold eight
requests, and the OOQ can hold 18 requests.
The system interface logic also contains two
128-byte coalescing buffers to support writecoalescing
stores. The buffers can coalesce
store requests at byte granularity, and they
strive to generate full line writes for best performance.
Writes of 1 to 8 bytes, 16 bytes, or
32 bytes are possible when holes exist in the
coalescing buffers.

Is not this rather related to commiting writes to main memory? There is nothing that would preven OOO cores to use similiar techniques and they AFAIK do.

If you look at predicate, and branch in Itanium, there is less a chance of stall in Itanium versus x86.

I believe you are speaking about different subject. If branch misprediction happens, pipeline is not stalled but canceled.

What we are dealing with when speaking about pipleine stall is cache miss. In that case any in-order core (like Itanium) has to wait for data (pipeline is stalled). That is why is so important for Itanium to use explicit prefetch and why it has so big low latency cache.

Meanwhile, OOO has a good chance to continue with another ready instruction from reorder buffer. In worst case, it will not do any real processing in execution units, but will likely still continue fetching instructions and placing them into reorder buffer.

(I hope pippero will correct any errors I made 😉

Every response phase of Itanium is followed by defered phase which the data phase can be out of order

I believe this is dealing with completition and commiting results to memory. That is nontrivial but still easier part of trouble...

Pippero · Sep 23, 2006

The system interface control logic contains
an in-order queue (IOQ) and an out-of-order
queue (OOQ), which track all transactions
pending completion on the system interface.
The IOQ tracks a request’s in-order phases
and is identical on all processors and the node
controller. The OOQ holds only deferred
processor requests. The IOQ can hold eight
requests, and the OOQ can hold 18 requests.
The system interface logic also contains two
128-byte coalescing buffers to support writecoalescing
stores. The buffers can coalesce
store requests at byte granularity, and they
strive to generate full line writes for best performance.
Writes of 1 to 8 bytes, 16 bytes, or
32 bytes are possible when holes exist in the
coalescing buffers.

Here you're talking about system interface logic, which takes place outside of the execution engine of the CPU.
And this "write coalescing" process is called write combining (link) in x86 CPU and it exists since the Pentium 2.
It it used to speed up write transfer to non-cacheable memory areas (typically, I/O operations on memory mapped registers of peripheral devices, especially graphic cards) through the use of a special write buffer; in case of cacheable memory areas, such a process is called write allocation, and consists in fetching a cache line also on a write miss (without write allocation, this happens only on load misses) with the idea that it's very likely that further writes will happen within the same line in a short amount of time (spatial and temporal locality).

Pippero · Sep 23, 2006

If you look at predicate, and branch in Itanium, there is less a chance of stall in Itanium versus x86.

I believe you are speaking about different subject. If branch misprediction happens, pipeline is not stalled but canceled.

What we are dealing with when speaking about pipleine stall is cache miss. In that case any in-order core (like Itanium) has to wait for data (pipeline is stalled). That is why is so important for Itanium to use explicit prefetch and why it has so big low latency cache.

Meanwhile, OOO has a good chance to continue with another ready instruction from reorder buffer. In worst case, it will not do any real processing in execution units, but will likely still continue fetching instructions and placing them into reorder buffer.

(I hope pippero will correct any errors I made 😉

What you wrote is correct

That's the reason why x86 CPUs have such deep reorder buffers, it's especially on memory access misses that an instruction can stall for tens or hundreds clocks before being completed.

cxl · Sep 23, 2006

That's the reason why x86 CPUs have such deep reorder buffers, it's especially on memory access misses that an instruction can stall for tens or hundreds clocks before being completed.

Actually, now learning more and more about this subject (to make correct posts in this thread

, I am starting to think that CURRENT x86 architecture can deal with with cache load latency this way, but would stall on memory access anyway, as there is just single load and single store unit on Intel CPUs and while there are 2 load units on AMD, AMD seems to be unable to reorder loads. In other words, while load does not stall pipeline, there is limited number of micro-ops that can be usefuly perfomed before another load micro-op blocks the pipeline.

Moreover, until Conroe (and K8L), Intel nor AMD does not allow reordering load after store (speculative loads) (anyway, Conroe CAN do that, but still has just single load unit). K8 seems to even disallow reordering of loads (which raises the interesting question why does it have 2 load units...)

In other words, I expect to see huge performance improvements when they will start to make things right... like add another load unit to Core 3

Also, if AMD gets K8L speculative loads right, it can give a significant boost to small loops (with two load units), just something AMD fanboys are looking for.

This of course does not change the basic premise that OOO engine can unroll loops and process them in parallel. It just means that with current implementation usually only two loop iterations are processed in parallel. OTOH, given the number of executive units of both Conroe and K8, this is hardly the problem, except maybe for some very small loops. The moral of the story is that it still might be a good idea to unroll very small loop in SW for K8 and the boost is caused by more than just branch latency.

Meanwhile, Itanium 2 has only 2 load units, so you cannot expect anything spectacular from its SW based solution either...

spud · Sep 23, 2006

OK. What the register renaming and around 100 physical registers on x86 implementations good for?

Not sure because there isn't that many registers, and frankly your spreading of misinformation is beginning to wear on my patience as well.

There is 8 32 bit General Purpose registers as listed: EAX, ECX, EDX, EBX, ESP, EBP, ESI, EDI
There is 8 new 64bit General Purpose registers as listed: RAX, RBX, RCX, RDX, RDI, RSI, RBP, RSP
There is 8 x87 FP Registers as listed: ST0, ST1, ST2, ST3, ST4, ST5, ST6, ST7
There is 8 MMX Registers as listed: MM0, MM1, MM2, MM3, MM4, MM5, MM6, MM7
There is 8 XMM Registers as listed: XMM0, XMM1, XMM2, XMM3, XMM4, XMM5, XMM6, XMM7

Source page 81 gives you a overview since you live in x86 dream land.

What is reorder buffer good for?

The function of the reorder buffer is to put the instructions back in the original program order after the instructions have finished execution. You have any other questions you need answered?

What do you think out-of-order means?

In superscalar architectures, out-of-order issue mechanisms increase performance by dynamically rescheduling instructions that cannot be statically reordered by the compiler. You have any other questions you need answered?

Is it there just because chip designers were bored and wanted to waste some silicon for no real gain?

The idea of out-of-order execution, was to achieve a higher level of hardware utilization, additionally out-of-order execution alleviated memory latency issues. You have any other questions you need answered?

No, for EPIC, compiler or programmer has to do it. EPIC hardware does not have out-of-order execution nor register renaming, therefore cannot do architecturaly invisible loop unrolling.

Your correct EPIC needs the compiler to unroll these loops statically. But it should be noted the Itanium 2 provides architectural support for software pipelining without unrolling. With that in mind I should be more specific in what revisions support what, I have been generalizing EPIC which has lead to some miscommunications I will try and proof my responses more thoroughly in the future.

cxl · Sep 23, 2006

OK. What the register renaming and around 100 physical registers on x86 implementations good for?

Not sure because there isn't that many registers, and frankly your spreading of misinformation is beginning to wear on my patience as well.

There is 8 32 bit General Purpose registers as listed: EAX, ECX, EDX, EBX, ESP, EBP, ESI, EDI
There is 8 new 64bit General Purpose registers as listed: RAX, RBX, RCX, RDX, RDI, RSI, RBP, RSP
There is 8 x87 FP Registers as listed: ST0, ST1, ST2, ST3, ST4, ST5, ST6, ST7
There is 8 MMX Registers as listed: MM0, MM1, MM2, MM3, MM4, MM5, MM6, MM7
There is 8 XMM Registers as listed: XMM0, XMM1, XMM2, XMM3, XMM4, XMM5, XMM6, XMM7

Source page 81 gives you a overview since you live in x86 dream land.

These are ARCHITECTURAL registers. Please lookup the definition and the difference.

Look, things would be much simpler if you would finally bother to follow any of links we posted to you. You clearly do neiter believe the information we post here nor are trying to get information from elsewhere.

Even old good Pentium Pro had 40 renaming registers:

http://arstechnica.com/articles/paedia/cpu/pentium-1.ars/5

The function of the reorder buffer is to put the instructions back in the original program order after the instructions have finished execution. You have any other questions you need answered?

Nope. Reorder buffer sits before execution units.

What do you think out-of-order means?

In superscalar architectures, out-of-order issue mechanisms increase performance by dynamically rescheduling instructions that cannot be statically reordered by the compiler. You have any other questions you need answered?

What do you think this means in context of loop unrolled by branch predictor?

Also "rescheduling instructions that cannot be statically reordered by the compiler" is something you have added to the definition. I bet the definition that you googled was " increase performance by dynamically scheduling instructions". It simply dynamically schedules ALL instructions, as soon as all operands are available. (how would you plan to detect instructions that "compiler could not reorder"?)

spud · Sep 23, 2006

Yes. This is exactly it - this is the result of in-order architecture. In OOO CPU, there are no pipiline stalls as that. That is why it is "out-of-order" - it simply continues with another independent opcode from reorder buffer.

Your reading it incorrectly it is saying in the case of a exception an EPIC machine will continue working while a x86 machine must flush it's pipeline because of the stall caused by the exception.

The Itanium 2 processor pipeline is fully interlocked such that a stall in the exception detect (DET) stage propagates to the instruction expand (EXP) stage and suspends instruction advancement. A stall caused by one instruction in the issue group stalls the entire issue group and never causes the core pipeline to flush and replay.

The DET-stage stall is the last opportunity for an instruction to halt execution before the pipeline control logic commits it to architectural state. The pipeline control logic also synchronizes the core pipeline and the L1D pipeline at the DET stage. The control logic allows these loosely coupled pipelines to lose synchronization so that the L1I and L2 caches can insert noncore requests into the memory pipeline with minimal
impact on core instruction execution. Table 3 lists the stages and causes of potential stalls.

Stage----------------Cause of stall
Rename-------------RSE activity required
Execute--------------Scoreboard and hazard detection logic
Exception detect---L2D TLB miss; L2 cache resources unavailable for
memory operations; floating-point and integer pipeline
coordination to avoid possible floating-point traps; or L1D
and integer pipeline coordination

If you are smart enough and combine this information with fact there is branch predictor and speculative execution, you should clearly see how OOO cores can execute overlapping loop iterations in parallel. You really need nothing more that this info.

Intel got back to me on the multiple loop instances issue I posed to them. They insist that x86 machines combined with compiler optimizations, which use SIMD instructions (anything with MMX, or SSE) can make loops execute in parallel. This is an extension to the ISA something I didn't take into account my bad and I am very sorry for the misunderstanding it was all on my side of the fence.

Intel also stated that the performance is not desirable at this point in time, and it would be just as effective to unroll the loops via SIMD instead, or in the case of one loop just simply unroll it.

cxl · Sep 23, 2006

Yes. This is exactly it - this is the result of in-order architecture. In OOO CPU, there are no pipiline stalls as that. That is why it is "out-of-order" - it simply continues with another independent opcode from reorder buffer.

Your reading it incorrectly it is saying in the case of a exception an EPIC machine will continue working while a x86 machine must flush it's pipeline because of the stall caused by the exception.

You are still confusing pipeline cancellation - flush (because of branch misprediction), and pipline stall (because of e.g. cache miss).

The Itanium 2 processor pipeline is fully interlocked such that a stall in the exception detect (DET) stage propagates to the instruction expand (EXP) stage and suspends instruction advancement. A stall caused by one instruction in the issue group stalls the entire issue group and

Actually, this is the improtant part "one instruction in the issue group stalls the entire issue group". This is simply a result of in-order architecture.

Maybe you would understand it better in EPIC terms - out-of-order engine simply creates "issue groups" dynamically from reorder buffer. CPU knows which instructions can be executed in parallel at any moment and chooses them dynamically.

impact on core instruction execution. Table 3 lists the stages and causes of potential stalls.

Exception detect---L2D TLB miss; L2 cache resources unavailable for
memory operations

Yep. That is it. Cache miss causes pipeline stall.

Intel got back to me on the multiple loop instances issue I posed to them. They insist that x86 machines combined with compiler optimizations, which use SIMD instructions (anything with MMX, or SSE) can make loops execute in parallel. This is an extension to the ISA something I didn't take into account my bad and I am very sorry for the misunderstanding it was all on my side of the fence.

Intel also stated that the performance is not desirable at this point in time, and it would be just as effective to unroll the loops via SIMD instead, or in the case of one loop just simply unroll it.

This is still somewhat misleading IMHO. I would really like to see what exactly was your question and what exactly they replied (and who did this reply 😉

Pippero · Sep 23, 2006

OK. What the register renaming and around 100 physical registers on x86 implementations good for?

Not sure because there isn't that many registers, and frankly your spreading of misinformation is beginning to wear on my patience as well.

There is 8 32 bit General Purpose registers as listed: EAX, ECX, EDX, EBX, ESP, EBP, ESI, EDI
There is 8 new 64bit General Purpose registers as listed: RAX, RBX, RCX, RDX, RDI, RSI, RBP, RSP
There is 8 x87 FP Registers as listed: ST0, ST1, ST2, ST3, ST4, ST5, ST6, ST7
There is 8 MMX Registers as listed: MM0, MM1, MM2, MM3, MM4, MM5, MM6, MM7
There is 8 XMM Registers as listed: XMM0, XMM1, XMM2, XMM3, XMM4, XMM5, XMM6, XMM7

Source page 81 gives you a overview since you live in x86 dream land.

Jump off your high horse, mister!
This is just the n-th proof that in this thread you don't know what you're talking about!
He said physical registers, not architectural.
Architectural registers are those which are exposed to the programmer by
the instruction set.
Physical registers are a larger pool of registers that the CPU uses for dynamic scheduling, and a subset of that gets mapped onto the architectural registers, when an instruction is dispatched to the reservation stations, and when it finally retires.
For example, the K8 has 120 physical registers just for FP operations, which get mapped onto 8 architectural registers for MMX/FP and 2x16 architectural registers for SSE (in 64 bit mode). (link)
Ok, so let's all wait, when your friendly Intel engineers will finally explain you how OOO architectures work, since you obviously are not even trying to read and understand what cxl and me are telling you on the subject.

Pippero · Sep 23, 2006

Intel also stated that the performance is not desirable at this point in time, and it would be just as effective to unroll the loops via SIMD instead, or in the case of one loop just simply unroll it.

Oh so they have replied at last

But there must have been a misunderstanding between you and them, because the loop unrolling via SIMD instructions is again some kind of SW loop unrolling.
What we are talking about here is (staggered) HW loop unrolling, which takes place even on integer code (non-SIMD) code.

spud · Sep 23, 2006

These are ARCHITECTURAL registers. Please lookup the definition and the difference.

Last time I checked we were talking about ISA's not schemes to get around ISA limitations. Now truthfully I have been a bit of a ass in regards to the whole discussion thus far, but the fact remains I was only discussing architectural registers, being the ass I was I chose to ignore physical registers as a whole since I was comparing apples to apples if you will of EPIC vs. x86.

As the discussion continued I got into a loop and lost some focus, I apologize for that I tend to loose my train of thought when I post, go to work don't post for a two or more days come back and post then repeat.

Nope. Reorder buffer sits before execution units.

Yes take a look at the Core 2 look at the flow chart.

Also "rescheduling instructions that cannot be statically reordered by the compiler" is something you have added to the definition. I bet the definition that you googled was " increase performance by dynamically scheduling instructions". It simply dynamically schedules ALL instructions, as soon as all operands are available. (how would you plan to detect instructions that "compiler could not reorder"?)

Actually no that’s not the case OOO processors have to retire instructions the same way they entered the pipeline. The fact they can execute out-of-order does not save them from the fact they have to retire in order of entry. Unless you know of a way to write software that isn’t serial in nature.

Pippero · Sep 23, 2006

Actually, now learning more and more about this subject (to make correct posts in this thread , I am starting to think that CURRENT x86 architecture can deal with with cache load latency this way, but would stall on memory access anyway, as there is just single load and single store unit on Intel CPUs and while there are 2 load units on AMD, AMD seems to be unable to reorder loads. In other words, while load does not stall pipeline, there is limited number of micro-ops that can be usefuly perfomed before another load micro-op blocks the pipeline.

Moreover, until Conroe (and K8L), Intel nor AMD does not allow reordering load after store (speculative loads) (anyway, Conroe CAN do that, but still has just single load unit). K8 seems to even disallow reordering of loads (which raises the interesting question why does it have 2 load units...)

This is a very interesting topic.
I also read on Anand about AMD's inability in reorder loads, but that does not seem to be true.
The L1 cache of the K8 is dual ported, and it is possible to perform 2 loads per cycle, or 1 load and 1 store; this already violates the in-order constraint 😉
So it is more of a problem of how limited the K8 is in load reordering compared to the competition; i haven't read enough on this topic, but i'll get back on this later.
However, the K8 certainly cannot execute speculatively stores ahead of loads, actually the stores are carried out to memory only when the instruction retires, but they can be forwarded inside the load/store unit earlier, if there is a dependent load which needs it.
The reason for committing the changes to memory only when the store retires, is that the store could have been a speculative one, and the speculation turned out to be wrong.
Yes, the K8 has 2 load/store unit, but they work sequentially, not in parallel; the first one checks if the data is present in the cache, the second one acts after the data has been transferred from the cache. (link)

In other words, I expect to see huge performance improvements when they will start to make things right... like add another load unit to Core 3

The LSU is not a unit in the traditional sense, i.e. a single ALU unit can only serve a single instruction per clock, but a LSU is just a queue (with logic to detect dependencies, and do data forwarding) can serve more than one instruction in the same clock cycle.
I don't think that the LSU is a bottleneck in the Core 2 architecture, the amount of loads/stores that can be performed in a clock cycle is dependent mainly on the L1 cache architecture.

Pippero · Sep 23, 2006

These are ARCHITECTURAL registers. Please lookup the definition and the difference.

Last time I checked we were talking about ISA's not schemes to get around ISA limitations. Now truthfully I have been a bit of a ass in regards to the whole discussion thus far, but the fact remains I was only discussing architectural registers, being the ass I was I chose to ignore physical registers as a whole since I was comparing apples to apples if you will of EPIC vs. x86.

As the discussion continued I got into a loop and lost some focus, I apologize for that I tend to loose my train of thought when I post, go to work don't post for a two or more days come back and post then repeat.

Ok, good to see that you acknowledge that.

But for an apples to apples comparison, you cannot compare EPIC ISA VS x86 ISA, because x86's ISA is just crappy, there's no discussion on that.
The discussion here was between OOO architectures and HW based scheduling VS explicit parallelism architectures (EPIC) and SW based scheduling.

Nope. Reorder buffer sits before execution units.

Yes take a look at the Core 2 look at the flow chart.

Yes Spud is right here.
The reorder buffer is used for instruction retirement, which must take place in-order, to achieve "precise interrupts", at the end of the out of order execution phase.

Also "rescheduling instructions that cannot be statically reordered by the compiler" is something you have added to the definition. I bet the definition that you googled was " increase performance by dynamically scheduling instructions". It simply dynamically schedules ALL instructions, as soon as all operands are available. (how would you plan to detect instructions that "compiler could not reorder"?)

Actually no that’s not the case OOO processors have to retire instructions the same way they entered the pipeline. The fact they can execute out-of-order does not save them from the fact they have to retire in order of entry. Unless you know of a way to write software that isn’t serial in nature.
I don't understand what you mean here.
Sure OOO processors have to retire instructions in order (but this as i said, only to preserve the timing of the interrupts), but that's also the program's semantics, because as you say, you always write SW which is serial in nature.
But the purpose of OOO, or SW scheduling, or SW scheduling with HW assistance, is just to speed up the execution of your code.
I really do not understand your point here..

spud · Sep 23, 2006

You are still confusing pipeline cancellation - flush (because of branch misprediction), and pipline stall (because of e.g. cache miss).

How so each are detrimental to performance of the machines pipelines, x86 machines in either case will flush the entire pipeline and start again, while EPIC will unlikely run into a branch mis-prediction due to predication, and cache misses, well that’s data dependencies and there really isn't much x86 or EPIC can do if the data isn't there.

Actually, this is the improtant part "one instruction in the issue group stalls the entire issue group". This is simply a result of in-order architecture.

Maybe you would understand it better in EPIC terms - out-of-order engine simply creates "issue groups" dynamically from reorder buffer. CPU knows which instructions can be executed in parallel at any moment and chooses them dynamically.

You can't choose what part you want to use for your argument, you miss the point that the instruction group doesn't affect anything else in the pipeline. The core stops it moves it and tries to rectify it.

Yep. That is it. Cache miss causes pipeline stall.

All ISA's suffer from that and like I said it happens.

This is still somewhat misleading IMHO. I would really like to see what exactly was your question and what exactly they replied (and who did this reply

How so this is what I was told, I was proven incorrect what more do you want?

Pippero · Sep 23, 2006

You are still confusing pipeline cancellation - flush (because of branch misprediction), and pipline stall (because of e.g. cache miss).

How so each are detrimental to performance of the machines pipelines, x86 machines in either case will flush the entire pipeline and start again, while EPIC will unlikely run into a branch mis-prediction due to predication, and cache misses, well that’s data dependencies and there really isn't much x86 or EPIC can do if the data isn't there.

Yep. That is it. Cache miss causes pipeline stall.

All ISA's suffer from that and like I said it happens.

Yes, but some architectures suffer more than others.
In order architectures just sit there twiddling their thumbs.
OOO architectures can process other instructions while waiting for the data to arrive.

cxl · Sep 24, 2006

Nope. Reorder buffer sits before execution units.

Yes Spud is right here.

The reorder buffer is used for instruction retirement, which must take place in-order, to achieve "precise interrupts", at the end of the out of order execution phase.

Actually, this more or less is terminology issue. At least AMD64 description is using this term to describe retirement problem, meanwhile Intel (and some other reference material) seems to use it for buffer to hold microops before they get executed.

cxl · Sep 24, 2006

You are still confusing pipeline cancellation - flush (because of branch misprediction), and pipline stall (because of e.g. cache miss).

How so each are detrimental to performance of the machines pipelines, x86 machines in either case will flush the entire pipeline and start again, while EPIC will unlikely run into a branch mis-prediction due to predication, and cache misses, well that’s data dependencies and there really isn't much x86 or EPIC can do if the data isn't there.

But predication is very limited to small number of cases. Do not you see the problem?

You can't choose what part you want to use for your argument, you miss the point that the instruction group doesn't affect anything else in the pipeline. The core stops it moves it and tries to rectify it.

Yes, that is the problem. Core stops the pipeline. Do not see it yet?

Yep. That is it. Cache miss causes pipeline stall.

All ISA's suffer from that and like I said it happens.

Nope. All *in-order cores* suffer from that. Out-of-order cores can to some degree overcome this problem. They do not have always stop on cache miss.

How so this is what I was told, I was proven incorrect what more do you want?

References to SIMD do not make sense to me. Yes, SIMD is affected as well, but the code does not have to be SIMD to have parallel loop interleaved execution.

cxl · Sep 24, 2006

while EPIC will unlikely run into a branch mis-prediction due to predication,

There is still a couple of problems with predication. The main is that its applicatibility in general code is low.

and cache misses, well that’s data dependencies and there really isn't much x86 or EPIC can do if the data isn't there.

That is the thing you should finaly try to understand. Out-of-order core dost not have to stop the pipeline at cache miss. That has nothing to do with x86, it is the core design.

You can't choose what part you want to use for your argument, you miss the point that the instruction group doesn't affect anything else in the pipeline. The core stops it moves it and tries to rectify it.

Meanwhile, out-of-order code does not need to try to rectify anything, it simply continues processing with any other independent instruction.

That is also the basis of my critique of the whole EPIC idea.

cxl · Sep 24, 2006

But for an apples to apples comparison, you cannot compare EPIC ISA VS x86 ISA, because x86's ISA is just crappy, there's no discussion on that.

Actually, this is the point I do not quite agree, at least if you consider x86-64.

Yes, it is damned more difficult to decode. Anyway, in the end, it has good code density and CISC character of instructions in this very late CPU architecture design stage has some advantages for OOO cores.

If you compare x86-64 to traditional RISCs, you can see (IMHO) two basic differences:

- x86-64 has variable instruction size. This results in much more complex decoding, OTOH it improves code density.

- x86-64 has very CISC addressing modes. I believe that this point in long term puts x86-64 ahead - it is no problem to build addressing executable unit capable to deal with x86-64 advanced addressing modes in sigle cycle and standard compiled code is full of opportunities to exploit them.

I believe that at this stage, decoding difficulities do not make too much problem anymore when compared to the rest of the core.

Pippero · Sep 24, 2006

Nope. Reorder buffer sits before execution units.

Yes Spud is right here.

The reorder buffer is used for instruction retirement, which must take place in-order, to achieve "precise interrupts", at the end of the out of order execution phase.

Actually, this more or less is terminology issue. At least AMD64 description is using this term to describe retirement problem, meanwhile Intel (and some other reference material) seems to use it for buffer to hold microops before they get executed.
Oh, right, i didn't pay too much attention about this.
But the "classical" meaning of the term is the one used from AMD, that's the reason of my (wrong) assumption here.

Pippero · Sep 24, 2006

But for an apples to apples comparison, you cannot compare EPIC ISA VS x86 ISA, because x86's ISA is just crappy, there's no discussion on that.

Actually, this is the point I do not quite agree, at least if you consider x86-64.

Yes, it is damned more difficult to decode. Anyway, in the end, it has good code density and CISC character of instructions in this very late CPU architecture design stage has some advantages for OOO cores.

If you compare x86-64 to traditional RISCs, you can see (IMHO) two basic differences:

- x86-64 has variable instruction size. This results in much more complex decoding, OTOH it improves code density.

Hmm yes, but in the end, code density has only an impact on the instruction cache size.. and with prefetching, not even so much.
The complex decoding has a cost both in terms of wasted die space, and in terms of decode stages / pipeline length.
Personally i like the RISC approach more.

- x86-64 has very CISC addressing modes. I believe that this point in long term puts x86-64 ahead - it is no problem to build addressing unit exexutable unit capable to deal with x86-64 advanced addressing modes in sigle cycle and standard compiled code is full of opportunities to exploit them.

I believe that at this stage, decoding difficulities do not make too much problem anymore when compared to the rest of the core.

Yes but in the end all that is broken down to several "RISC" micro-ops inside the CPU.
The Netburst architecture even went as far as just decoding x86 code once, then leaving in the trace cache the RISC micro-ops..
And of the huge instruction set of x86, only a small subset of instructions is frequently used.
In the end, i remain convinced that a CISC instruction set is useful only for handwritten code, and that all the tasks which can be efficiently performed by a compiler, should be better delegated to the compiler, rather than investing hardware on it.

spud · Sep 24, 2006

But predication is very limited to small number of cases. Do not you see the problem?

That’s a matter of opinion at this point, the fact that predication does work and by all points of reference say its quite effective. The only real reference would be to run heavy branched code, such as SQL.

But SPEC 2006gives you a good idea at least in general performance of the new processors that are available now.

HP HP ProLiant ML370 G5 SAS 3.0 GHz/4MB DC 169,360
HP HP ProLiant DL585 32GB/2.4GHz/4P 115,110
HP HP Integrity rx6600 230,569

TPC-C results showing as similar setups as I could find dual processor, dual core setups.

References to SIMD do not make sense to me. Yes, SIMD is affected as well, but the code does not have to be SIMD to have parallel loop interleaved execution.

Intel insists that the case to do back to back loop execution. I assume otherwise you would be using too much of the processors resources for loop execution, with considerations there is a great deal of other software needing the limited resources that exists it would be a safe bet that only "simulated" software could theoretically do 2 or more loop executions at once but real world it's not so realistic as they stated performance is not ideal otherwise.

SIMD allows for vectorizeation, 3 loops being executed at once would be by all standards be a lot easier for the SIMD engine to deal with than the main core. But I requested additional information from Intel on clarification, but Monday will be when they get a chance to look at it.

cxl · Sep 24, 2006

Hmm yes, but in the end, code density has only an impact on the instruction cache size.. and with prefetching, not even so much.

I think you should see the bigger picture. E.g. my ubuntu distribution runs over 256MB of code just sitting idle. With Itanium I would need about 700MB of memory just to hold all the code without swapping.

- x86-64 has very CISC addressing modes. I believe that this point in long
I believe that at this stage, decoding difficulities do not make too much problem anymore when compared to the rest of the core.

Yes but in the end all that is broken down to several "RISC" micro-ops inside the CPU.

I think this is not correct. AMD64 optimization guide lists all addressing modes with the same latency . It also has an interesting passage explicitly encouraging the use of as complex x86 opcodes as possible with listings that show reduced latency when compared to RISC-like code.

The example given is:

[code:1:7a981524a3]
movl %r10d,%r11d
leaq 0x68E35,rcx
addq %rcx,%r11
movb (%r11,%r13),%cl
cmpb %al,%cl
[/code:1:7a981524a3]

This according to AMD has 21 bytes and latency of 8 clocks. AMD suggests to replace this "RISC" code with complex instruction

[code:1:7a981524a3]
cmpb %al,0x68e35(%r10,%r13)
[/code:1:7a981524a3]

8 bytes and total latency 4 clocks.

Also the idea of "RISC" micro-ops is misleading I think. There are no RISCs that would have 100bit wide instructions (which is about the width of x86 microps).

Meanwhile, G5 cores do the same thing - translates Power RISC ISA to wide microops. It also has the similiar pipeline depth. OK, it clearly has easier job with fixed width of opcodes, OTOH variable width of opcodes is absolute must if you want to have advanced addressing modes.

In the end, i remain convinced that a CISC instruction set is useful only for handwritten code, and that all the tasks which can be efficiently performed by a compiler, should be better delegated to the compiler, rather than investing hardware on it.

See above

cxl · Sep 24, 2006

[code:1:e75f0a9b30]
lbl:
mov eax, [esi]
add [edi], eax
add esi, 4
add edi, 4
dec ebx
jnz lbl
[/code:1:e75f0a9b30]

Do you understand the difference between HW loop unrolling and SW based solution now?

yes you are showing me a HW loop because of bad software coding/compiling

No, it is not bad coding/compiling. You still do not get it. Anyway, I think now when Spud finally starts to READ links about Tomasulo and OOO engines, he will come back and explain to you what this is all about.

cxl · Sep 24, 2006

[code:1:6e664e2ef4]
lbl:
mov eax, [esi]
add [edi], eax
add esi, 4
add edi, 4
dec ebx
jnz lbl
[/code:1:6e664e2ef4]

actually this is very bad programming code
you never want to cause a loop with a jmp to a certain part of the disassembly. and also these loops are not run in OOO as you suggest the jnz lbl has to actually do the jump before the other loop can be finished.

You have completely ignored everything that me and Pippero posted here and what Spud finally starts to undestand.

also your code is very likley to cause very long loop and hang the hardware. i would suggest using a mutex or timer instead

Your ignorance is not even funny anymore. It starts to be sad. Mutex? Do you really know what are you speaking about?!

Do you have ANY clue what mutex really is, how it is implemented and what it is used for?

so your whole OOO loop argument is shot if you are using this code. this code would only run one then the other.

Ok, one last try, this is how microps for two overlapping iterations of this code might look like before entering the out-of-order buffer:

[code:1:6e664e2ef4]
mov r1, [r20]
mov r2, [r21]
add r1, r2
mov [r21], r1
add r21, 4
add r20, 4
sub r22, 1
mov r3, [r20]
mov r4, [r21]
add r3, r4
mov [r21], r3
add r21, 4
add r20, 4
sub r22, 1
[/code:1:6e664e2ef4]

Now OOO engine executes above instruction in the fastest possible order. It means that dependencies are detected (that is not that complicated), engine executes microps with no dependencies first, then executes all microps that wait for results of former etc.

Please show me where in the amd or intel documentation where it shows that x86 supports what you claim.

We have cited resources several times here, but they were ignored.

amd quadfather 4X4 VS intel Kentsfield

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Share this page