amd quadfather 4X4 VS intel Kentsfield

Page 9 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
SPARC
General Purpose:32 64bit
Float Point:16 128bit quad precision
32 64bit double precision
32 32bit single precision

You have missed the description of register windows. From wikipedia (http://en.wikipedia.org/wiki/SPARC):

"The SPARC processor usually contains as many as 128 general purpose registers. At any point, only 32 of them are immediately visible to software - 8 are global registers (one of which, g0, is hard-wired to zero, so only 7 of them are usable as registers) and the other 24 are from the stack of registers. These 24 registers form what is called a register window, and at function call/return, this window is moved up and down the register stack. Each window has 8 local registers and shares 8 registers with each of the adjacent windows. The shared registers are used for passing function parameters and returning values, and the local registers are used for retaining local values across function calls.

Yes, there are differences between Itanium and SPARC - in SPARC, you can "rotate" (in Itanium terms) by 8 registers only and "overflow" is managed by software interrupt and firmware. Anyway, basic idea is the same.
 
With speculative OOO, overlapping loop execution is standard.

You'll have to show me when that changed from theory to silicon, as well you'll have to show me a x86 machine that can do overlapping loop execution, without hardlocking.

Well, and that might explain something to you, I have seen it in reality when trying to optimize some of my algorithms. E.g. when I have tried to optimize using MMX, fine tuned assembly onroled loop on AMD64 was exactly as fast as primitive C implementation.

When thinking about the issue for week, I have finaly decided that only explanation is that loop is "unrolled" by CPU and executed in parallel. Since then I feel deep respect to combination of OOO and speculative exection.

Show me then, show the code and show me the assembly output. I want to see every line of code, so I can get it independantly be confirmed by Intel.


I can post it here (but I would have to dig deep in version control system, the code is abandoned for 2 years), but I am not very sure what you are going to take from it? Do you want to do benchmark?

Untill then I will not budge on my stance that a x86 machine is not capable of overlapping loop execution.

Then go ask somewhere else. It would be much faster :)

No matter I have requested this information from Intel as it would be something they would know for certain.
 
Here there is a simple example of register renaming and dynamic scheduling at work.

Yes I have read Aces explanation on the matter and he proves what I am saying that x86 has too few registers to do overlapping loop execution. Not enough registers for the instructions and dependencies.

OK, if you are not able to accept basic widely known facts of CPU design it is a little bit hard to discuss more complex topics like real impacts of EPIC architecture on performance.

At this point I am not sure what to say or suggest. If you do not believe computer science / CPU core design college teacher nor senior C++ programmer deeply interested in low-level optimizations, we are getting out of options.
 
You'll have to show me when that changed from theory to silicon, as well you'll have to show me a x86 machine that can do overlapping loop execution, without hardlocking.
The relevance, is that x86 machines can do overlapping loop execution in HW, thanks to Tomasulo's algorithm.

Well you are going to have to point me to the compiler white sheet that says that. I am looking as well and I keep comming up with Itanium only support.
Finally i see where we misunderstand each other.
Sorry, but you can quit your search, as you won't find compiler white sheet referring to dynamic scheduling in OOO CPUs. :)
In fact, EPIC's Rotatating Registers for loop unrolling are a "poor man's version" of Register Renaming in Tomasulo's Algorithm.
With Register Renaming, the CPU has a larger set of registers which are used with Tomasulo's algorithm for dynamic scheduling, in a way that is completely invisible to the compiler and the programmer.
For example, the K7 has a 88 physical FP registers, while the instruction set of course supports only 8; however the FP scheduler has 36 entries, which means that up to 36 instructions can be "in-flight" each with its own registers. (the loop unrolling happens inside the reorder buffers of the CPU, with OOO)
Basically, the difference with EPIC is that in this case you have the compiler who is doing the loop unrolling and explicitly tells the CPU to "rotate" the register file, to resolve naming conflict among registers, while an OOO CPU does this completely in HW internally, without the compiler knowing anything about what's going on.

Again I can provide whitepapers in the defence of my position, you will have to do the same thing otherwise your just another person trying to discredit me and frankly your not doing a very good job at this point.
Gee. :roll:
I was not trying to discredit you, i was just trying to discuss with you.
If you were seeing this in terms of idiotic "ownage", well i feel like i just wasted a lot of time.
Anyway, i previously posted another link (not that one of Ace's HW, which you didnt understand anyway) which explained even with an example how loop unrolling works on OOO CPU.
What "white papers" do you need?
And whether i did a good job or not, it's not yours to decide, but i'll leave it at the other forum members.
Sorry, but your posts are just repeatedly showing that you don't understand dynamic HW scheduling at all.
 
Pippero, you can't own Spud. He is not ignorant and he has the balls to admint when he is wrong. He is one of those members that are balancing the credibility of THG, which is consistently deminshed by BaronBS-ers, 9NMs and LameNoobMikes. I don't have time now, but latter I'll read the whole discussion latter. It seems interesting and constructive, just I don't like the "ownage" part between you & spud.
 
Pippero, you can't own Spud. He is not ignorant and he has the balls to admint when he is wrong.

We will see 😉

When Intel gets back to me, if I am wrong I'll be sure to let everyone know, and if I am right nothing changes on my end.
 
you guys are being silly you cant apply your traditional proc architectures to itanium because it is not a traditional proc design.

Traditional architectures perform loop unrolling, which results in code expansion and also increases cache misses. The Itanium processor uses predication and register renaming to overlapping execution of multiple loop instances. This further increases the runtime performance.
i already explained this a while ago. please read my links from before and see what is different

The point you silly Itanium fans are still not able to grasp is that loop unrolling in OOO design is not feature of code, but hardware.

In the code, there is no expansion, just simple loop. OOO speculative execution engine does unrolling in hardware.

That has the potential to be much faster that Itanium software based solution, because it is able to adapt on dynamic situation rather than static flow analysis performed by compiler. (e.g. when data are not available due to cache miss, CPU can immediately continue with next iteration and defer previous run to later time). In practice it shows - Itanium has lackluster integer performance.

You can argue that OOO register renaming based loop unrolling does not exist, but that is just simply silly. You could as well claim there are no LCD displays or that that the earth is flat (I believe you will be able to digg some whitepapers on both themes to prove your point 😉

Please get your CPU design knowledge up to date before we can continue debate about EPIC vs OOO.
 
Pippero, you can't own Spud. He is not ignorant and he has the balls to admint when he is wrong. He is one of those members that are balancing the credibility of THG, which is consistently deminshed by BaronBS-ers, 9NMs and LameNoobMikes. I don't have time now, but latter I'll read the whole discussion latter. It seems interesting and constructive, just I don't like the "ownage" part between you & spud.
I agree, but the "ownage" part is completely one-sided, i.e. he seems to think that i'm trying to discredit him, which is simply untrue.
He generally makes good posts, but in this case, regardless of the real debate "OOO VS EPIC" he does not seem to really understand how OOO CPUs work.
And unfortunately he also does not seem very open in this discussion, when he says that he'll only listen to what Intel will tell him (if Intel ever replies him at all).
 
Consider the following situation to understand why the prefetching of data is important and advantageous.

OMG. I never said that prefetching data is not important. But it is irrelevant here.

OK, prefetching is absolute must for Itanium where every cache miss results in pipleline stall. It is less imporant for OOO core where cache miss just results in leaving the instruction in reorder buffer for longer time - CPU is simply executing any other indpendent opcodes until data are available.

A loop misses the L2 cache each iteration. Simply unrolling the loop will result in missing the cache more often per iteration. In turn, this could cause the performance to degrade so much that the benefits of using SIMD optimizations is lost,

Sorry, you seem to be either really confused or want to look smart by throwing terms you really do not understand. SIMD and autovectorizer are completely unrelated topics.

To alleviate this issue, the compiler directives can be used.

This starts to be really funny. How many times you need to be told that OOO loop unrolling and overlapping execution is done by HARDWARE and NOT COMPILER? As opposed to EPIC where loop unrolling has to be done by compiler or programmer.

In order to improve the performance of applications that do not take advantage of the cache control instructions, the P4 processor has a HW prefetch mechanism which is invoked if cache misses occur in a recognizable pattern.

OK, this is true, but once again irrelevant.
 
Pippero, you can't own Spud. He is not ignorant and he has the balls to admint when he is wrong.

We will see 😉

When Intel gets back to me, if I am wrong I'll be sure to let everyone know, and if I am right nothing changes on my end.

BTW, not sure whether pippero also quoted this nice wiki description of OOO register renaming process:

http://en.wikipedia.org/wiki/Register_renaming



you guys are being silly you cant apply your traditional proc architectures to itanium because it is not a traditional proc design.

Traditional architectures perform loop unrolling, which results in code expansion and also increases cache misses. The Itanium processor uses predication and register renaming to overlapping execution of multiple loop instances. This further increases the runtime performance.
i already explained this a while ago. please read my links from before and see what is different

That’s what I am trying to say x86 machines can't do multiply loop unrolls, well I guess staggered would be more descriptive of the process.
 
Pippero, you can't own Spud. He is not ignorant and he has the balls to admint when he is wrong. He is one of those members that are balancing the credibility of THG, which is consistently deminshed by BaronBS-ers, 9NMs and LameNoobMikes. I don't have time now, but latter I'll read the whole discussion latter. It seems interesting and constructive, just I don't like the "ownage" part between you & spud.
I agree, but the "ownage" part is completely one-sided, i.e. he seems to think that i'm trying to discredit him, which is simply untrue.
He generally makes good posts, but in this case, regardless of the real debate "OOO VS EPIC" he does not seem to really understand how OOO CPUs work.
And unfortunately he also does not seem very open in this discussion, when he says that he'll only listen to what Intel will tell him (if Intel ever replies him at all).

Intel has always gotten back to me, especially in regards to their ISA features. Whether or not it will be in a reasonable amount of time is yet to be seen.
 
OK, prefetching is absolute must for Itanium where every cache miss results in pipleline stall. It is less imporant for OOO core where cache miss just results in leaving the instruction in reorder buffer for longer time - CPU is simply executing any other indpendent opcodes until data are available.

You say it like it’s an issue for the Itanium it's an 8 stage fully interlocked pipeline. Here let me explain to you Itanium pipeline control, it is a key feature of the Itanium, please read page 8 and stop spreading misinformation please.

This starts to be really funny. How many times you need to be told that OOO loop unrolling and overlapping execution is done by HARDWARE and NOT COMPILER? As opposed to EPIC where loop unrolling has to be done by compiler or programmer.

And how many times do I have to tell you it isn't possible to do overlapping execution of multiple loop instances on x86. EPIC hardware can do it all on its lonesome.

But what really is starting to bother me is no one here has said f*ck ya EPIC is way better than x86, and no one not even myself said EPIC would be a good replacement for x86. But here you are defending x86 like EPIC is the god damned plague and is somehow going to corrupt the POS x86 ISA more so than it all ready is. You have turned an essentially ISA feature debate into a standard AMD vs. Intel pissing match, but replace those two with x86 and EPIC.
 
How do you get Intel to reply to you on this kind of stuff?

I link them the thread and request clarification on ISA features or whatever the debate is, it's a PR thing for them they have never let me down before and I have been asking them for clarification since I made this account.

Plenty of times I've been proven quite wrong and other times quite right, but for me to have a Intel engineer say yes this is the way it is, at least for me is a lot more persuasive than someone posting that they believe this is how something works but never provide any relevant information to back up those claims.

It's very frustrating for me because I am going out and trying to make the my point of view valid and correct with supporting information, and than have someone say no your wrong but not provide any information to support that view on the matter.

It leaves me feeling like my creditability is at stake, and regardless of what anyone thinks posting on a forum is worth I do put heavy merit on credibility and for me if I don’t have that then I have no business debating any particular topic.

Ah I know it sounds corny but what can I say that’s how I feel.
 
What people are forgeting is that anyone can make unsubstantiated claims.
It leaves me feeling like my creditability is at stake, and regardless of what anyone thinks posting on a forum is worth I do put heavy merit on credibility and for me if I don’t have that then I have no business debating any particular topic.
Atleast since its coming from Intel direct is not like getting information from other, less, shall we say, credible sources. [cough.. Inq]
 
And how many times do I have to tell you it isn't possible to do overlapping execution of multiple loop instances on x86.

OK. What the register renaming and around 100 physical registers on x86 implementations good for?

What is reorder buffer good for?

What do you think out-of-order means?

Is it there just because chip designers were bored and wanted to waste some silicon for no real gain?

EPIC hardware can do it all on its lonesome.

No, for EPIC, compiler or programmer has to do it. EPIC hardware does not have out-of-order execution nor register renaming, therefore cannot do architecturaly invisible loop unrolling.
 
Spud said:
OK, prefetching is absolute must for Itanium where every cache miss results in pipleline stall. It is less imporant for OOO core where cache miss just results in leaving the instruction in reorder buffer for longer time - CPU is simply executing any other indpendent opcodes until data are available.

You say it like it’s an issue for the Itanium it's an 8 stage fully interlocked pipeline.
OK, please read carefuly from the material you post:

The Itanium 2 processor
pipeline is fully interlocked
such that a stall in the exception
detect (DET) stage propagates
to the instruction
expand (EXP) stage and suspends
instruction advancement.
A stall caused by one
instruction in the issue group
stalls the entire issue group
and never causes the core
pipeline to flush and replay.

Yes. This is exactly it - this is the result of in-order architecture. In OOO CPU, there are no pipiline stalls as that. That is why it is "out-of-order" - it simply continues with another independent opcode from reorder buffer.

BTW, this is from AMD's optimization guide:

A.8 Instruction Control Unit
The instruction control unit (ICU) is the control center for the AMD Athlon 64 and AMD Opteron
processors. It controls the centralized in-flight reorder buffer, the integer scheduler, and the floatingpoint
scheduler. In turn, the ICU is responsible for the following functions: macro-op dispatch,
macro-op retirement, register and flag dependency resolution and renaming, execution resource
management, interrupts, exceptions, and branch mispredictions.
The instruction control unit takes the three macro-ops per cycle from the early decoders and places
them in a centralized, fixed-issue reorder buffer. This buffer is organized into 24 lines of three macroops
each. The reorder buffer allows the instruction control unit to track and monitor up to 72 in-flight
macro-ops (whether integer or floating-point) for maximum instruction throughput. The instruction
control unit can simultaneously dispatch multiple macro-ops from the reorder buffer to both the
integer and floating-point schedulers for final decode, issue, and execution as micro-ops. In addition,
the instruction control unit handles exceptions and manages the retirement of macro-ops.

If you are smart enough and combine this information with fact there is branch predictor and speculative execution, you should clearly see how OOO cores can execute overlapping loop iterations in parallel. You really need nothing more that this info.
 
It's very frustrating for me because I am going out and trying to make the my point of view valid and correct with supporting information, and than have someone say no your wrong but not provide any information to support that view on the matter.

It leaves me feeling like my creditability is at stake, and regardless of what anyone thinks posting on a forum is worth I do put heavy merit on credibility and for me if I don’t have that then I have no business debating any particular topic.
Look, i've posted you 2 links explaining how hw loop unrolling works on OOO architectures, in one case there was even an example of an actual loop execution.
The problem is that these references had to be read and understood, not ignored.
Do you want an even more creditable source?
Then go to a library and get the following book:
Hennessy - Patterson "Computer Architecture - A Quantitative Approach" - editor Morgan Kaufmann.
It's what is used in most colleges around the world to teach the basic principles of computer architecture design.
 
Pippero, you can't own Spud. He is not ignorant and he has the balls to admint when he is wrong.

We will see 😉

When Intel gets back to me, if I am wrong I'll be sure to let everyone know, and if I am right nothing changes on my end.

BTW, not sure whether pippero also quoted this nice wiki description of OOO register renaming process:

http://en.wikipedia.org/wiki/Register_renaming



you guys are being silly you cant apply your traditional proc architectures to itanium because it is not a traditional proc design.

Traditional architectures perform loop unrolling, which results in code expansion and also increases cache misses. The Itanium processor uses predication and register renaming to overlapping execution of multiple loop instances. This further increases the runtime performance.
i already explained this a while ago. please read my links from before and see what is different

That’s what I am trying to say x86 machines can't do multiply loop unrolls, well I guess staggered would be more descriptive of the process.
What you both don't understand is that when they say "traditional architecture" the mean "traditional in-order architectures", which use SW loop unrolling performed by the compiler; EPIC uses predication and rotating registers to make the SW loop unrolling more efficient, my mimicking the execution which takes place in OOO architectures thanks to branch prediction & speculation, rename registers and dynamic scheduling.
What you also both do not understand, is that in case of OOO hw loop unrolling, the loop is still there in the code.
Completely unmodified.
No "code expansion."
No "additional cache misses."
 
show me HW loop unrolling link that shows this
i showed you mine you show me yours!

Piperro posted you a link to .ppt presentation showing how HW loop unrolling works in Tomasulo's OOO engine. What else you want to see?

Maybe you have missed it, here is the similiar one:

http://engineering.dartmouth.edu/~engs116/lectures/engs%20116%20lecture%208-05f.ppt

BTW, what you have shown us as Itanium loop unrolling is SW solution, not HW one - for Itanium, loop has to be unrolled and register renaming (rotation) performed by compiler or programmer. Please try to understand this important difference finally.
 
show me HW loop unrolling link that shows this
i showed you mine you show me yours!

This one is more comprehensive and shows actual loop processing too:

http://faculty.cs.tamu.edu/ejkim/Courses/cpsc614/lec03_kim.ppt

BTW, loop unrolling is not the issue, any core with speculative execution unrolls loops. What I am really speaking about is parallel processing of several loop iterations, which is only possible with OOO engine.
 
show me HW loop unrolling link that shows this
i showed you mine you show me yours!

OK, maybe showing you some loop that gets unrolled and executed in parallel by hardware will help. Lets make a loop that adds one integer vector to another:

[code:1:f8a3377d4f]
lbl:
mov eax, [esi]
add [edi], eax
add esi, 4
add edi, 4
dec ebx
jnz lbl
[/code:1:f8a3377d4f]

Do you understand the difference between HW loop unrolling and SW based solution now?
 
The system interface control logic contains
an in-order queue (IOQ) and an out-of-order
queue (OOQ), which track all transactions
pending completion on the system interface.
The IOQ tracks a request’s in-order phases
and is identical on all processors and the node
controller. The OOQ holds only deferred
processor requests. The IOQ can hold eight
requests, and the OOQ can hold 18 requests.
The system interface logic also contains two
128-byte coalescing buffers to support writecoalescing
stores. The buffers can coalesce
store requests at byte granularity, and they
strive to generate full line writes for best performance.
Writes of 1 to 8 bytes, 16 bytes, or
32 bytes are possible when holes exist in the
coalescing buffers.



If you look at predicate, and branch in Itanium, there is less a chance of stall in Itanium versus x86. Every response phase of Itanium is followed by defered phase which the data phase can be out of order