amd quadfather 4X4 VS intel Kentsfield

Page 7 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
how many pipelines does itanium have compared to ia32?

Depends on how you define pipeline.

also are you saying in order is being used on itanium so its worse?
i would say OOO procs are more convenient and itanium could just re-order the instructions and get good parallelism

Yes, but introducing OOO into Itanium (if that is what you propose) makes EPIC concept completely obsolete. Once you start to reorder, you need decoder capable of dealing with implicit paralelism. All "explicit" info which is the main trademark of EPIC architecture is then completely useless.

(BTW, something quite similiar happened to POWER architecture branch prediction flags - there are bits in POWER jump inctructions that show the most likely jump direction and are defined by compiler. In later iterations of architecture, these flags are ignored and normal branch predictor is used. Another example of static flow analysis failure.)
 
4X4 may not completely outperform Kentsfield. But I will give stiff competition to it, while offering future Quad core upgrade in the same platform. When K8L comes all will see a quantam jump in performance (speculation only), as K8L has Double presision FP, 2MB L3 and couple of IPC enhancements.

Mobo of 4X4 is not going to be a bomb as all are expecting. 4X4 will be available in 3 SKUs (FX 70, FX 72, and FX 74). And all are going to be 125W. I guess this should not be a problem, as these are unlocked processors and they will give max TDP of the process when unlocked. Also I guess no one will think of putting 4X4 in a slim chassis.

Also AMD is planning to announce X2s in November (5400/5600/5800 and 6000+) this will answer E6700 and E6800 C2D.


Over all I see a gap of 300USD between 4X4 config and Kentsfield config when they are out by Q4. Consider 300USD is like insurance difference keeping K8L upgrade path.
 
The I2 has a short pipeline stalls are the least of its worries

No, short pipeline is good for branch misprediction, but pipeline stall is stall no matter what - CPU simply has to wait for data. Having short pipeline does not help (if it stops).

In contrast, OOO can continue processing, simply deferring the execution of stalling instruction until data are available.

But on a serious note please don’t force me to go out and get all the technical data for the technology because there is frankly too much of it.

Well, let me just say that nothing I write about Itanium is my fantasy - all are informations from technical papers and articles about Itanium (well, certainly not marketing papers from Intel/HP, but those dealing with real pros/cons).

Obviously this isn't going anywhere, you believe its a failure because of whatever reasons at this point, I say no it's a work in progress and has shown superior performance in its discipline. But frankly I'm not going to sit here and try and build a picture for you.

This is from a 3rd party overview of the technology if you take the time and read what they tried to do, what they have done and what they would like to do with the technology you may or may not change you view on the situation. But I think for at least reference sake you might be interested in reading it, if not that’s cool to. Because frankly the finer points of the architecture, more specifically the execution of code is something that got my attention.
 
First let me say im no ones fan here. I am a fan of fast and cheap!
I just bought a core2 and when k8l comes out if it is faster then I will buy a k8l.
Anyway what I was saying is if you get a 4x4 it should compete or at least come close to kentsfield and down the road you can drop in quad core replacements if you wanted to have 8 total cores.

This is what some people want, the performance of a server board without the price premium, not to say its gonna be cheap but it will be cheaper then buying serverboard/server cpus/server memory by a long shot.

I have wanted a dual socket system since the pentium pro days when they came out with them.
 
Even though, I cannot change your point of view. There will be a very big apps or database which can be accomdated by Itanium, especially for very big company.
 
epic doesnt exactly just execute instructions off of a fsb or something it uses the giant register sets to figure out what needs/can be run and then runs it.

Not sure what you mean by this.... All modern CPUs read 99% of instructions from L1 code cache. Giant register set has a little to do with fsb....

so in a way it works around the problem you mention and makes OOO coded stuff run better.

What exactly is "OOO coded stuff"?

it is 6 way.

Woodcrest is 4-way on double frequency, if you consider this as metrics.

Anyway, I agree that we will hardly agree :) I clearly see how seductive EPIC architecture looks. Everybody likes simple concepts with performance promise. Nobody likes to study complicated tweaks of complicated architecture.

Well, in 2007 Intel seems to unify Itanium and Woodcrest sockets. We will see how long Itanium survives then. I think it will be dead and forgotten in 2010. (I will be happy to say "I told you so" by that time :)

Also, discussing Itanium failure was not my original point. What I said first was that Intel was not interested to bring high-performance server x86 CPU to the market because of Itanium. I believe that this point is proved by not activating 64 bit extensions of Prescott until being absolutely forced by AMD64 success. In other words, it is Itanium that got Intel into trouble.
 
Not sure what you mean by this.... All modern CPUs read 99% of instructions from L1 code cache. Giant register set has a little to do with fsb....

go look up rotating registers


What have rotating register to do with fsb? They have something to do with calling subroutines, if you wish.


What exactly is "OOO coded stuff"?
basic j++, C++ etc


No. Maybe you mean "OOP"? (OOO-out-of-order vs OOP-object-oriented-programming)?

In fact, there is no "out-of-order" code. Compiler can optimize code for better out-of-order execution, but that is about it. In most cases on really good modern cores, even such optimization has little impact.

no making heaters instead of efficient procs got intel into trouble.
intel not wanting to bring in a 4 way ia86 chip to replace a 6 way chip that has more potential and already does 64 bit seems like a logical descision to me.

OK, so what exactly Intel should have introduced to x86 line based on your premises? You seem to suggest that netburst is inefficient heater, but Intel was right not to bring more efficient x86 chip? 3-way ia86 chip limited to 32 bits is just right or too much? Or what?
 
no i mean OOO code those compilers make things meant to run OOO

? You think Itanium compilers generate code meant to run on OOO engine?!

go see what the markets look like if you dont get it.

I see Opteron and Netburst Xeons dominate low-to-medium server market and IBM POWER + Sun SPARC to dominate high end. All of them are out-of-order cores.

also
really go look at rotating registers so you understand how itanium handles OOO in the execution units. it does OOO and you didnt know it.

Maybe I missed something. Could you please point me into right direction?
 
x86-based processors have fewer strategies available for optimizing parallel throughput. Though they are designed to enable out-of-order instruction execution, the processor itself must determine which instructions can be safely executed in parallel or out of sequence. Extensive processing resources are devoted to this task, and since it happens during runtime, the processor can only look a few instructions ahead to find opportunities for optimization.

An EPIC compiler, on the other hand, can look thousands of instructions ahead to find and exploit opportunities for parallelism. This helps to improve code efficiency. It also frees the Intel Itanium 2 processor to focus all of its resources on executing the optimized code stream as fast as possible.
ALAT (advanced load)
 
x86-based processors have fewer strategies available for optimizing parallel throughput. Though they are designed to enable out-of-order instruction execution, the processor itself must determine which instructions can be safely executed in parallel or out of sequence. Extensive processing resources are devoted to this task, and since it happens during runtime, the processor can only look a few instructions ahead to find opportunities for optimization.

OTOH, CPU has information that is unavailable to compiler - actual data processed. Do not understimate this.

E.g. consider this C++ function:

[code:1:bd1bbd4096]
void AddToBoth(int& first, int& second, int amount)
{
first += amount;
second += amount;
}
[/code:1:bd1bbd4096]

There is no way how EPIC could compile this for parallel execution. But OOO can execute it parallel, as long as &first != &second.

Secondly, idea of scanning thousands of instructions ahead sounds good in marketing materials, but is far from reality. First, most functions do not have thousands of instructions and static flow analysis cannot go further. Second - actually, this point MIGHT be my fantasy because I was unable to find the confirmation of this problem, so if you know better please correct me - Itanium parallel execution bundle cannot surpase branch - unlike speculative OOO design. Last time I have checked the assembly of my C++ (wich was 2 hours ago), most loop bodies in common code have 10-20 instructions, which is the limit for EPIC parallel execution block (but not for OOO machine).
 
compiler optimizations do help. That is why intel sells compilers and addons for visual studio etc to better utilize their cpus.
They are two pass I think instead of 1 like most compilers are but compiler technology is improving.

Telling us(programmers that is) that we no longer need to optimize per cpu was a mistake as even though it was true at one point it no longer is yet many programmers do not do it still as they were taught it was not necessary.

For best results the code should be well written, the compiler should optimize it well and the cpu needs good prediction as well. Mostly this does not happen.

They should get john carmack to write a compiler, they call him the human compiler he optimizes so well.
 
Not sure what you mean by this.... All modern CPUs read 99% of instructions from L1 code cache. Giant register set has a little to do with fsb....

go look up rotating registers


What have rotating register to do with fsb? They have something to do with calling subroutines, if you wish.


What exactly is "OOO coded stuff"?
basic j++, C++ etc


No. Maybe you mean "OOP"? (OOO-out-of-order vs OOP-object-oriented-programming)?

In fact, there is no "out-of-order" code. Compiler can optimize code for better out-of-order execution, but that is about it. In most cases on really good modern cores, even such optimization has little impact.

no making heaters instead of efficient procs got intel into trouble.
intel not wanting to bring in a 4 way ia86 chip to replace a 6 way chip that has more potential and already does 64 bit seems like a logical descision to me.

OK, so what exactly Intel should have introduced to x86 line based on your premises? You seem to suggest that netburst is inefficient heater, but Intel was right not to bring more efficient x86 chip? 3-way ia86 chip limited to 32 bits is just right or too much? Or what?



no i mean OOO code those compilers make things meant to run OOO

yes netburst was inefficient heater. and no intel should have brought in a better chip but it took them a while.
conroe is here to take over for non servers xeon is high end workstations itanium servers. go see what the markets look like if you dont get it.
also
really go look at rotating registers so you understand how itanium handles OOO in the execution units. it does OOO and you didnt know it.

Beer he clearly hasn't bothered to even glance at what EPIC is about, he most likely read some spin from another competitor or doesn't understand the concepts, in this case let the sleeping dog lye.
 
x86-based processors have fewer strategies available for optimizing parallel throughput. Though they are designed to enable out-of-order instruction execution, the processor itself must determine which instructions can be safely executed in parallel or out of sequence. Extensive processing resources are devoted to this task, and since it happens during runtime, the processor can only look a few instructions ahead to find opportunities for optimization.

OTOH, CPU has information that is unavailable to compiler - actual data processed. Do not understimate this.

E.g. consider this C++ function:

[code:1:07f034f557]
void AddToBoth(int& first, int& second, int amount)
{
first += amount;
second += amount;
}
[/code:1:07f034f557]

There is no way how EPIC could compile this for parallel execution. But OOO can execute it parallel, as long as &first != &second.

Secondly, idea of scanning thousands of instructions ahead sounds good in marketing materials, but is far from reality. First, most functions do not have thousands of instructions and static flow analysis cannot go further. Second - actually, this point MIGHT be my fantasy because I was unable to find the confirmation of this problem, so if you know better please correct me - Itanium parallel execution bundle cannot surpase branch - unlike speculative OOO design. Last time I have checked the assembly of my C++ (wich was 2 hours ago), most loop bodies in common code have 10-20 instructions, which is the limit for EPIC parallel execution block (but not for OOO machine).

C Code;
if (r2 >= r3)
(r4 = r2 - r3);
else
r4 = r3 - r2;

Non predicated pseudo code; (ooo compiler)
cmpE r2,r3
jump_zero P2
P1: sub r4 = r2, r3
jump end
P2: sub r4 = r3, r2
end: ...

Predicated assembly code; (EPIC compiler)
cmp.ge p1,p2 = r2,r3;;
(p1) sub r4 = r2,r3
(p2) sub r4 =r3,r2

I know what I am looking at no branch care to explain?

Also what about loop unrolling in x86, EPIC useing rotateing registers can software pipeline. Why not have overlapping execution of multiple loops, which is attainable useing EPIC?
 
Predicated assembly code; (EPIC compiler)
cmp.ge p1,p2 = r2,r3;;
(p1) sub r4 = r2,r3
(p2) sub r4 =r3,r2

I know what I am looking at no branch care to explain?

Sure. Directly from Intel's marketing material. How many such cases you meet in real world?

BTW, just for your pleasure, here is the similar branchless code for x86:

http://aspn.activestate.com/ASPN/Mail/Message/pypy-dev/1819508

Good x86 compilers are able to use such techniques and do so.

Also what about loop unrolling in x86, EPIC useing rotateing registers can software pipeline. Why not have overlapping execution of multiple loops, which is attainable useing EPIC?

With speculative OOO, overlapping loop execution is standard.
 
Sure. Directly from Intel's marketing material. How many such cases you meet in real world?

Where else am I going to get assembly, I don't have any code on hand that is compiled with EPIC, and x86 in mind, I have Trimaran but no assembler. But the example is compareing simple branches, will generally go right on over to complex branches.

With speculative OOO, overlapping loop execution is standard.

You'll have to show me when that changed from theory to silicon, as well you'll have to show me a x86 machine that can do overlapping loop execution, without hardlocking.
 
With speculative OOO, overlapping loop execution is standard.

You'll have to show me when that changed from theory to silicon, as well you'll have to show me a x86 machine that can do overlapping loop execution, without hardlocking.
Well, a K7 can do that (and K8 and Core 2).
Perhaps even a humble Pentium 2.
With register renaming, thanks to Tomasulo's algorithm, and speculative execution (the branch will be correctly predicted as taken except the first and last time).
Each loop iteration will use a different "shadow" register, until the instruction retire.
 
Sure. Directly from Intel's marketing material. How many such cases you meet in real world?

Where else am I going to get assembly, I don't have any code on hand that is compiled with EPIC, and x86 in mind, I have Trimaran but no assembler. But the example is compareing simple branches, will generally go right on over to complex branches.


But predication wastes power, because for sure half of the work that it's doing will be wasted.
Also, this works fine only for short pipelines (or simple branches), because if you meet other branches before you've resolved the first, then the number of parallel paths increases exponentially and then you can't apply predication.
 
Guys, this discussion is awesome, but i think you're like trying to prove whether God exists or not.
I mean, the design of a CPU is dictated by statistics and models of the expected workload that the CPU is going to be sustaining, also it depends on marketing issues and process technology etc etc.
Some concepts of EPIC, like predication, don't fit very well modern desktops, for example, because now making a smart use of the available power is a top priority, while just a few years ago, an idle execution resource was just wasted die space..
I still think that EPIC was a brilliant idea, even though commercially it hasn't been so successful, yet it can hardly be proven that it is the "next big thing" after RISC and OOO... it's more of an alternative.
 
With speculative OOO, overlapping loop execution is standard.

You'll have to show me when that changed from theory to silicon, as well you'll have to show me a x86 machine that can do overlapping loop execution, without hardlocking.
Well, a K7 can do that (and K8 and Core 2).
Perhaps even a humble Pentium 2.
With register renaming, thanks to Tomasulo's algorithm, and speculative execution (the branch will be correctly predicted as taken except the first and last time).
Each loop iteration will use a different "shadow" register, until the instruction retire.

As for speculative execution, or as some of you are coining it speculative OOO (I have never heard of someone calling it that), or simply brand prediction, I’ve never heard of someone calling speculative execution speculative OOO my bad.

With that in mind yes modern x86 machines are capable of software pipelining, but it is done at the compiler level not at the machine level, which is the case with EPIC. It’s noticeably more efficient to be done on EPIC than x86, based on x86’s register limits as apposed to EPIC’s.

It also should be noted that software pipelining is quite difficult to implement on x86 and is nearly all cases is faster and much easier to do a straight forward loop, or use Duff's technique to do your loop unrolling which is easier and quite a bit simpler.

Why are you bringing the Tomasulo's algorithm into the argument, I quite aware that it gave birth to out-of-order execution processors? You look that up on Wiki to make me look stupid?
 
Sure. Directly from Intel's marketing material. How many such cases you meet in real world?

Where else am I going to get assembly, I don't have any code on hand that is compiled with EPIC, and x86 in mind, I have Trimaran but no assembler. But the example is compareing simple branches, will generally go right on over to complex branches.


But predication wastes power, because for sure half of the work that it's doing will be wasted.
Also, this works fine only for short pipelines (or simple branches), because if you meet other branches before you've resolved the first, then the number of parallel paths increases exponentially and then you can't apply predication.

Predication is done at assembly.
 
Predication means executing both ends of the branch in parallel, then discarding the wrong one.
This means, that half of the execution resources are wasting power for nothing.
ONLINE.
 
With speculative OOO, overlapping loop execution is standard.

You'll have to show me when that changed from theory to silicon, as well you'll have to show me a x86 machine that can do overlapping loop execution, without hardlocking.
Well, a K7 can do that (and K8 and Core 2).
Perhaps even a humble Pentium 2.
With register renaming, thanks to Tomasulo's algorithm, and speculative execution (the branch will be correctly predicted as taken except the first and last time).
Each loop iteration will use a different "shadow" register, until the instruction retire.

As for speculative execution, or as some of you are coining it speculative OOO (I have never heard of someone calling it that), or simply brand prediction, I’ve never heard of someone calling speculative execution speculative OOO my bad.

With that in mind yes modern x86 machines are capable of software pipelining, but it is done at the compiler level not at the machine level, which is the case with EPIC. It’s noticeably more efficient to be done on EPIC than x86, based on x86’s register limits as apposed to EPIC’s.

It also should be noted that software pipelining is quite difficult to implement on x86 and is nearly all cases is faster and much easier to do a straight forward loop, or use Duff's technique to do your loop unrolling which is easier and quite a bit simpler.

Why are you bringing the Tomasulo's algorithm into the argument, I quite aware that it gave birth to out-of-order execution processors? You look that up on Wiki to make me look stupid?
Nope, OOO execution existed before Tomasulo's algorithm was invented, and it was based on a simpler algorithm, called Scoreboarding.
Tomasulo's algorithm allows dynamic loop unrolling at HW level, that's why i mentioned it.
And if i find the time i might provide an example of this, but not tonight though (yes it's night here :) ).
And no, i didn't look it up in a Wiki, since i'm teaching it in a course on computer architecture in a college.. :)
Did you mention me lookin it up in a Wiki to make me look stupid? 😉
 
Predication means executing both ends of the branch in parallel, then discarding the wrong one.
This means, that half of the execution resources are wasting power for nothing.
ONLINE.

Yes I know what the compiler is creating for execution, but in the end it is faster than executing a complete branch.