AMD v INTEL on performace equation

Kelledin · Nov 15, 2001

<A HREF="http://www.blanos.com/benchmark/bprint.cgi?lw_scene=variations&limit=10&search=type" target="_new">Yes, even with SSE2 optimizations on and an 800MHz handicap.</A>

Only reason the 1800MP isn't there is because it hasn't been put through those benchmarks yet. Actually use the search feature, and you see which one wins. :tongue:

Kelledin
<A HREF="http://www.linuxfromscratch.org/" target="_new">LFS</A>: "You don't eat or sleep or mow the lawn; you just hack your distro all day long."

Kemche · Nov 15, 2001

Not Fair, AthlonMP has twice as much Ram.

KG.

FatBurger · Nov 15, 2001

*sniff sniff*

Ahh...the sweet sound of hairs being split down the middle.

Quarter Pounder Inside

balzi · Nov 15, 2001

From LoveGuRu
"SSE/SSE2, how effective thay are compred to just excecuting the proggy?"

SSE/SSE2 and 3DNow! blah blah MMX stuff.. it's all pretty much the same.. just tailored differently for different CPU cores.

when coding in standard asm you use codes like mov eax,yada .. add cx,y add y,bx.. whatever. they are "generally" small tasks. a better exmaple is probably.

let's say you wanna code C = (A*D + A*B) % 10;
(now there's probably no SSE/SSE2 instruction for this.)
but normally you would have to execute
A*D -> save result.
A*B -> add to previous result.
result % 10 -> store in C.

that's three instructions, and takes 3 clock cycles (well kinda , pipelines make this all messy.. but imagine it takes 3)

teh reason it takes 3 clock cycles and can't be calculated at once, is that the CPU is not capable of knowing the final outcome of the maths, it just executes little bits of math in what ever order you tell it.

I smart compiler might even be able to say " hey, let's add D and B and mutiply by A later, jsut to save dragging A out of memory twice.." but it's still 3 cycles.

now the CPU might capable of doing add/mul integer maths on a few numbers in parallel and doing modulo arithmetic on results in a single pipe-line stage.
But this parallelism cannot be utilised because the instructions are seperate, the CPU can't know to calculate them all at once without special flow tracking HW which takes silicon and would produce heat.

But what is you could make an instruction that does it all at once..
Simple instructions are embedded together, the CPU can easily calculate A*D, A*B at teh same time.. add them together and do a just before the result pops out of the pipe. An instruction could do all this.... and it would get labeled as MMX type thingy.

SSE/SSE2 and MMX yada yada.. do similar things, (AFAIK) but as to exactly what they help with, I have no clues. but that's the concept. Do lots of stuff in one instruction to make more efficient use of the CPU.

An analogy might help.
Imagine you know your 6 times tables (hopefully no-one is 'imagining')
and can add 3 digit numbers with ease.

now i am dumb, and want something calculate for me.
you can calculate anything but it's takes you 3 seconds to write anything.
I write down my question in bits... 5+5, 6*4, , AND 0xff.
I pass you 5+5?
wait for an answer --- 10.
I pass you 6*4?
wait for an answer --- 24.
I pass you ans1 + ans2
wait for an answer ---- 34.
I pass you ans3 % 10.
wait for answer ----- 4.
I pass you ans4 AND 0xff.
wait for answer --- 4.

so that's all very inefficient.
it takes you 5 x 3second = 15 second to write down all those intermediate answers and give me final result.
What if I pass you ((5+5)+(6*4)) % 10 AND 0xff
you look at it, takes you 1 second to calculate. umm 10+34 % 10.. and with 0xff.
yep..
ytou write back 4. it took 3 seconds because I told you everything at once.
because you know the full story you can optimise the maths behind the result and get me a result only writing down one thing, '4' in 3 seconds.

same with puters. they are real smart and real quick, but can you make good use of having 3 32bit ALUs, 2 single-cycle FPUs etc...

that's SSE/SSE2 simple style...

Balzi

"I spilled coffee all over my wife's nighty... ...serves me right for wearing it?!?"

Conqueror · Nov 15, 2001

balzi, i understand what you're saying. I know you have to trade off clock speed with IPC. It's like this: You may have a very high clock speed, but not enough speed to carry out instructions per every clock. In other words, the clock is pulsing so fast, but the system that carries out the instruction cannot keep up with the clock. So the high clock speed is wasted.

On the other hand, you could have a low clock speed, and the system that carries out the instructions is fast. It carries out instructions at a much higher rate then the clock can pulse. It finishes carrying out the instructions before the next clock can pulse. Here, the clock speed is too slow for the system that carries out the instructions.

But I beg the question again, why can't Intel make the "Instruction Per cycle" (IPC) term of the equation bigger?

Why can't they use 20 transistors in their pipe stages, together with their high clock speed?

Mankind must put an end to War,
or War will put an end to mankind!

Guest · Nov 15, 2001

maybe you need to repeat it again "propagation delays".

If the propagation delay is longer than the clock period then while the next clock pulse (ie sampling period) is occuring , the data is still trying to propagate to where it needs to be to be properly sampled. If for example some set of hardware was designed to output valid data within a certain time period, and doesn't do it then invalid data is then used by the software, we all know what kind of lovliness this causes. Of course it would be possible to only sample the data half as often, but that just gets you back to half speed, I don't think that's what you were after.

There are physical limits as to how the hardware can be used. If you are less than master of time space and dimension then you must remain within these limits.

"On the other hand, you could have a low clock speed, and the system that carries out the instructions is fast. It carries out instructions at a much higher rate then the clock can pulse. It finishes carrying out the instructions before the next clock can pulse. Here, the clock speed is too slow for the system that carries out the instructions."

No dude, not too slow, that is what you want , and how synchronous state machines operate. The idea to get maximum performance is make sure all the hardware stages stay busy until just before the next clock pulse, and they have data valid just in time for this next clock pulse.

It might be easiest to understand this by picturing the clock pulse as the data sampling time, where the data must be valid for a window of time before and after the actual clock pulse to make a succesful sample.

Conqueror · Nov 15, 2001

No dude, not too slow, that is what you want , and how synchronous state machines operate. The idea to get maximum performance is make sure all the hardware stages stay busy until just before the next clock pulse, and they have data valid just in time for this next clock pulse. 

What do you mean "not too slow", of course it would be slow in comparison to the data transfer rate.

This is exactly what happened, for example to the XT and AT computers. Their data tranfer rate was too sow in compasrison to the clock speed. So they had to be abandoned.

Data tranfer rate = Bus Width * Clock speed / (8 * No. of clock pulses per transfer)

Can you see that it depends on the data bus, address bus, and clock speed.

Some histrocial background for you:

Since a system can only run at the speed of its slower element (a system is only as strong as its weakest link) the clock speed was reduced to that which the memory could handle. This was a reflection of the state of the art in memory chips development at the time. Most ISA (Industry standard architecture) buses ran around 5Mb/s. It was important that all cards currently in use at the time for the XTs would also be able to work on the AT machine. So the bus layout would have to be compatible - and still provide the extra data and address bus connections. This was achieved by keeping the original XT expansion bus and adding an extension section to the bus for the extra connections. In that way, XT cards would fit in the expansion slot, while AT cards would also use the slot extension.

This is termed the ISA system (Industry Standard Architecture).

Then of course, they brought out the 386/486, MCA, EISA systems etc.

Mankind must put an end to War,
or War will put an end to mankind!

zengeos · Nov 15, 2001

Conqueror....

Intel probably COULD turn up the IPC if they completely redesigned the chip to shorten the steps etc. However, they would have to lower the clockspeed. It's a matter of tradeoffs.

Now, supposedly, there's a P4 variant that will be able to perform more instructions per cycle using a term they call Hyperthreading. However, that is earmarked to their server chips and will likely be very expensive, PLUS even with hyperthreading technology, P4 will not have the IPC of the Athlon.

Mark-

When all else fails, throw your computer out the window!!!

Kemche · Nov 15, 2001

What I like to know is if Intel can increase the IPC and also the Mhz. I don't know if this is possible or not. But let's just say they did. Then do you think AMD will go back and rename their Athlon XP 1900+ with Athlon XP 1700-.

Just making a joke.
KG.

AMD_Man · Nov 15, 2001

They can but it'll be more costly and they'll need to implement some serious heat spreaders and thermal protection as well as move to the .13 micron process.

AMD technology + Intel technology = Intel/AMD Pentathlon IV; the ULTIMATE PC processor

balzi · Nov 16, 2001

hehehe.. Ok guys. let's try and wrap this one THIS TIME.

firstly-> Conquerors comment
"This is exactly what happened, for example to the XT and AT computers. Their data tranfer rate was too sow in compasrison to the clock speed. So they had to be abandoned.

Data tranfer rate = Bus Width * Clock speed / (8 * No. of clock pulses per transfer)"

Your point is extremely correct but also extremely off-topic. The data-transfer rate you speak of is the memory sub-system transfer rate.

What knewton and myself have been saying is that each 'stage' of a pipe-line WITHIN the processor has a transfer or 'propagation' rate.

I'm sorry my previous examples haven't been satisfactory. but I might try again.

Let's say that you and a group of friends (10 of you) are a professional PC construction team. (I thought I'd use something we all might know about)

You can organise yourselves into groups or do one task each but at the end you want the PC to be together and working (oh really!!)

now, your construction team needs an overseer, that's me - I'm your clock.
For the sake of this example, let's assume that you have as many PC parts as you'll need (that's equivalent to unlimited memory bandwidth.. always enough data)

First attempt, you guys organise yourselves into two groups of 5 people.
First team puts CPU and RAM onto MB. Second team puts MB into case along with all PCI cards, video card, HD, PS, CD, Floppy.. we're ignoring peripherals for now.

Now the flow of PC bits is moderated by me yelling "MOVE ON" at exact intervals. say every 1 minute.

so every 2 minutes (1 minute per stage) 5 PCs come out.. or do they.

now the first group has to pass there MBs to the second group. but we all know that teh second stage is going to take longer, in fact they can't complete they duties in that 1 minute. so they tell me to 'clock' less often, yell "MOVE ON" every 5 minutes. (slower clock).

so I do that, and now the second group is ready for new MBs on time. but now group one is sitting around doing nothing for the most part of each cycle.. this is in-efficient.

it's takes 10 (5) minutes to 'clock' through 5 PCs...
but on average if we left the process running long enough, the time would be 5 minutes because we start and finish 5 PCs every 5 minutes. (that's 1 PC per minute).

you re-organise your team. now your in 2 groups again, but the first group is only 2 people, group 2 is 8 people.
The first group does more easy work, installing CPU and RAM. the second group does the same, but there's more of them to do it.
now I 'clock' every 5 minutes, but in that clock the first team can get 8 MBs ready, for the second team who can all finish 1 MB each, so that's 8 MBs for teh group.
so now every 5 minutes (on average after time) you produce 8 PCs. or 1.66 PCs per minute...

you just upped your rate by simply re-organising you pipe-line stages, notice that the clock is the same.
Now I can't start 'clocking' every 2,3 or 4 minutes to make it faster, unless I train you guys to build faster... this is this equivalent of getting transistors to switch faster.

you guys re-organise again, this time you are in 4 groups - 2 people, 2 people, 3 and 3. (10 altogether)
Group 1 of 2 puts CPUs and RAM in - this takes 1 minute each so 2 PCs per minute.
Group 2 of 2 puts PS and MB in case - this takes 1 minute each aswell.
Group 3 of 3 work together to get HD and Floppy and CD into the case. they can produce 2 PCs per minute allup.
Group 4 of 3 people stick all PCI cards in and video card in and plug in HD, CD and FLoppy cables. together they can assemble 2 PCs per minute.

so you have organised a nice pipe-line; each stage is equal.. throughput is maximised.. and you can change your clock to 1 minute intervals. all up on average (after time) your PCs will be assembled at 2 per minute.

now, see here, you have doubled your pipe-line and gone for a 5 times faster clock, but your throughput is only up 20% and that's really b/c your pipe-line is managed better.

now if you clock faster, no-one will get anything done.
You could split into 8 stages and halve your clock again, but your throughput wouldn't change would it. each stage might only get 1 PC out per clock.. with a 30 second clock, that's still 2 PCs per clock.

so the time it takes for each person or group to get theier tasks done is the assembly line propagation delay.. which is directly relatable to the propagation delay we spoke of in the real world CPU transistors delays.

I sincerely hope I covered it properly... any problems to speak of, feel free to correct me.

The logic is kinda simple when you finally get your head around it.. Sort of like once you find Wally you think everyone will see him straight away - too easy!!!!

balzi

"I spilled coffee all over my wife's nighty... ...serves me right for wearing it?!?"

Raystonn · Nov 16, 2001

This is a fairly good explanation. Now add to that the following: The people building the systems can hear the guys taking orders on the phones for these systems. They will attempt to predict what kind of systems they need to build before the sales call has been completed so they do not have to stall and waste time. If they predict incorrectly they must start building the correct system starting at the beginning. This is the equivolent of a branch misprediction.

The amount of time they wasted by mispredicting is determined by the clockspeed (the guy yelling 'Move on!') and the number of steps in the pipeline (the assembly line of people making the system.) In essence it is the time for one system to make its way through the assembly line from start to finish. You can think of this as latency while total number of produced systems is bandwidth.

-Raystonn

= The views stated herein are my personal views, and not necessarily the views of my employer. =

mala · Nov 16, 2001

hehe. Nice explanation

.

I have a question.

The pipeline discussed in your (and many other) post determines the amount of work done along ONE path inside the processor whereas my post talked more about the number of paths, or the parallellism, of the processor.

My question is:
How does wider designs (i.e higher issue bandwidth) limit the maximum clock frequency? I have the feeling that transforming the P4 to an athlon with longer pipelines can only be done at the expense of maximum operating frequency, although I can't really convince myself of why that would be.

Of course, the logic issuing 6 uops has to "do more" than the one issuing 3 uops, but that can't be the limiting factor, or can it?

/Markus

lhgpoobaa · Nov 16, 2001

dont forget that a FPU is inherantly more accurate than a sse2. 80bits vs 64

Why do i feel like the lone sane voice in the mental assylum?

balzi · Nov 16, 2001

ahh yes, parallelism... the buzz word of CPUs quite a while back.. when i was in uni. parallel processing, WOW?!?!?! hehe

anyway, parallelism is great because it's like having two sets of assembly teams who can operate independently but respond to teh same 'clock'. The thing is, this works lovely for assembly lines, but when your performing calculations, who cares about having two pipes to get teh same answer..
you double production in assembly, you look stupid in CPU calculations... "hey the answers 50..", "I know!!!"

instead parallelism in CPUs is very different. It's when you try to ask as many mutually exclusive questions as possible, and then have the processor calculate lots of things in parallel (same time, different silicon).

Because of program flow on a CPU with branches and instructions that want to know the answer to the previous instruction and yada yada, parallelism is difficult to implement.

To Raystonn, there's lots of stuff that my analogy covers, but it's not really that close to the real thing. Thanks for you clarification.. the more people explaining, the better we cover everyone else trying to comprehend.

In thinking of your additions to the PC assembly example - I thought about teh P4s branch predict.. at least I think it's the P4.
it has a simultaneous execution unit.. when it executes both branches.
Kinda of like, whenever the construction team here's a phonecall each team gets building a PC of a different type. When they find out what PC they're supposed to build, everyone but the correct team pulls there PC to bits and discards their work.

oo I like this analogy.. it's so scalable.. hehehe.

"I spilled coffee all over my wife's nighty... ...serves me right for wearing it?!?"

mala · Nov 16, 2001

About the branch prediction:

I've read about some processor taking both branches, but I can't remember which one. I am pretty sure P4 doesn't do this though.

Actually, it is a kind of stupid approach. Branch prediction is based on the assumption that your hardware can predict the outcome of a branch more than 50% of the time. Otherwise the predictor would be a waste of space as either 'always taken' or 'never taken' must be true at least 50% of the time. If you continue execution on both the taken and not taken adress only half of your execution paths would be doing any useful work and essentially you would be 50% right; that is - waste of space...

/Markus
Edited by mala on 11/15/01 11:30 PM.

balzi · Nov 16, 2001

yeah, sorry I was only guessing.. way back inteh recesses of my mind it linked P4 to multiple branch execution../. don't know why.

It wasn't supposed to be informative.. I was just mucking around with the assmebly line analogy.

hey, Mala.. you seem to know a bit about everything.. I always wondered what the P4 did wrong with their FPUs.. everybody disses 'em.. btu I don't know whats uop with that.

do you know??

ta
balzi

"I spilled coffee all over my wife's nighty... ...serves me right for wearing it?!?"

mala · Nov 16, 2001

I don't know how much of a differance there is (I still have celerons powering my system) =/.

First you have the overall design of P4 which affects all code, both ALU and FPU.

Then athlon (and P3) can issue up to 2 FP uops/cycle; P4 can only do one.

And add to that higher latencies for instructions on the P4. The latency problem is not as bad as it sounds because as athlons can issue twice as many instructions per cycle. There is probably a greater risk of stalling the athlon pipeline than the p4. (my guess)

I guess some of it's "slowness" has to do with the P4 being a new processor, and that the Net Burst microarchitecture differs quite a lot from previous cpu:s. Maybe the applications produced today doesn't use the newest compilers or maybe the programmers doesn't care to optimize for the P4 as it would lower the performance on P3's and Athlons(?). And of course; old programs released before the P4 hasn't been optimized...

There was an article here on THG when the P4 was new. THG got horrible results in some video encoding benchmark. A guy working for Intel asked to look at the software and the next day the programmers at Intel (Raystonn? =) ) had tuned the code to much higher scores.

Humm... That was the long version of "sorry, I don't know"

/Markus

Guest · Nov 16, 2001

>I've read about some processor taking both branches, but I
>can't remember which one.

That would be the Itanium. If im not mistaken, intel calls this 'branch predication' (raystonn, correct me if im wrong).

Anyway.. while this seems a crazy idea, it might not be that crazy. A lot of logic is spent on predicting the outcome of branches. A lot die space therefore. WHile its true the predictions are correct in much more than 50% of the case, dont forget the penalty for mispredicting is HUGE. Especially with a long pipeline. When you follow ALL the branches, there is never a misprediction. Its a strange approach.. I'll admit. And frankly, it doesnt seem to work well. read this:
<A HREF="http://www.aceshardware.com/Spades/read.php?article_id=45000187" target="_new">http://www.aceshardware.com/Spades/read.php?article_id=45000187</A>

Aces benched a Queens chess puzzle on the itanium. This code has lots of branches, and this really seems to kill the performance on it. (to be fair, some of the other benches in that preview, especially the FP benches on the itanium are very impressive)

= The views stated herein are my personal views, and not necessarily the views of my wife. =

Guest · Nov 16, 2001

>maybe the programmers doesn't care to optimize for the P4
>as it would lower the performance on P3's and Athlons(?).

The funny thing is, these P4 optimized apps generally also perform (much) faster on the Athlon, and often even on the P3.. Go figure.

So thats not the reason were are seeing relatively little P4 or SSE2 optimized code. I think it has more to do with compilers being used. Currently, intel compilers seem to offer the best performing code, especially on the P4. But I read somewhere that Intel compilers are rarely used to compile mainstream applications, because they wouldnt be as stable as other C/C++ compilers?

= The views stated herein are my personal views, and not necessarily the views of my wife. =

mala · Nov 16, 2001

Maybe the processor I read about was the Itanium.
But I don't agree that predication on the Itanium takes both branches.
The point of predication is to do if-statements without using branches at all.

Instead of a general compare setting flags in the processor and branches being taken if some flag equals something, IA64 lets you test a value for a specific property and storing the answer (true/false) in a 1 bit predicate register. Then any instruction can be made conditional depending on the result in a predicate register.

Kind of like the cmovxx instruction in x86, but more powerful.

/Markus

Guest · Nov 16, 2001

You'r right.. Its not predication, your explanation of it seems correct.

However, the capability of Itanium to process different branches in parallel is called "control speculation". here is a quote from it-enquirer.com :
---------------
In contrast to branch prediction, the speculative execution performed by processors with Itanium-architecture involves loading and executing both expected instruction sequences. HP and Intel call this procedure control speculation.

Itanium architecture flags the results in additional registers, so that the results of the unnecessary program branch which was executed can be discarded without any problems. The “costs” of this procedure are smaller than the time and thus performance losses if a false branch prediction is made. The multiple functional units within Itanium processors facilitate simultaneous execution of various program branches.
---------------

= The views stated herein are my personal views, and not necessarily the views of my wife. =

Conqueror · Nov 16, 2001

Balzi, with all due respect, I understood your explanation about the clock when you explained it the very first time in one of your previous posts.

Even I mentioned in one of my previous posts: "You may have a very high clock speed, but not enough speed to carry out instructions per every clock. In other words, the clock is pulsing so fast, but the system that carries out the instruction cannot keep up with the clock. So the high clock speed is wasted.

On the other hand, you could have a low clock speed, and the system that carries out the instructions is fast. It carries out instructions at a much higher rate then the clock can pulse. It finishes carrying out the instructions BEFORE the next clock can pulse. Here, the clock speed is too slow for the system that carries out the instructions."

Sorry, my language wasn't a bit too technical but that was what you were basically trying to explain, weren't you?

My question again is (lol, sorry!), why can't Intel organise their pipe-line and make it similar to Athlon's pipe-line so that it would carry out more instructions per clock?? That way, they would have high clock speed, and high instructions per cycle, and hence higher performance.

I know about propagation delays etc. but this question is not relating to that.

Mankind must put an end to War,
or War will put an end to mankind!

mala · Nov 16, 2001

Interesting.
I reread the IA64 Software Developers Manual now, but I can't find anything that suggests that Itanium would behave like this. Actually, I didn't find any reference to "procedure control speculation" at all.

I think it is possible that it-enquirer.com has misunderstood some concept. Of cource, it is possible that I have too.

/Markus

Guest · Nov 16, 2001

Of course I could be wrong ($h1t happens), but this is the tradeoff. You either get high complexity per stage or high clock speed, not both. If you increase complexity/stage, as a result there is more propagation delay, and thus a lower clock is necessary. Also if complexity/stage is reduced the propagation through each stage is reduced, and so clock speed may safely be increased. For Intel to have their system behave like Athalons more, i.e. get more work done per cycle, would require them to increase stage complexity, and thus be forced to reduce clock speed to correct for that. Each apprach is valid, just different.

"I know about propagation delays etc. but this question is not relating to that."

Sorry about using the "P" word, but I just don't know how else to talk about this.

I think maybe it is this statement where the communication breakdown is occuring:

"It carries out instructions at a much higher rate then the clock can pulse."

This is most certainly what is happening in an underclocked part. If, in general, there were an excess of idle time then the clock speed would be increased as much as possible, by the manufacturer, right up as close as possible to the limit. After all, by taking up this slack by raising clock speed the manufactures can call it a faster processor, and charge more for it.

It appears to me that everyone seems to be saying the same thing here, and chances are good that we are getting close to what is true. There is obviously some kind of miscommunication going on, but I'll play the odds, and say it is you who is misunderstanding. Not to be disrespectful, but you may want to take a step back, and reevaluate what you are asking, and what the replies were. You may find that the question has been sufficiently answered already.

AMD v INTEL on performace equation

Distinguished

Distinguished

Illustrious

Distinguished

Distinguished

Guest

Guest

Distinguished

Distinguished

Distinguished

Splendid

Distinguished

Distinguished

Distinguished

Illustrious

Distinguished

Distinguished

Distinguished

Distinguished

Guest

Guest

Guest

Guest

Distinguished

Guest

Guest

Distinguished

Distinguished

Guest

Guest

Share this page