G
Guest
Guest
Disclaimer:
===========
The following is informed speculation on the Pentium 4's FPU performance based on publically available developer documentation provided by Intel.
===========
I've just read Tom's latest benchmarks and comments on the Intel Pentium 4's FPU - the results are disappointing but, in some ways, not entirely surprising. Let me explain:-
I work as a developer for music software company FXpansion Audio; in our corner of the industry, FPU is the_most_important factor determining software performance, and our code tends to have a lot of hand-optimised x87 FPU assembler code.
When writing this code, we of course target the P6-core; most of the time, the Athlon's characteristics are similar enough that with its better overall performance it can equal the P6-core clock for clock even on P6-targeted code.
However, the P4 is a very different kettle of fish, and here's why. Floating point uses, mostly, the FADD/FSUB and FMUL instruction; others like FDIV are used as little as possible, as they're slow anyway. It's FADD and FMUL performance that's the key for fast floating point.
Now, these instructions are quite complex, and it takes several clock cycles before the result of a calculation pops out the other end of the FPU pipeline ready to be used. For example,
a = a + (b * c)
The processor can't do the "a = a +" part - a FADD - until it knows the outcome of (b * c) - a FMUL. The time it takes for this outcome to be known is called latency, and is measured in clock cycles. However, while this is happening, it _can_ do _other_ calculations, which don't need that result (like "e = (f * g)") - that's what pipelining is "all about".
For a PPro/2/3 (P6-core), the FMUL latency is 5, and the FADD latency is 3. So, a typical piece of P6-code might be
temp = b * c // FMUL, latency 5
do_something_else
do_something_else
do_something_else
do_something_else
a = a + temp // FADD, latency 3
do_something_else
do_something_else
store (a)
. = 9 clocks
A good programmer will be able to get the most out of the processor by getting it to do other useful things while waiting for the result of (b*c) and (a = a + temp).
Now, let's take a look at the P4. Its longer pipeline, made up of more, simpler stages, allows much higher clock speeds than a P6-core. But, because each stage does less, the instructions need more stages to get through the pipe, hence longer latency. In the P4 FPU, FMUL has a latency of 7, and FADD has a latency of 5, versus 5 and 3 on the PIII.
So here's the same code, executing on P4:-
temp = b * c // FMUL, latency 7
do_something_else
do_something_else
do_something_else
do_something_else
*****
*****
a = a + temp // FADD, latency 5
do_something_else
do_something_else
*****
*****
store (a)
. = 13 clocks
So, first thing to note is that the code takes longer to execute. Normally, this isn't actually as big a problem as you'd think, as you can "fill in the blanks" with other useful things - the =overall=throughput=, in terms of operations-per-second, is just as high, because the chip can still execute one instruction every clock (a gross oversimplification, but illustrates the point) - but look at those rows marked *****.
The point is, in this piece of P6-code, the programmer has arranged 4 clocks' worth of work to do while the FMUL result is being calculated, and 2 clocks' worth for the the FADD. But the P4 needs longer than this, so is _idle_, or stalled, for two clocks at the end of the FMUL, and two clocks for the FADD.
Conclusion: on tightly optimised P6-core code, targeted at Pentium Pro/2/3 chips - as most high performance FPU code is - the FPU core of P4 will run up to 50% slower than the equivalent P3, clock for clock!
The less tightly targeted the code is (and hence, the more leeway it gives the processor's own scheduler), the less bad the problem. But for benchmarks like MPEG4, where every instruction counts, the P4 simply stinks at running P3 code.
The good news for Intel is that for =most= code, it can be rewritten in such a way as to minimize this performance hit. With a bit of care, the performance difference versus a P3 clock-for-clock can be reduced to something less than 10% in many cases, which will be enough to save P4's bacon once its other advantages are taken in to account.
The question, however, is - will anyone bother? Code specially targeted to the Athlon, as opposed to the usual P6-code which just happens to run OK, can be up to 50% faster than =either= Intel processor, clock for clock. If Athlons continue to sell well and P4 doesn't, well, you can see which way it will go.
I, for one, hope Intel wins this out. For two reasons - first, the P4's memory subsystem kicks huge quantities of butt, and the overall design has headroom to hit huge clockspeeds. And second, because an uncompetitive Intel churning out P4's slowed to a crawl by lack of code support, would be as bad news for us performance freaks as an uncompetitive AMD in the days before Athlon. Fast P4's = faster/cheaper Athlons = happy users!
==============================
== http://www.fxpansion.com ==
==============================
===========
The following is informed speculation on the Pentium 4's FPU performance based on publically available developer documentation provided by Intel.
===========
I've just read Tom's latest benchmarks and comments on the Intel Pentium 4's FPU - the results are disappointing but, in some ways, not entirely surprising. Let me explain:-
I work as a developer for music software company FXpansion Audio; in our corner of the industry, FPU is the_most_important factor determining software performance, and our code tends to have a lot of hand-optimised x87 FPU assembler code.
When writing this code, we of course target the P6-core; most of the time, the Athlon's characteristics are similar enough that with its better overall performance it can equal the P6-core clock for clock even on P6-targeted code.
However, the P4 is a very different kettle of fish, and here's why. Floating point uses, mostly, the FADD/FSUB and FMUL instruction; others like FDIV are used as little as possible, as they're slow anyway. It's FADD and FMUL performance that's the key for fast floating point.
Now, these instructions are quite complex, and it takes several clock cycles before the result of a calculation pops out the other end of the FPU pipeline ready to be used. For example,
a = a + (b * c)
The processor can't do the "a = a +" part - a FADD - until it knows the outcome of (b * c) - a FMUL. The time it takes for this outcome to be known is called latency, and is measured in clock cycles. However, while this is happening, it _can_ do _other_ calculations, which don't need that result (like "e = (f * g)") - that's what pipelining is "all about".
For a PPro/2/3 (P6-core), the FMUL latency is 5, and the FADD latency is 3. So, a typical piece of P6-code might be
temp = b * c // FMUL, latency 5
do_something_else
do_something_else
do_something_else
do_something_else
a = a + temp // FADD, latency 3
do_something_else
do_something_else
store (a)
. = 9 clocks
A good programmer will be able to get the most out of the processor by getting it to do other useful things while waiting for the result of (b*c) and (a = a + temp).
Now, let's take a look at the P4. Its longer pipeline, made up of more, simpler stages, allows much higher clock speeds than a P6-core. But, because each stage does less, the instructions need more stages to get through the pipe, hence longer latency. In the P4 FPU, FMUL has a latency of 7, and FADD has a latency of 5, versus 5 and 3 on the PIII.
So here's the same code, executing on P4:-
temp = b * c // FMUL, latency 7
do_something_else
do_something_else
do_something_else
do_something_else
*****
*****
a = a + temp // FADD, latency 5
do_something_else
do_something_else
*****
*****
store (a)
. = 13 clocks
So, first thing to note is that the code takes longer to execute. Normally, this isn't actually as big a problem as you'd think, as you can "fill in the blanks" with other useful things - the =overall=throughput=, in terms of operations-per-second, is just as high, because the chip can still execute one instruction every clock (a gross oversimplification, but illustrates the point) - but look at those rows marked *****.
The point is, in this piece of P6-code, the programmer has arranged 4 clocks' worth of work to do while the FMUL result is being calculated, and 2 clocks' worth for the the FADD. But the P4 needs longer than this, so is _idle_, or stalled, for two clocks at the end of the FMUL, and two clocks for the FADD.
Conclusion: on tightly optimised P6-core code, targeted at Pentium Pro/2/3 chips - as most high performance FPU code is - the FPU core of P4 will run up to 50% slower than the equivalent P3, clock for clock!
The less tightly targeted the code is (and hence, the more leeway it gives the processor's own scheduler), the less bad the problem. But for benchmarks like MPEG4, where every instruction counts, the P4 simply stinks at running P3 code.
The good news for Intel is that for =most= code, it can be rewritten in such a way as to minimize this performance hit. With a bit of care, the performance difference versus a P3 clock-for-clock can be reduced to something less than 10% in many cases, which will be enough to save P4's bacon once its other advantages are taken in to account.
The question, however, is - will anyone bother? Code specially targeted to the Athlon, as opposed to the usual P6-code which just happens to run OK, can be up to 50% faster than =either= Intel processor, clock for clock. If Athlons continue to sell well and P4 doesn't, well, you can see which way it will go.
I, for one, hope Intel wins this out. For two reasons - first, the P4's memory subsystem kicks huge quantities of butt, and the overall design has headroom to hit huge clockspeeds. And second, because an uncompetitive Intel churning out P4's slowed to a crawl by lack of code support, would be as bad news for us performance freaks as an uncompetitive AMD in the days before Athlon. Fast P4's = faster/cheaper Athlons = happy users!
==============================
== http://www.fxpansion.com ==
==============================