Why do P4's kill Athlon 64 in media encoding?

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
>That would likely take you back to branch prediction, which
>the P4 does poorly,

As pointed out, it does the prediction rather excellently; it is however, generally still slower on branchy code than K8 because of, among other reasons, its longer pipeline.

>Still, you see the P4 getting a bigger boost from
>dual-channel mode than the A64.

I'd like to see some good benchmarks on that, before accepting this as a fact. Secondly, this wouldn't necessarely contradict my claim for several reasons I'm too lazy to explain right now.

Its my opinion that this "P4 loves bandwith and A64 loves LL" is for the most part, just a myth; in that other long thread where I discussed with slvrphoenix, I gave a list of reasons why people think this, but I have yet to see convincing data that shows there would be any significant difference between A64 and P4 related to bandwith or RAM latency scaling.

Besides, I think DDR2 disproves this theory pretty much. Even though DDR533 has considerable more bandwith than DDR400, and even its absolute latency numbers (expressed in ns, not CAS latency!) is quite close, there is nearly no gain whatsoever.

= The views stated herein are my personal views, and not necessarily the views of my wife. =
 
>I'm looking at the end result,

Be carefull doing that, because cpu performance is just an incredibly complex matter, depending on so many factors, you just can't decide 'it does better on streaming media encoding, so it will because of X'. To name just one potential (and likely) reason why the P4 performs so well, is that its SSE2 units basically have the same IPC as the K8s (at least, afaikà, yet run at a ~50% higher clock.

>the P4 acting like it has a lot of latency

Please do explain.. a cpu having a lot of latency ?

= The views stated herein are my personal views, and not necessarily the views of my wife. =
 
Taking longer to execute.

Common now, everything has latency, how long does it take for light to get from the sun to the earth...

<font color=blue>Only a place as big as the internet could be home to a hero as big as Crashman!</font color=blue>
<font color=red>Only a place as big as the internet could be home to an ego as large as Crashman's!</font color=red>
 
A P4 with 64-bit 800 bus can't use the extra bandwidth of a 128-bit 667MHz RAM bus. See, that's an easy one.

<font color=blue>Only a place as big as the internet could be home to a hero as big as Crashman!</font color=blue>
<font color=red>Only a place as big as the internet could be home to an ego as large as Crashman's!</font color=red>
 
I think its a good observation. When you start a procedure (load a prog, select a menu choice, etc.) the shorter pipeline on the A64 means you start seeing results in a matter of dozens of clock ticks compared to the P4's hundred or so ticks. Even though the clock speed is 50% higher, it takes 75% more ticks to get the same work done (numbers in the rough - not to be taken as gospel or even real - just to show the point).

I notice it a lot between work and home. Home is an XP3200 (well, almost 191fsb, 2196mhz), work is a Prescott 3.0ghz with HT. Which is a better performer in your opinion? The P4 seems to lumber along and there's a slight delay when you click something before it gets up to speed. The AXP is more responsive and reacts faster. Some of that could be RAM (1gig on the AMD, 512m on Intel), but for basic Office apps & web surfing? I doubt much if any.

Mike.

<font color=blue>Outside of a dog, a book is man's best friend. Inside the dog its too dark to read.
-- Groucho Marx</font color=blue>
 
>I think its a good observation. When you start a procedure
>(load a prog, select a menu choice, etc.) the shorter
>pipeline on the A64 means you start seeing results in a
>matter of dozens of clock ticks compared to the P4's hundred
> or so ticks. Even though the clock speed is 50% higher, it
>takes 75% more ticks to get the same work done (numbers in
>the rough - not to be taken as gospel or even real - just to
> show the point).

You may want to rethink that; A P4 pipeline "cycle" takes 0,000000007 seconds (3 GHz NW), I somehow doubt that by itselve could cause any noticeable delays. Especially when the A64 takes roughly 0,0000000055 seconds (@2Ghz) to do the same. Are you claiming you could notice the difference ?

Any difference in responsiveness are more likely caused by harddisk performance, HD fragmentation, the OS, background apps, size of the windows registry,.. your own stress level, ..




= The views stated herein are my personal views, and not necessarily the views of my wife. =
 
Please do compare FSB 1066 with DDR-2 533 using medioce timings, with FSB 800 and low latency DDR-1 400 so both memory subsytems would have comparable absolute latency (in ns). I doubt the performance difference will be anything else than neglectable.

= The views stated herein are my personal views, and not necessarily the views of my wife. =
 
Ok, if it's not that...

more likely caused by harddisk performance, HD fragmentation, the OS, background apps, size of the windows registry,.. your own stress level, ..
Both HDD's are 7200rpm 8mb cache 80gig. Don't know what brand is in the work machine (its a Dell), but at home it's a Seagate Barracuda 7200.7. Both SATA.

Work's PC is a pretty new installation (march IIRC) and there's about 50gig free space so fragmentation is minimal. At home the installation is a year old, there's 6gig of space left (95% full) and I've never defragged it yet. Both XP Pro SP2 w/auto updates.

At home I run the monster of cpu suckers for AV (Norton), McAffee at work which doesn't take as much.

Don't know about size of registry, but off the top of my head: Home PC with numerous games installed, 1 yr old installation with lots of installs/uninstalls would seem to get a larger one than work with just 3 mos and a couple apps and lots of free space.

Swapfile is set static on both. Don't remember sizes, but sufficient - 1gig or more in both cases.

Stress level? I doubt it as I'm normally more insterested in 'instant response' at home when its 'my' time than at work when it's 'their' time, and I work in the most un-stressful environment for a job I've ever had. (and I've been working for over 25 yrs)

Besides I'm not talking about a single instruction. The simple act of pulling down a menu is going to involve hundreds if not thousands of instructions - branches out to the language table, loading a new program module, windows calls to give the visual effect of clicking on the menu name, draw a the menu, etc. The excessively branchy nature of that sort of operation means the P4 branch predictor is likely at its lowest accuracy point, so...

I'd also wonder if, because of the longer pipeline, the P4 translates X86 instructions into more micro-ops than the A64. I'm (almost) certain they both don't use the exact same micro-op structure. That's going to be a difficult one to determine.

Oh yeah... and F@H loaded as a service on both systems. And the one at home allows large work units, the one at work small only.

All this just leads me to that conclusion. I'm open to other possibilities though, if they are feasible.

Mike.

<font color=blue>Outside of a dog, a book is man's best friend. Inside the dog its too dark to read.
-- Groucho Marx</font color=blue>
 
>Besides I'm not talking about a single instruction. The
>simple act of pulling down a menu is going to involve
>hundreds if not thousands of instructions

Exactly, and pipeline length and the associated potential slow responsiveness is thereby completely mitigated, and only overall performance counts where the P4 is pretty much neck and neck with the A64. And I sincerly doubt opening a menu and sort of operatings would take longer than a nanosecond cpu time on anything as fast or faster than a P3. Anything taking longer than that is caused by other reasons, so its just not realistic there would be a noticable difference caused by different cpu architectures.

Try this: underclock your Athlon system as much as you can, and tell me if its still snappier than the P4 at work. Or wait, you probably don't even have to, since if you enabled Q&C its already running at 800 MHz, and there is no way a 800 MHz A64 is faster than a 3 GHz P4 on anything but the most obscure, CPU performance unrelated benchmark.

= The views stated herein are my personal views, and not necessarily the views of my wife. =
 
It's an XP not a 64. No C&Q.

Assuming what you say is true, what is your diagnosis of the perception then? I notice it on a regular basis - start 'x' program, or use 'y' function of word or excel at home and response is perceptibly quicker than doing the same thing at work. Not over a long time (i.e. the whole program load), just that split second (i.e. the splash screen display).

I suppose its possible that other factors come into play - MS mouse at home, Logitech at work, LCD at work, CRT at home, integrated graphics at work, 9600XT at home. Its possible that the combination of all these things *could* present what I see here, but I still feel a perceptible difference.

Mike.

<font color=blue>Outside of a dog, a book is man's best friend. Inside the dog its too dark to read.
-- Groucho Marx</font color=blue>
 
>Assuming what you say is true, what is your diagnosis of the
>perception then?

Tough call, could be so many things.. just a WAG, at work you are in a domain (active directory or otherwise) while at home you are not ? As I said, try downclocking your Athlon, and if you still perceive a snappier system, you can be assured its NOT the cpu.

= The views stated herein are my personal views, and not necessarily the views of my wife. =
 
Memory bandiwth requirements are simply defined by the software and to some extent, how efficient the cpu is at executing the software. If the cpu is dogslow at certain routines, more memory bandwith will simply not be needed. If you'd somehow manage to make that code run 10x or 100x faster, either processor would be bandwith starved.
That pretty much says it all. After all, the P4 does run 50% faster, and executes SSE2 better, so it does indeed need more bandwidth.
You also seem to forget that latency is unidirectional, while FSB is bi-directional, so an increase in fsb will have twice the impact on bandwidth that it will have on latency.
 
But the Pentium D is a $259-600 CPU and the X2 is $600-1100.
The EE 840 is $1100 but it is hyperthreaded and dual core so can run 4 threads.
 
We've had a lot of discussion here of a number of factors but I think I finally found something that points out clearly the advantage that Intel has in video performance.

I think it boils down to floating point performance, possibly compounded by better SSE2, SSE3 performance.

What really stood out to me was a benchmark I found on this site under the title "The Mother of All CPU Charts Part 2".
The Intel bars are blue and the AMD bars are green.
If you skip down to the <A HREF="http://www.tomshardware.com/cpu/20041221/cpu_charts-18.html" target="_new">video performance benchmarks</A>
you'll see a lot of blue (Intel) at the top of the charts.
What's impressive is that some of these near the top are $270 CPU's (P43.4E) beating $500 (A644000+) and better CPU's.

What becomes even more clear is when you look at the synthetic benchmarks under <A HREF="http://www.tomshardware.com/cpu/20041221/cpu_charts-22.html" target="_new">Whetstones </A> which is floating point performance. This is completely dominated by Intel.
The same is true for the <A HREF="http://www.tomshardware.com/cpu/20041221/cpu_charts-23.html" target="_new">multimedia bench </A> both integer and floating point are dominated by Intel.

Clearly these differences are REAL and not explainable just by saying one benchmark is optimized for Intel. This is especially so since the video performance benchmarks are real applications and they tested 4 or 5 different applications.

I may be sounding like an Intel fan but believe me I am not! I just want the best performance for my money and if you are doing heavy video processing it just seems that Intel does better while A64 does better in gaming.

Like everyone I started out with Intel but switched to a 750Mhz Athlon when AMD had double the FSB and better CPU performance (including floating point). I switched back to Intel when they came out with dual channel DDR 800Mhz FSB and where way ahead in Ghz. I'd really like to switch back to an A64 but not if its going to cost more money or give less performance. Right now I just can't see anything that convinces me that a reasonably priced A64 is going to outperform a reasonably priced P4 (reasonably priced is $150-300).

Maybe AMD has dominated 64 bit computing which is still in the future but not floating point and multimedia performance which I need right now.
 
Its all like AMD and Intel pull up to a stoplight and Intel starts reving its engine and AMD knows its gonna lose the race by a little bit so it goes extra slow to prove that it has no intention to race @ all.
Thats my story and i'm stickin to it


The know-most-of-it-all formally known as BOBSHACK
 
The guy in the AMDmobile also get's the ladies whereas the intel guy is sad and lonely.

The know-most-of-it-all formally known as BOBSHACK
 
You do however see 3 little blue heads bobbing up and down on intel's lap and this draws a lot of attention.

The know-most-of-it-all formally known as BOBSHACK
 
I may just do that.

I'm on an NT4 domain in both locations.

Mike.

<font color=blue>Outside of a dog, a book is man's best friend. Inside the dog its too dark to read.
-- Groucho Marx</font color=blue>
 
Yes, an Intel 3.6 ghz prescott will beat an A64 3500, at some well optimized encoding progs. Divx is a prime example. The A64 will win most other benchmarks. On the other hand, that P4 will not beat a similarly priced A64 3800. True, the Intel will still win 1/2 the benches in encoding. The Amd chip will win the other 1/2 of encoding benches, and eat the Intel chip in all other benchmarks.
Add to that the simple fact that without high-end extreme cooling, the Intel system will spend a good deal of time throttling, and you have a good understanding of why Amd offers the better option for everyone, right now.
 
>You also seem to forget that latency is unidirectional, while
> FSB is bi-directional, so an increase in fsb will have twice
> the impact on bandwidth that it will have on latency.

Nice logical phalacy, I almost fell for it :) But a 10% increase in FSB will still reduce (fsb) latency by 10%, not 20% as you seem to suggest. Think about it...

= The views stated herein are my personal views, and not necessarily the views of my wife. =
 
You mean an increase of 10% on memory calls, and 10% on memory. Not quite true though, as memory isn't the only thing using the fsb on Intel chips. The actual gain is marginally better than 10% for each.
The thing is, that getting more calls sooner, helps to negate the effect of latency, especially with streaming extensions. Then the increased memory bandwidth gives a further boost, on the return.
I agree that that doesn't work out to a 20% gain, but think that it may account for more of the total gain than latency does.
 
>What becomes even more clear is when you look at the
>synthetic benchmarks under Whetstones which is floating
>point performance.

Whetstone and drystone are nonsense benchmarks these days; its a miniscule piece of 1980's code that simply doesn't reflect anything usefull. It used to be somehwat usefull for gauging performance in technical computing ("supercomputers"), but even there its considered obsolete. Just looking at those graphs should tell you that much; try and find me one real world app that runs faster on a 2.4 GHz northwood than on a FX55. Writing an app that reports the clockspeed and adds the square root of the ammount of cache would be a more meaningfull performance indicator.

In general synthetic benchmarks are totally useless to compare performance between different architectures. Its only usefull as an analysis tool if you know exactly what the benchmark does, it could help providing insights in what bottlenecks there might be or something (think apps like cachemem) , but since no one *runs* synthetic apps, why should you care about the results, especially if they do not reflect actual app performance ?

>This is completely dominated by Intel.
>The same is true for the multimedia bench both integer and >floating point are dominated by Intel.

Its not as easy as that. In fact, in x87 floating point, the AMD chips perform considerably better. Still its true generally netburst is better at video encoding, mostly because of its SSE2 performance. Other reasons include software optimization; intel provides excellent compilers and libraries which do perform very well, although of course primarely for their own cpu's. Lastly, performance really does depend on what codec/app you use, and even the workload can make a huge difference. I've seen tests where for instance a certain CPU would beat the other one with a large difference on 3DMax (or similar 3D renderer) using a certain scene, yet with the very same app, rendering a different scene, the tables turned. Now intel provides reviewers with 'guidelines', which are more often than not followed at least partially by the reviewers, and guess which of both scenes they will recommend ? How come so many review sites use the exact same scenes ?

Anyway, just a quick link to prove my point. Have a look here:
<A HREF="http://www.digit-life.com/articles2/intelamdcpuroundupvideo/" target="_new">http://www.digit-life.com/articles2/intelamdcpuroundupvideo/</A>

Yes, its dated, but just look how different cpu's excell on one codec, and fail on another.. Everyone (especially THG) uses DivX to measure encoding speed, while Xvid is the more popular codec (its free, and generally better). Coincidence ?



bref, short story is: don't believe everything you read on THG, but its still true netburst generally has an edge over K8 in media encoding. But if you mainly use a single app/codec for your cpu hungry conversions, you should find a benchmarks that utilizes that specific app/codec, and base your purchase decission upon that. Not much point in getting cpu A if that is generally better in encoding, but offers considerable less bang/$ on the app *you* need it for.

one last thing still: whatever cpu you end up buying, make sure its 64 bit capable. Expect fairly considerable speedups for media encoding once 64 bit ports are released/mature.

= The views stated herein are my personal views, and not necessarily the views of my wife. =
 
You're not making too much sense here; using correct terminology might help getting your point across.

>You mean an increase of 10% on memory calls, and 10% on
>memory

No, I meant it doesn't matter memory access latency works "both ways" whereas bandwith would work "one way". If you decrease the latency by increasing clockspeed of FSB and RAM, you will get at best a linear speedup with that clockspeed. 10% faster FSB/RAM can never result in >10% performance on anything, even low level synthetic performance tests. And to get that 10% speedup, your benchmark should be 100% ram latency bottlenecked.

The exact same thing applies to bandwith, a 10% clock increase (of FSB/RAM) can only result in up to 10% better performance.

Worse even, 10% higher bandwith and 10% lower latency can only result in up to 10% better application performance, but never more, you can't add those percentages. In fact, to ensure 10% faster application performance, ALL performance sensitive clocks should get a 10% boost, so you would need 10% faster CPU clock AND 10% faster fsb AND 10% faster RAM (both bandwith and latency). Only then COULD you see a 10% actual performance boost. Performance can't scale superlinear with clockspeed of anything, only with things like cache *size* (doubling the cache can in theory increase performance by more than 2x on low level synthetic benches, if the bigger cache can hold the entire dataset, and smaller cache can't, since the L2 cache is much more than 2x faster than RAM).

>I agree that that doesn't work out to a 20% gain, but think
>that it may account for more of the total gain than latency
>does.

Nope. increase any or all clocks by X% and performance will increase by Y% where X>Y. Only possible exceptions are where you change other things as well, like running FSB in or out of synch.

= The views stated herein are my personal views, and not necessarily the views of my wife. =