Intel responds to AMD's POV display.

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
In simple words; If THAT quad socket K10 system (with 6G of RAM) was using v3.6, the amount of RAM has been multiplied by 16, more or less placing a warranty on heavy HDD swapping.

In even simpler words, if AMD was dumb enough to do that then they deserve all this bad press about 10 times over 😉
 
Gahhh you didn't read the article now did you...

I'll post again...

Because - (1) PoV-Ray's usage of SSE2 is not SSE (Stream SIMD Execution) at all, but really double-precision FP with random register access;

Here's a nice diagram for you.. it contains AMD's K8 vs. Intel Core (not to be confused with Core 2).
march-comp-sm.png


Now look at the segment on the lower part of that image where we see "MAX DP FP / CYCLE".

Even Intel's Core architecture can do a better job then AMD's K8. Core 2 can do 5 per core per clock. K10 is no different then K8 in this respect and can only process a maximum of 3 per core per clock.

Then there's the second part of thise... Random register Access.. Core 2 is 400% faster then K8 at this, K10 is twice as faster as K8 at this thus half the speed of Core 2 per clock.

Not all apps use these optimisations, but MANY professional applications (if not most as I stated before) do. Games don't really use this particular optimisation as well as the Graphics card (GPU) mainly handles most of the FP load while the CPU concentrates on AI and for now physics (Mainly high Integer load).

This exlains why AMD choose to target it's integer performance and ignored this one downfall and also why you should read the article.

I think the table you provided has some errors. According to ARS Technica, the P6 has a 12 stage pipeline and the Pentium-M pipeline depth was never made publicly available. Off the top of my head, I don't know what else, if anything, is inaccurate, but you may want to check the source where that table was obtained. Chances are there are more errors.

Here are the ARS links:
P6 12 stage pipeline
Pentium-M info
More Pentium-M info

Ryan
 
That's simply what you get out of correctly supporting SMP in a multithreaded application. Sometimes the tiniest code changes can cause a huge performance and resource-usage impact. POVRay 3.7 exhibits 2x scaling while v. 3.6 as well as Cinebench exhibit ~1.8x scaling at best.
It's not simple speed scaling I am talking about (I am also overlooking their stupid comments after which POV-Ray usually gains 100% 8O with doubling cores and those 'inexperienced' sites have tested on slower CPUs-don't know why the heck a slower CPU is 20% less efficient than a faster one- and gotten ~85% of gain)
; v3.6 uses a hell lot more memory than v3.7 ; it uses the scene's memory multiplied by the cores that render it (each core needs it's own copy), while v3.7 uses only one copy for all the threads.
In simple words; If THAT quad socket K10 system (with 6G of RAM) was using v3.6, the amount of RAM has been multiplied by 16, more or less placing a warranty on heavy HDD swapping.

With all due respect - I believe that most of the people got the point about 3.6 using A LOT of RAM the first time You mentioned it and the rest got it with the second post on the subject. It is a fact that the 3.7 beta was used for the test and it made use of all available cores (i.e. 16). 4933>4000.
 
I think AMD is sane and wanted to show a scaling improvement. Since we dont know the freq of any of these cpu's then it really isnt showing anything more than the scaling improvements. As has been said AMD woyuld truly deserve to be kicked if they tried to showcase this as an improvement in performance other than scaling
 
:lol: The only one who didn't get it is you. Did you miss the:
...more or less placing a warranty on heavy HDD swapping.
If that system used HDD swapped memory after it finished ALL the RAM, the render time can slow down VEEEEERY much, because the RAM bottlenecks the CPU and does not really matter if the CPU is worse or better than the other; if it starts reading from the HD :roll:
Short story: Last week I had to render a file which, in pre-render phase was 1.2GB. My 3.0GHz P4 at work, with 2G RAM crunched one shot in ~20 minutes while my home X2 4200+ (much faster but with only 1G of RAM) only managed to do it in 28 minutes, and if you opened the task managet, you could easily understand why :wink:
 
Any good version that I can my desktop on? You know, just to test my old beater.
To get Povray running, you need to get the 3.6 executable package from:

http://www.povray.org/download/

then get the 3.7 zip package:
http://www.povray.org/beta/

Install 3.6 then unzip 3.7.

At least they finally updated the beta so that it wasn't already expired and forced one to change the computer clock just to get it to run.
 
:lol: The only one who didn't get it is you. Did you miss the:
...more or less placing a warranty on heavy HDD swapping.
If that system used HDD swapped memory after it finished ALL the RAM, the render time can slow down VEEEEERY much, because the RAM bottlenecks the CPU and does not really matter if the CPU is worse or better than the other; if it starts reading from the HD :roll:
The Povray benchmark uses about 30MB of memory. Here Xbitlabs' V8 system scored 4677 with 4GB of memory:

http://www.xbitlabs.com/articles/cpu/display/intel-v8_11.html
 
Gahhh you didn't read the article now did you...

I'll post again...

Because - (1) PoV-Ray's usage of SSE2 is not SSE (Stream SIMD Execution) at all, but really double-precision FP with random register access;

Here's a nice diagram for you.. it contains AMD's K8 vs. Intel Core (not to be confused with Core 2).
march-comp-sm.png


Now look at the segment on the lower part of that image where we see "MAX DP FP / CYCLE".

Even Intel's Core architecture can do a better job then AMD's K8. Core 2 can do 5 per core per clock. K10 is no different then K8 in this respect and can only process a maximum of 3 per core per clock.

Then there's the second part of thise... Random register Access.. Core 2 is 400% faster then K8 at this, K10 is twice as faster as K8 at this thus half the speed of Core 2 per clock.

Not all apps use these optimisations, but MANY professional applications (if not most as I stated before) do. Games don't really use this particular optimisation as well as the Graphics card (GPU) mainly handles most of the FP load while the CPU concentrates on AI and for now physics (Mainly high Integer load).

This exlains why AMD choose to target it's integer performance and ignored this one downfall and also why you should read the article.

I think the table you provided has some errors. According to ARS Technica, the P6 has a 12 stage pipeline and the Pentium-M pipeline depth was never made publicly available. Off the top of my head, I don't know what else, if anything, is inaccurate, but you may want to check the source where that table was obtained. Chances are there are more errors.

Here are the ARS links:
P6 12 stage pipeline
Pentium-M info
More Pentium-M info

Ryan

Wanna know what's funny.. looks like Arstechnica got it wrong. The P6 does in fact have 10 stages.
Virginia.edu

And also here.. Wikipedia P6

Superpipelining, which increased from Pentium's 5-stage pipeline to 14 of the Pentium Pro, and eventually morphed into the 10-stage pipeline of the Pentium III, and the 12- to 14-stage pipeline of the Pentium M.
 
I have said it from the start; in the presentation, the guy says that POV-Ray only recognizes 2 sockets, so that demo was 4 (2sockets x 2cores) K8s and 8 (2sockets x 4cores) K10s, not 8 vs 16 cores, and if that was even the 2.1 GHz Barcelona (not the lowest clocked 1.9 one), a 2.5GHz barcy will improve that figure by 23%, pulverizing Intel's score, above 5000. Again, till we have crisp numbers, these are only assumptions. However, if that is the score of a 3 GHz, 8core V8, a 2.5GHz, 4x4 Agena, will do better IF, the one that scored ~4000 was @ 2.1GHz or lower. So, at the end, this is more a point in favor of AMD rather than Intel.

Finally a straight answer about the multithreading of POV-Ray. I was wondering because a QFX system got just under what a quad Opteron did in the demo. That means 4-8 core scaling in the same envelope.
Is this yet another case of seeing only what you want to see? POVRay has no documented issue scaling past 2 sockets, yet you believe m25 right away.

I don't think there can be a dispute that the machine on the left loaded 8 cores and the machine on the right, 16. Watch the original video again - http://www.youtube.com/watch?v=VGiv9Dtrc5Q. From 0:45 to 1:10, pause or take a snapshot and count up the columns representing cores in task manager - on the left, clearly 8, and on the right, I know the video is blurry, but it's at least pretty close to 16.

Observe that the overall load represented by the separate green bars at the left of task manager remained at around 100%. That proves there was no substantial penalty from disk swapping during the videotaped portion. The benchmark only takes <30MB here on 32-bit Windows 2003 Server (probably the OS they were using); multiplying that by 16 cores and 2x for worst-case 64-bit memory scaling, the total comes out to 960MB RAM, so total memory is a non-issue.
 
Gahhh you didn't read the article now did you...

I'll post again...

Because - (1) PoV-Ray's usage of SSE2 is not SSE (Stream SIMD Execution) at all, but really double-precision FP with random register access;

Here's a nice diagram for you.. it contains AMD's K8 vs. Intel Core (not to be confused with Core 2).
march-comp-sm.png


Now look at the segment on the lower part of that image where we see "MAX DP FP / CYCLE".

Even Intel's Core architecture can do a better job then AMD's K8. Core 2 can do 5 per core per clock. K10 is no different then K8 in this respect and can only process a maximum of 3 per core per clock.

Then there's the second part of thise... Random register Access.. Core 2 is 400% faster then K8 at this, K10 is twice as faster as K8 at this thus half the speed of Core 2 per clock.

Not all apps use these optimisations, but MANY professional applications (if not most as I stated before) do. Games don't really use this particular optimisation as well as the Graphics card (GPU) mainly handles most of the FP load while the CPU concentrates on AI and for now physics (Mainly high Integer load).

This exlains why AMD choose to target it's integer performance and ignored this one downfall and also why you should read the article.

I think the table you provided has some errors. According to ARS Technica, the P6 has a 12 stage pipeline and the Pentium-M pipeline depth was never made publicly available. Off the top of my head, I don't know what else, if anything, is inaccurate, but you may want to check the source where that table was obtained. Chances are there are more errors.

Here are the ARS links:
P6 12 stage pipeline
Pentium-M info
More Pentium-M info

Ryan

Wanna know what's funny.. looks like Arstechnica got it wrong. The P6 does in fact have 10 stages.
Virginia.edu

And also here.. Wikipedia P6

Superpipelining, which increased from Pentium's 5-stage pipeline to 14 of the Pentium Pro, and eventually morphed into the 10-stage pipeline of the Pentium III, and the 12- to 14-stage pipeline of the Pentium M.

I wouldn't trust the PowerPoint from UVa either. It doesn't cite any sources or the credentials of the author (or even who the author is for that matter.) Plus it contradicts your table. The PPT says that Netburst had a pipeline depth of 20, yet the table says 21+8.

I don't think Wiki can be trusted here either. (We all know all too well that Wiki is to be taken with a grain of salt.) It states that the Pentium Pro was 14 stages and P3 was 10. Heck, the Wiki article even contradicts itself regarding the length of the PM pipeline. I cannot find any other source that states the pipeline was changed from PP to P3.

I only brought it up because I don't know how relevant the table is to the thread if it's not even accurate.

Ryan
 
What I meant was, K8 didnt perform with sse well and apparently neither does K10.
Second, comparing the K10 instruction latency with the K8 instruction latency, we find that K10 has little, if any, improvement on scalar SSE instructions; worse yet, some CVTxx2yy instructions are even downgraded and have longer decode and higher latency. What this shows is that PoV-Ray SSE remains unfriendly to both the K8 and K10 microarchitectures. Thus the fact that 16 cores of K10 can still almost double the speed of 8 cores of K8 actually implies there are some core improvements at work elsewhere inside the K10 design.
They are scaling somehow tho, using what improvements? Ideas? I know pov isnt a K8 friendly, so maybe without any changes (sse,smp ) it still scales well.
 
Gahhh you didn't read the article now did you...

I'll post again...

Because - (1) PoV-Ray's usage of SSE2 is not SSE (Stream SIMD Execution) at all, but really double-precision FP with random register access;

Here's a nice diagram for you.. it contains AMD's K8 vs. Intel Core (not to be confused with Core 2).
march-comp-sm.png


Now look at the segment on the lower part of that image where we see "MAX DP FP / CYCLE".

Even Intel's Core architecture can do a better job then AMD's K8. Core 2 can do 5 per core per clock. K10 is no different then K8 in this respect and can only process a maximum of 3 per core per clock.

Then there's the second part of thise... Random register Access.. Core 2 is 400% faster then K8 at this, K10 is twice as faster as K8 at this thus half the speed of Core 2 per clock.

Not all apps use these optimisations, but MANY professional applications (if not most as I stated before) do. Games don't really use this particular optimisation as well as the Graphics card (GPU) mainly handles most of the FP load while the CPU concentrates on AI and for now physics (Mainly high Integer load).

This exlains why AMD choose to target it's integer performance and ignored this one downfall and also why you should read the article.

I think the table you provided has some errors. According to ARS Technica, the P6 has a 12 stage pipeline and the Pentium-M pipeline depth was never made publicly available. Off the top of my head, I don't know what else, if anything, is inaccurate, but you may want to check the source where that table was obtained. Chances are there are more errors.

Here are the ARS links:
P6 12 stage pipeline
Pentium-M info
More Pentium-M info

Ryan

Wanna know what's funny.. looks like Arstechnica got it wrong. The P6 does in fact have 10 stages.
Virginia.edu

And also here.. Wikipedia P6

Superpipelining, which increased from Pentium's 5-stage pipeline to 14 of the Pentium Pro, and eventually morphed into the 10-stage pipeline of the Pentium III, and the 12- to 14-stage pipeline of the Pentium M.

I wouldn't trust the PowerPoint from UVa either. It doesn't cite any sources or the credentials of the author (or even who the author is for that matter.) Plus it contradicts your table. The PPT says that Netburst had a pipeline depth of 20, yet the table says 21+8.

I don't think Wiki can be trusted here either. (We all know all too well that Wiki is to be taken with a grain of salt.) It states that the Pentium Pro was 14 stages and P3 was 10. Heck, the Wiki article even contradicts itself regarding the length of the PM pipeline. I cannot find any other source that states the pipeline was changed from PP to P3.

I only brought it up because I don't know how relevant the table is to the thread if it's not even accurate.

Ryan

The table is 100% accurate.

Manchester.edu
INTEL.com <------

The idea of out-of-order execution, or executing independent program instructions out of program order to achieve a higher level of hardware utilization, was first implemented in the P6 microarchitecture. Instructions are executed through an out-of-order 10 stage pipeline in program order.

Netburst Northwood has a 21 stage pipeline (varying on how you look at it)
Intel.com Willamette had a 20 stage pipeline.

The +8 represents the much despised and ridiculed mis-predictions. It's not an accurate number but rather an estimate see although the whole 20-21 stages are misprediction pipelines, the Netburst architecture actually mispredicted quite often, more so then any other architecture thus far. Although Intel's Rapid Fire Execution engine (ALU's running at twice the Clock speed) helps alleviate some of the cost in performance from a misprediction it's not enough. The architecture itself is VERY inneficient.
 
I have said it from the start; in the presentation, the guy says that POV-Ray only recognizes 2 sockets, so that demo was 4 (2sockets x 2cores) K8s and 8 (2sockets x 4cores) K10s, not 8 vs 16 cores, and if that was even the 2.1 GHz Barcelona (not the lowest clocked 1.9 one), a 2.5GHz barcy will improve that figure by 23%, pulverizing Intel's score, above 5000. Again, till we have crisp numbers, these are only assumptions. However, if that is the score of a 3 GHz, 8core V8, a 2.5GHz, 4x4 Agena, will do better IF, the one that scored ~4000 was @ 2.1GHz or lower. So, at the end, this is more a point in favor of AMD rather than Intel.

Finally a straight answer about the multithreading of POV-Ray. I was wondering because a QFX system got just under what a quad Opteron did in the demo. That means 4-8 core scaling in the same envelope.
Is this yet another case of seeing only what you want to see? POVRay has no documented issue scaling past 2 sockets, yet you believe m25 right away.

I don't think there can be a dispute that the machine on the left loaded 8 cores and the machine on the right, 16. Watch the original video again - http://www.youtube.com/watch?v=VGiv9Dtrc5Q. From 0:45 to 1:10, pause or take a snapshot and count up the columns representing cores in task manager - on the left, clearly 8, and on the right, I know the video is blurry, but it's at least pretty close to 16.

Observe that the overall load represented by the separate green bars at the left of task manager remained at around 100%. That proves there was no substantial penalty from disk swapping during the videotaped portion. The benchmark only takes <30MB here on 32-bit Windows 2003 Server (probably the OS they were using); multiplying that by 16 cores and 2x for worst-case 64-bit memory scaling, the total comes out to 960MB RAM, so total memory is a non-issue.

Typical AMD fanpoi behavior. m25's point was debunked yet you still believe him Baron. I guess you didn't read the other posts?
 
Gahhh you didn't read the article now did you...

I'll post again...

Because - (1) PoV-Ray's usage of SSE2 is not SSE (Stream SIMD Execution) at all, but really double-precision FP with random register access;

Here's a nice diagram for you.. it contains AMD's K8 vs. Intel Core (not to be confused with Core 2).
march-comp-sm.png


Now look at the segment on the lower part of that image where we see "MAX DP FP / CYCLE".

Even Intel's Core architecture can do a better job then AMD's K8. Core 2 can do 5 per core per clock. K10 is no different then K8 in this respect and can only process a maximum of 3 per core per clock.

Then there's the second part of this... Random register Access.. Core 2 is 400% faster then K8 at this, K10 is twice as faster as K8 at this thus half the speed of Core 2 per clock.

Not all apps use these optimizations, but MANY professional applications (if not most as I stated before) do. Games don't really use this particular optimization as well as the Graphics card (GPU) mainly handles most of the FP load while the CPU concentrates on AI and for now physics (Mainly high Integer load).

This exlpains why AMD choose to target it's integer performance and ignored this one downfall and also why you should read the article.
If this is the case, then it might mean we have another r600 here, one that has extreme power in one area, but has a few huge bottlenecks that will completely hamper its performance

Makes me wonder a bit really if we have r600 delays because they can't beat core 2, or if they really do have legit reasons

btw...I fixed most of your spelling errors :wink:

Not really.. as I've tried to explain this is only relative to apps that rely on Double Precision Floating Points operations. POV Ray is one such app. So yes Barcelona seems to be bottlenecked so far in this particular app if the explanation given proves to be true.

It makes sense to me. But make no mistake that this is not an indicator of it's Processing power in general usage apps that we enjoy, which are mainly games, video/audio encoding and video/audio transcoding.
 
So the whole AMD K10 Quad Socket setup would not have had any issues being supported by POV-Ray. Of course Windows support is another thing entirely and luckilly for us most of these demo's are run on Server versions of Windows.

They would have to had used Windows Server 2003 as the OS as Windows XP and Vista will only work with two sockets and the Pov-Ray 3.7 beta has not been released for Linux quite yet (grumble grumble.) The only build that'll work on OS X or Linux is 3.6 and that does not support SMP no matter the OS.
 
Man, I seriously hope that pov-ray has scaling issues, if it doesn't and that's the performance of 16 k10 cores compared to 8 clovertowns (even though they are clocked roughly 45% higher if what I've heard is true), then we should be worried about k10's performance

It won't be.... at least I don't think it will. I fully expect K10 to obliterate Core 2 per clock and give Penryn a run for it's money.
 
The Intel link is trustworthy. Add one to the inaccurate Wiki articles total. That is also the first time I can think of that Arstechnica was flat out wrong. I take it the table is yours based on how you were defensive about it. I meant nothing more than to question an inconsistancy I saw. Thank you for clearing it up.

Ryan
 
Must be Server 2003 because according to the M$ website, XP x64 still only supports 2 sockets.

Link

Multiprocessing and multicore processor support
Windows XP Professional x64 Edition is designed to support up to two single or multicore x64 processors for maximum performance and scalability.

Ryan
 
The kernel will support up to 32 cores and doesn't technically care about the number of sockets, but the Mother-may-I system that Microsoft installed sure does. They don't want somebody to pay only $399 for XP Professional x64 and run it on a >2-socket machine when the $999 Windows Server 2003 x64 is what MS wants you to buy instead. The same thing is true for XP Home vs. XP Professional. Same kernel, but the OS won't let the kernel use more than one socket with Home. God forbid somebody not pay them the extra hundred dollars.
 
Didn't Intel and MS have a spar a long time ago about licensing Windows at a per core level (i.e. what constitutes a "processor")? Anyone know what the final outcome was? I was under the impression that most consumer MS OS's are licensed for 1-2 CPUs (which I assume means physical CPU package...so could include 2 socket), while server OS's are licensed per CPU. Is it in fact per core? What about virtual cores? Obviously, I don't have any knowledge in this area and focus just on making sure my computers at home are "legal" :-D
 
Hi there,

Yeah it's licensed per CPU with Windows XP Pro capable of using two CPU's even if those CPU's are Dual Core or even Quad Core whereas Windows XP Home can only use a Single CPU even if it's Dual Core or even Quad Core.

The number of Cores doesn't matter so long as they're on the same package.
 
Their latest non-beta OS series is Windows Server 2003 R2 SP2, and it still treats a dual-core as two processors, unlike XP.

Number of cores only decides what minimum version of server you need to make use of them, but the big price variance is based on access licensing - i.e., how many employees or clients simulaneously connect to a server.
 
Their latest non-beta OS series is Windows Server 2003 R2 SP2, and it still treats a dual-core as two processors, unlike XP.

Number of cores only decides what minimum version of server you need to make use of them, but the big price variance is based on access licensing - i.e., how many employees or clients simulaneously connect to a server.

If there is a limit to how many processors can be supported, it should be because that's all the kernel was set up for and no less. The "pay for number of sockets you'll use" bit is asinine in my book as it's a very arbitrary and needless crippling of the program. It would be one thing if there had to be a new kernel to support n number of cores; then a higher price is justified for the development costs. But just saying that version X will support n cores/sockets and version Z (which is more expensive) will support 2n cores/sockets when the source code is the sameis crippleware. Crippleware really, really sucks.