Intel responds to AMD's POV display.

carlhungis · May 28, 2007

From our favorite rag..
http://theinquirer.net/default.aspx?article=39896

..."That said, the numbers speak for themselves, 4933, almost 5000, with only 8 cores."

I wonder how AMD is going to react to this. Maybe they will release more info about their test system to show if there is going to be enough headroom to beat Intel or not.

m25 · May 28, 2007

I have said it from the start; in the presentation, the guy says that POV-Ray only recognizes 2 sockets, so that demo was 4 (2sockets x 2cores) K8s and 8 (2sockets x 4cores) K10s, not 8 vs 16 cores, and if that was even the 2.1 GHz Barcelona (not the lowest clocked 1.9 one), a 2.5GHz barcy will improve that figure by 23%, pulverizing Intel's score, above 5000. Again, till we have crisp numbers, these are only assumptions. However, if that is the score of a 3 GHz, 8core V8, a 2.5GHz, 4x4 Agena, will do better IF, the one that scored ~4000 was @ 2.1GHz or lower. So, at the end, this is more a point in favor of AMD rather than Intel.

r0ck · May 28, 2007

http://www.uberpulse.com/us/2007/05/quadcore_opteron_demo_barcelona_twice_as_fast.php
During what time interval does he say it only recognizes only 2 sockets? He says that Intel's Core is only capable of 2 socket, then shows the 4 socket machines, then task manager clearly shows 8 and 16 cores.

bonkers · May 28, 2007

I have said it from the start; in the presentation, the guy says that POV-Ray only recognizes 2 sockets, so that demo was 4 (2sockets x 2cores) K8s and 8 (2sockets x 4cores) K10s, not 8 vs 16 cores, and if that was even the 2.1 GHz Barcelona (not the lowest clocked 1.9 one), a 2.5GHz barcy will improve that figure by 23%, pulverizing Intel's score, above 5000. Again, till we have crisp numbers, these are only assumptions. However, if that is the score of a 3 GHz, 8core V8, a 2.5GHz, 4x4 Agena, will do better IF, the one that scored ~4000 was @ 2.1GHz or lower. So, at the end, this is more a point in favor of AMD rather than Intel.

http://www.povray.org/download/

The most significant change from the end-user point of view between versions 3.6 and 3.7 is the addition of SMP (symmetric multiprocessing) support, which in a nutshell allows the renderer to run on as many CPU's (or cores) as you have installed on your computer. This will be particularly useful for those users who intend purchasing a dual-core CPU or who already have a two (or more) processor machine.

m25 · May 28, 2007

The change from 3.6 to 3.7 refers to memory ueage; 3.6 sucked so much that each thread allocated it's own copy of the object in the RAM, so, to use 8 cores you used 8 times more ram than you would normally use, however, I haven't seen any documentation of POV-Ray referring to the maximum number of cores and sockets, because even some of the most advanced renderers, have the core number limited to 8, and a normal copy of Windows won't let them run more than 2 sockets.

ElMoIsEviL · May 28, 2007

I have said it from the start; in the presentation, the guy says that POV-Ray only recognizes 2 sockets, so that demo was 4 (2sockets x 2cores) K8s and 8 (2sockets x 4cores) K10s, not 8 vs 16 cores, and if that was even the 2.1 GHz Barcelona (not the lowest clocked 1.9 one), a 2.5GHz barcy will improve that figure by 23%, pulverizing Intel's score, above 5000. Again, till we have crisp numbers, these are only assumptions. However, if that is the score of a 3 GHz, 8core V8, a 2.5GHz, 4x4 Agena, will do better IF, the one that scored ~4000 was @ 2.1GHz or lower. So, at the end, this is more a point in favor of AMD rather than Intel.

Wonder where you got that info from? On the other hand I opted to email POV-Ray at this address: warp@iki.fi (on their site)... here's the exchange.

My Email:

Hi there,

I just wanted to know if POV Ray supports Multi Core CPU's. And also how many seperate sockets does POV Ray support (as I was thinking about getting a 4 sockets Dual Core Opteron setup (8 cores total).

Would POV Ray support such a configuration?

Thanks,

Their Response:

Hello,

We apologize for any delay in our response to your question. To answer your question POV-Ray supports an x number of sockets. Which means there are no limits to how many sockets one uses. POV-Ray also supports Multi Core CPU's including newer Intel Cloverton Quad Core CPU's.

Thank you for your patience,

Alex
warp@iki.fi

So the whole AMD K10 Quad Socket setup would not have had any issues being supported by POV-Ray. Of course Windows support is another thing entirely and luckilly for us most of these demo's are run on Server versions of Windows.

m25 · May 28, 2007

OK, looks like they support every number of cores, however, it looks a bit of a miraculous scaling from choppy, not to say prohibitive multithreaded support in V3.6 , to an infinity of perfectly smooth, available threads in 3.7 :roll:

Ycon · May 28, 2007

I dont think that an advanced software is limited to 2 sockets (not taking into consideration that socket count would be a weird limitation).

Even Cinebench for example, which is useful for err... zero things does support as many CPUs as you have.

m25 · May 28, 2007

And don't forget that most probably they got the crappy V 3.6 (see the radicalchanges between that and 3.7) for their tests, because 3.7 is still in beta and I doubt they took beta software for that test:
http://www.povray.org/download/

ElMoIsEviL · May 28, 2007

And don't forget that most probably they got the crappy V 3.6 (see the radicalchanges between that and 3.7) for their tests, because 3.7 is still in beta and I doubt they took beta software for that test:
http://www.povray.org/download/

AMD would have taken the beta. AMD and Intel both will use BETA software that suits their needs much like ATi or nVIDIA use BETA alpha veraions of games to try and sell their superiority.

AMD wanted to highlight the power of their 16 cores.. so it's only logical that they would have used version 3.7.

CLICK ME There's your info. AMD was using version 3.7 BETA.

It turns out that none of these questions is appropriate. Because - (1) PoV-Ray's usage of SSE2 is not SSE (Stream SIMD Execution) at all, but really double-precision FP with random register access; (2) PoV-Ray SSE seems to be optimized more specifically for Core 2 than anything else, where on K8 it is only about 5% faster than PoV-Ray x87. This is also not going to change with K10.

Second, comparing the K10 instruction latency with the K8 instruction latency, we find that K10 has little, if any, improvement on scalar SSE instructions; worse yet, some CVTxx2yy instructions are even downgraded and have longer decode and higher latency. What this shows is that PoV-Ray SSE remains unfriendly to both the K8 and K10 microarchitectures. Thus the fact that 16 cores of K10 can still almost double the speed of 8 cores of K8 actually implies there are some core improvements at work elsewhere inside the K10 design.

So again it's SSE coming to bite AMD in the butt. And this will probably also be AMD's undoing in many professional apps. They really need to get their SSE performance up to par with Intel.

m25 · May 28, 2007

I dont think that an advanced software is limited to 2 sockets (not taking into consideration that socket count would be a weird limitation).

Even Cinebench for example, which is useful for err... zero things does support as many CPUs as you have.

If you read how the V3.6 worked in multithreading, you won't think any more it's that advanced; you have, say a 1G object to throw on the RAM and make the render calculations on it?!; with v3.6 you have to have one copy of the object FOR EACH THREAD 8O , so basically, a 16 core barcelona needs 16G of RAM instead of 1 needed by v3.7. So if those systems had 6G of RAM and that scene was near 500MB, the barcelona system should have swapped a lot on the HDD.

ElMoIsEviL · May 28, 2007

I dont think that an advanced software is limited to 2 sockets (not taking into consideration that socket count would be a weird limitation).

Even Cinebench for example, which is useful for err... zero things does support as many CPUs as you have.

If you read how the V3.6 worked in multithreading, you won't think any more it's that advanced; you have, say a 1G object to throw on the RAM and make the render calculations on it?!; with v3.6 you have to have one copy of the object FOR EACH THREAD 8O , so basically, a 16 core barcelona needs 16G of RAM instead of 1 needed by v3.7. So if those systems had 6G of RAM and that scene was near 500MB, the barcelona system should have swapped a lot on the HDD.

Good thing it used version 3.7

m25 · May 28, 2007

AMD wanted to highlight the power of their 16 cores.. so it's only logical that they would have used version 3.7.

AMD only wanted to show a 2X scaling within the same TDP and however, seeing the block diagrams of both the K8 and K10, I just can't believe their SSE performance is identical, like one of the interpretations of that demo can make us think. At the end, it's all assumptions and really stupid to talk about unspecified CPUs, at unspecified clock rates running unspecified software.

ElMoIsEviL · May 28, 2007

Well you should at least read that post. Because it does explain why Intel's Core 2 does so well under POV-Ray. Basically POV-Ray is very optimized for SSE (SSE2 etc). As such any processor able to execute SSE code the fastest will have an advantage. Especially a processor such as the Core2 who's entire design is about efficiency.

Most professional applications are optimizes heavilly for SSE (like 99.9% of them).

So AMD really needs to do something about their SSE perfomance... if not then Penryn will eat K10 alive (full SSE4 on Penryn) under the majority of applications.

What remains to be seen is the per clock performance of K10 which I do believe to be superior without a doubt over Core 2.

m25 · May 28, 2007

Good thing it used version 3.7

How you please, but till we get our feet on the ground, we're going more or less like this:

m25 · May 28, 2007

Well you should at least read that post. Because it does explain why Intel's Core 2 does so well under POV-Ray. Basically POV-Ray is very optimized for SSE (SSE2 etc). As such any processor able to execute SSE code the fastest will have an advantage. Especially a processor such as the Core2 who's entire design is about efficiency.

And isn't barcelona built for SSE efficiency too, having double the number of SSE engines, better prefetches and things like this; maybe not better than Core2, but how can that thing perform THE SEAME AS A K8 :?: :!:

Most professional applications are optimizes heavilly for SSE (like 99.9% of them).
So AMD really needs to do something about their SSE perfomance... if not then Penryn will eat K10 alive (full SSE4 on Penryn) under the majority of applications.
What remains to be seen is the per clock performance of K10 which I do believe to be superior without a doubt over Core 2.

POV-Ray is only optimized up to SSE2, which AMD supports since the first K8s and it scales almost linearly in performance with Intel CPUs; so no need to mention SSE4 here because it really does not mater. Current professional SW today are mostly optimized for SSE2, because this ensures a wider compatibility starting from the first P4s and K8s. SSE3 is just starting to be used in such software and SSE4 will take it's time.
So, you still want us to go like this

there is nothing solid to base this discussion till we have real numbers 😀 [/quote]

ElMoIsEviL · May 28, 2007

Gahhh you didn't read the article now did you...

I'll post again...

Because - (1) PoV-Ray's usage of SSE2 is not SSE (Stream SIMD Execution) at all, but really double-precision FP with random register access;

Here's a nice diagram for you.. it contains AMD's K8 vs. Intel Core (not to be confused with Core 2).

Now look at the segment on the lower part of that image where we see "MAX DP FP / CYCLE".

Even Intel's Core architecture can do a better job then AMD's K8. Core 2 can do 5 per core per clock. K10 is no different then K8 in this respect and can only process a maximum of 3 per core per clock.

Then there's the second part of thise... Random register Access.. Core 2 is 400% faster then K8 at this, K10 is twice as faster as K8 at this thus half the speed of Core 2 per clock.

Not all apps use these optimisations, but MANY professional applications (if not most as I stated before) do. Games don't really use this particular optimisation as well as the Graphics card (GPU) mainly handles most of the FP load while the CPU concentrates on AI and for now physics (Mainly high Integer load).

This exlains why AMD choose to target it's integer performance and ignored this one downfall and also why you should read the article.

m25 · May 28, 2007

Wait; who said Core2 performs 5 operations and K10 only 3; from all the articles I have seen, they process theoretically the same number of SSE instructions per cycle :roll:

ElMoIsEviL · May 28, 2007

You're thinking IPC wise (Integer's per Cycle)... not Double Precision FP wise.

WR · May 28, 2007

OK, looks like they support every number of cores, however, it looks a bit of a miraculous scaling from choppy, not to say prohibitive multithreaded support in V3.6 , to an infinity of perfectly smooth, available threads in 3.7 :roll:

That's simply what you get out of correctly supporting SMP in a multithreaded application. Sometimes the tiniest code changes can cause a huge performance and resource-usage impact. POVRay 3.7 exhibits 2x scaling while v. 3.6 as well as Cinebench exhibit ~1.8x scaling at best.

AMD wanted to highlight the power of their 16 cores.. so it's only logical that they would have used version 3.7.

http://abinstein.blogspot.com/2007/05/pov-ray-benchmark-and-amds.html

There's your info. AMD was using version 3.7 BETA.

That's the best analysis I've seen so far and basically says all the marketing speak about 2x bandwidth, double-width FP units, and single-cycle SSE doesn't tell the full story in an "SSE2-supporting" application.

But another possibility to consider besides writing off K10 SSE units as broken (perhaps poorly reverse engineered) is that AMD hasn't had the resources to develop compilers like Intel does. There may be an implementation that lets K10 shine like C2D, but without a compiler to make it happen, K10 is stuck running code suboptimally.

BaronMatrix · May 28, 2007

I have said it from the start; in the presentation, the guy says that POV-Ray only recognizes 2 sockets, so that demo was 4 (2sockets x 2cores) K8s and 8 (2sockets x 4cores) K10s, not 8 vs 16 cores, and if that was even the 2.1 GHz Barcelona (not the lowest clocked 1.9 one), a 2.5GHz barcy will improve that figure by 23%, pulverizing Intel's score, above 5000. Again, till we have crisp numbers, these are only assumptions. However, if that is the score of a 3 GHz, 8core V8, a 2.5GHz, 4x4 Agena, will do better IF, the one that scored ~4000 was @ 2.1GHz or lower. So, at the end, this is more a point in favor of AMD rather than Intel.

Finally a straight answer about the multithreading of POV-Ray. I was wondering because a QFX system got just under what a quad Opteron did in the demo. That means 4-8 core scaling in the same envelope.

1Tanker · May 29, 2007

I have said it from the start; in the presentation, the guy says that POV-Ray only recognizes 2 sockets, so that demo was 4 (2sockets x 2cores) K8s and 8 (2sockets x 4cores) K10s, not 8 vs 16 cores, and if that was even the 2.1 GHz Barcelona (not the lowest clocked 1.9 one), a 2.5GHz barcy will improve that figure by 23%, pulverizing Intel's score, above 5000. Again, till we have crisp numbers, these are only assumptions. However, if that is the score of a 3 GHz, 8core V8, a 2.5GHz, 4x4 Agena, will do better IF, the one that scored ~4000 was @ 2.1GHz or lower. So, at the end, this is more a point in favor of AMD rather than Intel.

Finally a straight answer about the multithreading of POV-Ray. I was wondering because a QFX system got just under what a quad Opteron did in the demo. That means 4-8 core scaling in the same envelope.I'm rather surprised that AMD chose POVray over Sciencemark to highlight their architectural advantages. I guess scaling was their main criteria, for the demo, and maybe Sciencemark isn't showing the same degree of scaling. :? It's inarguably known that the K8(and i would assume K10) arch. owns the Sciencemark suite.

turpit · May 29, 2007

I have said it from the start; in the presentation, the guy says that POV-Ray only recognizes 2 sockets, so that demo was 4 (2sockets x 2cores) K8s and 8 (2sockets x 4cores) K10s, not 8 vs 16 cores, and if that was even the 2.1 GHz Barcelona (not the lowest clocked 1.9 one), a 2.5GHz barcy will improve that figure by 23%, pulverizing Intel's score, above 5000. Again, till we have crisp numbers, these are only assumptions. However, if that is the score of a 3 GHz, 8core V8, a 2.5GHz, 4x4 Agena, will do better IF, the one that scored ~4000 was @ 2.1GHz or lower. So, at the end, this is more a point in favor of AMD rather than Intel.

Finally a straight answer about the multithreading of POV-Ray. I was wondering because a QFX system got just under what a quad Opteron did in the demo. That means 4-8 core scaling in the same envelope.

You should have read the whole thread before responding.

r0ck · May 29, 2007

I have said it from the start; in the presentation, the guy says that POV-Ray only recognizes 2 sockets, so that demo was 4 (2sockets x 2cores) K8s and 8 (2sockets x 4cores) K10s, not 8 vs 16 cores, and if that was even the 2.1 GHz Barcelona (not the lowest clocked 1.9 one), a 2.5GHz barcy will improve that figure by 23%, pulverizing Intel's score, above 5000. Again, till we have crisp numbers, these are only assumptions. However, if that is the score of a 3 GHz, 8core V8, a 2.5GHz, 4x4 Agena, will do better IF, the one that scored ~4000 was @ 2.1GHz or lower. So, at the end, this is more a point in favor of AMD rather than Intel.

Finally a straight answer about the multithreading of POV-Ray. I was wondering because a QFX system got just under what a quad Opteron did in the demo. That means 4-8 core scaling in the same envelope.

Dum, read the whole thread why don't ya. Point debunked.

m25 · May 29, 2007

That's simply what you get out of correctly supporting SMP in a multithreaded application. Sometimes the tiniest code changes can cause a huge performance and resource-usage impact. POVRay 3.7 exhibits 2x scaling while v. 3.6 as well as Cinebench exhibit ~1.8x scaling at best.

It's not simple speed scaling I am talking about (I am also overlooking their stupid comments after which POV-Ray usually gains 100% 8O with doubling cores and those 'inexperienced' sites have tested on slower CPUs-don't know why the heck a slower CPU is 20% less efficient than a faster one- and gotten ~85% of gain)
; v3.6 uses a hell lot more memory than v3.7 ; it uses the scene's memory multiplied by the cores that render it (each core needs it's own copy), while v3.7 uses only one copy for all the threads.
In simple words; If THAT quad socket K10 system (with 6G of RAM) was using v3.6, the amount of RAM has been multiplied by 16, more or less placing a warranty on heavy HDD swapping.

Intel responds to AMD's POV display.

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Splendid

Splendid

Splendid

Distinguished

Distinguished

Share this page