Intel responds to AMD's POV display.

Page 3 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
This is a fair statement, though I thought POVray was FPU intesive and not SSE... nonetheless, the most and I emphasize the MOST we can say about all of this whoopla is that is appears likely that Barcey did not improve IPC for the code type used by POVray.... and that is all we can say, it is in no way indicative of how Barcey will fair in general.

However, I would argue that POVray is indeed K8 friendly as it is one of the few cases where Opty/Athlon K8's can keep up with Core uArch... see any TechReport comparision for an example:
http://www.techreport.com/reviews/2007q1/quad-core/index.x?pg=8

Rendering apps such as Cinebench and POVray make heavy use of FPU, and the K8 FP capabilities are still strong even compared to Core uArch.
Jack

I think it depends which compiled version of POVray 3.7 you are running. The benchmark you referenced here is using the 64-bit version, which as far as I can tell is either not using optimized instructions or is optimized for AMD64. I think the SSE2-optimized version of POVray 3.7 is faster, and in those benchmarks Intel is about 40-45% faster than K8 clock for clock.
 
This is a fair statement, though I thought POVray was FPU intesive and not SSE... nonetheless, the most and I emphasize the MOST we can say about all of this whoopla is that is appears likely that Barcey did not improve IPC for the code type used by POVray.... and that is all we can say, it is in no way indicative of how Barcey will fair in general.

However, I would argue that POVray is indeed K8 friendly as it is one of the few cases where Opty/Athlon K8's can keep up with Core uArch... see any TechReport comparision for an example:
http://www.techreport.com/reviews/2007q1/quad-core/index.x?pg=8

Rendering apps such as Cinebench and POVray make heavy use of FPU, and the K8 FP capabilities are still strong even compared to Core uArch.
Jack

I think it depends which compiled version of POVray 3.7 you are running. The benchmark you referenced here is using the 64-bit version, which as far as I can tell is either not using optimized instructions or is optimized for AMD64. I think the SSE2-optimized version of POVray 3.7 is faster, and in those benchmarks Intel is about 40-45% faster than K8 clock for clock.

Yeah, I don't diagree -- I wish people/sites would stop trying to make any comparisions from this data to Intel HW, it is simply not possible to draw any kind of meaningful conclusions.

However, considering that POVray.org has not released source code for any thing higher than version 3.6 (including no 64-bit source code), we can assume (how valid is debatable), that all systems are using the same compiled versions. While your point that recompiling with different compilers using platform specific options can significantly change the results, in the context of this one data point this is not applicable.

Why?

The real message to look into this POVray 'demo' that AMD sponsored is only meaningful when talking about Barcey compared to Opty's, as such a very vague statement can be made -- Barcey does not appear to have improved much on the FPU IPC efficiency.

Now, using your argument to try to explain away this result sorta insults AMD as they would need to intentionally alter the code (if the source code existed) to favor one way or the other and they did a very poor job of making Barcey look it's very best.

It is a moot point, however, because the real purpose of the demo was not to get performance absolutes but relative performance with a drop in upgrade --- which is a major thrust of AMD's strategy.... numbers and comparisions notwithstanding.

Jack

The source code for 3.7 beta is not published, but they do distribute it with several different executables that are compiled with optimizations for different processor types. See http://www.povray.org/beta/

I'm not trying to discredit AMD. It is entirely possible that the AMD Barcelona demo used a version of the executable that doesn't use any optimized instructions, which is a possible explanation why the raw numbers do not compare well to Intel POVRay benchmarks.

It is also a possible explanation why the IPC over K8 is not that much improved. For example, maybe if they had used the SSE2 optimized version there would be a better IPC difference since this is something that is supposed to be much improved in K10. But this would assume that the people giving this demo were incompetent and didn't chose a good demo to showcase their product.
 
What annoys me though is that AMD tried to use this POVray demo, where a scaling factor not too far away of 2 is expected, and then tried to make the statement that the factor of 2 improvement that they got will be typical for other applications. It just makes them look bad when the only demo they give shows such poor results. If there is some other benchmark out there that shows better improvement over K8, why not show that?
 
What annoys me though is that AMD tried to use this POVray demo, where a scaling factor not too far away of 2 is expected, and then tried to make the statement that the factor of 2 improvement that they got will be typical for other applications. It just makes them look bad when the only demo they give shows such poor results. If there is some other benchmark out there that shows better improvement over K8, why not show that?
Which is the $64,000 question. It's really hard to have any faith in AMD, when this is the best they can/will offer. :x
 
Nonetheless, most all sites using POVray as a bench are consistent with the K8 core, in that, it holds it's own against C2D....

I'm not sure, I tried searching for benchmarks on google and all 4 sites I clicked on showed Intel ahead of K8, but maybe this was just by chance.

In terms of optimizing for SSE2, it is unlikely --- though many in the forums, if you read, have tried using SSE2 optimizations in thier compilations.

I haven't read all the forums and I'm not sure what tests people have tried. I see people referencing a blog that says 3.7 beta doesn't use vectorized SSE2 instructions so SSE2 shouldn't make a difference, but I'm pretty sure this is incorrect. The versions of gcc, icc, and Visual C++ compilers within the last 2 years have optimizations for vectorized SSE2, and there is some evidence that the developers have been using them since version 3.6 and that this does make a difference:
AFAIK at the time 3.5 was released the available compilers were not
able to automatically generate SSE2-optimized code for that particular peace of code, and Intel contributed the hand-optimized code you are referring to.
Nowadays the situation is much different since, in particular, the Intel C++
compiler greatly improved its optimization framework (the GCC compiler did also improve a lot, though it is still not as good as ICPC at optimizing on the P4 architecture; K8 might be slightly different though). Therefore,
the 3.6 codebase didn't need this hand-optimization any longer.

I'm not sure for the latest 3.6.1c Windows binary -- but if it is
indeed optimized for SSE2-capable CPUs, not only the noise code will benefit from the optimizations. At least the 3.7 beta offer a fully optimized SSE2 build as well as an non-SSE2 optimized binary.

Also here is an old link dealing with POVray 3.6 http://pov4grasp.free.fr/articles/fastpov1/ but you can get some idea that SSE2 optimizations do make a difference, and that the performance differences between different compilations will manifest themselves in a different way depending on the processor architecture. For example, if the SSE2 build of 3.7 beta is compiled with the Intel C++ Compiler, it might show a big improvement for Intel processors over a non-optimized build, but maybe doesn't generate a mix of instructions that show as much improvement for K8 CPUs.
 
.... From the scores that they got I would gather that the clock rate of the Opteron AMD used was either 1.6 or 1.8ghz. My reason for this is because I have an Opteron 165 (1.8ghz 1mb cache DDR 400) that gets ~585 on the POV3.7 benchmark and I also have an Athlon 64 X2 4200+ (2.2ghz 512k cache DDR2-800) that gets ~737 on the POV3.7 benchmark.

Theory 2200 /4 = ~550 score
585 * 4 = 2340 (Using ECC DDR2-667 memory could drop this score to 2200)

Soo if the Opteron was clocked at 2.2ghz (much fairer test against intel) it would get 737 * 4 = ~2948 and the Barcelona would score around 737 * 8 = ~5896.

Understand AMD top clocked Opteron (Barcelona model) will be clocked at 2.6ghz so this score should come out to around ~7000 for the high end Barcelona 4x4.

Note: AMD said it was drop in right. So from this I would gather they are using a current Motherboard meaning - NO HT3 or any of the other additional features of the Barcelona that are off when used in an older board (PowerSaving, NB overclocking, etc.)