News Industry group invalidates 2,600 official Intel CPU benchmarks — SPEC says the company's compiler used unfair optimizations to boost performance

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.

NinoPino

Respectable
May 26, 2022
496
310
2,060
Is that proven that this is what happens?!
Consider that Intel removed these "optimizations" from his compiler some time ago. Apparently the affected versions are from 2022.0 to 2023.0.
Wouldn't SPEC need to be able to look through the source code of the compiler to be sure?
Intel oneAPI compiler used is a derivate of LLVM but from what I know is closed source.
The only other way would be to do counter benchmarks running other apps and seeing if the performance changes, but if they did that wouldn't hey include that data?
The improvement was 9% overall on speed and 4% on rate. Not bad.
 
  • Like
Reactions: Logics and bit_user

bit_user

Titan
Ambassador
Is that proven that this is what happens?! Wouldn't SPEC need to be able to look through the source code of the compiler to be sure?
I do wish they were more transparent, but no. You can analyze the compiler output and try running the compiled program on different data. If the compiler is behaving as though it's doing PGO (Profile-Guided Optimization) targeted at that specific code, on that specific dataset it can be demonstrated beyond a statistical aberration that it's happening.

if they did that wouldn't hey include that data?
I don't see a press release from SPEC about this. I'm not sure how it was communicated.

If we use sports anti-doping as an analogy, there is an argument that you don't want to be too transparent about your methods, as such transparency could give future cheats some clues about to evade detection.

Has Intel contested the finding?
 
  • Like
Reactions: Logics and NinoPino
Feb 17, 2024
1
0
10
While the news that intel is doing some unfair optimization for the sake of making themselves look better than they really are, does this really surprise anyone? There isn't a major corporation that hasn't done the same thing with their marketing. Once you understand that marketing is just creative lying, then you know to not believe what you read unless you can find a reliable source to confirm it, and believe only half of what you see. So Intel gets a black eye for getting caught at it, and maybe AMD gets a bit more market share as a result...this doesn't change anything in my world at all. I have loads of computers with intel processors and they all perform well enough that I keep using them. When they cease to do the job they get retired and replaced with something better. So I'll sit back and relax and not get my panties in a twist because this happened.
 

AkroZ

Reputable
Aug 9, 2021
56
31
4,560
What the heck SPEC, so basically what you are saying is that you are using useless benchmarks that don't target any kind of workload, or even specific applications, but then you are salty that somebody optimizes their compiler for it?!
Also how is SPEC NOT a specific application, it doesn't get any more specific than benchmarks that don't target any kind of workload, or even specific applications .
It's like automobile constructors that have coded the car to perform well in pollution tests. The test was too specific and documented, the optimization for the car does not change the performance of the car outside of the pollution tests. (In Europe the tests have been changed to measure the pollution when driving the car).
 
  • Like
Reactions: bit_user

bit_user

Titan
Ambassador
this doesn't change anything in my world at all.
If you don't look at SPECbench scores to help inform your purchasing decisions, then of course it won't. Likewise, if you aren't in the market for Sapphire Rapids-based machines, it also won't matter to you.

News doesn't have to personally impact you, in order to be important or newsworthy. Are we really debating that point? Really??

It seems you registered a new account, just to say you don't care. Your actions belie your words.
 
  • Like
Reactions: NinoPino

Shirley Marquez

Honorable
Sep 2, 2019
13
7
10,515
It's not different though. Whether they put an * in a Power Point saying "which Hyper Thrading disabled" or ship a custom compiler that cuts 100ms off a <insert benchmark here>, benchmarks from OEMs are typically tailored to one specific situation/setting/config/compiler/etc. and should always be taken with a grain of salt.

I know it's not supposed to be that way for SPEC, but just about every OEM out there works on ways to cheat the standard body. It's true even in other industries, look at all the emission defeat issues the auto industry had just a few years ago. In most cases OEM metrics can't be trusted.
But it wasn't just OEM metrics. Intel put these benchmark-specific optimizations in the compiler that was released to the public, so ANYBODY who used Intel's compiler to compile the benchmark would get those enhancements.
 
  • Like
Reactions: bit_user

CmdrShepard

Prominent
BANNED
Dec 18, 2023
531
428
760
I'd really like to know whether compiler was made to recognize specific benchmark code snippet and optimize it, because that's the only way they could claim cheating. How would you even prove something like that without access to compiler source code or at least without disassembling the binary?
 
I'd really like to know whether compiler was made to recognize specific benchmark code snippet and optimize it, because that's the only way they could claim cheating. How would you even prove something like that without access to compiler source code or at least without disassembling the binary?
Already talked about, intel released new compiler versions since then that don't include these parts and performance is 9% different.
 
  • Like
Reactions: Logics

Logics

Distinguished
Jan 29, 2015
2
0
18,510
I'll just leave this here.


Regards.
…And this is not the first time either. When AMD created 3DNow, the Intel compiler nether tested for, nor optimised for it, and before that, when Intel created their competitor for the AMD64, (not Itanium), if the CPU identifier contained “Intel”, then the compiler optimised, otherwise it did not.

At that point, many serious benchmarkers looked for benchmarking tools NOT compiled with the Intel compiler, and many F/LOSS benchmarks started to become popular. Others refused to give them credence, claiming that they only test theoretical performance, and not real world performance. This was certainly true IF your real world software was compiled with the Intel compiler.

I personally have only been using F/LOSS benchmarks since the AMD Athlon CPUs came out. Linux, (in particular, Ubuntu GNOME), has made this easy to do, with the benchmarks being a part of standard installs.
 
…And this is not the first time either. When AMD created 3DNow, the Intel compiler nether tested for, nor optimised for it,
3DNow was AMD property/IP intel would have had to do industrial espionage or illegal decompiling of it to optimize for it, or AMD would have had to give intel that info, none of those options seem probable to me.
At least the sse (1-2-3) thing is a common extension they both used, where it's arguable that there could be differences or not.
 

bit_user

Titan
Ambassador
3DNow was AMD property/IP intel would have had to do industrial espionage or illegal decompiling of it to optimize for it,
The 3D Now spec was public - no corporate espionage needed! Furthermore, I think AMD would've been happy for Intel's compilers to support generating optimized 3D Now.

That said, I don't blame Intel for not supporting 3D Now, in its compilers. I think most people wouldn't have expected them to.

At the end of the day, SSE was the better & more comprehensive standard - and that's what won the day. 3D Now was an improvement over MMX, but then so was SSE. It did have a couple interesting ideas, like being able to tradeoff more time for better accuracy in computing things like SQRT and 1/x, or you could get approximate results more quickly.
 
  • Like
Reactions: NinoPino

JamesJones44

Reputable
Jan 22, 2021
866
807
5,760
It sounds to me like you're making another argument that because the benchmarks aren't absolutely free of manipulation or targeting, that we shouldn't even try (i.e. making "perfect" the enemy of "good"). I think that would be sad. I have found SPEC2006 and SPEC2017 to be very valuable metrics, particularly in the way they had been embraced and utilized by Anandtech. The world would've been poorer without them.

No, those are your words you are injecting into the conversation to support some tangent you feel is important.

The point I'm making is companies will always work to cheat and/or optimize for standardized tests (this doesn't have to be malicious by the way and engineer trying to eek out an extra millisecond of performance may use a standardized test to see if it has been done, but my have zero value in 99% of workloads). Thus trusting any OEM submission is a fool's errand in my opinion. Independent submissions from sites like Anandtech, Toms Hardware and users are what is important to look at as it likely to give more realistic views of performance. If those independent sites used Intel specific compilers I would question why they would do such a thing. Using any OEM tool for a benchmark is likely to sway results of an independent cross system/platform test. There are plenty of generic x86 compilers these days. Now if Intel went into every compiler and added cheats and op codes for benchmarks to all of them, then it would get my attention, otherwise this is just more OEM * talk.
 
Last edited:

JamesJones44

Reputable
Jan 22, 2021
866
807
5,760
But it wasn't just OEM metrics. Intel put these benchmark-specific optimizations in the compiler that was released to the public, so ANYBODY who used Intel's compiler to compile the benchmark would get those enhancements.
Again, this is no different than any other benchmark cheat. As I explained in my comment.

If there are ANY sites or users (aka ANYBODY) using OEM specific compilers to run independent, cross platform/system tests then shame on them. Any OEM tool used for benchmarking is likely to sway results and no longer be independent.

Which is the point people seem to be missing.
 
Last edited:

bit_user

Titan
Ambassador
Thus trusting any OEM submission is a fool's errand in my opinion. Independent submissions from sites like Anandtech, Toms Hardware and users are what is important to look at as it likely to give more realistic views of performance.
That's not how SPEC works, though.

"SPEC does not perform these measurements itself, rather it allows members and in some cases other licensees to submit results and measurements for review and publication on SPEC's website."

I didn't find a description of their rules governing how the tests are to be run, however. I would agree that it'd be better if you had to submit your system to an independent, 3rd party lab, for testing.

I think trusting sites like Toms is not without its own downsides, such as:
  1. They don't all run the same tests or use the same testing conditions.
  2. Aside from laptops, they test very few full systems, and most are consumer-oriented.
  3. They tend to run either opaque commercial benchmark suites or single-application benchmarks. SPECbench strikes a nice sweet spot in between, but I haven't seen anyone but Anandtech use it.
  4. These sites tend to rely on manufacturers for review samples, which could be cherry-picked golden samples and creates a form of dependence.

The thing about SPEC is that while it does rely on industry submissions, it's also industry-funded. If SPEC went away, I really don't see anything remotely comparable that would step in and fill the breach. The closest thing out there is Phoronix Test Suite/OpenBenchmarking.org, but it doesn't have well-defined, narrow suites like SPEC does, which means no composite scores like SPECint/SPECfp. Also, its submission rules are basically nonexistent and Michael can't test very much, by himself. Plus, he's dependent on vendor-supplied samples, just like most of the other sites.

So, if we want SPECbench, then we have to accept that the scores on spec.org are going to be tested under favorable conditions. I'm alright with that, especially knowing that they try to hold a firm line on certain things. I only look at those scores to get a rough idea of a system's performance, anyhow. Probably not at the granularity where the kinds of manipulations you're talking about could majorly swing them, but maybe.

What I care a lot more about is 3rd parties' ability to run their test suites. Again, unless it's one group running multiple tests, I don't put a ton of stock in exact score comparisons, but it's definitely useful to have a consistent benchmark with scores you can compare between different reviewers.

If those independent sites used Intel specific compilers I would question why they would do such a thing. Using any OEM tool for a benchmark is likely to sway results of an independent cross system/platform test.
Why is it bad to use an Intel compiler on an Intel CPU or an AMD compiler on an AMD CPU? As long as those compilers don't have overly-targeted optimizations, intended specifically to rig such benchmarks, I think it's not necessarily a problem to use them.

You should consider that these types of benchmarking efforts largely grew out of the HPC industry, where using a vendor-supplied compiler is the norm. Heck, back in the 80's and 90's, even (non-x86) workstations and servers would tend to have their own OS and their own compilers. That's the era SPEC grew out of.

There are plenty of generic x86 compilers these days.
It would be interesting for them to define a benchmark suite that used a standard set of binaries, such as from RHEL or Debian upstream. Whether this better reflects customers' use cases really depends on the type of system and the customer. Since most of the submissions they get are servers and workstations, it's not unreasonable to expect many customers will either be compiling their own software or will be buying hardware from a software vendor's list of supported machines. In the latter case, that enables the software vendor to target a much newer ISA baseline than the 20-year-old x86-64 standard, for instance, or the ARMv8-A that's more than a decade old, and maybe even tune for a specific CPU target or use that CPU vendor's compiler.
 

JamesJones44

Reputable
Jan 22, 2021
866
807
5,760
I think trusting sites like Toms is not without its own downsides, such as:
  1. They don't all run the same tests or use the same testing conditions.
  2. Aside from laptops, they test very few full systems, and most are consumer-oriented.
  3. They tend to run either opaque commercial benchmark suites or single-application benchmarks. SPECbench strikes a nice sweet spot in between, but I haven't seen anyone but Anandtech use it.
  4. These sites tend to rely on manufacturers for review samples, which could be cherry-picked golden samples and creates a form of dependence.
Totally agree here. Benchmarking is hard without a doubt and money sways these results too. I tend to look at a few reviews before attempting to draw any conclusions about new product performance as the variables are high and workload tests vary greatly.


Why is it bad to use an Intel compiler on an Intel CPU or an AMD compiler on an AMD CPU? As long as those compilers don't have overly-targeted optimizations, intended specifically to rig such benchmarks, I think it's not necessarily a problem to use them.
For most benchmarks the goal is to make the playing field as level as possible, this is especially true for a generic workload benchmark. Very few applications in the world are going to compile for Intel and AMD using Intel or AMD specific compilers. When a user goes and downloads a binary for Windows there as almost always 1 binary for 64-bit/x86 Windows. For the benchmark to be realistic they should do the same (use a single compiler). Sure OEMs will do this because they know exactly what they are targeting, but once the user installs a 3rd party application suddenly the results don't match because those applications are not going to use specific CPU compilers.

Now if a benchmark wanted to be be complete and included sections that are for company specific compilers, then I would agree, but for generic tests it would not show realistic workload results, which is what most hope benchmarks will show them.

It would be interesting for them to define a benchmark suite that used a standard set of binaries, such as from RHEL or Debian upstream.

Yes I agree. There are so many variables in a benchmark that it's really hard to compare anything apple's to apple's and OEM specific anything is likely to sway results (+/-). I wish we could take it a step further and remove the chipsets from the equation to get a true sense of a CPUs performance like we can with GPUs (at least from a frame rate perspective. GPU vector instruction compilers get back into vendor/OEM hell).

Since most of the submissions they get are servers and workstations, it's not unreasonable to expect many customers will either be compiling their own software or will be buying hardware from a software vendor's list of supported machines. In the latter case, that enables the software vendor to target a much newer ISA baseline than the 20-year-old x86-64 standard, for instance, or the ARMv8-A that's more than a decade old, and maybe even tune for a specific CPU target or use that CPU vendor's compiler.
This part I agree with to a degree. There are still some extremely performance sensitive applications built for specific server hardware for which Intel cheating a bench mark is potentially important and I will submit more dubious than an *. However, most of the industry has moved to containerization which removes that aspect for most applications (Docker images for example are not usually CPU specific, only architecture specific). This is why I strongly believe in using generic compilers for generic and typical workload benchmarks is important.
 

bit_user

Titan
Ambassador
For most benchmarks the goal is to make the playing field as level as possible, this is especially true for a generic workload benchmark. Very few applications in the world are going to compile for Intel and AMD using Intel or AMD specific compilers. When a user goes and downloads a binary for Windows there as almost always 1 binary for 64-bit/x86 Windows. For the benchmark to be realistic they should do the same (use a single compiler).
Yes, but remember the history of SPECbench and the fact that most submissions are of workstations and servers. If it were intended as a generic benchmark for client machines, then I would more fully agree that they should really use a standard set of binaries - not just a standard compiler!

However, most of the industry has moved to containerization which removes that aspect for most applications (Docker images for example are not usually CPU specific, only architecture specific). This is why I strongly believe in using generic compilers for generic and typical workload benchmarks is important.
This claim is what motivated my reply. Just because you can run a container on a variety of different machines doesn't mean you can't make the container arbitrarily specific. You just have to know enough about the hardware of what host machines it will be running on, and then you can tailor its contents to that specific machine specification.

Containers are basically lighter-weight variations on VMs. The main difference is that a VM has its own kernel, whereas the container uses the kernel of the host. A VM therefore can run under emulation, but the container is going to use the same ISA as its host (which, confusingly, could itself be a VM, running under emulation).

Anyway, the point is that containers aren't really an argument for/against using code optimized for the specific host. If you're running in the cloud, containers are probably being migrated amongst instances of the same kind. On my own personal machines, I never move containers between hosts.
 
Status
Not open for further replies.