News Industry group invalidates 2,600 official Intel CPU benchmarks — SPEC says the company's compiler used unfair optimizations to boost performance

Status
Not open for further replies.

peachpuff

Reputable
BANNED
Apr 6, 2021
690
733
5,760
4ac.gif
 
What the heck SPEC, so basically what you are saying is that you are using useless benchmarks that don't target any kind of workload, or even specific applications, but then you are salty that somebody optimizes their compiler for it?!
Also how is SPEC NOT a specific application, it doesn't get any more specific than benchmarks that don't target any kind of workload, or even specific applications .
The disclaimer that it is now attached to over 2,600 SPEC CPU 2017 results states, "The compiler used for this result was performing a compilation that specifically improves the performance of the 523.xalancbmk_r / 623.xalancbmk_s benchmarks using a priori knowledge." This means the compiler (in this case, Intel's oneAPI DPC++/C++ Compiler) was not optimized for a particular kind of workload, or even for specific applications, but specifically for two SPEC CPU 2017 benchmarks.
 

bit_user

Titan
Ambassador
523.xalancbmk_r / 623.xalancbmk_s benchmarks using a priori knowledge." This means the compiler (in this case, Intel's oneAPI DPC++/C++ Compiler) was not optimized for a particular kind of workload, or even for specific applications, but specifically for two SPEC CPU 2017 benchmarks.
I think it's basically just one benchmark that's included in two different suites. Xalan is a XSLT processor developed under the umbrella of the Apache Software Foundation.

As for the _r and _s distinction, these signify rate vs. speed. SPEC explains them as follows:

"There are many ways to measure computer performance. Among the most common are:
  • Time - For example, seconds to complete a workload.
  • Throughput - Work completed per unit of time, for example, jobs per hour.

SPECspeed is a time-based metric; SPECrate is a throughput metric."

https://www.spec.org/cpu2017/Docs/overview.html#Q15

They further list several key differences, but these two jumped out at me:
  • For speed, 1 copy of each benchmark in a suite is run. For rate, the tester chooses how many concurrent copies to run.
  • For speed, the tester may choose how many OpenMP threads to use. For rate OpenMP is disabled.

I'm a little surprised by the latter point, but I guess it makes sense. What it means is that SPECspeed shouldn't be taken purely as a proxy for single-threaded performance. You really ought to use SPECrate[n=1] for that, which I think is what I've seen.

Results for fifth-gen Xeon Emerald Rapids CPUs are very unlikely to have been running a version of the compiler with the banned optimization since Emerald Rapids came out after the good versions of the compiler were available.
I'm not sure how the author concludes this. I don't see anywhere to download previous versions of Intel's DPC++ compiler. The latest is 2024.0.2 and that release is dated Dec. 18th, 2023.

BTW, I wonder who tipped them off. Did someone just notice those results were suspiciously good and start picking apart the generated code, or did a disgruntled ex-Intel employee maybe drop a dime?
 
Last edited:
  • Like
Reactions: prtskg and NinoPino

bit_user

Titan
Ambassador
What the heck SPEC,
Standard Performance Evaluation Corporation

The System Performance Evaluation Cooperative, now named the Standard Performance Evaluation Corporation (SPEC), was founded in 1988 by a small number of workstation vendors who realized that the marketplace was in desperate need of realistic, standardized performance tests. The key realization was that an ounce of honest data was worth more than a pound of marketing hype.

SPEC publishes several hundred different performance results each quarter spanning a variety of system performance disciplines.

The goal of SPEC is to ensure that the marketplace has a fair and useful set of metrics to differentiate candidate systems. The path chosen is an attempt to balance requiring strict compliance and allowing vendors to demonstrate their advantages. The belief is that a good test that is reasonable to utilize will lead to a greater availability of results in the marketplace.

SPEC is a non-profit organization that establishes, maintains and endorses standardized benchmarks and tools to evaluate performance for the newest generation of computing systems. Its membership comprises more than 120 leading computer hardware and software vendors, educational institutions, research organizations, and government agencies worldwide.


One neat thing about SPECbench is that you actually get it in the form of source code that you can compile and run just about anywhere. For years, Anandtech even managed to run it on iPhone and Android phone SoCs. This allowed them to compare performance and efficiency relative to desktop x86 and other types of CPUs.

As far as I'm aware, GeekBench is one of the only other modern, cross-platform benchmarks. However, unlike SPECbench, it's basically a black box. This makes it a ripe target for allegations of bias towards one kind of CPU or platform vs. others.

so basically what you are saying is that you are using useless benchmarks that don't target any kind of workload, or even specific applications
No, SPECbench is comprised of real world, industry-standard applications.

then you are salty that somebody optimizes their compiler for it?!
Yes. The article explains that the benchmark suite is intended to be predictive of how a given system will perform on certain workloads. If a vendor does highly-targeted compiler optimizations for the benchmark, those don't carry over to similar workloads and thus invalidate the benchmark. That undermines the whole point of SPECbench, which is why they need to take a hard line on this sort of activity.
 
Last edited:

bit_user

Titan
Ambassador
Let's be honest, who believes any of the benchmarks released by any host company? Not a day goes by where an independent benchmark looks different than what Apple, Intel, AMD, Nvidia, Micron, etc. stated as their benchmarks.
That's not what this is about.

SPEC gets submissions for an entire system. As such, they're usually submitted by OEMs and integrators. There's a natural tendency to use the compiler suite provided by the CPU maker, since those have all of the latest and greatest optimizations and tuning for the specific CPU model. That's where the trouble started.

SPEC has various rules governing they way systems are supposed to be benchmarked, in order to be eligible for submission. It's a little like the Guinness book of World Records, or perhaps certain athletics bodies and their rules concerning official world records.
 

punkncat

Polypheme
Ambassador
What do you mean the new car I just purchased doesn't really get 40 MPG in real world conditions?

AGAST!

Why does this even qualify as news? This isn't anything novel or unheard of. Whispers about it going on for years now. If anyone is surprised, they are also naive...and I got some beachfront property to sell you...
 
  • Like
Reactions: rtoaht
Yes. The article explains that the benchmark suite is intended to be predictive of how a given system will perform on certain workloads. If a vendor does highly-targeted compiler optimizations for the benchmark, those don't carry over to similar workloads and thus invalidate the benchmark. That undermines the whole point of SPECbench, which is why they need to take a hard line on this sort of activity.
I don't get the distinction...
If it's predictive of possible compiler optimizations and intel actually did those compiler optimizations then what's the issue?!

It can't be both ways,
either these particular benches are useless, or intel optimized the compiler towards whatever the predicted use case was.
 

bit_user

Titan
Ambassador
Why does this even qualify as news? This isn't anything novel or unheard of. Whispers about it going on for years now. If anyone is surprised, they are also naive...and I got some beachfront property to sell you...
You're effectively making a "slippery slope" argument - that, because some people will probably always try to game the system, that it's a lost cause and we should just give up on ever having a fair, standard way to benchmark anything.

I disagree. SPEC was founded to meet a need and I think it has been pretty effective to those ends. Don't let "perfect" be the enemy of "good", here.
 

bit_user

Titan
Ambassador
I don't get the distinction...
If it's predictive of possible compiler optimizations and intel actually did those compiler optimizations then what's the issue?!
The benchmark is intended to be predictive of the system's performance on similar workloads. If the compiler optimizations aren't sufficiently general also to apply to those similar workloads, then they're counterproductive.
 

punkncat

Polypheme
Ambassador
You're effectively making a "slippery slope" argument - that, because some people will probably always try to game the system, that it's a lost cause and we should just give up on ever having a fair, standard way to benchmark anything.

I disagree. SPEC was founded to meet a need and I think it has been pretty effective to those ends. Don't let "perfect" be the enemy of "good", here.


Honesty in product marketing and sales?

Lol. So, about that beachfront property...it's a bit of a walk to the water, but...
 
The benchmark is intended to be predictive of the system's performance on similar workloads. If the compiler optimizations aren't sufficiently general also to apply to those similar workloads, then they're counterproductive.
If the predicting benchmark isn't sufficiently general then how is it supposed to make any useful predictions?
In other words if it is playable/cheatable then it's a bad benchmark.
 

NinoPino

Respectable
May 26, 2022
496
310
2,060
If the predicting benchmark isn't sufficiently general then how is it supposed to make any useful predictions?
In other words if it is playable/cheatable then it's a bad benchmark.
If you optimize specifically for that piece of code, and to do so, you program the compiler to recognise that specific benchmark, than this is not an optimization of the workflow, but of the specific benchmark.
 
If you optimize specifically for that piece of code, and to do so, you program the compiler to recognise that specific benchmark, than this is not an optimization of the workflow, but of the specific benchmark.
Yes, I realize and understand that, but how is a benchmark that uses such a specific piece of code any good to tell anything about any workflow? If that piece of code only exists inside that benchmark and nowhere else then it's a bad benchmark.
 

JamesJones44

Reputable
Jan 22, 2021
865
807
5,760
That's not what this is about.

SPEC gets submissions for an entire system. As such, they're usually submitted by OEMs and integrators. There's a natural tendency to use the compiler suite provided by the CPU maker, since those have all of the latest and greatest optimizations and tuning for the specific CPU model. That's where the trouble started.

SPEC has various rules governing they way systems are supposed to be benchmarked, in order to be eligible for submission. It's a little like the Guinness book of World Records, or perhaps certain athletics bodies and their rules concerning official world records.
It's not different though. Whether they put an * in a Power Point saying "which Hyper Thrading disabled" or ship a custom compiler that cuts 100ms off a <insert benchmark here>, benchmarks from OEMs are typically tailored to one specific situation/setting/config/compiler/etc. and should always be taken with a grain of salt.

I know it's not supposed to be that way for SPEC, but just about every OEM out there works on ways to cheat the standard body. It's true even in other industries, look at all the emission defeat issues the auto industry had just a few years ago. In most cases OEM metrics can't be trusted.
 
  • Like
Reactions: rtoaht

NinoPino

Respectable
May 26, 2022
496
310
2,060
Yes, I realize and understand that, but how is a benchmark that uses such a specific piece of code any good to tell anything about any workflow? If that piece of code only exists inside that benchmark and nowhere else then it's a bad benchmark.
Here with "piece of code" we are talking of a C program that do XML document conversion to other formats.
The only info I found about the type of optimization is this :
"... the compiler ... was performing a compilation that specifically improves the performance of the 523.xalancbmk_r / 623.xalancbmk_s benchmarks using a priori knowledge of the SPEC code and dataset to perform a transformation that has narrow applicability"
 
People that encourages and likes the saying "if you ain't cheatin' you ain't trying" are part of the problem.

Any Company that tries to cheat must be loathed, because it goes against consumers (i.e. you; yes the ones reading this post, you). Simping for corporations is not good for the consumer. Never has and never will.

If Intel can demonstrate those optimizations carry over to real workloads outside of the benchmark, then it's a valid "optimization". As they're keen on using "accelerators", then I wouldn't be surprised they can also "accelerate" certain benchmark's well known behaviours.

Regards.

EDIT: Spelling.
 
Last edited:

bit_user

Titan
Ambassador
If the predicting benchmark isn't sufficiently general then how is it supposed to make any useful predictions?
Whether intentionally or not, you're playing a shell game with words, in order to make a claim unrelated to mine, while making it sound like we're talking about the same thing.

A benchmark exists to provide a prediction of how a system will perform similar tasks. What you're talking about is the compiler making predictions, except the compiler isn't even doing that much, since the benchmark is deterministic. What you're spinning as a virtue is actually no such thing.

It's like this: if a student successfully studies for an exam, they will have reviewed the relevant material by predicting what the test will cover and learning it. That would be a legitimate use of the term. What Intel did was basically have the compiler memorize the correct answers, so that it can do well on the test, but without understanding the material or being able to work through the problems. If you present it with a slightly different test that covers similar material, it wouldn't do nearly so well, because it doesn't actually understand how to solve the problems very well.

In other words if it is playable/cheatable then it's a bad benchmark.
This is a false and arbitrary standard. Plenty of real world workloads are cheatable. To somehow make the benchmark not cheatable would change its fundamental nature. It would be like saying that we're going to deal with athletes taking steroids by completely removing athletics from the competition and making them compete at chess, instead.

The solution isn't to try to bend over backwards to make it difficult to cheat. Don't compromise the integrity or predictive power of the benchmark, but rather catch & punish cheaters.


For anyone who doesn't regularly follow these forums, they should be aware of your 100% perfect record supporting & defending Intel, while bashing their competitors.
 

bit_user

Titan
Ambassador
If that piece of code only exists inside that benchmark and nowhere else then it's a bad benchmark.
I half agree with you that if you manage to write some piece of code that's unlike anything else in common use, that would make it a bad benchmark. However, that's not what we're talking about, here.

The code in Xalan isn't so unlike other code in common usage, yet I think the optimizations are not only specific to Xalan, but probably Xalan processing the specific dataset used by the benchmark!
 
  • Like
Reactions: Logics

bit_user

Titan
Ambassador
benchmarks from OEMs are typically tailored to one specific situation/setting/config/compiler/etc. and should always be taken with a grain of salt.
That's what makes SPEC special. They provide standard benchmarks that can be used to make apples-to-apples comparisons between systems from different vendors.

I know it's not supposed to be that way for SPEC, but just about every OEM out there works on ways to cheat the standard body. It's true even in other industries, look at all the emission defeat issues the auto industry had just a few years ago. In most cases OEM metrics can't be trusted.
They have rules to try and keep submissions honest. In both examples you're talking about, the cheaters were eventually found out and punished.

It sounds to me like you're making another argument that because the benchmarks aren't absolutely free of manipulation or targeting, that we shouldn't even try (i.e. making "perfect" the enemy of "good"). I think that would be sad. I have found SPEC2006 and SPEC2017 to be very valuable metrics, particularly in the way they had been embraced and utilized by Anandtech. The world would've been poorer without them.
 
It's like this: if a student successfully studies for an exam, they will have reviewed the relevant material by predicting what the test will cover and learning it. That would be a legitimate use of the term. What Intel did was basically have the compiler memorize the correct answers, so that it can do well on the test, but without understanding the material or being able to work through the problems. If you present it with a slightly different test that covers similar material, it wouldn't do nearly so well, because it doesn't actually understand how to solve the problems very well.
Is that proven that this is what happens?! Wouldn't SPEC need to be able to look through the source code of the compiler to be sure?
The only other way would be to do counter benchmarks running other apps and seeing if the performance changes, but if they did that wouldn't hey include that data?
 
Status
Not open for further replies.