News Purported Intel Core i7-14700K Benchmarks up to 20% Faster in Multi-Threaded Workloads

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
pretty much all online games, and playstation games I own on pc will use all cores available in the cpu.

however, most of the games I play often won't even saturate all the cores of a 2600k backup pc. No need to upgrade at the moment. However, don't get me wrong. I'm glad these new released cpu's have so many cores -- plenty of good cpu's to choose from when the time to upgrade comes.

playing Star Rail right now.


there are a few games like DOA VI that do push the 2600k to the limit, which is why I play those games on 10600k.

______

Handbrake also use all cores available on the cpu when I do video encoding. Even with Nvenc active, it will still use cpu cores.
_____

my 10600k encodes faster, and can also play demanding games at a higher resolution than 2600k. So I guess more cores do matter? Glad Intel released the 14th gen with more cores. :)
 
Last edited:
pretty much all online games, and playstation games I own on pc will use all cores available in the cpu.

however, most of the games I play often won't even saturate all the cores of a 2600k backup pc. No need to upgrade at the moment. However, don't get me wrong. I'm glad these new released cpu's have so many cores -- plenty of good cpu's to choose from when the time to upgrade comes.

playing Star Rail right now.


there are a few games like DOA VI that do push the 2600k to the limit, which is why I play those games on 10600k.

______

Handbrake also use all cores available on the cpu when I do video encoding. Even with Nvenc active, it will still use cpu cores.
_____

my 10600k encodes faster, and can also play demanding games at a higher resolution than 2600k. So I guess more cores do matter? Glad Intel released the 14th gen with more cores. :)
It has more to do with the 100% IPC increase from sandybridge to 10th gen, although 4 cores vs 8 definitely helps. More than 8 cores doesn’t translate into better gaming performance, even though the game will utilize more than 8, generally it’s very low utilization and is less efficient than stacking secondary processes onto cores until they are fully utilized, so there is a limit on core count usefulness past 8.
 
It has more to do with the 100% IPC increase from sandybridge to 10th gen, although 4 cores vs 8 definitely helps. More than 8 cores doesn’t translate into better gaming performance, even though the game will utilize more than 8, generally it’s very low utilization and is less efficient than stacking secondary processes onto cores until they are fully utilized, so there is a limit on core count usefulness past 8.

just a small info. 10600k has 6 cores. So it's 4 cores vs. 6.
 
For realz??? A 50% increase I could believe. 100% single-thread performance I could believe. But 100% IPC between Sandybridge and Comet Lake is just going too far. I expect there's barely that much IPC increase between Sandybridge and Golden Cove.
EDIT: OOPS!!!! Forgot we were talking about comet lake and not raptor lake, it’s only 30% IPC increase from sandy to 10th gen. My bad.

Well it depends on which source you are viewing. According to Intel, the IPC gain between sandybridge and raptor lake is well over +100% (Intel as source: sandy-ivy = 6%, ivy-haswell = 11%, haswell-broadwell = >5%, broadwell-skylake = another >5%, skylake-7th-8th-9th-10th gen = 0%, 10th gen-11th gen = 19%, 11th-12th gen = another 19%, 12th-13th gen = 15%. If you compound those together it comes to a +112% IPC increase from sandy bridge to raptor lake.
However, a more believable increase is if you use cinebench r15 scores with CPU’s locked to 3Ghz then raptor lake has 65% greater IPC (sandybridge @ 3Ghz = 103, raptor lake @ 3 Ghz = 170)
 
  • Like
Reactions: bit_user
Well it depends on which source you are viewing. According to Intel, the IPC gain between sandybridge and raptor lake is well over +100% (Intel as source: sandy-ivy = 6%, ivy-haswell = 11%, haswell-broadwell = >5%, broadwell-skylake = another >5%, skylake-7th-8th-9th-10th gen = 0%, 10th gen-11th gen = 19%, 11th-12th gen = another 19%, 12th-13th gen = 15%. If you compound those together it comes to a +112% IPC increase from sandy bridge to raptor lake.

However, a more believable increase is if you use cinebench r15 scores with CPU’s locked to 3Ghz then raptor lake has 65% greater IPC (sandybridge @ 3Ghz = 103, raptor lake @ 3 Ghz = 170)
Thanks for that!

I think the reason you have trouble compounding the generational IPC is that it's probably the median across a variety of benchmarks, and probably not even the same benchmarks every time. From one generation to the next, which benchmarks receive the greatest benefit will differ. If you compounded just the generational improvements for a single benchmark, that should work. But, compounding the median won't, because median-computation is a nonlinear operation.
 
Thanks for that!

I think the reason you have trouble compounding the generational IPC is that it's probably the median across a variety of benchmarks, and probably not even the same benchmarks every time. From one generation to the next, which benchmarks receive the greatest benefit will differ. If you compounded just the generational improvements for a single benchmark, that should work. But, compounding the median won't, because median-computation is a nonlinear operation.
Absolutely agree, that’s why I compared Intel’s claimed IPC increase per generation with real cinebench R15 scores. Besides the caveat with every manufacturer IPC claim is “up to” because, like you said, it’s the median between 20-30 different applications.

And BTW, before anyone says “why did you use R15 instead of R20 or R23?” It is because R15 is the last revision to not use AVX at all. Why does that matter? Because Sandy Bridge does not support AVX2 (has well and newer) which is a 256 bit wide AVX execution unit and can double pump regular 128 bit AVX instructions which would skew results since sandy bridge and ivy bridge only have 128-bit wide AVX execution units.
 
  • Like
Reactions: bit_user
And BTW, before anyone says “why did you use R15 instead of R20 or R23?” It is because R15 is the last revision to not use AVX at all. Why does that matter?
Yeah, I even wrote something about trying to exclude ISA extensions, but then edited it out.

since sandy bridge and ivy bridge only have 128-bit wide AVX execution units.
That's funny, because I know I read something to that effect. But, just yesterday, I was reading about Zen 4 and saw ChipsAndCheese claim that Sandybridge had a full 256-bit implementation of AVX:


So, I checked Anandtech and they also seemed to suggest that Sandybridge's AVX is 256 bits wide:

Then, I checked wikichip and it goes into quite a bit of detail about how AVX was plumbed into Sandybridge:

"Intel doubled the width of all the associated executed units. This includes a full hardware 256-bit floating point multiply, add, and shuffle - all having a single-cycle latency."

 
Yeah, I even wrote something about trying to exclude ISA extensions, but then edited it out.


That's funny, because I know I read something to that effect. But, just yesterday, I was reading about Zen 4 and saw ChipsAndCheese claim that Sandybridge had a full 256-bit implementation of AVX:

So, I checked Anandtech and they also seemed to suggest that Sandybridge's AVX is 256 bits wide:

Then, I checked wikichip and it goes into quite a bit of detail about how AVX was plumbed into Sandybridge:
"Intel doubled the width of all the associated executed units. This includes a full hardware 256-bit floating point multiply, add, and shuffle - all having a single-cycle latency."​
For sandy bridge and ivy bridge floating point was indeed 256-bit, however AVX integer was limited to 128-bit execution cycle. AVX2 allowed 256-bit AVX/SSE integer operations to be executed in one pass instead of 2 so AVX/SSE integer performance doubles with AVX2. AVX2 floating point performance also increased vs AVX by adopting the new Fused-Multiply-Add (FMA3) instruction. So yes sandy bridge does have a 256-bit execution unit (my bad bit_user, been a while since I looked at sandy bridge architecture break-downs, I appreciate you keeping me humble 😉 ), but only some instructions could be executed at full bit-width.
 
For sandy bridge and ivy bridge floating point was indeed 256-bit, however AVX integer was limited to 128-bit execution cycle.
I believe there's no such thing as "AVX integer". The support for integer vector elements was added in the form of AVX2 (which supported fp64, as well), in a repetition of what happened with SSE and SSE2.

AVX2 allowed 256-bit AVX/SSE integer operations to be executed in one pass instead of 2 so AVX/SSE integer performance doubles with AVX2.
According to Wikichip, the way Sandybridge executed 256-bit AVX instructions was by dispatching them down two 128-bit pipes.

"Intel solved this problem by cleverly dual-purposing the two existing 128-bit stacks during AVX operations to move full 256-bit values. For example a 256-bit floating point add operation would use the Integer SIMD domain for the lower 128-bit half and the FP domain for the upper 128-bit half to form the entire 256-bit value."

https://en.wikichip.org/wiki/intel/microarchitectures/sandy_bridge_(client)#New_256-bit_extension

This is consistent with what I quoted above about Sandybridge having:

"a full hardware 256-bit floating point multiply, add, and shuffle - all having a single-cycle latency."

I actually think they're wrong about that latency figure. What I think they mean is that you can issue one 256-bit operation per cycle. The reason I think so is that Broadwell achieved considerable latency reductions in floating point arithmetic:

"FP multiplication instructions has reduced latency (3 cycles, down from 5). Affects AVX, SSE, and FP instructions"

https://en.wikichip.org/wiki/intel/microarchitectures/broadwell_(client)#Key_changes_from_Haswell

Anyway, what they actually did in Haswell, apart from the additional instructions in the AVX2 ISA extension, was to add, re-balance, and widen execution ports, so that the vector ports were each 256 bits wide. The excerpt is too big to quote, but here's a link to the precise paragraph:
 
I believe there's no such thing as "AVX integer". The support for integer vector elements was added in the form of AVX2 (which supported fp64, as well), in a repetition of what happened with SSE and SSE2.


According to Wikichip, the way Sandybridge executed 256-bit AVX instructions was by dispatching them down two 128-bit pipes.
"Intel solved this problem by cleverly dual-purposing the two existing 128-bit stacks during AVX operations to move full 256-bit values. For example a 256-bit floating point add operation would use the Integer SIMD domain for the lower 128-bit half and the FP domain for the upper 128-bit half to form the entire 256-bit value."​

This is consistent with what I quoted above about Sandybridge having:
"a full hardware 256-bit floating point multiply, add, and shuffle - all having a single-cycle latency."​

I actually think they're wrong about that latency figure. What I think they mean is that you can issue one 256-bit operation per cycle. The reason I think so is that Broadwell achieved considerable latency reductions in floating point arithmetic:
"FP multiplication instructions has reduced latency (3 cycles, down from 5). Affects AVX, SSE, and FP instructions"​

Anyway, what they actually did in Haswell, apart from the additional instructions in the AVX2 ISA extension, was to add, re-balance, and widen execution ports, so that the vector ports were each 256 bits wide. The excerpt is too big to quote, but here's a link to the precise paragraph:
I rebut with quotes directly from an Intel White Paper and Intel employee comments.
P.S. vector integers are indeed AVX workloads so AVX integer may not be official language but is technically correct nonetheless.

AVX2 (also known as Haswell New Instructions) expands most integer commands to 256 bits and introduces new instructions. They were first supported by Intel with the Haswell processor, which shipped in 2013.
AVX2 makes the following additions:
  • expansion of most vector integer SSE and AVX instructions to 256 bits
  • Gather support, enabling vector elements to be loaded from non-contiguous memory locations
  • DWORD- and QWORD-granularity any-to-any permutes
  • vector shifts.“
  • Fused-Multiply-Add (FMA3) support
-http://software.intel.com/en-us/blogs/2011/06/13/haswell-new-instruction-descriptions-now-available/


“The main advantage of new ISA of AVX2 is for integer code/data types – there you can expect up to 2x speedup”
-Igor_A_Intel: an Intel employee.
 
Last edited:
Different question, why is avx NOT part of IPC?

A CPU with 128 wide AVX is going to be slower than one with 256 wide, how much stuff you can process is basically the definition of IPC.
Instructions per cycle or cycles per instruction, the more and wider hardware you have the more processing the CPU will be able to do.
So why would you measure that by handicapping a CPU with software that doesn't use all available hardware?
 
I rebut with quotes directly from an Intel White Paper and Intel employee comments.
P.S. vector integers are indeed AVX workloads so AVX integer may not be official language but is technically correct nonetheless.
Here's the list of AVX instructions. Now, tell me which ones operate on a vector of packed ints:


There's a handful which reference signed ints, but a close reading of their description shows that they're merely data-movement instructions and don't act upon the values of the contents. You could (and I have) use them to move floating-point data, and the CPU doesn't know or care.

AVX2 (also known as Haswell New Instructions) expands most integer commands to 256 bits and introduces new instructions. They were first supported by Intel with the Haswell processor, which shipped in 2013.
AVX2 makes the following additions:
  • expansion of most vector integer SSE and AVX instructions to 256 bits
...
“The main advantage of new ISA of AVX2 is for integer code/data types – there you can expect up to 2x speedup”
-Igor_A_Intel: an Intel employee.
Yes, that's what I'm talking about. AVX1 doesn't have vector arithmetic on integer elements. On CPUs without AVX2, you have to use SSE2+ instructions for that, which are limited to 128 bits.

There's one only one thing I was wrong about, and it's about AVX1 lacking fp64 support. In fact, it does have them, as you can see from all the "pd" (short for "packed double") instructions in the list I linked above.
 
Different question, why is avx NOT part of IPC?
Well, if one is using the strict definition of Instructions Per Clock, then it wouldn't even matter what kind of instructions. An AVX-512 fsqrt would count the same as an int8 xor.

In common usage, it's taken to mean "clockspeed-normalized performance". Using that definition, then whether or not to allow for new instructions is really a matter of what question you're trying to answer. If we only want to know how a core's microarchitectural efficiency changed (i.e. things like branch prediction, instruction scheduling, decoding throughput, etc.), then allowing for new instructions distorts the metric. On the other hand, if we're interested in characterizing efficiency either specifically on vectorizable workloads or on a broad portfolio of workloads, then it's fair.

So, in the ideal world, I'd like to know the clockspeed-normalized performance both with and without allowing for new instructions. If I had to pick only one, then it would depend on which type of workload I'm interested in doing. Something like code compilation or database queries is going to be overwhelmingly scalar and integer. So, in that case, any generic IPC metrics which include vectorized workloads would over-predict the performance improvements on such jobs. On the other hand, if I'm doing image/signal processing or AI, any relative performance metrics that aren't overwhelmingly dominated by by vectorizable code are likely to under-predict the performance on those jobs.

So, it really just depends on what question you're trying to answer. That's why there are lots of benchmarks and they often don't agree. In fact, if you had 2 benchmarks in a suite which were always highly-correlated, you could just get rid of one and maybe double-weight the other.

how much stuff you can process is basically the definition of IPC.
Not the literal or traditional definition, no. Sadly, people abused the term instead of creating a new one.
 
Here's the list of AVX instructions. Now, tell me which ones operate on a vector of packed ints:

There's a handful which reference signed ints, but a close reading of their description shows that they're merely data-movement instructions and don't act upon the values of the contents. You could (and I have) use them to move floating-point data, and the CPU doesn't know or care.


Yes, that's what I'm talking about. AVX1 doesn't have vector arithmetic on integer elements. On CPUs without AVX2, you have to use SSE2+ instructions for that, which are limited to 128 bits.

There's one only one thing I was wrong about, and it's about AVX1 lacking fp64 support. In fact, it does have them, as you can see from all the "pd" (short for "packed double") instructions in the list I linked above.
Well then why does my x86 programming textbook dedicate an entire chapter on AVX Packed Integer Operands?

Chapter Title: “AVX Programming-Packed Integers”

First paragraph:
“In the previous chapter, you learned how to use the AVX instruction set to perform calculations using packed floating-point operands. In this chapter, you learn how to carry out computations using packed integer operands. Similar to the previous chapter, the first few source code examples in this chapter demonstrate basic arithmetic operations using packed integers. The remaining source code examples illustrate how to use the computational resources of AVX to perform common image processing operations, including histogram creation and thresholding.”
-Modern x86 Assembly Language Programming: pg 215-275


256-bit integer operations are only added since AVX2, so you'll have to use 128-bit __m128i vectors for integer intrinsics if you only have AVX1.

AVX1 does have integer loads/stores, and intrinsics like _mm256_set_epi32. Nevertheless if the int value range fits in 24 bits then you can use floatinstead. However note that if you need the exact result or the low bits of the result then you'll have to convert float to double, because a 24x24 multiplication will produce a 48-bit result which can be only stored exactly in a double. At that point you still only have 4 elements per vector, and might have been better off with XMM vectors of int32. AVX1 has VEX encodings of 128-bit integer operations so you can use them in the same function as 256-bit FP intrinsics without causing SSE-AVX transition stalls.”
-stack overflow

“Data Types Compatible with both AVX/AVX2
__m128128-bit vector containing 4 floats
__m128d128-bit vector containing 2 doubles
__m128i128-bit vector containing integers
__m256256-bit vector containing 8 floats
__m256d256-bit vector containing 4 doubles
__m256i256-bit vector containing integers

-Codeproject

IDK why you are refuting my justification for using a non-avx version of cinebench to determine real-world IPC when you yourself are saying the same thing as I. AVX2 capable CPU’s would have a 2x integer performance advantage over sandy bridge if the program took advantage of 256-bit integer operations??? Maybe it’s because I failed to recognize that SSE was the precursor and AVX was the next evolution in SIMD and built on top of SSE. I apologize for assuming SSE integer execution within the AVX1 frame set is not the same as AVX2 integer support but me thinks we be splitting hairs on that one since SSE and AVX are very much associated with one another. Plus all the reputable sources, including Intel’s own papers and employees, saying AVX operates on integer vectors…

Anyway, I hope my stubbornness in this topic does not change your opinion of me, I am rather enjoying our “verbal jousting” on this subject haha. I have much respect for you!
 
I encoded with Handbrake (using same settings) on 2600k and 10600k, for a comparison on CPU core utilization.

2600k needs to work a little bit harder with all cores always near 100% utilized. While the 10600k mostly using 80-90%

2600k

10600k

______

also played DOA 6. Here, the 10600k has all graphics settings maxed, while the 2600k uses lower graphics settings. 2600k cannot use max settings of DOA 6, it will lag and become unplayable. Played for nearly 30 minutes on both, so that HWMonitor would show more stable results.

perhaps due to having less cores, the 2600k has all cores being utilized at 50% and above. While the 10600k does have one core being utilized at 70%, the rest of it's cores are more relaxed at 10-40%.

2600k

10600k

__________

this 14th gen intel on the thread title looks impressive with the many cores it has. I assume it will do encoding and gaming a lot easier.

but it would probably be the 16th gen or 17th gen intel, before I consider buy a new pc. The 10600k can still do the chores I want it to do. :)
 
Last edited:
  • Like
Reactions: bit_user
Well then why does my x86 programming textbook dedicate an entire chapter on AVX Packed Integer Operands?
Because they're using the term generically to apply to AVX + AVX2, or SSEn + AVX. A tutorial isn't the same thing as an ISA reference. Wording can be fuzzy, their actual ISA definition is not.

Did you actually look at the link I provided? It filters instructions by ISA extension, and the AVX(1) had no packed integer arithmetic.

IDK why you are refuting my justification for using a non-avx version of cinebench to determine real-world IPC when you yourself are saying the same thing as I.
I guess you're talking to Terry, now? I didn't think I refuted that.

Anyway, I hope my stubbornness in this topic does not change your opinion of me,
You can't worry about what anyone on the internet thinks, much less of yourself. That said, I think you're a cut above.
 
Because they're using the term generically to apply to AVX + AVX2, or SSEn + AVX. A tutorial isn't the same thing as an ISA reference. Wording can be fuzzy, their actual ISA definition is not.

Did you actually look at the link I provided? It filters instructions by ISA extension, and the AVX(1) had no packed integer arithmetic.


I guess you're talking to Terry, now? I didn't think I refuted that.


You can't worry about what anyone on the internet thinks, much less of yourself. That said, I think you're a cut above.
For the most part yes, most people on the internet I don’t care about, but people who spend the time to have an intellectual conversation with me I consider real people worthy of respect.

I understand what you mean, SSE can be called upon through AVX. I stand corrected
 
  • Like
Reactions: bit_user