News Rocket Lake i7 Loses to Apple M1 in Single-Threaded Performance

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
How can you even compare these chips - one is a laptop CPU with TDP of ~30W, the other is a desktop behemoth with TDP of 125W and actual use around 250W. Yet the little guy trounces the goliath. This is a complete wipeout by Apple. If M1 was allowed to use 250W, it would be much, much faster leaving all current desktop offerings in the dust. It's such a shame we cannot get a PC powered by M1, only a Mac.
If you pushed 250W through the M1 it would probably melt...
We have seen it many times that a low powered CPU does not scale up much by using more power.
 
To hsl on gpu's and ucode instruction sets on cpu are broken down into a series of micro ops. Now there is a tremendous amount of ucode which is extremely hardware dependent. Ie: addressing modes and virtual memory. Or avx instructions, machine specific registers like interrupts or error code trap recoveries. There may be no equivalent hardware and emulating it can be both time consuming or near impossible.
This would only really matter if you're trying to run system software where it's expecting ISA specific behavior. But Rosetta wasn't meant to run system software because... well why would you do that? The system software that runs natively on the hardware platform should already exist. So all of the apps Rosetta is meant to run don't actually care about what CPU it's actually running on. Only that the things it wants to run get run. Sure, said app trying to use ISA specific features like AVX will slow down because there doesn't seem to be an equivalent on M1, but the app won't really care as long as the system provides something to make it happy.

I understand how the allocation of resources work and when hyperthreading kicks in, or doesn't. And your right if a core is truly saturated there is nothing left for the hyperthread. All I am saying is frequently workloads, even benchmarks, don't tend to actually use all of those resources as the cores are designed in such a fashion to help ensure they have some resources left for the hyperthread. But your not wrong on what you said as you stated it.

On your edit... that was the whole point of my post to begin with. Using a test that only spawns a single thread off a hyperthreaded/symmetric multithreading CPU core is by its nature is potentially not using the the entirety of the CPU core's resources/compute power. There is no question that the M1 has extremely potent single threaded grunt. It was never my intent to imply otherwise. My point was I didn't feel the comparison was an apples to apples one and therefore not entirely a fair comparison to make. I would just like to see some benchmarks where the M1 goes against an x86 CPU's entire core, both threads. ST absolutely matters but so does using the whole core. I think having a comparison of both ST and total core performance would be more revealing about the position Apple is actually in, and x86 by extension. That is not an unreasonable stance to take.
If the single threaded test cannot saturate a core, then that's a problem with the implementation of the test, not how the test is designed. It's not like simultaneous multithreading reserves portions of the CPU execution units just in case there happens to be another thread to run. Because if it did that, well, we'd get Bulldozer.
 
This would only really matter if you're trying to run system software where it's expecting ISA specific behavior. But Rosetta wasn't meant to run system software because... well why would you do that? The system software that runs natively on the hardware platform should already exist. So all of the apps Rosetta is meant to run don't actually care about what CPU it's actually running on. Only that the things it wants to run get run. Sure, said app trying to use ISA specific features like AVX will slow down because there doesn't seem to be an equivalent on M1, but the app won't really care as long as the system provides something to make it happy.

And many many apps are still using these instructions, and have no replacement. Why? Because these instructions make it faster!

Apple has to do native recompiles for a reason.
 
  • Like
Reactions: atomicWAR
And many many apps are still using these instructions, and have no replacement. Why? Because these instructions make it faster!

Apple has to do native recompiles for a reason.
If you're talking about things like SSE and AVX, then ARM has Neon and Helium. And even then, Intel made a report on how often their SIMD/Vector instructions are actually used. None of them hit more than 2.5%. So I'm under the impression that except in very few cases, there won't be a significant performance hit due to the lack of a mapping between x86 ISA extensions and ARM ISA extensions.

I mean yes, you should native recompile when possible, but it's not like most of the time running x86 apps on M1 is going to be a problem. Plus if said app is using a lot of Apple's API and Rosetta 2 is smart enough to figure this out, then it's not even running x86 at that point.
 

ginthegit

Distinguished
BANNED
Nov 15, 2012
201
32
18,610
Comparing a RISC and CISC on a benchmarking suit means literally nothing.

Once they test the Apple Chip on a real life product or game, this will give a much better metric of real performance.

For a start, RISC must do more work to achieve the same as CISC, and this may impact the result. It does not actually mean the same actual processing work has been done, just more activity on a set task.

Lets see a real Performance metric before we get all excited or Cognitively dissonant.
 
How can you even compare these chips - one is a laptop CPU with TDP of ~30W, the other is a desktop behemoth with TDP of 125W and actual use around 250W. Yet the little guy trounces the goliath. This is a complete wipeout by Apple. If M1 was allowed to use 250W, it would be much, much faster leaving all current desktop offerings in the dust. It's such a shame we cannot get a PC powered by M1, only a Mac.
While the i7 can use MUCH more power, when it comes to a single thread the power draw isn't very high. For example, TGL 35W with 55W boost draw can provide higher ST performance than the 10900k, even though the 10900k can draw 250W.
 
Comparing a RISC and CISC on a benchmarking suit means literally nothing.

Once they test the Apple Chip on a real life product or game, this will give a much better metric of real performance.

For a start, RISC must do more work to achieve the same as CISC, and this may impact the result. It does not actually mean the same actual processing work has been done, just more activity on a set task.

Lets see a real Performance metric before we get all excited or Cognitively dissonant.
This is true f the benchmark is made for x86/cisc but if the bench that runs on risc is made especially for risc then it will actually do less work since that's the whole point of risc to do any supported instruction on the least amount of clocks possible (whenever possible on a single clock)

But then again seeing the differences between architectures is one main reason for doing benchmarks.
 
This is true f the benchmark is made for x86/cisc but if the bench that runs on risc is made especially for risc then it will actually do less work since that's the whole point of risc to do any supported instruction on the least amount of clocks possible (whenever possible on a single clock)

But then again seeing the differences between architectures is one main reason for doing benchmarks.
If this were the 90s then sure there are big differences between RISC and CISC. However, over the last 20 years CISC has become more RISC like and RISC has become more CISC like.
 
Comparing a RISC and CISC on a benchmarking suit means literally nothing.

Once they test the Apple Chip on a real life product or game, this will give a much better metric of real performance.

For a start, RISC must do more work to achieve the same as CISC, and this may impact the result. It does not actually mean the same actual processing work has been done, just more activity on a set task.

Lets see a real Performance metric before we get all excited or Cognitively dissonant.
The problem with testing a "real world" application is those developers may not really care about fine tuning the performance of their apps. As long as the app "just works" and nobody's complaining, the performance figures may vary depending on the quality of the port. It's like saying PC's suck because it can't run a shoddily done console port as well as the console. Benchmark utilities are carefully crafted to make sure they're running the same kinds of workloads that are not dependent on any major feature, API, or whatever.

Also the difference between RISC and CISC is largely academic now. Modern implementations will just decode the ISAs to microcode anyway. And I don't know where you're getting "RISC must do more work," but if you're going to by how many more lines of assembly there are in a RISC program vs. a CISC one, well, CISC programs simply hide details behind a more "complex" instruction. Sure RISC can't do an add operation on two memory locations and must explicitly call load and store instructions, but that doesn't mean the CISC processor isn't going behind your back doing the same things when you do an add on two memory locations.
 

atomicWAR

Glorious
Ambassador
How can you even compare these chips - one is a laptop CPU with TDP of ~30W, the other is a desktop behemoth with TDP of 125W and actual use around 250W. Yet the little guy trounces the goliath. This is a complete wipeout by Apple. If M1 was allowed to use 250W, it would be much, much faster leaving all current desktop offerings in the dust. It's such a shame we cannot get a PC powered by M1, only a Mac.

Not an unfair point of view. Though I would want to see numbers rather than make assumptions of higher wattage performance. Though you likely not wrong in certain workloads.
 

funguseater

Distinguished
I'm sure it slots in exactly where it does, I mean it's 200 points above my 8086k but that's been out since 2017. I'll only be impressed if I can run my choice of OS, Software, and peripherals, like GPU's, but it is apple, so it will not.
 

domaru

Reputable
Mar 5, 2018
4
0
4,510
This test is both single threaded and synthetic which limits its usefullness when comparing agsinst different cpu generations not to mentions architectures. I would use it only to compare intel to intel amd to amd. In other scenarios I would just find a real life single thread scenario. Tried to find and there is not much of these anymore. Closest I could find was 7-zip with thread limit (but why would you do that?). Unfortunately couldn't find if it was running on rosetta or not but scored similar to Intels i7-8559U in single thread and ryzen 5 4600U in unrestricted threads.
 

ginthegit

Distinguished
BANNED
Nov 15, 2012
201
32
18,610
This is true f the benchmark is made for x86/cisc but if the bench that runs on risc is made especially for risc then it will actually do less work since that's the whole point of risc to do any supported instruction on the least amount of clocks possible (whenever possible on a single clock)

But then again seeing the differences between architectures is one main reason for doing benchmarks.

The difference in CISC and RISC is not academic. The two have different execution Registers and it takes time to fill up those registers. The CISC usually has 4 General purpose registers and an Accumulator register, that can be complex to understand in terms of writing code for it. Direct Assembly code is so different from High level language, which is why I have said what I said.

CISC, in a single call the the Instruction register can carry out a string of functions all in a row, without having to make a further call on that instruction. The Decoding or Encoding is simply passed through a chain of functions that are Preset and negates the need to keep instructing the execution and Command side of the core, repeatedly making multiple calls to run the same function as the RISC must. This can cause latency that is inherent in RISC due to each Execution or part of an instruction has to be decoded on an individual basic instruction. The RISC chooses to keep the encode and decode aspect simple, which means that simple tasks like simple math Algorithms are better run on the RISC because the Register core is better suited than the AX BX loads on 8086 style core structure. However if a call like a Fourier transform with multiple coefficients (done all the time on general purpose computing), would run better on CISC.

So a benchmark suite can favour one over the other simply by favouring a complex instruction set or just smacking out simple instructions in a predictable way. A Graphical suit like 3D mark will run a complex set of instructions, but in a linear way, where as a Game complicates that by having to switch between different complex tasks.

The last problem is simply between OS. Having a Benchmark running on an optimised operating system with apple helping to streamline its development, is always going to perform better than bloatware OS systems like Windows 10, and adds a further unfair elements in the comparison. In some cases, some LINUX Distros are 50% faster than Windows 10 in some games and apps.
 
The difference in CISC and RISC is not academic. The two have different execution Registers and it takes time to fill up those registers. The CISC usually has 4 General purpose registers and an Accumulator register, that can be complex to understand in terms of writing code for it. Direct Assembly code is so different from High level language, which is why I have said what I said.

CISC, in a single call the the Instruction register can carry out a string of functions all in a row, without having to make a further call on that instruction. The Decoding or Encoding is simply passed through a chain of functions that are Preset and negates the need to keep instructing the execution and Command side of the core, repeatedly making multiple calls to run the same function as the RISC must. This can cause latency that is inherent in RISC due to each Execution or part of an instruction has to be decoded on an individual basic instruction. The RISC chooses to keep the encode and decode aspect simple, which means that simple tasks like simple math Algorithms are better run on the RISC because the Register core is better suited than the AX BX loads on 8086 style core structure. However if a call like a Fourier transform with multiple coefficients (done all the time on general purpose computing), would run better on CISC.

So a benchmark suite can favour one over the other simply by favouring a complex instruction set or just smacking out simple instructions in a predictable way. A Graphical suit like 3D mark will run a complex set of instructions, but in a linear way, where as a Game complicates that by having to switch between different complex tasks.

The last problem is simply between OS. Having a Benchmark running on an optimised operating system with apple helping to streamline its development, is always going to perform better than bloatware OS systems like Windows 10, and adds a further unfair elements in the comparison. In some cases, some LINUX Distros are 50% faster than Windows 10 in some games and apps.
Ummmmm x86 has 8 general purpose registers, x86-64 has 16 general purpose registers, and RISC has 32 general purpose registers. Since the introduction of x86-64, numbers of registers hasn't been a big problem on x86 CPUs. Also during that time RISC and CISC designs started to converge to the point that they are almost indistinguishable now.
 

ginthegit

Distinguished
BANNED
Nov 15, 2012
201
32
18,610
Ummmmm x86 has 8 general purpose registers, x86-64 has 16 general purpose registers, and RISC has 32 general purpose registers. Since the introduction of x86-64, numbers of registers hasn't been a big problem on x86 CPUs. Also during that time RISC and CISC designs started to converge to the point that they are almost indistinguishable now.

I am talking in terms of reserved and Personal use. If you are say doing a simple mathematical operation Ax and Bx will suffice as Ax not only does part of the Calculation, but also affords itself to be the answer too. So yes there are 8 possible general purpose registers that can be programmed with the use of Assembly code, but not when we are talking about high level compilers . And also, some of those registers are set for functions other than the use of general purposes. And in general, in x64 programming, there are not 16 registers, as Ah and Al are combined into one and used as a full x64 code... but yes, you can separate them and control them individually if you still program with assembler... Do you? No! and that is my whole point, what you can go with them in assembly is not what a compiler does in practice as some are reserved for timers etc. And yes I can program in Both Assembly and C. So your point is noted, but also irrelevant in this case as I am referring to how the compiler talks with the register in terms of what you control. And RISC communicates to its GPRs differently too.

My Point is that if the Benchmark counts CPU activity instead of Objective driven tasks that test both CISC and RISC advantages (like games do) then it is a pointless comparison. The real question is (as it is with 3D mark), How long does it take for each processor to complete a rendering image with Physics calculations and heavy calls created on the fly, by a user?

Does this Benchmark system do this, or just some mathematical equations that favour the RISC because of their simplicity.

Not only that, Intel's foul practice on AMD over the last 2 decades proved that objective benchmarking was not possible due to Compiler hampering and paying Game companies to optimise for Intel only.