News Zen 5 SMT-focused testing suggests Intel made a mistake ditching Hyper-Threading on Lunar Lake

Trying to make any overall SMT efficiency conclusions based on a laptop platform seems useless. All you can really say is that when the CPUs are heavily power limited it's a net gain.

It's looking like Intel's strategy moving forward is using the E-cores on the client side of things in place of SMT. They're more space efficient and allow the P-cores to be smaller as well. It's a move that makes sense from a business perspective since they're leveraging the E-cores in enterprise and low power parts as well.
 
  • Like
Reactions: helper800

bit_user

Titan
Ambassador
The article said:
The Phoronix benchmarks demonstrate that Zen 5 and Zen 5c benefit massively from multi-threading technology. In the case of the Ryzen AI 9 HX 370, AMD is only giving up 2% of its power to extract a very impressive 18% more performance from the chip, significantly improving efficiency.

Ironically, Intel removed Hyper-Threading in Lunar Lake to improve performance efficiency.
Mike Clark said the dual-decoder microarchitecture of Zen 5 works differently between single-threaded and SMT modes. When executing multiple threads, each thread gets its own decoder. He also said there are a couple other per-thread resources, but I forget which and I think the decoder is really the main one.

In other words, I think this doesn't invalidate what Intel said about Hyperthreading, in Lions Cove. It just says that while Intel went in the direction of eliminating SMT, AMD went in the direction of understanding its bottlenecks and optimizing them.
 
Last edited:

bit_user

Titan
Ambassador
The article said:
Intel says that removing Hyper-Threading allowed its designers to squeeze a 30% improvement in performance per power per area out of the Lion Cove P-cores.
To be clear, the 30% number is some kind of weird metric they concocted. IMO, it's just an exercise in claiming bigger numbers, rather than anything terribly useful.

They said a single thread has a 15% better perf/W and 10% better perf/area advantage vs. a single thread running on an equivalent core that's HT-capable.

d743MPmtAGFZpWgwcv5HDL.jpg


They said multithreaded apps have 5% better perf/W vs. a comparable core running 2 threads. However, on perf/area, the non-HT core is 15% less efficient than a comparable hyperthreaded core.

4a3sRhEMdALmD2H2H96aKL.jpg


It really would've been nice if you'd quoted these slides more precisely, or just posted them for people to see the key points I just outlined.

Anyway, whether HT makes sense is really a question of whether you're optimizing for lightly-threaded workloads and prioritizing power-efficiency, or focused mainly on multi-threaded workloads and optimizing for perf/$. This aligns with Intel's decision to remove it from the client version of Lions Cove, but to retain it in the server variant of the P-core.
 
  • Like
Reactions: helper800

bit_user

Titan
Ambassador
Trying to make any overall SMT efficiency conclusions based on a laptop platform seems useless.
Power does complicate the picture, but then multithreaded workloads tend to be power-limited (or thermally-limited - basically the same thing) almost no matter where they're run!

All you can really say is that when the CPUs are heavily power limited it's a net gain.
Intel is claiming SMT actually hurts power-efficiency. So, running it on a more heavily power-limited platform is actually a more strenuous endorsement for Zen 5's implementation!
 
  • Like
Reactions: Sluggotg
I think this slide is meant to address 1-thread vs. 2-thread efficiency on the same HT-capable core:
It's still "projected" and "best-case" rather than an applicable overall statement and they're still prominently pushing area savings.

SMT is just one of those things where it can have big swings in efficiency and cost depending on the workload. I think the AMD enterprise chips will probably show similar gains to the laptop, but it'd be the desktop parts I'm most curious about as they have a lot of headroom vs the amount of cores.
 
Last edited:

Pierce2623

Prominent
Dec 3, 2023
249
214
460
AMD recently did an interview on Chips and Cheese where they specifically said they don’t add extra resources/larger structures for SMT. They specifically design the architecture where it can use the full resources of the core in 1T mode as they also load every core before engaging any SMT.
 
  • Like
Reactions: bit_user

Pierce2623

Prominent
Dec 3, 2023
249
214
460
Mike Clark said the dual-decoder microarchitecture of Zen 5 works differently between single-threaded and SMT modes. When executing multiple threads, each thread gets its own decoder. He also said there are a couple other per-thread resources, but I forget which and I think the decoder is really the main one.

In other words, I think this doesn't invalidate what Intel said about Hyperthreading, in Lions Cove. It just says that while Intel went in the direction of eliminating SMT, AMD went in the direction of understanding its bottlenecks and optimizing them.
He also specifically said in the same interview that the core can use all of its resources in 1T mode and they didn’t overbuild any structures to improve SMT performance. 1T still uses the full 8 wide decode, it just doesn’t go into 2 threaded operation. SMT is PURELY about trying to keep all the ALUs always occupied and never stalled waiting on data as that greatly improves efficiency.
 
  • Like
Reactions: bit_user

TheSecondPower

Distinguished
Nov 6, 2013
42
38
18,560
One commenter on Phoronix said that turning off SMT on a Zen 5 doesn't actually power off its resources, and if that's true then this test does nothing to test the power savings that could be realized by removing SMT entirely.

Moreover AMD's SMT since Zen has usually been considered a bigger boost to threaded workloads than Intel's hyperthreading, so Intel has less to lose by turning it off.

Lastly, the Phoronix article doesn't touch the theory behind Intel's plan with Lunar Lake. If I open Task Manager on my Ryzen 1800X during a moderately-threaded workload, every other logical core will be busy. The OS assigns work to the 8 physical cores first and only then begins to assign work to the "logical cores". SMT is useless until that 9th thread is scheduled. On Meteor Lake, work will be assigned to the 6 big cores and 8 little cores and so hyperthreading is useless until the 15th thread is scheduled.

Now Lunar Lake only has 4 big and 4 little cores, and the little cores aren't on the same ring bus nor L3 cache so the OS is going to try to keep related threads on only one type of core at a time, so the performance loss will probably be a little bigger for Lunar Lake than for Meteor Lake. But Lunar Lake's little cores will be a lot faster than Meteor Lake's. Intel said that multithreading improves performance by 20% and increases power consumption by 10%. Lunar Lake is going into low-power devices, where most users will spend 90% of their time running light workloads and will be wanting long battery life and quiet fans. Which is better, 10% less power consumption 90% of the time, or 20% more performance 10% of the time?
 

bit_user

Titan
Ambassador
One commenter on Phoronix said that turning off SMT on a Zen 5 doesn't actually power off its resources, and if that's true then this test does nothing to test the power savings that could be realized by removing SMT entirely.
That's a false experiment, because nearly all of a core's resources are shared. In fact, the whole point of SMT is resource-sharing.

Lastly, the Phoronix article doesn't touch the theory behind Intel's plan with Lunar Lake. If I open Task Manager on my Ryzen 1800X during a moderately-threaded workload, every other logical core will be busy. The OS assigns work to the 8 physical cores first and only then begins to assign work to the "logical cores". SMT is useless until that 9th thread is scheduled. On Meteor Lake, work will be assigned to the 6 big cores and 8 little cores and so hyperthreading is useless until the 15th thread is scheduled.
Phoronix tested this on a CPU with 12 cores and 24 threads. That's not much different than the 6P+8E scenario you outlined (although I think you forgot about the 2LPE cores, but never mind them). For the benchmarks to have shown a benefit, the workloads he tested must use > 12 threads. That's not uncommon in tasks like compiling, rendering, video compression, etc.

Lunar Lake is going into low-power devices, where most users will spend 90% of their time running light workloads and will be wanting long battery life and quiet fans. Which is better, 10% less power consumption 90% of the time, or 20% more performance 10% of the time?
If users of thin-and-light laptops really had no need for more than 8 threads, I guess it would mean there are a whole lot of bad marketing departments out there, because that's not where thin-and-light laptops currently max out.