News Zen 5 SMT-focused testing suggests Intel made a mistake ditching Hyper-Threading on Lunar Lake

Page 3 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.
I think Intel is just trying to maximize performance within thermal constraints. The E- cores typically run significantly cooler than the P-cores.
They do, but it's not clear to me how well this holds when you normalize by performance. If you wanted to maximize perf/W, you could also run the P-cores at lower frequencies, as server CPUs do.

Raising the compute load/mm^2 raises temps with everything else equal.
E-cores have better compute density than P-cores. Gracemont vs. Golden Cove is like twice the perf/mm^2, in fact.

I think what you want to focus on is W/mm^2.
 
Intel was the fore runner of hyper threading.
SMT goes back a lot further than the Intel-branded "Hyper Threading Technology".

Beyond what it says there, I first learned of SMT from Tera MTA, which dates back to the 1990's.

And now they are ditching it?
Not for server CPUs, at least those featuring Lions Cove.
 
SMT goes back a lot further than the Intel-branded "Hyper Threading Technology".

Beyond what it says there, I first learned of SMT from Tera MTA, which dates back to the 1990's.


Not for server CPUs, at least those featuring Lions Cove.
I am guessing the complexity of the Intel CPU, the failure to keep up with fabs, and the cost of whatever microcode they are doing is part of this craptaculon. Such sadness.
 
They do, but it's not clear to me how well this holds when you normalize by performance. If you wanted to maximize perf/W, you could also run the P-cores at lower frequencies, as server CPUs do.


E-cores have better compute density than P-cores. Gracemont vs. Golden Cove is like twice the perf/mm^2, in fact.

I think what you want to focus on is W/mm^2.
You are right about both of those but Intel has their reasons to keep running the p-cores as fast as possible and I suppose I should have been more clear with the everything else equal. I meant keeping the arch the same as well. I know the e-core arch will be a bit different to increase IPC, but not that different. I just wanted to liken increasing IPC to increasing %load in terms of heat output to point out that IPC can't just be gained without power use being gained. The e-cores had some thermal headroom so they can get an IPC increase. I really think that is the bottleneck on the progress of performance. There are shortcuts like saving time on data delivery that make the total time of processing faster but not the time of non waiting processing. Node improvements of course help this. Cutting clockspeeds to get into the efficient ranges also can but that offsets IPC in the metric of IPS. The p-cores may be able to have their IPS increased like this but it probably doesn't fit Intel's sales narrative.

I think that most of the time, without node improvements there aren't a lot of efficiency gains and the IPS gains will either come at the expense of more power, and if power dissipation is already maxed out, then IPS gains will be limited to power saved from node improvements. At least until clocks start getting lowered on consumer chips.

To bring this back to HT, the additional power consumption per area from HT (it exists and is easy to test) was hindering the max single thread performance of upcoming Intel products. Give p-cores more thermal headroom and you can increase IPC and/or frequency. Seeing as how voltage is already too high because of frequency, I'm glad they went with IPC which would increase amperage.
 
What is a false experiment?

When I have video compression work to do, I use my desktop. Most compiling work I do (JS) only takes a few seconds on my 4-core Tiger Lake laptop, so if given a choice I would take faster and more efficient cores for work that involves compiling.

The smaller variant of Meteor Lake is 2+8 cores (12 threads), Phoenix 2 is 2+4 cores (12 threads), and the M3 in the MacBook Air is 4+4 cores (8 threads). In small, thin, and light laptops, there's not a lot of thermal or battery headroom to power more cores. This is approximately the market Lunar Lake is after. I know Lunar Lake's thread count is down from the smaller Meteor Lake, but it trades 2 little cores for big cores and promises a 50% IPC increase for the remaining little cores.
You believe their 50% IPC claim? They said Redwood cove would deliver 20+% IPC increase.
It didn't.
They said lion cove would have 2x increase in IPC over raptor cove
Now they changed the claims
 
I think Intel is just trying to maximize performance within thermal constraints. The E- cores typically run significantly cooler than the P-cores. Raising the compute load/mm^2 raises temps with everything else equal. The contrapositive can be said for losing HT.

New desktop CPUs are pretty thermally limited so raising the IPC where you have thermal headroom is kind of picking the low hanging fruit.
Yeah, heat will drop considerably by removing HT - games perform a tiny bit better and you can reach the same clockspeeds with lower voltage.

On the other hand HT boosts MT performance by a lot compared to the die space required.

It's just that Intel doesn't really need to rely on it when they have a bunch of small cores to deal with MT performance.
 
You believe their 50% IPC claim? They said Redwood cove would deliver 20+% IPC increase.
It didn't.
They said lion cove would have 2x increase in IPC over raptor cove
Now they changed the claims
They did deliver ~21+% ST improvements back to back from comet to rocket to alderlake. At least when measured in 1T CBR23.
 
You believe their 50% IPC claim? They said Redwood cove would deliver 20+% IPC increase.
I never saw any IPC figures with regards to Redwood Cove. The only thing I can recall that mentioned 20% anything was in relation to Intel 4 vs 7 in terms of power efficiency. In fact Intel being cagey about the performance uplift on Redwood Cove seemed weird until they released it and it wasn't really an improvement.
They said lion cove would have 2x increase in IPC over raptor cove
Now they changed the claims
They certainly didn't say anything if the sort with regards to Lion Cove. The only 2x was talking Skymont vs Crestmont LPE in one of their weird almost pointless marketing graphs. RPL was also only mentioned in terms of Skymont where Intel was projecting the IPC to be about the same for "general software".
 
Last edited:
They said Redwood cove would deliver 20+% IPC increase.
It didn't.
I don't recall such a claim. Source?

They said lion cove would have 2x increase in IPC over raptor cove
No, they said 15% improvement.

Again, I'd be really curious to see a source on that, if you can dig one up, because 2x improvement in IPC is something we probably haven't seen since Core 2 vs. Pentium 4.
 
  • Like
Reactions: rluker5
Intel: Removing SMT support freed up die space, and gave us a 30% improvement performance per watt improvement.

Independant test: AMD SMT gives an 18% performance per watt improvement.

Tom's Hardware: "What were intel thinking? Morons!"

😀
 
  • Like
Reactions: NinoPino
Why do you think it's clickbait? The data presented clearly shows Zen 5 & 5C getting significantly more performance for virtually the same power. In light of that, it's fair to question whether Intel made the right call to remove it, or took the right perspective on its downsides.

Also, let's not call people fools.
Intel claims it nets them 30% performance per watt improvement, due to more die space - die space most likely given to the E cores.

I think that we can assume that the highly talented chip designers at Intel have a clue about how best to use the die space they have.
If they say that its a win on modern hardware, then it's likely a win.

The only fools here are all of us, clueless, debating whether Intel knows what they're doing, because an entirely different architecture from a different company demonstrated some small improvements with SMT: And we don't know what the cost of die space is for AMD.

One thing we do know: E-cores are very agressive in their design, running a different, simpler microcode with an entirely different architecture. So it's pretty reasonable to accept Intel at their word that the space savings more than pay for it. And remember, on modern many-core CPUs, 10% die space is another couple of cores... As well as simplifying the architecture on the P core, likely allowing it to hit higher clock speeds.
 
Intel claims it nets them 30% performance per watt improvement, due to more die space - die space most likely given to the E cores.

I think that we can assume that the highly talented chip designers at Intel have a clue about how best to use the die space they have.
If they say that its a win on modern hardware, then it's likely a win.
Good point but you'll admit that we can be reasonably skeptical.

The only fools here are all of us, clueless, debating whether Intel knows what they're doing, because an entirely different architecture from a different company demonstrated some small improvements with SMT: And we don't know what the cost of die space is for AMD.
Fools is a bit too much. But I agree on the fact that we are debating without detailed information on both architectures.
You forget the fact that the disputing is fun and we all are here for fun.

One thing we do know: E-cores are very agressive in their design, running a different, simpler microcode with an entirely different architecture. So it's pretty reasonable to accept Intel at their word that the space savings more than pay for it. And remember, on modern many-core CPUs, 10% die space is another couple of cores... As well as simplifying the architecture on the P core, likely allowing it to hit higher clock speeds.
10% is a huge amount of space for hyperthreading, I have not found reliable information on space used by SMT implementations but imho a 1-5% of total area is a more reasonable estimate.
 
SMT is just one of those things where it can have big swings in efficiency and cost depending on the workload. I think the AMD enterprise chips will probably show similar gains to the laptop, but it'd be the desktop parts I'm most curious about as they have a lot of headroom vs the amount of cores.
With commodity CPUs nowadays having more physical cores than most users have threads running on them (let alone threads with any substantial throughput or latency requirements) the advantages of SMT in being able to ram more concurrent threads through a dual-core or quad-core die is rather moot. On top of that, SMT was great in getting idle die area working when CPUs were concerned about leaving die area idle, but today SPUs are power budget limited and architected around racing cores to complete their work and return to idle, so SMT can end up being antithetical to overall chip efficiency (even if efficiency per core is improved).
 
Still, regardless, AMD's Zen 5 architecture demonstrates that multi-threading is advantageous if you optimize for it.
Not quite. It demonstrates that using SMT is advantageous if you have already optimised for it. It doesn't demonstrate that optimising for and using it is advantageous, which is how the sentence reads, to me.

Also "still, regardless" is redundant.

I'll stop now.
 
Intel: Removing SMT support freed up die space, and gave us a 30% improvement performance per watt improvement.
No, you misread that. What the article actually said was:

"Intel says that removing Hyper-Threading allowed its designers to squeeze a 30% improvement in performance per power per area out of the Lion Cove P-cores."

I included their slide in post #8, but here it is again:

d743MPmtAGFZpWgwcv5HDL.jpg


Independant test: AMD SMT gives an 18% performance per watt improvement.
Actually, Phoronix' tests yielded a 20.2% efficiency benefit. So, that's definitely bigger than 15%.

Of course, those figures aren't directly comparable, but using the other information I cited in post #8, I estimated Intel gets a 9.5% efficiency benefit when the exact same core is used with vs. without Hyper-Threading.

Tom's Hardware: "What were intel thinking? Morons!"

😀
When you're calling someone out, it's a pretty good idea to make sure you've got the facts straight.
: )
 
Last edited:
Intel claims it nets them 30% performance per watt improvement, due to more die space - die space most likely given to the E cores.
No, that's even more wrong than your previous post. The 30% figure was comparing P-core vs. P-core. It wasn't comparing entire hybrid CPUs!

And remember, on modern many-core CPUs, 10% die space is another couple of cores...
Performance per area doesn't mean the P-cores are necessarily smaller than they otherwise would be. It's just a measure of the computational density. However, let's say you wanted to make your P-cores without HT perform single-threaded tasks the same as they would with HT - then you'd get a core that's 91% as big. Now, if you wanted to reinvest that die area in more E-cores, let's see how far it gets you...

As I mentioned in post #27, Gracemont is about 29% as big as Golden Cove, though I expect Skymont will be even closer in size to Lions Cove. Even using the Alder Lake ratio, shrinking 8x P-cores to 91% of their original size would net you only 2.51 extra E-cores. Unfortunately, the E-cores seem to come in clusters of 4, so that's not quite enough.

Anyway, we know that Lunar Lake will have 4P + 4E cores and the flagship Arrow Lake die will have 8P + 16E cores, which is the same as Raptor Lake. So, whether Intel invested the improved computational density in making the P-cores perform better or it just got consumed by Skymont being relatively larger, we can't really say.
 
SMT was great in getting idle die area working when CPUs were concerned about leaving die area idle, but today SPUs are power budget limited
Phoronix' tests showed a 20.2% efficiency benefit, within the same power limit. Granted, that experiment was between HT on vs. off, rather than a core with support for it entirely removed, however even Intel's numbers imply a 9.5% efficiency benefit from using HT on multi-threaded tasks on a HT-capable core.

and architected around racing cores to complete their work and return to idle,
This doesn't make sense to me. Cores are more efficient at lower frequencies. Countless benchmarks show lower total task energy usage when frequencies are reduced. When power-limited, a core with higher pipeline occupancy will likely have to drop its clock speed, helping achieve the efficiency benefit.

Furthermore, at lower clockspeeds, the penalty of cache misses looks smaller and becomes easier to mask via prefetching and OoO execution.

so SMT can end up being antithetical to overall chip efficiency (even if efficiency per core is improved).
Well, this runs contrary to the Phoronix data, but Intel's slides do claim that their non-HT core is 5% more efficient on MT workloads than a fully-occupied HT-capable equivalent.
 
Out of curiosity why would you care how many threads there are as long as the necessary performance is provided?

Realistically speaking thread count doesn't matter anymore than core count does. It's about getting the right level of performance for the application in question.

In the case of LNL the drive behind dropping SMT is all about maximizing space and power efficiency while maintaining good performance. If they've managed to do that without sacrificing performance then it's a win in my book.

With the complexity of today's programs, and the emerging inclusion of AI capable chips in work stations, I would say there's lot's of opportunity to break the computing chain up in many small parallel pieces.

That is why I think more threads is better. Also there's a reason why it has been 2 threads per core, and not say 3 or 4.

It's a bit like having a winning horse and replacing it with a brand new horse.
 
Intel and AMD are usually spot-on with IPC claims. If Intel promises Skymont will deliver a 38% increase in integer IPC and a 68% increase in floating-point IPC over low-power Crestmont*, then I believe it. I think part of the reason they're spot-on with IPC is that it's not a marketing claim. Cypress Cove had a huge IPC increase over Kaby Lake/Skylake, but a small performance increase due to frequency and core count regression. Willow Cove had a small IPC increase over Sunny Cove but a huge performance increase due to higher frequency at lower power.

Now it sounds like I'm saying Skymont might not be any faster than Crestmont, but IPC is just one of the gains in its favor. Skymont in Lunar Lake is also being built on TSMC N3 and Intel seems to be promising higher frequency at lower power. Theoretically Skymont will deliver so much of an increase in performance that if someone tested it and didn't know that hyperthreading was gone, the person wouldn't find out through benchmarks.

*Crestmont has 3 variants:
  • As an e-core in Meteor Lake sitting on an Intel 4 die and connected to an L3 cache and ring bus that it shares with Redwood Cove cores
  • As a low-power (LP) e-core in Meteor Lake sitting on a TSMC N6 die with no L3 cache
  • As the only core (still an e-core though) in Sierra Forest sitting on one or two Intel 3 dies with L3 cache.
Intel has only compared Skymont to the LP e-core variant of Crestmont, which is the slowest version. This is because Skymont in Lunar Lake is also an LP e-core, with no L3 cache and it's not on the same ring bus as the Lion Cove cores. But Skymont in Lunar Lake does have access to an 8MB "memory-side" cache or system-level cache in addition to a bigger L2 cache (4MB for 4 cores versus 2MB for 2 cores for Meteor Lake's LP e-cores). So it's not a true apples-to-apples comparison.
 
This doesn't make sense to me. Cores are more efficient at lower frequencies. Countless benchmarks show lower total task energy usage when frequencies are reduced. When power-limited, a core with higher pipeline occupancy will likely have to drop its clock speed, helping achieve the efficiency benefit.
Which is great, if the CPU is only processing tasks with no time requirements. A single E-core traipsing away at sub-GHz could likely perform the same overall amount of computation over a 24h period that a full-up multi-core CPU will during its few hours of normal usage, at much greater efficiency. But it'll be at best excruciatingly slow to use, and at worst outright worthless for any tasks with a requirement to complete within a limited time period (which includes all UI-interactive tasks as well as networking, cryptography, etc).
Some use-cases benefit from doing the same job with less overall energy over a longer time period, and some do not. Some use-cases benefit from doing multiple parallel runs of a task at the same time, and some do not. But all use-cases benefit from doing the same task in a shorter amount of time. It's why both AMD and Intel CPUs have continued to trend upwards in clock speed despite the per-core efficiency penalty.
 
Performance per area doesn't mean the P-cores are necessarily smaller than they otherwise would be. It's just a measure of the computational density. However, let's say you wanted to make your P-cores without HT perform single-threaded tasks the same as they would with HT - then you'd get a core that's 91% as big. Now, if you wanted to reinvest that die area in more E-cores, let's see how far it gets you...

As I mentioned in post #27, Gracemont is about 29% as big as Golden Cove, though I expect Skymont will be even closer in size to Lions Cove. Even using the Alder Lake ratio, shrinking 8x P-cores to 91% of their original size would net you only 2.51 extra E-cores. Unfortunately, the E-cores seem to come in clusters of 4, so that's not quite enough.

Anyway, we know that Lunar Lake will have 4P + 4E cores and the flagship Arrow Lake die will have 8P + 16E cores, which is the same as Raptor Lake. So, whether Intel invested the improved computational density in making the P-cores perform better or it just got consumed by Skymont being relatively larger, we can't really say.
But doesn't the "I expect skymont will be even closer in size to lions cove" answers the whole thing? Removing HT allows them to make ecores larger at the same die size, no?
 
I think part of the reason they're spot-on with IPC is that it's not a marketing claim.
Consider everything you see on a marketing slide to be a marketing claim. IPC is definitely subject to manipulation, based on which applications they measure, how they benchmark them, and how the distill the resulting distribution down to a single number. There are also aspects of the test setup subject to manipulation.

You'd do better by focusing on what they present in a tech forum, like Hot Chips, or looking at 3rd party analysis. BTW, I'm sure that the Hot Chips presentations have been combed through by their laywers and PR folks.

Even when an engineer gives an interview, you can be sure the PR team prepped them on what they can, can't, and should say. That's one reason I was surprised at how candid Mike Clark's interview was with Chips & Cheese. The again, being the lead architect since the first Zen probably means there are very few people at AMD he really needs to listen to or worry about.


Intel has only compared Skymont to the LP e-core variant of Crestmont, which is the slowest version.
Yes, thanks for noticing. This is exactly the sort of jujitsu marketing people do in their slides - pick the most favorable point of comparison to elicit the strongest superlatives.
 
  • Like
Reactions: NinoPino
But doesn't the "I expect skymont will be even closer in size to lions cove" answers the whole thing? Removing HT allows them to make ecores larger at the same die size, no?
That embeds a lot of assumptions.

First, let's consider that perf/area is mostly a backwards looking metric and very much a point on a curve. It's not as if that one metric suddenly gives you a knob that you can use to order up a core of a specific performance level, based on how much die area you're willing to spend.

Second, as I've pointed out, it doesn't mean the cores are 91% as big. That should be true only if they took the area savings they got from removing HT and didn't reinvest them in making the P-cores faster. I find that unlikely.

Finally, we don't know the exact chronology. Maybe hyperthreading removal came fairly late in the process, which could help explain why the server Lions Cove will still have it. Maybe they looked at die size projected for Lunar Lake, worked out the cost projections (which are higher since they're using TSMC N3 - during their recent quarterly results, Intel said profit margins on Lunar lake are going to suffer from this), and decided they needed to go back and find some area savings, somewhere.

That said, I'm fairly certain they're tracking the projected transistor count during the entire design process and working according to a budget. Though, it still could be true that the use of TSMC could've heightened their sensitivity to area efficiency.
 
Yes, thanks for noticing. This is exactly the sort of jujitsu marketing people do in their slides - pick the most favorable point of comparison to elicit the strongest superlatives.
Clearly Intel should've compared Skymont LP e-cores to Crestmont LP e-cores only when exploring power consumption. And Intel should've compared Skymont LP e-cores to the Crestmont regular e-cores when talking about performance, to ensure Skymont is presented in the worst-possible light.
 
Status
Not open for further replies.