News Intel's Nova Lake CPU reportedly has up to 52 cores — Coyote Cove P-cores and Arctic Wolf E-cores onboard

If they're going to spread CPU cores across multiple chiplets, it'd be interesting to have chiplets with all P-cores and others with all E-cores. Then, Intel could easily mix & match to serve more markets.
It sounds like the plan is to connect up to two 8P + 16 E chiplets, and they'll disable some cores if needed. I think the mixing of these two types on one chiplet could help with switching tasks between the two core types without incurring a latency penalty. And this remains true for any SKU with only one of these 8+16 chiplets, since there they all are. With the exception of the LP E-cores...

Bringing LP E-cores to desktop and increasing them from 2 (Meteor Lake) to 4 could help idle power efficiency immensely if they pull it off, great for office PCs idling 90% of the time, or doing very light work, video playback, etc. If we are looking at cores that are like Golden Cove IPC at 2 GHz, four of them could do surprisingly well on the SoC tile.
 
If they're going to spread CPU cores across multiple chiplets, it'd be interesting to have chiplets with all P-cores and others with all E-cores. Then, Intel could easily mix & match to serve more markets.
Yeah this is a question Ian Cutress raised after the ADL launch. At this point I'm not really sure why they don't separate out the P and E-cores. They somewhat doubled down on the existing monolithic nature with ARL given the E-cores clusters are mixed in with the P-cores. I'm sure there are advantages in power constrained parts so maybe this is dictating client strategy.
 
  • Like
Reactions: usertests
Another fake news) And Arrow Lake scaled up to 32 cores. This whole strategy with P-E on one die is a failure, as are all their pseudo-chiplets. They still have double-digit sets of dies, completely thoughtlessly fishing out money for the design and production of electronic waste. They are losing money on this, and this will not change until they understand that the strategy of chiplets should be the simpler, the better. Take AMD as an example and do the same. Dumb pride and lack of competence will not allow them to change. The era of their miserable ring bus has passed, and dumb interposers with terrible latency are not what is needed in client machines. But here we are.
 
It sounds like the plan is to connect up to two 8P + 16 E chiplets, and they'll disable some cores if needed.
Right, that's how it sounds.

I think the mixing of these two types on one chiplet could help with switching tasks between the two core types without incurring a latency penalty.
But they still do. Arrow Lake's P vs. E latency is basically as bad as between two different chiplets of Zen 5.

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55af2adb-22d6-49b5-b5ce-c93df3f0518b_996x515.png


https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefaf8904-f4b4-4335-8d5d-184569d27e25_1001x517.png


Source: https://chipsandcheese.com/p/examining-intels-arrow-lake-at-the
Also, core-to-core latency isn't about context switching. It's about communication between active threads on two different cores. That's something schedulers are already aware of, when they place different threads of the same process.

In practice, core-to-core latency has been shown to be of minimal real-world importance. The main reason people look at it is to try and glean details about a CPU's interconnect. What's much more important is each core's cache & memory latency.

Bringing LP E-cores to desktop and increasing them from 2 (Meteor Lake) to 4 could help idle power efficiency immensely if they pull it off, great for office PCs idling 90% of the time, or doing very light work, video playback, etc. If we are looking at cores that are like Golden Cove IPC at 2 GHz, four of them could do surprisingly well on the SoC tile.
If desktops want to decrease idle power, the first thing they should do is implement dynamic scaling of memory frequency. I think we'll never see P-cores being used for LP, since the SoC tile tends to be on an older node, which would both make them less efficient and bigger area hogs. They also can't use an older core, without holding back the ISA support of the newer ones.

Not to mention how Skymont supposedly has IPC similar to Raptor Cove, showing that the newer E-cores are just fine for LP duty.
 
They somewhat doubled down on the existing monolithic nature with ARL given the E-cores clusters are mixed in with the P-cores. I'm sure there are advantages in power constrained parts so maybe this is dictating client strategy.
Interleaving them was an interesting move. I wonder which actually has the higher power-density. Probably P-cores, but I think E-cores are actually quite power-dense. Even if power density is similar, Interleaving could make sense in cases where you have a lightly-threaded job hitting only the P-cores. Less likely, but an analogous thing could happen with a background job of some sort that's hitting the E-cores.
 
  • Like
Reactions: thestryker
I wonder how feasible it would be to go beyond what Intel did with ARL. I very much like the fact that it can drop down to the base JEDEC profile (assuming the feature is enabled) and then clock back up to the chosen XMP profile as needed.
GPUs and phones scale memory frequency way more than that.

I'd guess the limit is probably something intrinsic to how DDR5 memory works. Perhaps we'll see even greater memory frequency scaling, in DDR6.

I wonder if laptops already drop frequencies like this, with LPDDR5. Given phones' use of the LP memory standards, I think it should be supported on that end.
 
If they're going to spread CPU cores across multiple chiplets, it'd be interesting to have chiplets with all P-cores and others with all E-cores. Then, Intel could easily mix & match to serve more markets.
Taking this further... P and E don't necessarily need the same process node, so separating those tiles gives them considerable options to deal with whatever the future holds.
 
  • Like
Reactions: Peksha
Taking this further... P and E don't necessarily need the same process node,
They don't, but this doesn't necessarily work out like you might expect. For instance, AMD used a TSMC N4 node for the Zen 5 CCD and a N3 node for the Zen 5C CCD (source: https://www.anandtech.com/show/2146...bile-strix-point-with-rdna-35-igpu-xdna-2-npu ).

I'm not even going to speculate on the factors leading to that decision (I can imagine a lot). I just wanted to point out that being the E-core doesn't necessarily mean it should get the worse node.
 
They don't, but this doesn't necessarily work out like you might expect. For instance, AMD used a TSMC N4 node for the Zen 5 CCD and a N3 node for the Zen 5C CCD (source: https://www.anandtech.com/show/2146...bile-strix-point-with-rdna-35-igpu-xdna-2-npu ).

I'm not even going to speculate on the factors leading to that decision (I can imagine a lot). I just wanted to point out that being the E-core doesn't necessarily mean it should get the worse node.
Funny... i was thinking E would get the more advanced node. It seems LP tends to be the pipe cleaner with HP coming later. Even though sram scaling is better recently, better doesnt mean the trend of hard to shrink has reversed, so less sram gives E an advantage there as well. Historically, Intel nodes hit higher frequency where TSMC nodes have better efficiency, but really it's impossible to predict from where I'm sitting... just that more tiles equals more flexibility at the cost of latency.