News Intel doesn't plan to bring 3D V-Cache-like tech to consumer CPUs for now — next-gen Clearwater Forest Xeon CPUs will feature "Local Cache" in th...

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.

That's the only free link I can find that wasn't a YouTube video, but some of the paid sources on there will give more context and detail.
Thanks for the link!

you cannot say that matches Moore's law when it's inverse exponent),
Certainly, SRAM has a scaling problem. I think we were only negotiating the particulars.

That's far less scaling than they are achieving with transistor density and is the whole reason why E-Cores, dense cores, or whatever you want to call them are having their day in the sun.
So, you think that a lot of E-cores' density improvement comes from relying on smaller SRAM structures, like physical register files, reorder buffers, and caches? That's an interesting take. I honestly don't know enough to say one way or another.

BTW, the SRAM scaling matter puts an interesting perspective on ARM's move away from using micro-op caches. Last I heard, Intel's E-cores don't have them, either. I don't know if Skymont changed that, but their move from dual to triple decoder blocks (each 3-wide) would seem to suggest not, as it probably adds enough decoder bandwidth to keep the backend fed.

EDIT: Did I miss them saying they would use TSMC? Not doubling down, just genuinely asking, as if they were using N3E for cache ($$$ but yeah) but Intel 7/5 for core tiles then it's a whole different ball game.
I certainly haven't heard anything about which nodes they would use for what, but I'm not the most plugged-in to the rumor mill. It would indeed be a bombshell if Intel started using TSMC for substantial parts of their datacenter CPUs.
 
Last edited:
  • Like
Reactions: genz
I'm curious why you say that.
sram-density-tsmc-n3b-n3e.png

An addendum to the above link should be:

Which basically states that TSMC's N2 node achieves a 20% SRAM density improvement over N5/N3. It's not huge, but it's in line with the other purported density improvements offered by N2 vs N3.
It looks like SRAM scaling hit a wall when TSMC started with EUV at 7nm. Maybe high NA EUV will allow another step down in area?
 
So, you think that a lot of E-cores' density improvement comes from relying on smaller SRAM structures, like physical register files, reorder buffers, and caches? That's an interesting take. I honestly don't know enough to say one way or another.

BTW, the SRAM scaling matter puts an interesting perspective on ARM's move away from using micro-op caches. Last I heard, Intel's E-cores don't have them, either. I don't know if Skymont changed that, but their move from dual to triple decoder blocks (each 3-wide) would seem to suggest not, as it probably adds enough decoder bandwidth to keep the backend fed.
AMD claims enough performance parity with most of Zen 5c Vs 5 that I would guess they found that having more execution units and smaller but faster caches is likely better per watt than refreshing more SRAM every cycle and gaining via throughput as the larger cores would, but that suits smaller bursty, "more often idle" cores like 5c and ARM where lower inter core and core to L3 latency will add to the performance of the backend. Their reorder buffers are definitely smaller, but I don't have details on the rest to hand and am pressed for time.

Intel is a harder beast to predict because simply put, their E-Core characteristics are too different to their P-Cores to ascertain without someone else giving me the data, because I haven't looked into them in detail yet, I've been too focused on the big cores.

I can only guess that the sum of more execution and lower latencies is what 5c is using to reach 5 levels of burst performance. With smaller caches you can get better yield on higher clocks so that's a factor. Obviously larger loops fall over on cache size, but ARM and 5c aren't really built for that in standard form.

SRAM uses power regardless of how much it's used, so it would be my first target if I was trying to minimise base power draw. Taking up more and more space per generation only adds to the argument.

As for micro-op caches, ditching that was specifically because it was an easy attack vector for side channel attacks. Asking for a op that had just been executed by another process and timing how quickly it came would allow you to work out what was being used op-wise pretty accurately. The smaller the cache the better this worked. I believe there are mitigations but they hamstring the performance benefits and use more power and space.
 
Last edited:
AMD claims enough performance parity with most of Zen 5c Vs 5 that I would guess they found that having more execution units and smaller but faster caches is likely better per watt than refreshing more SRAM every cycle and gaining via throughput as the larger cores would,
Zen 4C only differs from Zen 4 in terms of layout, removing the TSVs and Tag RAM to support 3D V-cache, and halving the L3 cache per CCD. They also were able to decrease thickness of certain wires and components, due to its lower frequency ceiling. However, the microarchitecture is exactly the same, even to the point of having the exact same IPC (i.e. for microbenchmarks that don't hit memory).

Indications are that Zen 5C continues this philosophy. You can read a bit about that, here:

The area data supports this, as Zen 5C is about 75% the size of a Zen 5 core (see above link), while Lunar Lake's Skymont seems to be about 38.1% as big as its Lion Cove counterpart.

Intel is a harder beast to predict because simply put, their E-Core characteristics are too different to their P-Cores to ascertain without someone else giving me the data, because I haven't looked into them in detail yet, I've been too focused on the big cores.
Chips And Cheese sometimes compares the estimated size of structures between different cores. Here's how the P & E cores in Alder Lake compare (with Zen 2 & 3, as well):

Here, I've attempted to compute the ratios. They don't align terribly well with the E-cores being only about 29.6% as big as the P-cores (excluding L2 & L3):

Structure​
Gracemont​
Golden Cove​
Ratio​
ROB
256​
512​
50.0%​
Int Regs
214​
280​
76.4%​
Flags Regs
214​
280​
76.4%​
256-bit VFP Regs
111​
332​
33.4%​
Load Queue
80​
192​
41.7%​
Store Queue
50​
114​
43.9%​
BOB
126​
128​
98.4%​
Scheduler
221​
205​
107.8%​

I also found where they compared Skymont to Crestmont.

Lastly, here's data on the latest Lion Cove P-cores. Sadly, I didn't find a table with just Lion Cove vs. Skymont.

Again, I've attempted to compute the ratios between the P and E cores. Excluding L2 and L3 cache, Skymont appears to be about 33.2% as big as Lion Cove. The relative sizes of their structures seem to correlate even worse to the relative core areas than we saw with Alder Lake, although I have stats on slightly fewer aspects.

Structure​
Skymont​
Lion Cove​
Ratio​
ROB
416​
576​
72.2%​
Int Regs
272​
290​
93.8%​
256-bit VFP Regs
282​
406​
69.5%​
Load Queue
114​
189​
60.3%​
Store Queue
56​
120​
46.7%​
BOB
96​
180​
53.3%​

As for micro-op caches, ditching that was specifically because it was an easy attack vector for side channel attacks.
What ARM publicly said about it was summarized here:

"By moving many of the benefits of the MOP cache along with its newly added decode lane, Arm says it was able to achieve similar performance without the MOP cache. For this reason it was removed. Removing the cache also offered some area and power gain, albeit in terms of performance, the fairly large design swap largely equal each other out."


The X4 subsequently picked up the change, also dropping its MOP cache.

Asking for a op that had just been executed by another process and timing how quickly it came would allow you to work out what was being used op-wise pretty accurately.
This isn't consistent with how MOP cache works. The MOP cache is private to each core and none of these cores support SMT.
 
  • Like
Reactions: genz and thestryker