News Intel doesn't plan to bring 3D V-Cache-like tech to consumer CPUs for now — next-gen Clearwater Forest Xeon CPUs will feature "Local Cache" in th...

bit_user · Nov 16, 2024

genz said:
SRAM Scaling Issues, And What Comes Next

While it will remain a workhorse memory, using SRAM at advanced nodes requires new approaches.

semiengineering.com

That's the only free link I can find that wasn't a YouTube video, but some of the paid sources on there will give more context and detail.

Thanks for the link!

genz said:
you cannot say that matches Moore's law when it's inverse exponent),

Certainly, SRAM has a scaling problem. I think we were only negotiating the particulars.

genz said:
That's far less scaling than they are achieving with transistor density and is the whole reason why E-Cores, dense cores, or whatever you want to call them are having their day in the sun.

So, you think that a lot of E-cores' density improvement comes from relying on smaller SRAM structures, like physical register files, reorder buffers, and caches? That's an interesting take. I honestly don't know enough to say one way or another.

BTW, the SRAM scaling matter puts an interesting perspective on ARM's move away from using micro-op caches. Last I heard, Intel's E-cores don't have them, either. I don't know if Skymont changed that, but their move from dual to triple decoder blocks (each 3-wide) would seem to suggest not, as it probably adds enough decoder bandwidth to keep the backend fed.

genz said:
EDIT: Did I miss them saying they would use TSMC? Not doubling down, just genuinely asking, as if they were using N3E for cache ($$$ but yeah) but Intel 7/5 for core tiles then it's a whole different ball game.

I certainly haven't heard anything about which nodes they would use for what, but I'm not the most plugged-in to the rumor mill. It would indeed be a bombshell if Intel started using TSMC for substantial parts of their datacenter CPUs.

rluker5 · Nov 16, 2024

bit_user said:
I'm curious why you say that.

Source: https://fuse.wikichip.org/news/7343/iedm-2022-did-we-just-witness-the-death-of-sram/

An addendum to the above link should be:

https://www.tomshardware.com/tech-i...cs-2nm-process-tech-claims-major-improvements

Which basically states that TSMC's N2 node achieves a 20% SRAM density improvement over N5/N3. It's not huge, but it's in line with the other purported density improvements offered by N2 vs N3.

It looks like SRAM scaling hit a wall when TSMC started with EUV at 7nm. Maybe high NA EUV will allow another step down in area?

genz · Nov 17, 2024

bit_user said:
So, you think that a lot of E-cores' density improvement comes from relying on smaller SRAM structures, like physical register files, reorder buffers, and caches? That's an interesting take. I honestly don't know enough to say one way or another.

BTW, the SRAM scaling matter puts an interesting perspective on ARM's move away from using micro-op caches. Last I heard, Intel's E-cores don't have them, either. I don't know if Skymont changed that, but their move from dual to triple decoder blocks (each 3-wide) would seem to suggest not, as it probably adds enough decoder bandwidth to keep the backend fed.

AMD claims enough performance parity with most of Zen 5c Vs 5 that I would guess they found that having more execution units and smaller but faster caches is likely better per watt than refreshing more SRAM every cycle and gaining via throughput as the larger cores would, but that suits smaller bursty, "more often idle" cores like 5c and ARM where lower inter core and core to L3 latency will add to the performance of the backend. Their reorder buffers are definitely smaller, but I don't have details on the rest to hand and am pressed for time.

Intel is a harder beast to predict because simply put, their E-Core characteristics are too different to their P-Cores to ascertain without someone else giving me the data, because I haven't looked into them in detail yet, I've been too focused on the big cores.

I can only guess that the sum of more execution and lower latencies is what 5c is using to reach 5 levels of burst performance. With smaller caches you can get better yield on higher clocks so that's a factor. Obviously larger loops fall over on cache size, but ARM and 5c aren't really built for that in standard form.

SRAM uses power regardless of how much it's used, so it would be my first target if I was trying to minimise base power draw. Taking up more and more space per generation only adds to the argument.

As for micro-op caches, ditching that was specifically because it was an easy attack vector for side channel attacks. Asking for a op that had just been executed by another process and timing how quickly it came would allow you to work out what was being used op-wise pretty accurately. The smaller the cache the better this worked. I believe there are mitigations but they hamstring the performance benefits and use more power and space.

bit_user · Nov 17, 2024

genz said:
AMD claims enough performance parity with most of Zen 5c Vs 5 that I would guess they found that having more execution units and smaller but faster caches is likely better per watt than refreshing more SRAM every cycle and gaining via throughput as the larger cores would,

Zen 4C only differs from Zen 4 in terms of layout, removing the TSVs and Tag RAM to support 3D V-cache, and halving the L3 cache per CCD. They also were able to decrease thickness of certain wires and components, due to its lower frequency ceiling. However, the microarchitecture is exactly the same, even to the point of having the exact same IPC (i.e. for microbenchmarks that don't hit memory).

Indications are that Zen 5C continues this philosophy. You can read a bit about that, here:

https://www.tomshardware.com/pc-com...he-normal-core-new-soc-architecture-disclosed

The area data supports this, as Zen 5C is about 75% the size of a Zen 5 core (see above link), while Lunar Lake's Skymont seems to be about 38.1% as big as its Lion Cove counterpart.

genz said:
Intel is a harder beast to predict because simply put, their E-Core characteristics are too different to their P-Cores to ascertain without someone else giving me the data, because I haven't looked into them in detail yet, I've been too focused on the big cores.

Chips And Cheese sometimes compares the estimated size of structures between different cores. Here's how the P & E cores in Alder Lake compare (with Zen 2 & 3, as well):

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23939ab8-35bf-4a6b-8d98-0faf58321b1a_1062x805.png

Source: https://chipsandcheese.com/p/gracemont-revenge-of-the-atom-cores

Here, I've attempted to compute the ratios. They don't align terribly well with the E-cores being only about 29.6% as big as the P-cores (excluding L2 & L3):

Structure	Gracemont	Golden Cove	Ratio
ROB	256	512	50.0%
Int Regs	214	280	76.4%
Flags Regs	214	280	76.4%
256-bit VFP Regs	111	332	33.4%
Load Queue	80	192	41.7%
Store Queue	50	114	43.9%
BOB	126	128	98.4%
Scheduler	221	205	107.8%

I also found where they compared Skymont to Crestmont.

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04d7c771-f211-477a-8087-4cc0633aab95_1061x451.png

Source: https://chipsandcheese.com/p/skymont-intels-e-cores-reach-for-the-sky

Lastly, here's data on the latest Lion Cove P-cores. Sadly, I didn't find a table with just Lion Cove vs. Skymont.

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5edcfe0c-f355-40e1-8bd8-cc85453f76fc_1063x658.png

Source: https://chipsandcheese.com/p/lion-cove-intels-p-core-roars

Again, I've attempted to compute the ratios between the P and E cores. Excluding L2 and L3 cache, Skymont appears to be about 33.2% as big as Lion Cove. The relative sizes of their structures seem to correlate even worse to the relative core areas than we saw with Alder Lake, although I have stats on slightly fewer aspects.

Structure	Skymont	Lion Cove	Ratio
ROB	416	576	72.2%
Int Regs	272	290	93.8%
256-bit VFP Regs	282	406	69.5%
Load Queue	114	189	60.3%
Store Queue	56	120	46.7%
BOB	96	180	53.3%

genz said:
As for micro-op caches, ditching that was specifically because it was an easy attack vector for side channel attacks.

What ARM publicly said about it was summarized here:

"By moving many of the benefits of the MOP cache along with its newly added decode lane, Arm says it was able to achieve similar performance without the MOP cache. For this reason it was removed. Removing the cache also offered some area and power gain, albeit in terms of performance, the fairly large design swap largely equal each other out."

Arm Introduces The Cortex-A715

Arm introduces next-generation big core: Cortex-A715.

fuse.wikichip.org

The X4 subsequently picked up the change, also dropping its MOP cache.

genz said:
Asking for a op that had just been executed by another process and timing how quickly it came would allow you to work out what was being used op-wise pretty accurately.

This isn't consistent with how MOP cache works. The MOP cache is private to each core and none of these cores support SMT.

JayNor · Jan 26, 2025

Intel's PVC GPUs have 144MB of SRAM on each of the two base tiles... so they have already demonstrated that 3d manufacturing capability.

Search

News Intel doesn't plan to bring 3D V-Cache-like tech to consumer CPUs for now — next-gen Clearwater Forest Xeon CPUs will feature "Local Cache" in th...

bit_user

Titan

SRAM Scaling Issues, And What Comes Next

rluker5

Distinguished

genz

Distinguished

bit_user

Titan

Arm Introduces The Cortex-A715

JayNor

Honorable

TRENDING THREADS

Latest posts

Moderators online

Share this page