News Intel doesn't plan to bring 3D V-Cache-like tech to consumer CPUs for now — next-gen Clearwater Forest Xeon CPUs will feature "Local Cache" in th...

The gaming CPU market isn't large enough to justify it? This is more of the same Intel sticking with the same safe bets and making excuses. It's not just raw performance but perf-per-watt isn't close even on Arrow Lake compared to Zen 4 and 5 X3D parts.

"Local cache"? *Face palm* Cache is inherently local, lol, at least compared to main system memory (RAM). I'm not saying they need to go crazy on marketing it, but that sounds like zero effort.
 
I don't understand, "However, this is unlike AMD's X3D approach since the CPU chiplets are mutually dependent on the Base tile."
They're trying to say that the CPU Tile and Base Tile ( cache ) exist as a complete package and hence are dependent on one another ( mutually dependent ). You can't have CPU cores without cache and vice versa.
 
  • Like
Reactions: truerock
That Clearwater Forest though.
How many cores is that thing going to have with them being further shrunk?
Probably 288 or so. I believe 4 CPU tiles have 96 cores ( saw this on WCCFTech a while back )... Or 24 per each CPU tile... Since there are 12 CPU tiles: 12*24 = 288 but that's just a guess.
 
But, how is that "unlike AMD"
It's simple. AMD's CCDs are not dependent on the cache chiplet since they have their own cache ( L1 + L2 + L3 ). The additional chiplet simply extends the existing L3 cache.
In Intel's approach, they're mutually dependent ( one can't exist without the other ) but in AMD, they aren't ( one can exist without the other ).
 
The gaming CPU market isn't large enough to justify it? This is more of the same Intel sticking with the same safe bets and making excuses. It's not just raw performance but perf-per-watt isn't close even on Arrow Lake compared to Zen 4 and 5 X3D parts.

"Local cache"? *Face palm* Cache is inherently local, lol, at least compared to main system memory (RAM). I'm not saying they need to go crazy on marketing it, but that sounds like zero effort.
I think maybe the news reporter was confused. Intel was using the term "local cache" as it has always been used to refer to L3 and L4 cache (L2, L1 and L0 also, I guess)
 
But, how is that "unlike AMD"?
AMD's approach: CPU dies connect to support die via substrate links. CPU dies have some amount of local cache on them. 3D V-cache die sits on top of CPU dies connected via TSVs.

Intel's approach: CPU dies connect to support die via TSVs. CPU dies do not have their own cache (or at least, not L3), possibly not L2) and this instead lives on the support die.

In other words: AMD's approach is to bond an extra cache die to the CPU die and leave the support die alone. Intel's is to leave the CPU die alone and swap out the support die it's bonded to with whatever cache amounts are required for that SKU.
 
The article said:
Until now, even with a disaggregated design, Intel has employed "Compute Tiles" featuring all cores alongside their respective caches linked via the Ring Bus.
Intel's server CPUs haven't used a ring bus in a long time (Broadwell used up to 2.5 rings). Granite Rapids uses what Intel calls a "Modular Mesh Fabric":


So, does anyone know where the memory controllers will go, in Clearwater Forest?
 
"Local cache"? *Face palm* Cache is inherently local, lol, at least compared to main system memory (RAM).
The degree of locality varies. For L2 cache, it's either per-core (P-cores) or per-cluster (E-cores). Intel traditionally has a L3 cache domain which spans the entire CPU, whereas AMD's L3 cache domain is just per CCD.

In Lunar Lake, Intel introduced something they call a System Level Cache, which is like a L4 cache that has a domain spanning all of the CPU cores, the NPU, and the iGPU. I'm not sure whether it includes PCIe, but I'd guess so.

Anyway, I'd guess that what Intel means is the caching domain supported by each of those base tiles is limited to just the specific CPU dies stacked on them.
 
Last edited:
  • Like
Reactions: thestryker
AMD's approach: CPU dies connect to support die via substrate links. CPU dies have some amount of local cache on them. 3D V-cache die sits on top of CPU dies connected via TSVs.
In Zen 5, the V-Cache die is underneath and I/O gets routed down through it. However, the 3D V-Cache die is still a separate and optional thing, apart from the substrate or interposer, which I think remain just passive.
 
  • Like
Reactions: P.Amini
So, does anyone know where the memory controllers will go, in Clearwater Forest?
To my knowledge Intel hasn't given away much of anything with regards to what is on the IO vs compute tiles. GNR has memory controllers on each compute tile, but I'm not sure if that's to minimize overall latency or optimize for their NUMA split.

If CWF is a maximum of 12 compute tiles it is entirely possible there's a single channel memory controller on each one. Then the SKUs would likely be made up of either 12 or 8 compute tiles with disabled core clusters within the tiles. Of course at the same time if there's cache in the base tiles connecting each set of 4 compute tiles perhaps that's doing double duty of giving additional cache and lowering the impact of memory controllers that are further away.

It will certainly be interesting to see what the design ends up being.
 
  • Like
Reactions: bit_user
I'm completely unsurprised that Intel's lack of direct X3D competition is due to cost versus volume. Even if the CWF design bears fruit I wouldn't count on seeing that type of implementation at the client level. The only way I really see it happening is if Intel can work it into their tile designs. The two hypotheticals that make the most sense to me is replacing a filler tile with cache assuming adjacent to compute tiles or if they disaggragate P and E cores then replace E cores with cache for a gaming part.
 
  • Like
Reactions: bit_user
If CWF is a maximum of 12 compute tiles it is entirely possible there's a single channel memory controller on each one. Then the SKUs would likely be made up of either 12 or 8 compute tiles with disabled core clusters within the tiles. Of course at the same time if there's cache in the base tiles connecting each set of 4 compute tiles perhaps that's doing double duty of giving additional cache and lowering the impact of memory controllers that are further away.
On an aesthetic level, I like symmetrical and distributed designs.

My favorite EPYC was the first gen. If they had enough lanes for each CCD to be fully-connected to all of the others, then you'd always be zero or 1 hop away from memory, whereas current EPYC has always 1 hop.

Of course, all-to-all connectivity doesn't scale well, as the number of chiplets increased. But, if you'd then put them in a toroidal mesh, you could keep the memory controllers distributed and still keep the number of hops down.
 
  • Like
Reactions: thestryker
Maislinger stated, "But for us, this (gaming) is not an extremely large mass market. You still have to see that we sell a lot of CPUs that are not necessarily used for gaming. We still have it (3D Stacked Cache) technologically. This means that next year there will be a CPU (Clearwater Forest) for the first time that has a cache tile, but not on desktop."

It doesn't matter if it's "mass market" when there's a part of the market that's willing to pay a price premium for a product, and the "gamer" market is an ever growing market where people will spend as much on a GPU as the "mass market" spends on an entire PC. In addition to getting a price premium you also gain manufacturing experience so you don't have to deal with early generation product detriments against a competitor's mature product.
 
It doesn't matter if it's "mass market" when there's a part of the market that's willing to pay a price premium for a product, and the "gamer" market is an ever growing market where people will spend as much on a GPU as the "mass market" spends on an entire PC. In addition to getting a price premium you also gain manufacturing experience so you don't have to deal with early generation product detriments against a competitor's mature product.
Market size absolutely does matter when you manufacture on the scale that Intel does with the designs they have. The only reason we have X3D parts is because AMD was designing stacked cache for Epyc and they use the same CCDs across enterprise and client. This is not the case for Intel as nothing physical translates across enterprise and client.

Enthusiast level parts are likely a rounding error on Intel's books so carving out an even smaller niche that requires its own run does not make any financial sense. In fact I'd bet that it would actually not be profitable at all for Intel given their current designs even if they could raise the price by 25-50%.

It will take some very intentional design choices from the start of a CPU design process for it to make sense.
 
Last edited:
I guess we'll have to see how this plays out. They're ceding the gaming crown to AMD, if that lets them design server cache that beats AMD's on server workloads that'll be big. If they can't… I mean, I'm here commenting because this looks close to a bet-the-company move. They're going to suffer IBM's fate if they can't hold down the general server market, they'll have niche hardware for lucrative and not-going-anywhere-soon niches and a reputation that still holds some of its glory-days lustre.
 
  • Like
Reactions: bit_user
"Splits the core and cache into separate tiles" but doesn't say it stacks them.... That means nothing says it will actually come at the sort of capacities that X3D brings (cache did not really shrink at all since 14nm and so is now huge compared to core logic).

If it's not going to be stacked then it's another market segmentation ploy imo, unless (unlikely) they are having cache yield issues or outsourcing cache tiles to Samsung or something.
 
I don't understand, "However, this is unlike AMD's X3D approach since the CPU chiplets are mutually dependent on the Base tile."
The CPU tile only contains the cores, with no cache, so it wouldn't function without the base tile, as all the cache resides there.
In contrast, AMD uses a fully functional CPU chiplet with all three levels of cache integrated, and then they attach an X3D V-Cache chiplet to add additional L3 cache.
 
  • Like
Reactions: truerock
Re: gaming
This move appears to be primarily motivated by core density. Move all cache off the CPU tile = more cores per tile. This approach will work best with E cores, and E cores aren't best for gaming. Therefore this approach will not work well for a gaming CPU.

Extrapolating further, I believe Maislinger is hinting it's not worthwhile to design a cache less P core + base tile combo, esp. in the low core desktop market.
 
(cache did not really shrink at all since 14nm and so is now huge compared to core logic).
I'm curious why you say that.

sram-density-tsmc-n3b-n3e.png


Source: https://fuse.wikichip.org/news/7343/iedm-2022-did-we-just-witness-the-death-of-sram/

An addendum to the above link should be:

Which basically states that TSMC's N2 node achieves a 20% SRAM density improvement over N5/N3. It's not huge, but it's in line with the other purported density improvements offered by N2 vs N3.
 
I'm curious why you say that.
sram-density-tsmc-n3b-n3e.png

An addendum to the above link should be:

Which basically states that TSMC's N2 node achieves a 20% SRAM density improvement over N5/N3. It's not huge, but it's in line with the other purported density improvements offered by N2 vs N3.

That's the only free link I can find that wasn't a YouTube video, but some of the paid sources on there will give more context and detail.

I'm not saying TSMC's scaling is not impressive but even in your slide, if you ignore the curved line (which implies that SRAM will never shrink anyway as it's flat at the end: you cannot say that matches Moore's law when it's inverse exponent), you can see that they have achieved a 50% reduction between 10nm and 3nm when they did the same between 16nm and 10nm. So it took them 3.5 nodes to do the same as what they did in the node before 10nm, which matches what I said about around 14nm being (Intel's) last node where SRAM scaled well. Now they are boasting about 20%.

They are probably way ahead of Intel Foundries here too. I mean we know they are brand names at this point but 16->10 is a 40% reduction, and 10 -> 3 is a 300%+ reduction, or at least that's what the numbers imply.

That's far less scaling than they are achieving with transistor density and is the whole reason why E-Cores, dense cores, or whatever you want to call them are having their day in the sun. In essence , when you copy over your old Verilog to start your new die shrink, the first thing you'll see is the SRAM taking up more % of the die as everything else has shrunk more.

EDIT: Did I miss them saying they would use TSMC? Not doubling down, just genuinely asking, as if they were using N3E for cache ($$$ but yeah) but Intel 7/5 for core tiles then it's a whole different ball game.
 
Last edited:
  • Like
Reactions: bit_user