News AMD and JEDEC Develop DDR5 MRDIMMs With Speeds Up To 17,600 MT/s

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
quad data rate means 400mhz = 1600mt/s (QDR is mostly seen on GPUs)
The way I saw it explained for DDR2 is that it maintained sending on both the rising and falling edges, but then halved the DRAM clock relative to the interface clock. Hence, the rate at which data is being sent is 4x per tick of the DRAM clock. I'm not sure exactly where DDR3 got another doubling, unless it was by halving the DRAM clock yet again.

ddr5 split 64bits into two 32bit streams (aka qdr or dual channel at half bandwith), no bandwith benefit, just latency related..
I think you're saying that only because you assume they could've kept scaling up frequencies at the current channel data width. I'm not sure that's true.
 
  • Like
Reactions: TJ Hooker
M(C)R-DIMMs don't put any extra load on memory controllers besides the strain from increased bus speed since all control and data signals are buffered on each DIMM and look like a single-point load to the memory controller no matter how many chips are on the buffer's back-end.
V4KUdzn.png

9dpYdSP.jpg

I think the MCR Buffer would be more useful for Rank's that come from "Multiple Rows" on a DIMM.

Eventually DDR6 will have DIMM Memory Sub-Channel (A + B on side 1) & (C + D on side 2), that means 4x DIMM Memory Sub-Channels.

Multiple Ranks will come from Multiple Rows.

Eventually I see more than 2x ROWS of RAM Packages per DIMM moving into the server world.

I expect Double Height DIMM Modules will start appearing to be more common in the Server world with more than 2x ROWS of RAM Packages some time in the future once DDR6 and it's 4x DIMM Memory Sub-Channels becomes standard. One Row ain't enough, you'll grow row's as many as there are Memory Ranks.
pf6XvAo.jpg
 
Last edited:
Eventually I see more than 2x ROWS of RAM Packages per DIMM moving into the server world.
Registered DIMMs (server memory) with multiple ranks of memory have been around for decades, nothing new there. The problem with those is that having 4X as much memory per DIMM reduces bandwidth per chip by 4X since they are all sharing the same common low-speed bus and having four devices sharing data pins makes it difficult to scale bus speed. M(C)R multiplexes the data pins between DRAM chips at 2X/4X the data rate to make bandwidth scale with rank count,.
 
  • Like
Reactions: bit_user
Registered DIMMs (server memory) with multiple ranks of memory have been around for decades, nothing new there. The problem with those is that having 4X as much memory per DIMM reduces bandwidth per chip by 4X since they are all sharing the same common low-speed bus and having four devices sharing data pins makes it difficult to scale bus speed. M(C)R multiplexes the data pins between DRAM chips at 2X/4X the data rate to make bandwidth scale with rank count,.
I know, that's why I brought it up. But combine the MCR Buffers with a JEDEC developed Open Source/Patent Micro-Threading equivalant to use in DDR(6 or Higher) since RAMBUS' Patent on Micro-Threading is due to expire in 2025 and I doubt DDR6 will be here by 2025, too early in the life cycle for that to happen. Stack that on top of 4x DIMM Memory Sub-Channels for each DDR6 DIMM Module and adding more DIMM Memory Sub-Channels moving forward with future iterations of DDR past 6; you have a recipe for more bandwidth.
n4qIdOz.png
 
Stack that on top of 4x DIMM Memory Sub-Channels for each DDR6 DIMM Module and adding more DIMM Memory Sub-Channels moving forward with future iterations of DDR past 6; you have a recipe for more bandwidth.
By the time DDR6 reaches maturity, mainstream CPUs will have on-package if not 3D-stacked primary DRAM and external memory expansion will shift to PCIe-attached memory controllers.
 
I see the left half of that image seems to be sourced from Rambus, but where did you get the right half of that image? What's your source on that claim? When are you claiming they developed it, and in what context?
I'm saying JEDEC should design / implement it in a open source / patent way.
Some way that doesn't involve RAMBUS' patent on Micro-Threading since that patent is about to expire in 2025.

Give it a new name, implement it as part of the JEDEC standard for future DDR iterations.
 
By the time DDR6 reaches maturity, mainstream CPUs will have on-package if not 3D-stacked primary DRAM and external memory expansion will shift to PCIe-attached memory controllers.
We'll see in due time.

The more things change, the more things stay the same.

I don't think DIMM slots connected to the CPU will go away anytime soon.

The latency penalty from running through PCIe PHY & the CXL.Mem protocol won't be worth it compared to faster direct attached to the CPU's Memory Controller.

In fact, both will co-exist as another set of tiered Main memory.

That's why we have multiple tiers of memory, you can never have enough faster memory.

L4 = On-Package Memory
L5 = Main Memory attached to DIMM
L6 = PCIe Attached Memory.
 
That's why we have multiple tiers of memory, you can never have enough faster memory.

L4 = On-Package Memory
L5 = Main Memory attached to DIMM
L6 = PCIe Attached Memory.
That's not justifiable. If you look at the relative latency & bandwidth of layers in the cache hierarchy, there's a much bigger gap than what would exist between what you call L5 and L6. Capacity-wise, it's also hard to justify having both L4 and L5.
 
The latency penalty from running through PCIe PHY & the CXL.Mem protocol won't be worth it compared to faster direct attached to the CPU's Memory Controller.
When you have 128+GB of on-package HBM3 like every high-end server CPU is likely going to have by the end of 2024, your application's active working set is unlikely to overflow to external memory often enough to cause major performance issues unless you are plotting Chia in memory and the generally negligible performance impact won't be worth wasting 80 pins per extra dedicated memory channel that could be better used for an extra 5.0x16 interface you could plug 8X as much memory into at 4-6X the speed.
 
  • Like
Reactions: bit_user
That's not justifiable. If you look at the relative latency & bandwidth of layers in the cache hierarchy, there's a much bigger gap than what would exist between what you call L5 and L6. Capacity-wise, it's also hard to justify having both L4 and L5.
We'll see what customers choose to do.

When you have 128+GB of on-package HBM3 like every high-end server CPU is likely going to have by the end of 2024, your application's active working set is unlikely to overflow to external memory often enough to cause major performance issues unless you are plotting Chia in memory and the generally negligible performance impact won't be worth wasting 80 pins per extra dedicated memory channel that could be better used for an extra 5.0x16 interface you could plug 8X as much memory into at 4-6X the speed.
I have a feeling on-package SRAM would be even faster with even lower latency and higher bandwidth.

Given TSMC's stacking technology with 3Dv-Cache and the Die-Area real estate of a typical Zen core, having a dedicated Core Slot with pure SRAM and many layers of SRAM just to store your data set would be the fastest memory possible; WAY faster and lower latency than anything DRAM based.

The fundamental limits of DRAM is coming into play; if you want speed, SRAM > DRAM.

Full Duplex vs Half Duplex.

SRAM L4 Cache would be expensive, but the fastest Memory possible, faster than DRAM, faster than HBM.

Hard to beat the fundamental principles of how each memory type operates.

Yes DRAM has the capacity, but given what I've estimated about how much you can pack on 8x Cores worth of Zen Die Real Estate along with the limits of TSMC's stackable dies.

I estimate that on 16x Cores worth of Zen Die Real Estate on 5nm at ~_6,336 MiB using the upper stacking process limit of 12 layers.
I estimate that on 32x Cores worth of Zen Die Real Estate on 5nm at ~12,672 MiB using the upper stacking process limit of 12 layers.

The $360-$600 or more BoM cost per stacked SRAM cache die would be worth it for Hyper Scalers or those in HPC and those who need the fastest memory alive.

We all know how fast SRAM cache is and how low latency it is, there's nothing in the DRAM world that can touch it in terms of pure speed or latency.

As far as capacity, that's what DRAM has in spades. but given ~12 GiB of L4 scratch area, I'm sure that could make most applications sing while having MASSIVE Bandwidth & Latency close to crossing CCD/CCX domains for accessing L3 cache. That's still better than anything DRAM has currently.

We also know that Enterprise / HPC markets are willing to pay for such high performance parts.

And we also know that AMD LOVES SRAM and isn't afraid to use more of it. imagine what sacrificing 1x CCD slot or 2x CCD slots on a CPU for L4$ Stacked SRAM dies.

That would be INSANELY fast, that buffers everything before you hit DRAM.
 
Last edited:
I have a feeling on-package SRAM would be even faster with even lower latency and higher bandwidth.
SRAM may be faster with lower latency but it also has much lower cell density and is ~10X more expensive. 128GB of SRAM would require ~60 000sqmm of wafer space at 5nm, which would come with a price tag of about $15 000. 128GB of DDR5 on the other hand can be had for $600 and HBM should be around $1500, though I read Micron and friends have massively increased their HBM prices now that everyone wants heaps of it for AI.

Large SRAM isn't economically viable for general-purpose computing. That is why we have L1/2/3 to and all of the new DDR channel sub-division and multiplexing hacks to make DRAM workable for a little while longer until on-package memory can take over most of the heavy lifting.
 
  • Like
Reactions: bit_user
SRAM may be faster with lower latency but it also has much lower cell density and is ~10X more expensive. 128GB of SRAM would require ~60 000sqmm of wafer space at 5nm, which would come with a price tag of about $15 000. 128GB of DDR5 on the other hand can be had for $600 and HBM should be around $1500, though I read Micron and friends have massively increased their HBM prices now that everyone wants heaps of it for AI.

Large SRAM isn't economically viable for general-purpose computing. That is why we have L1/2/3 to and all of the new DDR channel sub-division and multiplexing hacks to make DRAM workable for a little while longer until on-package memory can take over most of the heavy lifting.
For general purpose computing, for normal folks. I agree.

But for the Enterprise sector, $600 is a drop in the bucket, and the speed & latency is hard to beat.
I think on-package DRAM will have it's place, but SRAM offers something technologically that DRAM can't compete with.

And it's familiar tech, nothing new under the sun, just a implementation issue.
 
But for the Enterprise sector, $600 is a drop in the bucket, and the speed & latency is hard to beat.
I think on-package DRAM will have it's place, but SRAM offers something technologically that DRAM can't compete with.
No matter how much faster SRAM migh be, you won't be able to pack 60 000sqmm worth of SRAM silicon close enough to a CPU or GPU to make it actually work worth even if you didn't mind the ~25X higher price tag than DDR5. Even if you stacked them 8-high like HBM you'd still need almost 8000sqmm of substrate space to put them on.

Modern CPUs and GPGPUs have enough on-chip cache that even supercomputers which are the epitome of "money is no object" cannot be bothered with using SRAM as system memory.

There likely are zero applications using general-purpose processors where using SRAM as system memory would provide enough performance to justify a 25X higher price tag and the increased power draw from all that SRAM's leakage current.
 
No matter how much faster SRAM migh be, you won't be able to pack 60 000sqmm worth of SRAM silicon close enough to a CPU or GPU to make it actually work worth even if you didn't mind the ~25X higher price tag than DDR5. Even if you stacked them 8-high like HBM you'd still need almost 8000sqmm of substrate space to put them on.
The Die Area of Zen 4 manages to stack 3DvCache, that's literally pure SRAM stacked on top.

So you don't need to make that big of a Substrate in the modern era with modern Die Stacking.

And replace the Core Areas with the L3 Cache area's and stack more 3DvCache on top, you have ALOT of SRAM in a small package.

Modern CPUs and GPGPUs have enough on-chip cache that even supercomputers which are the epitome of "money is no object" cannot be bothered with using SRAM as system memory.
Do they really? That's coming dangerously close to saying 640K is enough.

They could always use "More Cache"

There likely are zero applications using general-purpose processors where using SRAM as system memory would provide enough performance to justify a 25X higher price tag and the increased power draw from all that SRAM's leakage current.
We'll have to test it out to find out. That's what lab R&D is for =D
 
The Die Area of Zen 4 manages to stack 3DvCache, that's literally pure SRAM stacked on top.

So you don't need to make that big of a Substrate in the modern era with modern Die Stacking.

And replace the Core Areas with the L3 Cache area's and stack more 3DvCache on top, you have ALOT of SRAM in a small package.
The V-cache chip is only 64MB in 36sqmm of 7nm silicon. 128GB is 2048X that amount, which is 73 000sqmm worth of silicon. Even if you stack that 8-high, the total footprint would still be over 8000sqmm, ~14X the size of an RTX4090 die. The substrate would need to be huge, no ifs or buts about it even if you layered those 8-tall stacks on top of CCDs, IODs and whatever else may be on there.

SRAM also has scaling issues: bigger SRAM has higher latency from having more decoding and routing logic overhead, especially if you want to maintain access concurrency from any client to any SRAM chunk. When AMD and Intel increase any cache tier's size, it usually comes at the expense of 1-3 extra latency cycles for L2 and L3, they generally don't touch L1 size unless they can get it for zero added latency, which is why L1 sizes in performance-oriented cores have only increased form 16KB to 64KB in 20+ years. Zen 4's larger L2 cache came at the expense of latency going up from 12 to 14 cycles while the V-cache came at the expense of L3 latency going up from 46 to 50 cycles. If going up 96X in size costs 3X in latency, going up to 2048X would mean ~200 cycles of latency and make SRAM at best marginally better than worst-case DRAM.

We'll have to test it out to find out. That's what lab R&D is for =D
The R&D has already been done. That is how we ended up with the cache structures we have in today's CPUs and the server chips that get used in supercomputers.
 
The V-cache chip is only 64MB in 36sqmm of 7nm silicon. 128GB is 2048X that amount, which is 73 000sqmm worth of silicon. Even if you stack that 8-high, the total footprint would still be over 8000sqmm, ~14X the size of an RTX4090 die. The substrate would need to be huge, no ifs or buts about it even if you layered those 8-tall stacks on top of CCDs, IODs and whatever else may be on there.
Your math is a little wonky there. Using the same Density as 3DvCache.
Ffk7AGK.png
If you strip out all the Core stuff & L2 $ and use only L3$ at the base.

That's 32 MB for Base level only, Upper level 3DvCache is Double Density = 64 MB
Each 3DvCache area does slightly over-hang the L2$ area and that approximates ~36 mm²

Getting rid of unnecessary structures and minimizing connectors to what is necessary, I can easily see 128 MB per Layer using the full CCD area for a standard 8x Core Die-Area for the stacked 3DvCache, base layer will obvious be smaller in cache size by ~(½ to ⅞) due to need for 3D TSV's and the communication channels vertically.

If you go for 16x Core Die-Area floor plan, you can double that to 256 MB per 3DvCache stack.

Add up to TSMC's 12 stack maximum that they've stated.

You've got ALOT of L3$ available.

SRAM also has scaling issues: bigger SRAM has higher latency from having more decoding and routing logic overhead, especially if you want to maintain access concurrency from any client to any SRAM chunk. When AMD and Intel increase any cache tier's size, it usually comes at the expense of 1-3 extra latency cycles for L2 and L3, they generally don't touch L1 size unless they can get it for zero added latency, which is why L1 sizes in performance-oriented cores have only increased form 16KB to 64KB in 20+ years. Zen 4's larger L2 cache came at the expense of latency going up from 12 to 14 cycles while the V-cache came at the expense of L3 latency going up from 46 to 50 cycles. If going up 96X in size costs 3X in latency, going up to 2048X would mean ~200 cycles of latency and make SRAM at best marginally better than worst-case DRAM.
L1 is optimized for Speed, that's why the caches have been so tiny.
L2 is optimized as a balance between Density & Speed.
L3 is more optimized for Density, ergo the larger amount of Cache and slower speeds.
Even at 50 Cycles, it's not that big of a deal.
It's not going up 96X in size.

You only need to look at modern large 32 MB of SRAM cache and see the actual latency in the real world.
VZhqgnr.png
At 9.4 ns for the large L3$, it's still WAY better than standard main memory at 63.2 ns

And the bandwidth potential is literally on different orders of magnitude.
 
I think on-package DRAM will have it's place, but SRAM offers something technologically that DRAM can't compete with.
Depending on the size & compactness of your working set, additional SRAM cache might provide little or no benefit. We've seen that with benchmarks of AMD's 3D V-Cache CPUs, where some workloads benefit substantially and others experience so little benefit that it doesn't even outweigh the loss in clockspeed.

Ultimately, what it comes down to is perf/$. I'd hazard a guess that the aggregate performance benefits of increasing cache size are basically logarithmic, but the cost increases faster than linear. That's hard math to overcome.
 
Your math is a little wonky there. Using the same Density as 3DvCache.
Nothing wonky there. I used 2nd-gen V-cache as a reference because V-cache chips are pretty much all SRAM and the interconnects you would still need to stack it.

Even if you shrink it to 5nm to reduce the total footprint to ~73000sqmm as in my earlier napkin calculation and stacked those 12-high, you'd still need over 6000sqmm of floor space. BTW, my calculations that you say are "wonky" seemingly because you think the density is too low yield ~1GB of SRAM per 8-tall CCD-sized SRAM stack. Looks like your "256MB per stack on top of 16x cores" is missing a multiplication by stack height. 256MB of SRAM on a double-sized CCD is approximately the same per-layer density as V-cache chips.

It's not going up 96X in size.
1MB L2 to 96MB L3 is 96X in size and the latency increase shown in your numbers is 4X instead of 3X. Going from 96MB to 128GB is a 1280X increase, so we could expect an average access time latency increase by at least another 4X, quite possibly much worse if reads and writes have to traverse multiple hops across the fabric to get from the CPU to SRAM and back, which they certainly would have when the SRAM spans 6000sqmm of floor space and 70000+sqmm of total silicon.

And the bandwidth potential is literally on different orders of magnitude.
Not really. SRAM stacks will be limited by bus bandwidth the same way DDR5 and HBM are. Based on how DDR5 platforms show no more scaling between 1R and 2R DIMMs, it looks like 32 banks is all DRAM needs to achieve sufficient concurrency and pipelining to keep the bus consistently saturated. Using SRAM won't improve bandwidth much beyond what HBM can achieve.
 
Nothing wonky there. I used 2nd-gen V-cache as a reference because V-cache chips are pretty much all SRAM and the interconnects you would still need to stack it.

Even if you shrink it to 5nm to reduce the total footprint to ~73000sqmm as in my earlier napkin calculation and stacked those 12-high, you'd still need over 6000sqmm of floor space. BTW, my calculations that you say are "wonky" seemingly because you think the density is too low yield ~1GB of SRAM per 8-tall CCD-sized SRAM stack. Looks like your "256MB per stack on top of 16x cores" is missing a multiplication by stack height. 256MB of SRAM on a double-sized CCD is approximately the same per-layer density as V-cache chips.
No, I factored in stack height, I'm just using a slightly denser library and different layout of the SRAM for the 3DvCache stacks that comes out to 512 MiB per 3DvCache layer on a 16x Core Layout that's using Zen 3 Die Area instead of Zen 4 and maxes out as much SRAM as I can get for what I want to do on future Zen Iteration of their Product Stacks / CCD|CCX design.

1MB L2 to 96MB L3 is 96X in size and the latency increase shown in your numbers is 4X instead of 3X. Going from 96MB to 128GB is a 1280X increase, so we could expect an average access time latency increase by at least another 4X, quite possibly much worse if reads and writes have to traverse multiple hops across the fabric to get from the CPU to SRAM and back, which they certainly would have when the SRAM spans 6000sqmm of floor space and 70000+sqmm of total silicon.
I'm NOT TRYING to hit 128 GB of SRAM, that was never my goal, look carefully at the numbers I posted in the previous post.
You're trying to hit 128 GB with DRAM, while I'm trying to hit:
  • __6,336 MiB = _6.1875 GiB of SRAM using 16x Cores of Die Area w/ 12 Stacks Hi
  • 12,672 MiB = 12.375 GiB of SRAM using 32x Cores of Die Area w/ 12 Stacks Hi
There's a major difference in what I'm trying to hit vs what you're trying to do.

You want to solve the issue with DRAM attached to the side, I want L4 SRAM cache.

We can both have what we want, it's just a matter of what market needs what solution.

Not really. SRAM stacks will be limited by bus bandwidth the same way DDR5 and HBM are. Based on how DDR5 platforms show no more scaling between 1R and 2R DIMMs, it looks like 32 banks is all DRAM needs to achieve sufficient concurrency and pipelining to keep the bus consistently saturated. Using SRAM won't improve bandwidth much beyond what HBM can achieve.
HBM has it's own latency issues that are worse than normal DDR, that's why HBM is more useful for Video Graphics than for regular System Memory.
Latency w/ good enough Bandwidth is King on regular System RAM for CPU's, that's why I stick with SRAM.
It has Latency & Bandwidth advantages.

And the Serial Bus Bandwidth will improve with time as they up the bandwidth over the major generation iterations.

The SRAM is WAY faster than the Dual GMI Bus on the Zen Die layouts, so it has plenty of room to grow as the Bus Link improves over time.
 
Last edited:
Depending on the size & compactness of your working set, additional SRAM cache might provide little or no benefit. We've seen that with benchmarks of AMD's 3D V-Cache CPUs, where some workloads benefit substantially and others experience so little benefit that it doesn't even outweigh the loss in clockspeed.

Ultimately, what it comes down to is perf/$. I'd hazard a guess that the aggregate performance benefits of increasing cache size are basically logarithmic, but the cost increases faster than linear. That's hard math to overcome.
That's why AMD made different SKU's.
7950X3D has 1x CCD w/ 3DvCache.
7800X3D w/ 1x Layer of 3DvCache
Most of the Product Stack lineup won't need 3DvCache

That's why you come up with different products to target different segments.
 
HBM has it's own latency issues that are worse than normal DDR, that's why HBM is more useful for Video Graphics than for regular System Memory.

The SRAM is WAY faster than the Dual GMI Bus on the Zen Die layouts, so it has plenty of room to grow as the Bus Link improves over time.
HBM is fundamentally the same structure as any other DRAM from the last 25 years, the only reason it has slightly higher latency is because the HBM base die is basically an FBDIMM buffer re-clocking commands and data between the host and HBM stack. Integrate the base die functions into the memory controller, 3D-stack the HBM on top of the die the controller resides in and you eliminate the extra latency.

I don't think SRAM stacks would have any meaningful bandwidth advantage over HBM3 when HBM3 can push up to 2.4TB/s of memory bandwidth vs 2.5TB/s for AMD's gen2 V-cache.

If all you want is bigger L3 caches, a major problem is that increasing L3 cache will increase L3 cache latency, which also increases the time it will take before cache misses go to memory. There are already cases where AMD's V-cache is actually hurting performance either due to the increased cache miss penalties or reduced clocks to offset the increased thermal resistance. Most of THG's productivity suite appears to dislike V-cache.

Increased SRAM sizes aren't a one-size-fits-all solution to performance bottlenecks. If it was, AMD and Intel would be in a cache size war instead of a core count one.
 
HBM is fundamentally the same structure as any other DRAM from the last 25 years, the only reason it has slightly higher latency is because the HBM base die is basically an FBDIMM buffer re-clocking commands and data between the host and HBM stack. Integrate the base die functions into the memory controller, 3D-stack the HBM on top of the die the controller resides in and you eliminate the extra latency.
That means you wouldn't mind a chiplet based Memory Controller solution that moves the entire Memory Controller off the I/O die and onto it's own chip that connects to the I/O die via a serial connection like AMD does or parallel connection like Intel prefers

So you just shoved the bottle neck from HBM connection to the Memory Controller to the connection between the Memory Controller connection to the I/O die.

If you just shoved it onto the Integral Memory Controller of the existing I/O die, that will have some weird, unforseen cost consequences per I/O die.
That means the price of your I/O die will sky rocket or there might be other performance constraints, similar to the ones that 3DvCache has with it's Thermal & Voltage sensititivies.

We don't know what kind of limitations or CON(s) comes with shoving DRAM directly on top of the I/O die.

I don't think SRAM stacks would have any meaningful bandwidth advantage over HBM3 when HBM3 can push up to 2.4TB/s of memory bandwidth vs 2.5TB/s for AMD's gen2 V-cache.
It will definitely have Latency advantages & it'll be a Full Duplex interface vs Half Duplex compared to DRAM.

If all you want is bigger L3 caches, a major problem is that increasing L3 cache will increase L3 cache latency, which also increases the time it will take before cache misses go to memory. There are already cases where AMD's V-cache is actually hurting performance either due to the increased cache miss penalties or reduced clocks to offset the increased thermal resistance. Most of THG's productivity suite appears to dislike V-cache.
That's why you have options between the 3DvCache part and the non 3DvCache part.
You make different SKU's with different cache sizes to fit your target market.

3DvCache seems to target gaming & simulation work loadsd very well, but sucks at everyday Apps.

There are many "Non-Standard" apps that are larger than the 32 MiB of L3$ that AMD seems to have stuck with for quite a while.
Those could really benefit with a standard CCD/CCX with larger than 32 MiB of L3$, but not TOO much larger.

Increased SRAM sizes aren't a one-size-fits-all solution to performance bottlenecks. If it was, AMD and Intel would be in a cache size war instead of a core count one.
They still kind of are in a SRAM war, but more in the Enterprise side than on the consumer side.
And Intel's solution was to create a HBM specific version for their Sapphire Rapids SKU's and let their customers pick between w/HBM and w/o HBM.

So you already won on that front, Intel was the first to attach HBM to their Enterprise CPU's.
 
That means you wouldn't mind a chiplet based Memory Controller solution that moves the entire Memory Controller off the I/O die and onto it's own chip that connects to the I/O die via a serial connection like AMD does or parallel connection like Intel prefers

So you just shoved the bottle neck from HBM connection to the Memory Controller to the connection between the Memory Controller connection to the I/O die.
You don't need to shove the memory controller into a separate die to stack HBM or similar memory on top of it, HBM-like memory could be stacked directly on top of the CPU/GPU tiles to eliminate the cost and latency of intermediate silicon altogether.

Putting memory on a separate base die between the CPU/GPU and DRAM stacks is only a temporary work-around until solutions are found to thermal challenges that come with stacking things directly on top of a high-power die. Avoiding an excessive increase in thermal resistance between the CPU cores and IHS is one of the reasons why AMD's V-cache overlaps little more than the CCD's internal cache, a relatively low-power die area. If AMD made V-cache cover the entire CCD, it may have to cut clocks by another 500MHz for thermal management.
 
You don't need to shove the memory controller into a separate die to stack HBM or similar memory on top of it, HBM-like memory could be stacked directly on top of the CPU/GPU tiles to eliminate the cost and latency of intermediate silicon altogether.
True, but you add in another layer of insulation between the core logic and the Heat Spreader.
That Thermal Barrier is a real physics problem that is hard to solve.

Putting memory on a separate base die between the CPU/GPU and DRAM stacks is only a temporary work-around until solutions are found to thermal challenges that come with stacking things directly on top of a high-power die.
That's why my proposed solution for stacking is mounting the DRAM or SRAM stacks on the Opposite side of the PCB, directly behind the CPU/GPU.

Then you can cool the Front / Back.

Yes there are interconnect / PCB challenges, but that seems more solvable if they really wanted to do it, it's the best compromised solution that I can come up with.

Avoiding an excessive increase in thermal resistance between the CPU cores and IHS is one of the reasons why AMD's V-cache overlaps little more than the CCD's internal cache, a relatively low-power die area. If AMD made V-cache cover the entire CCD, it may have to cut clocks by another 500MHz for thermal management.
That's why 3DvCache never goes past the L2$ die-area.
That's as far as they're willing to go, and I concur, it makes alot of sense.

The main issue with those Structural Silicon Shims above the Die is that their Heat Conductivity is TRASH.

But I wish they would consider putting small slices of Dymalloy Shims above the core die where the structural silicon is located, buried within a cut-out of the Structural Silicon or above a very thin layer of Structural Silicon. This way you have better transfer of heat out of the core logic area.

Dymalloy has a higher Thermal Conductivity than Copper @ 420.00 W/(m•K) and can have it's thermal expansion adjust to match other materials like silicon.

Given how small of an area the die area above the Core Parts, I think it's worth exploring the cost of making tiny shims and bonding it above to form a nice flat surface to transfer heat out of.

Dymalloy is far better than Thermal Conductivity of Silicon @ ~148.00 W/(m•K).

If you minimize the Silcon shim thinness needed to insulate the Die, then the Dymalloy could be used as a nice Heat Capacitor to shunt to the Heat Spreader above.
 
Last edited: