What I've read about 3D V-Cache stacking suggests they can only stack it 1-high. I'm guessing that's due to thermal issues.
I remember reading that TSMC stated that
12 hi is the maximum, but that's also limited by accumulated thermal limits between each layers. So you need to choose carefully what you put underneath.
Sounds like its communication link would be a bottleneck. I figured you wanted the L4 cache on the I/O Die.
I would if there was die area available, but it seems that the floor plan for the cIOD is maxed out.
There really isn't any room for more SRAM on the cIOD, despite what I want. So until they figure out their 3D transistor stacking or something else to free up die-space, whatever Process Node shrink I can get, it'll need to make space for the Directory Cache Coherency.
The Directory Cache Coherency is mostly for CPU's with 3x CCD's and up. Normal Ryzen CPU's w/ 2x CCD's at most don't really need it. That isn't a complex enough of a CPU that would need Directory Based Cache Coherency.
And I want AMD to start segmenting it's upper echelon of CPU's into different platforms for different markets.
1x-2x CCD's for Ryzen
3x-4x CCD's for Ryzen FX (WorkStations / HEDT / SMB)
5x-6x CCD's for Ryzen TR (ThreadRipper PRO).
7x-24x CCD's for EPYC.
By "basement floor", you mean putting logic and memory cells in the substrate? I'd imagine that would make it much more expensive, and would hardly be worthwhile, since it's so much lower-density.
No, I mean by building one layer of transistors below the main layer of transistors.
3D-Stacked CMOS Takes Moore’s Law to New Heights
Kind of like building construction.
But what you choose on the bottom (Basement Layer) needs to not generate too much heat.
Ergo, what I put on that layer is very selective and critical to Min/Max the 2D Die Area.
Most of what goes on the bottom is L1.I$ & L1D$ along with μOp Cache.
I've measured the size of what is possible with TSMC 5nm & 3D CMOS stacking, and underneath, the main Zen 3 CCD logic area, you can place ALOT of L1.I$ & L1D$ & enlarged μOp Cache.
I finally found a space for my desired: 192 KiB of L1(I$ & D$) + 65,536 entry µOp cache w/ 16-way Associativity
This give enough Cache resources to allow for SMT1-8 for Regular future Zen # cores.
Zen #C cores would be limited to SMT 1-4 because of Maximum Cache Sizes.
With the increased density of transistors for core logic and the same Physical Die area limited to Zen 3's Die Area size.
L1$ & L2$ & L3$ SRAM Die Area due to the nature of SRAM Transistor Scaling stopping at 5nm for now. I'm working on a lay-out that needs 3D Transistor stacking. This way a basement layer can take most of the Cache & Cache Tags as needed to maximize actual SRAM cache on the main levels for L2$ & L3$.
L1$ will largely get shunted into the basement and send data above to the Core Logic area.
What density figures did you use for that?
TSMC 5nm for SRAM since that's when SRAM scaling has stopped until somebody figures out a better solution.
So, what are the tradeoffs?
For
Cache-Coherence, Snooping tends to be faster if you have enough bandwidth available.
Directory-based Cache Coherence is better for scalability with many cores / CCD's.
Guess what era we're in, the era of ever more & more CCD's, cores, and caches.
Directory based Cache Coherence is also what TensTorrent is using for their upcoming RISC-V Tile/chiplet based CPU. It's better for CPU's that have high number of Cores/Tiles, while the traditional Bus Snooping method is based on Broadcasting out and waiting for reply's on Cache. Great when your core count is small.
But given the Hub-Spoke model that AMD has chosen, it makes sense to have a Directory on the cIOD for maintaining Cache Coherency across the CPU for all the various data's located on different caches.
Each local copy would make changes and pass along the changes as necessary across the CPU when it does.
Obviously you pass along which state a Cache line or segment is in. If it's locked by something else, you have to wait until you get the token to read/write to the Cache line/segment once it gets updated.
Having a central directory is going to be crucial once Caches grow larger, RAM / memory sizes grow larger.
Having multiple layers of cache as well on different parts of the system at different parts of the chain allow very high speeds, but also some complexity due to who is writing to what memory address.
I don't know specifics on those, but it seems a rather poor fit. The latency will be similar to DIMMs, so your only real benefit is bandwidth. That makes sense for graphics, since GPUs tend to be good at latency-hiding and are very bandwidth hungry. For CPU cores, it seems a lot less interesting.
I don't agree with using DRAM as L4$, the only thing you really gain is Massive Memory Capacity.
But the slow-ness of DRAM along with the need to pipe-line due to it being Half-Duplex in nature to get more bandwidth is unnecessary complexity where you don't want it.
SRAM has the Low Latency, High Bandwidth, inherent Full-Duplex nature, and the capability to scale with the IFOP links as needed over time. And I'm sure IFOP will continue to grow in bandwidth as PCIe PHY is growing along with how fast AMD feels like clocking it above PCIe spec using their own GMI protocols. That's why I want a L4$ SRAM based CCD (Cache Complex Die). With the right Cache Coherency Structure and design, it can literally be the correct buffer to keep all the CCD's filled on CPU's with more than 2 CCD's.
I'm thinking CPU's starting from 3x CCD's -> > 24x CCD's in the future.
Keeping every CCD filled and busy is going to be a issue on multiple levels.