News Jim Keller Shares Zen 5 Performance Projections

Page 5 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
What about L4$ and a massive SRAM stack for it?
That's also a option.
Same answer as last time: requires too much silicon to be worth the effort.

The CPU's L1+2+3 caches achieve their 99+% overall hit rate because the CPU's prefetch algorithms can almost always get the correct data moving before it is actually needed. The likelihood of an additional cache layer catching things prefetch couldn't get right is extremely low.

Also, adding yet another cache level between L3 and system memory means adding 50+ cycles of additional latency on top of the L3's 50 cycles latency before cache misses go to memory. That is yet another reason why chip manufacturers are reluctant to slap more cache than absolutely necessary on their chips.

I told you, I'm against using that model, that's the same BS that Intel did back in the day with FB-DIMM's and that initiative failed as well.
Any attempt to change the Parallel DIMM model with a on-DIMM controller is doomed to fail.
Meanwhile, Samsung and SK-Hynix are shipping 96-512GB EDSFF CXL-mem cards for application OEMs who require huge memory space.

FBDIMM and RDRAM were just ~20 years too early at serializing the memory interface. CPUs didn't have enough cache to keep a large enough subset of active data on-chip and tolerate the increased bulk memory latency very well. On-package memory will eliminate most of that problem.

BTW, it was just over 20 years ago that Rambus started terrorizing the tech industry with its trivial patent lawsuits over all things from SDRAM to PCIe including simply clocking data on both clock edges for DDR DRAM. Those garbage patents expiring over the last couple of years is probably one of the main reasons why interest in serial memory resurfaced in 2022.
 
  • Like
Reactions: bit_user
You are, in a way. You want to make the big calls in system architecture. Those are crucial strategic decisions, with a lot at stake, which is why the privilege of having a seat at the table is earned. Meritocracy and all that.
So only those who are chosen, are allowed at the table of decision making.
Everybody else can sod off. Is that what you're implying?

There are a lot of engineers with a lot of ideas. A lot of ideas aren't as good as they seem. They either have some non-obvious fatal flaw, or they have a number of drawbacks that cumulatively overwhelm their benefits.

Sometimes, good ideas fail for other reasons, but it's pretty hard to keep a truly good idea down. They just have a way of coming up again and again, in one form or another.

Don't worry about the engineers. Most are mature enough to shake off some failures and setbacks. I'm guessing most of OpenCAPI's backers are now fully onboard the CXL bandwagon.
They kind of have to be, their organization folded into the CXL Consortium.

And who are you to judge?
I'm me, a fellow tech enthusiast, just like them.

What is the sun, in this case?
Putting complex logic like a Memory Controller onto the DIMM.
A decision that the PC Industry and Memory industry agreed to "never ever do that" a long time ago for the sake of costs.
Also, it keeps competitive advantages on the hardware vendors side and leaves the DIMM manufacturers to worry about cost cutting and a competitive race to the bottom for cost of memory.

Citation please.
VZhqgnr.png
Notice how much faster in Read/Write/Copy every level of SRAM is over DRAM.

DRAM is a joke compared to SRAM, you don't want to transfer things from DRAM if you don't have to, you want to use as little RAM as possible.

SRAM has Advantages in: Speed, Latency, Energy Costs.
It only has a few downsides (Cost per Memory Capacity).

Alder Lake’s Caching and Power Efficiency
Across all platforms we tested, we saw a consistent pattern where each level in the memory hierarchy cost more power to get data from than the previous level. As a rule of thumb, hitting L2 costs about twice as much power as hitting L1. Hitting L3 tends to cost more than twice as much power as hitting L2. Pulling data from memory is extremely expensive, with power costs typically 4-5x that of hitting L3.
vjdD9NI.png
Race to sleep logic also applies. Some of the CPUs we tested actually reported lower package power draw when testing DRAM-sized regions, compared to hitting cache. But DRAM bandwidth is so much lower than cache bandwidth that it ends up costing more energy over the time it takes to bring data in from DRAM. When cores are waiting for data, they need to keep their large out-of-order structures powered up. The more time they spend waiting for data, the more power they waste. This applies to slower cache levels as well, just to a lesser extent. That means keeping data close to the core is not just good for performance – it’s very important for power efficiency too. And we’re not the first ones to notice:

6zLJGtG.png
A new CI-700 Coherent Interconnect & NI-700 NoC For SoCs
The SLC can server as both a bandwidth amplifier as well as reducing external memory/DRAM transactions, reducing system power reduction.
Imagine what a MUCH larger SRAM cache can do for overall System Power Reduction & enhancing bandwidth.

Silicon gets cheaper when the cost per transistor goes down through density increasing faster than the cost of newer fab nodes. We know that's over, for SRAM. So, no. It's not going to get substantially cheaper per bit.
But energy costs are still better for SRAM than DRAM, it's the nature of the tech along with distance that the data has to travel.

Also as DRAM scales to every shrinking Processing Nodes, their Die costs will go up, just like the more advanced nodes that our main CPU's / GPU's had to endure. That means DRAM costs will still not go down nearly as quickly as you think it will.

Unlike SRAM, DRAM is still scaling.
For now, they're on older process nodes, they haven't hit the more expensive process nodes yet.

Not my problem. You're the one making the claims. You need to cite your source and we really need answers to the various circumstantial questions I raised.
See above for the memory, and
hZ6PCzx.png
hOeljM4.png
Sourced from this research paper here.

1PdkQ81.png
Sourced from this research paper over here.
All the #'s are more or less inline.

You don't want to go to DRAM if you don't have to.
Keeping everything in SRAM is superior for energy efficiency.

You want everything to use as few CPU cycles as possible and take the least amount of time to transfer.
SXenRAU.png

You want to use the "Fastest Programming Languages possible" as well if you want to be as energy efficient as possible.

f83HsUk.png
4v3i6oF.png

Well, are you not talking about mainstream computing? If not, you should specify which markets you're talking about.
I think more customers would prefer choice in how much memory they're buying into and not be stuck with paying OEM markup like Apple has done by forcing the RAM to be soldered onto their products. Now Apple customers don't even have an option to upgrade their RAM.
They're at the mercy of Apple's over pricing for hardware.

So far, that's what we're seeing in the consumer space. I can believe what @InvalidError has said about HBM's long-term cost competitiveness, but I don't know. Apple has shown how much you can do with just LPDDR5X.
Apple also over-charges for alot of things.

CXL 1.0 and 2.0 are both 32 GT/s. CXL 3.0 is 64 GT/s.


Yes, CXL 3.0 uses PAM4. That was my point.
Yes, CXL is a protocol that runs on top of PCIe as a "Alternate Mode Protocol" and PCIe is nice enough to allow other Protocols to use their PHY.
So CXL exists at the mercy/kindness of PCIe.
Otherwise, the CXL consortium would be forced to come up with their own PHY and that would be alot of extra work.
But since CXL uses the same PHY as PCIe, everybody is a winner.
We get to share hardware interfaces.

That doesn't negate the realities / down-sides of accessing memory over PCIe or CXL.

It's going to be slower, higher latency, higher energy costs then even regular main memory.

The advantage you have is having more PCIe connections to add in more memory so your data set sits in Memory somewhere.
 
Last edited:
Same answer as last time: requires too much silicon to be worth the effort.

The CPU's L1+2+3 caches achieve their 99+% overall hit rate because the CPU's prefetch algorithms can almost always get the correct data moving before it is actually needed. The likelihood of an additional cache layer catching things prefetch couldn't get right is extremely low.

Also, adding yet another cache level between L3 and system memory means adding 50+ cycles of additional latency on top of the L3's 50 cycles latency before cache misses go to memory. That is yet another reason why chip manufacturers are reluctant to slap more cache than absolutely necessary on their chips.
A giant slab of L4 Cache directly attached to the cIOD as just another CCD would still be better and more energy efficient than having to reach all the way to DRAM.

It's not like L4$ hasn't existed before, it's been done several times in computer history.

Meanwhile, Samsung and SK-Hynix are shipping 96-512GB EDSFF CXL-mem cards for application OEMs who require huge memory space.

FBDIMM and RDRAM were just ~20 years too early at serializing the memory interface. CPUs didn't have enough cache to keep a large enough subset of active data on-chip and tolerate the increased bulk memory latency very well. On-package memory will eliminate most of that problem.

BTW, it was just over 20 years ago that Rambus started terrorizing the tech industry with its trivial patent lawsuits over all things from SDRAM to PCIe including simply clocking data on both clock edges for DDR DRAM. Those garbage patents expiring over the last couple of years is probably one of the main reasons why interest in serial memory resurfaced in 2022.

And some projects just might need more Main Memory than what most systems currently offer.

On-Package memory is going to be hard to do with the chiplet mantra and disaggregation of the memory controller.

Also the Heat issues alone are a nightmare for Memory.

Look at 3DvCache and it's voltage limitations. Given that Intel & AMD are going chiplet and the Memory Controller is moving away from the CPU / GPU / Monolithic dies, it's going to be hard to go backwards.

Serial & Memory can come back, just not in the form that those people were thinking about with Serial DIMMs.
 
So only those who are chosen, are allowed at the table of decision making.
Everybody else can sod off. Is that what you're implying?
I'm not saying anything you shouldn't already generally know about how the tech industry works. JEDEC is an industry consortium of memory makers and memory users. Its committees are comprised of employees of said member companies, and are there to represent their interests.

You want a seat at the table? Start a company and pay your dues. Or join a member company and work your way up. Or, publish & present original and compelling research which catches their attention.

The problem I see is that you want influence without having to do the work. That's just not realistic.

I'm me, a fellow tech enthusiast,
I think the opinion of a mere enthusiast doesn't really count for anything. And I include myself in that category.

Putting complex logic like a Memory Controller onto the DIMM.
A decision that the PC Industry and Memory industry agreed to "never ever do that" a long time ago for the sake of costs.
Source?

That doesn't show what you think it does.

8 measily MiB of SLC (System Level Cache) AKA SRAM reduced the net power consumption from DDR power draw by 8%
6zLJGtG.png
A lot of that is probably due to the mobile iGPU, which probably doesn't have as much L2 cache as the CPU cores.

But energy costs are still better for SRAM than DRAM, it's the nature of the tech along with distance that the data has to travel.
If you're comparing access to the cells, then Wikipedia is contradicting you:

"it is less dense and more expensive than DRAM and also has a higher power consumption during read or write access. The power consumption of SRAM varies widely depending on how frequently it is accessed.[8] "​

Okay, so 2014 is rather outdated. What does it mean "64-bit Cache"? What's the operation it's measuring?

<irrelevant stuff removed>

Now Apple customers don't even have an option to upgrade their RAM.
They could, if it supported CXL.mem.

Yes, CXL is a protocol that runs on top of PCIe as a "Alternate Mode Protocol" and PCIe is nice enough to allow other Protocols to use their PHY.
Stop trolling.

But since CXL uses the same PHY as PCIe, everybody is a winner.
Yes, it means flexibility for the customer, since they can reassign lanes between CXL or PCIe. That's a good thing.

It's going to be slower, higher latency, higher energy costs then even regular main memory.
I already answered this point. Repeating points I've already answered is making a bad faith argument.
 
Last edited:
Notice how much faster in Read/Write/Copy every level of SRAM is over DRAM.

DRAM is a joke compared to SRAM, you don't want to transfer things from DRAM if you don't have to, you want to use as little RAM as possible.
We aren't disputing the speed of SRAM. We are disputing the probable real-world benefits of more SRAM.

When the existing cache hierarchy and prefetch algorithms' hit rate is already 99+%, having all-SRAM system memory would only ever improve that last sub-1% of memory accesses that have to go beyond L3$. The likelihood of a cache hit after all else has already failed will boil down to luck unless you are running software optimized specifically for your L4$ size. From a programming viewpoint though, if you couldn't be bothered to telegraph your memory accesses far enough ahead of use for the CPU's prefetcher to get data into L3$ or closer on time and not need L4$ in the first place, you aren't going to bother optimizing for L4$ size either.

So the trade-off becomes:
  • 4 cycles L1 latency + 14 cycles L2 latency + 50 cycles L3 latency + sub-1% chance of 60ns memory latency or 200ns for pages in CXL
  • 4 cycles L1 latency + 14 cycles L2 latency + 50 cycles L3 latency + 60+ cycles "huge L4$" latency + memsize/L4size x 100% chance of 60ns memory latency or 200ns for CXL

If your L4$ is 1/128th the size of total system RAM, ex.: 1GB of SRAM for 128GB of DRAM with all 128GB getting exercised to some extent, then there will be a ~1% chance of a cache hit in software not optimized specifically for your L4$ or properly optimized for the available L1/2/3$.

The net outcome from adding your huge L4$ is that 99% of accesses that miss through L3$ will also miss through L4$, take 20+ns longer getting to DRAM and yield significantly degraded performance from threads stalling 20+ns longer than they would have otherwise. Why has the concept of L4$ been mostly abandoned? Because its benefits are highly situational, prefetch algorithms have improved and growing L1/2/3 caches have made L4 obsolete if there was ever a time where it consistently delivered tangible benefits. On Intel's Broadwell, the only thing Crystal Well was consistently good at is boosting the IGP's performance from passable to decent, which may be why it looks like Intel might be packing SRAM in its Meteor Lake SoC tile.

BTW, HBM2 requires less than 4pJ/bit for reads from row activation to the host. Not having to drive a long parallel bus practically eliminates IO power from the equation.
https://www.cs.utexas.edu/users/skeckler/pubs/MICRO_2017_Fine_Grained_DRAM.pdf
 
  • Like
Reactions: bit_user
When the existing cache hierarchy and prefetch algorithms' hit rate is already 99+%,
AMD's X3D CPUs show there's still some room for improvement, but only on select workloads.

As for mitigating cache misses, don't forget the role of OoO and SMT.

From a programming viewpoint though, if you couldn't be bothered to telegraph your memory accesses far enough ahead of use
Modern CPUs are smart enough to pick up on patterns and make good guesses about what you're going to need. It's a little like branch prediction, in the sense that it's implicit and automatic.
 
We aren't disputing the speed of SRAM. We are disputing the probable real-world benefits of more SRAM.

When the existing cache hierarchy and prefetch algorithms' hit rate is already 99+%, having all-SRAM system memory would only ever improve that last sub-1% of memory accesses that have to go beyond L3$. The likelihood of a cache hit after all else has already failed will boil down to luck unless you are running software optimized specifically for your L4$ size. From a programming viewpoint though, if you couldn't be bothered to telegraph your memory accesses far enough ahead of use for the CPU's prefetcher to get data into L3$ or closer on time and not need L4$ in the first place, you aren't going to bother optimizing for L4$ size either.

So the trade-off becomes:
  • 4 cycles L1 latency + 14 cycles L2 latency + 50 cycles L3 latency + sub-1% chance of 60ns memory latency or 200ns for pages in CXL
  • 4 cycles L1 latency + 14 cycles L2 latency + 50 cycles L3 latency + 60+ cycles "huge L4$" latency + memsize/L4size x 100% chance of 60ns memory latency or 200ns for CXL
If your L4$ is 1/128th the size of total system RAM, ex.: 1GB of SRAM for 128GB of DRAM with all 128GB getting exercised to some extent, then there will be a ~1% chance of a cache hit in software not optimized specifically for your L4$ or properly optimized for the available L1/2/3$.

The net outcome from adding your huge L4$ is that 99% of accesses that miss through L3$ will also miss through L4$, take 20+ns longer getting to DRAM and yield significantly degraded performance from threads stalling 20+ns longer than they would have otherwise. Why has the concept of L4$ been mostly abandoned? Because its benefits are highly situational, prefetch algorithms have improved and growing L1/2/3 caches have made L4 obsolete if there was ever a time where it consistently delivered tangible benefits. On Intel's Broadwell, the only thing Crystal Well was consistently good at is boosting the IGP's performance from passable to decent, which may be why it looks like Intel might be packing SRAM in its Meteor Lake SoC tile.

BTW, HBM2 requires less than 4pJ/bit for reads from row activation to the host. Not having to drive a long parallel bus practically eliminates IO power from the equation.
https://www.cs.utexas.edu/users/skeckler/pubs/MICRO_2017_Fine_Grained_DRAM.pdf
But also the upkeep energy for DRAM is still higher than SRAM along with the heat generation.
Having to periodically refresh the DRAM cells to maintain data integrity vs SRAM only needing a steady supply of voltage.

There's a reason why SRAM has historically been inside the CPU / GPU / main chip and not DRAM.

With a larger L4$, if your current data set doesn't fit in existing L3$, it can be swapped for the portions necessary from L4$.
That would still be more efficient to go SRAM to SRAM.

There's a reason why Zen cores would rather place the data even in a adjacent CCD's L3$ than just shove it to DRAM.
It's still faster & more energy efficient to hop across to cIOD and to another CCD's L3$ then go straight to DRAM.

And given how large the OS is, all the various apps, the size of modern data sets in various programs.

The small L3$ can't possibly fit everything.

Especially with the constant context switches and loading / unloading data.

Having more SRAM acts as a great buffer to spend less energy in general and still be fast.

Even Ponte Vechio uses SRAM as part of it's RAMBO cache to buffer the HBM DRAM.

https://ieeexplore.ieee.org/document/9731673
Ponte Vecchio (PVC) is a heterogenous petaop 3D processor comprising 47 functional tiles on five process nodes. The tiles are connected with Foveros [1] and EMIB [2] to operate as a single monolithic implementation enabling a scalable class of Exascale supercomputers. The PVC design contains> 100B transistors and is composed of sixteen TSMC N5 compute tiles, and eight Intel 7 memory tiles optimized for random access bandwidth-optimized SRAM tiles (RAMBO) 3D stacked on two Intel 7 Foveros base dies. Eight HBM2E memory tiles and two TSMC N7 SerDes connectivity tiles are connected to the base dies with 11 dense embedded interconnect bridges (EMIB). SerDes connectivity provides a high-speed coherent unified fabric for scale-out connectivity between PVC SoCs. Each tile includes an 8-port switch enabling up to 8-way fully connected configuration supporting 90G SerDes links. The SerDes tile supports load/store, bulk data transfers and synchronization semantics that are critical for scale-up in HPC and AI applications. A 24-layer (11-2-11) substrate package houses the 3D Stacked Foveros Dies and EMIBs. To handle warpage, low-temperature solder (LTS) was used for Flip Chip Ball Grid Array (FCBGA) design for these die and package sizes.

https://www.techpowerup.com/292250/...r-63-tiles-600-watt-tdp-and-lots-of-bandwidth

https://www.tomshardware.com/news/intel-14th-gen-meteor-lake-cpus-may-embrace-an-l4-cache
The rumor mills are abuzz that the L4$ will be SRAM
 
Last edited:
With a larger L4$, if your current data set doesn't fit in existing L3$, it can be swapped for the portions necessary from L4$.
That would still be more efficient to go SRAM to SRAM.
If your L4 is just more SRAM, how is it functionally different than simply increasing the size of L3?

There's a reason why Zen cores would rather place the data even in a adjacent CCD's L3$ than just shove it to DRAM.
Do they? I was under the impression they can read an adjacent CCD's L3, but only write to their own.

I'd be interested in seeing a source on that, either way.

It's still faster & more energy efficient to hop across to cIOD and to another CCD's L3$ then go straight to DRAM.
Yes, but still painful enough that I think you'd rather avoid it. That's why it makes sense to me if they do (avoid it).

And given how large the OS is, all the various apps, the size of modern data sets in various programs.

The small L3$ can't possibly fit everything.
It doesn't need to - just the hot data. @InvalidError is right that the combination of speculative execution and prefetching can hide a lot of latency from cache misses.

For all the benefit that AMD's V-Cache provides, there are actually more cases where it provides little or no benefit, at all.

Especially with the constant context switches and loading / unloading data.
The typical OS "tick" rate is between 100 and 1000 Hz. It's far more likely than not that a thread will continue executing into the next timeslice.

Even in the worst case scenario of a highly-oversubscribed CPU, where you have a context switch and complete L2 replacement in each timeslice, that's only 1 MB * 1 kHz = 1 GB/s per core of extra memory traffic, for a rather extreme scenario.

In other words, largely a non-issue.

Even Ponte Vechio uses SRAM as part of it's RAMBO cache to buffer the HBM DRAM.
That's L2 cache. Nobody is saying you can get rid of your L2, just because you have HBM. AMD and Nvidia also have L2 in their HBM-equipped compute accelerators.

Not sure where you got the idea that SLC makes sense to be anything but SRAM.

We're not talking about in-package DRAM as cache, per say. In the event you also have some external memory, it just works like a slower tier. Kind of like swap, but much faster.
 
As for mitigating cache misses, don't forget the role of OoO and SMT.
I had lumped OoO and speculative execution in with the prefetch process: for the CPU to dependably prefetch data, it needs to rush scheduling of any AGU dependencies to find out what the next most likely load addresses are going to be.

SMT doesn't quite factor in there (at least IMO) since a thread stalled on a memory read is still going to be stalled and the CPU's resources will almost certainly get more under-used than they should even if an extra thread is still chugging along.

Even in the worst case scenario of a highly-oversubscribed CPU, where you have a context switch and complete L2 replacement in each timeslice, that's only 1 MB * 1 kHz = 1 GB/s per core of extra memory traffic, for a rather extreme scenario.
A normal context switch is only a couple of kBs though, easily accommodated by the L2/L3. If the OS is only borrowing the thread to do interrupt processing, chances are that it will return to the same thread and a different one will handle the rest of the interrupt as a DPC and whatever may have been evicted from L2 will still be in L3.

But also the upkeep energy for DRAM is still higher than SRAM along with the heat generation.
Having to periodically refresh the DRAM cells to maintain data integrity vs SRAM only needing a steady supply of voltage.
What power? 32GB of DDR5-4800 uses 2-3W per DIMM under full load while self-refresh uses about 200mW, no big deal.

SRAM has static leakage current and that current gets worse as transistors get smaller. The newest SRAM I could find a standby current spec for pulls 320mA at 2.5V on a 72Mbits chip. Assuming the leakage per bit gets cut by 3X going down to 1.2V on a newer process, you are looking at about 1W of static power draw per 64MB of SRAM. The fastest discrete SRAM I could find is spec'd at 2.6A @ 1.2V running at full speed, so chalk another ~3W per 128MB of SRAM for IO and internal routing.

If you wanted to have 16GB of SRAM, your SRAM DIMM may very well require its own 8-pins power connector.

SRAM has horrible scaling in every possible way besides latency, which is why chip architects put so much design effort into using as little of it as they can.
 
  • Like
Reactions: bit_user
SMT doesn't quite factor in there (at least IMO) since a thread stalled on a memory read is still going to be stalled and the CPU's resources will almost certainly get more under-used than they should even if an extra thread is still chugging along.
Yes and no. It helps to look at actual data of SMT scaling.

The latest SPEC2017 testing data with & without SMT I've found is from https://www.anandtech.com/show/16594/intel-3rd-gen-xeon-scalable-review/6 . Sadly, I think Anandtech failed to review either Sapphire Rapids or Genoa.

122609.png

In the 2S case, EPYC 7763 got a 18.3% benefit from SMT. In the 1S case, it got 16.2%.
In the 2S case, Xeon 8380 got a 8.1% benefit from SMT. In the 1S case, it got 9.2%.

For integer workloads, it's a clear win. However, not such a huge win that you'd worry about too much of the CPU sitting idle, when one of the threads is stalled.

117496.png

In the 2S case, EPYC 7763 got a 2.3% deficit from SMT. In the 1S case, it got 0.8% deficit.
In the 2S case, Xeon 8380 got a 2.5% deficit from SMT. In the 1S case, it got 2.2%.

It's a little more tricky to say exactly why the benefit wasn't greater, but the data suggests that a single thread is typically good enough at keeping the FPU busy that adding a second thread probably causes more of a detriment through cache contention than any benefit it adds by hiding memory latency. I'd love to see the same testing repeated with the 3D V-cache version of Milan.

A normal context switch is only a couple of kBs though, easily accommodated by the L2/L3.
I was taking "context switch" to imply a full task-switch, in which case you'd expect much of the prior thread's working set to get flushed from at least L2, in short order.
 
It's a little more tricky to say exactly why the benefit wasn't greater, but the data suggests that a single thread is typically good enough at keeping the FPU busy that adding a second thread probably causes more of a detriment through cache contention than any benefit it adds by hiding memory latency.
It could just be that a purely synthetic benchmark is over-exerting specific issue ports and the two threads bottleneck on that, in which case SMT on vs off becomes a measure of how much slack there is on the most contentious ports for that workload.

Memory controller contention would cause a similar behavior too: if 28+ cores in a single socket trample each other enough to max out memory controllers, SMT would do little more than almost double the average service time, minus whatever slack may have been left.
 
It could just be that a purely synthetic benchmark
Nope. SPEC2017 is comprised of industry-standard applications. There are something like 10 apps feeding into the integer score and 12 feeding into the fp score. You can find details on their website.

That said, it would be interesting if Anandtech had bothered to break out the subscores for the non-SMT cases, but they didn't. It would be nice to know specifically which workloads benefited most/least from SMT.
 
If your L4 is just more SRAM, how is it functionally different than simply increasing the size of L3?
L3 can only be of finite size due to the shape and construction of the L3 $ in the center of the CCD (Core Chiplet Die) and how many 3DvCache layers are you willing to stack, since most are only willing to stack 1x layer at the moment for costs reasons, nobody on the consumer end or enterprise end has bothered with higher stacks even though 12x is the current upper limit stated by TSMC. I think adding more layers adds to more risk of failure or bad bonds along with exorbitant costs per CCD. So most products stick with 1x layer of 3DvCache so far.

A dedicated L4$ SRAM only CCD (Cache Chiplet Die) that functions as Victim Cache for all the various L3$ amongst all the CCD's in a CPU and can act as a buffer for Main Memory Controllers w/ at least 1 layer of 3DvCache on top would be a "Huge" Scratch Pad / Buffer.

Based on Zen 3 CCD Die Size & Area, but using TSMC 5nm with basement floor usage for alot of things like Cache Tags, you can pack in ALOT more SRAM on the ground floor area. Finally going 3D stacking like they promised, even if it's only 1 extra layer, and putting in structures that aren't going to get super hot. I estimate that on TSMC 5nm with density optomized cells and a massive Tag space occupying the basement floor and regular SRAM on top would allow for better performance and more space in general.
cUetvMr.png

Even having some extra L4$ on (Non-Consumer Ryzen) based CPU's would boost EPYC / ThreadRipper performance massively.

For single CCD Ryzens, using the other CCD slot as a L4$ could be a huge benefit since it could buffer alot of Main Memory data into the L4$.

Do they? I was under the impression they can read an adjacent CCD's L3, but only write to their own.

I'd be interested in seeing a source on that, either way.
You're correct, it can read data from adjacent CCD's if necessary data is already there, but can't write to them specifically.

Yes, but still painful enough that I think you'd rather avoid it. That's why it makes sense to me if they do (avoid it).


It doesn't need to - just the hot data. @InvalidError is right that the combination of speculative execution and prefetching can hide a lot of latency from cache misses.

For all the benefit that AMD's V-Cache provides, there are actually more cases where it provides little or no benefit, at all.
I agree you want to avoid a cache miss, but no amount of L3$ is large enough for everything, especially when multi-tasking and given how large many modern programs are along with their data sets.

That's why when you have multiple Normal CCD's, having a nice big L4$ CCD as a buffer in the middle could really help.

The next big step when the cIOD gets a die shrink is for AMD to add in a
Directory based Cache Coherency scheme to know exactly where all the data is stored in the system.

If the Directory gets excessively large, it'll be a good thing that the massive L4$ will be around to help maintain that directory.

nWYXJxj.png
I think AMD will eventually consider a Directory based Cache Coherency model down the line.
Instead of the current "Snoop Model".

The typical OS "tick" rate is between 100 and 1000 Hz. It's far more likely than not that a thread will continue executing into the next timeslice.

Even in the worst case scenario of a highly-oversubscribed CPU, where you have a context switch and complete L2 replacement in each timeslice, that's only 1 MB * 1 kHz = 1 GB/s per core of extra memory traffic, for a rather extreme scenario.

In other words, largely a non-issue.


That's L2 cache. Nobody is saying you can get rid of your L2, just because you have HBM. AMD and Nvidia also have L2 in their HBM-equipped compute accelerators.
Nobody's talking about getting rid of L2$

Not sure where you got the idea that SLC makes sense to be anything but SRAM.

We're not talking about in-package DRAM as cache, per say. In the event you also have some external memory, it just works like a slower tier. Kind of like swap, but much faster.
Well Intel did have those CPU's with eDRAM attached next door that functioned as L4$.
The Broadwell CPU's had 128 MiB of eDRAM with latency of < 150 cycles and 50 GiB/s of bandwidth.
So it's not unprecedented to consider DRAM as L4$.

But that's not what I want to do, I want SRAM as L4$, not DRAM.

What power? 32GB of DDR5-4800 uses 2-3W per DIMM under full load while self-refresh uses about 200mW, no big deal.

SRAM has static leakage current and that current gets worse as transistors get smaller. The newest SRAM I could find a standby current spec for pulls 320mA at 2.5V on a 72Mbits chip. Assuming the leakage per bit gets cut by 3X going down to 1.2V on a newer process, you are looking at about 1W of static power draw per 64MB of SRAM. The fastest discrete SRAM I could find is spec'd at 2.6A @ 1.2V running at full speed, so chalk another ~3W per 128MB of SRAM for IO and internal routing.

If you wanted to have 16GB of SRAM, your SRAM DIMM may very well require its own 8-pins power connector.

SRAM has horrible scaling in every possible way besides latency, which is why chip architects put so much design effort into using as little of it as they can.
First off, nobody is talking about mounting SRAM on DIMM's!
I don't know where you got that conclusion from?

SRAM was never ever meant to be mounted on a DIMM in the modern/current era.

SRAM is only to be used as cache on the CPU and Memory Controllers, and in most cases, it'll be built into the chip itself with only a few exceptions like my idea for a L4$ CCD.
 
But that's not what I want to do, I want SRAM as L4$, not DRAM.
Inserting an extra cache tier bigger than L3$ adds latency worse than L3$, which would be 20+ns of extra latency checking L4$ before going to DRAM.

If you make all DRAM accesses 20+ns slower due to having to check L4$ first when a DRAM access averages a cost of 60ns, your L4$'s hit rate needs to be at least 33% of what already missed L1-2-3 to break even against no L4$ whatsoever.

If you want your L4$ to have significantly better hit rate than the 33% bare minimum required to at least not make things worse on average and look completely silly, it will need to be several times bigger than all of the caches that have already missed. If we extend the ~16X progression per tier from L1-2-3, L4$ would need to land near giga-scale to be useful.

Exponential costs, inversely proportional benefits. That is the reality of SRAM.
 
  • Like
Reactions: bit_user
Inserting an extra cache tier bigger than L3$ adds latency worse than L3$, which would be 20+ns of extra latency checking L4$ before going to DRAM.

If you make all DRAM accesses 20+ns slower due to having to check L4$ first when a DRAM access averages a cost of 60ns, your L4$'s hit rate needs to be at least 33% of what already missed L1-2-3 to break even against no L4$ whatsoever.
Then it's time to follow TensTorrents lead and move on to the Directory based Cache Coherency model to deal with all those latencies.

Good thing AMD has a cIOD that everything runs through.
A nice central Hub/Router for everything to go in/out towards.

You were definitely correct on cIOD being the location that AMD would place the GPU.

I definitely didn't see them going with 2 CU's though.

But once they annouced the CU size, I can understand why they didn't from a market segmentation purpose.

2 CU's would be "Just Enough" to beat Intel and never step on the toes of any of their APU or dGPU line of cards.

APU's would be limited to 3-12 CU's.

While dGPU's would ≥ 12 CU's.

If you want your L4$ to have significantly better hit rate than the 33% bare minimum required to at least not make things worse on average and look completely silly, it will need to be several times bigger than all of the caches that have already missed. If we extend the ~16X progression per tier from L1-2-3, L4$ would need to land near giga-scale to be useful.
Perfect, that's where I was planning on 2x L4$ CCD's together would be around that size w/ only 1x layer of 3DvCache on each piece to be a starting point.

Exponential costs, inversely proportional benefits. That is the reality of SRAM.
Would you say that about the massive L3$ that AMD regularly uses?
Those are MASSIVE compared to what they used before in the past.

The point of chiplets was to maintain reasonable costs and prevent them from growing "Exponential".
 
Last edited:
Then it's time to follow TensTorrents lead and move on to the Directory based Cache Coherency model to deal with all those latencies.

Would you say that about the massive L3$ that AMD regularly uses?
Those are MASSIVE compared to what they used before in the past.

The point of chiplets was to maintain reasonable costs and prevent them from growing "Exponential".
The directory coherency only deals with chip-to-chip "who is caching what" bus chatter, it doesn't change anything for tag-RAM lookups to find out which cache line a given RAM address is cached in and all of the pipelining around the SRAM to access it, which is where the L3$'s relatively high latency (3-4X L2's) comes from even in single-CCD CPUs where cache snooping across CCDs isn't a thing.

I wouldn't call 32MB of L3$ per CCD in "stock" form massive. The extra cache per chiplet is a work-around to soften the performance penalty from using CCDs in the first place. An extra 64MB with V-cache isn't that much of a step up from that.

The exponential cost is in the amount of SRAM required to improve the hit rate by a given fraction of what still misses the L3$. The effort is still exponentially costly regardless of how many chiplets you split it into.
 
Last edited:
  • Like
Reactions: bit_user
L3 can only be of finite size due to the shape and construction of the L3 $ in the center of the CCD (Core Chiplet Die) and how many 3DvCache layers are you willing to stack, since most are only willing to stack 1x layer at the moment for costs reasons, nobody on the consumer end or enterprise end has bothered with higher stacks even though 12x is the current upper limit stated by TSMC. I think adding more layers adds to more risk of failure or bad bonds along with exorbitant costs per CCD. So most products stick with 1x layer of 3DvCache so far.
What I've read about 3D V-Cache stacking suggests they can only stack it 1-high. I'm guessing that's due to thermal issues.

A dedicated L4$ SRAM only CCD (Cache Chiplet Die) that functions as Victim Cache for all the various L3$ amongst all the CCD's in a CPU and can act as a buffer for Main Memory Controllers w/ at least 1 layer of 3DvCache on top would be a "Huge" Scratch Pad / Buffer.
Sounds like its communication link would be a bottleneck. I figured you wanted the L4 cache on the I/O Die.

Based on Zen 3 CCD Die Size & Area, but using TSMC 5nm with basement floor usage for alot of things like Cache Tags, you can pack in ALOT more SRAM on the ground floor area.
By "basement floor", you mean putting logic and memory cells in the substrate? I'd imagine that would make it much more expensive, and would hardly be worthwhile, since it's so much lower-density.

My estimates for SRAM capacities:
cUetvMr.png
What density figures did you use for that?

I think AMD will eventually consider a Directory based Cache Coherency model down the line.
Instead of the current "Snoop Model".
So, what are the tradeoffs?

Well Intel did have those CPU's with eDRAM attached next door that functioned as L4$.
The Broadwell CPU's had 128 MiB of eDRAM with latency of < 150 cycles and 50 GiB/s of bandwidth.
So it's not unprecedented to consider DRAM as L4$.
I don't know specifics on those, but it seems a rather poor fit. The latency will be similar to DIMMs, so your only real benefit is bandwidth. That makes sense for graphics, since GPUs tend to be good at latency-hiding and are very bandwidth hungry. For CPU cores, it seems a lot less interesting.
 
What I've read about 3D V-Cache stacking suggests they can only stack it 1-high. I'm guessing that's due to thermal issues.
I remember reading that TSMC stated that 12 hi is the maximum, but that's also limited by accumulated thermal limits between each layers. So you need to choose carefully what you put underneath.

Sounds like its communication link would be a bottleneck. I figured you wanted the L4 cache on the I/O Die.
I would if there was die area available, but it seems that the floor plan for the cIOD is maxed out.
There really isn't any room for more SRAM on the cIOD, despite what I want. So until they figure out their 3D transistor stacking or something else to free up die-space, whatever Process Node shrink I can get, it'll need to make space for the Directory Cache Coherency.

The Directory Cache Coherency is mostly for CPU's with 3x CCD's and up. Normal Ryzen CPU's w/ 2x CCD's at most don't really need it. That isn't a complex enough of a CPU that would need Directory Based Cache Coherency.

And I want AMD to start segmenting it's upper echelon of CPU's into different platforms for different markets.

1x-2x CCD's for Ryzen

3x-4x CCD's for Ryzen FX (WorkStations / HEDT / SMB)

5x-6x CCD's for Ryzen TR (ThreadRipper PRO).

7x-24x CCD's for EPYC.

By "basement floor", you mean putting logic and memory cells in the substrate? I'd imagine that would make it much more expensive, and would hardly be worthwhile, since it's so much lower-density.
No, I mean by building one layer of transistors below the main layer of transistors.
3D-Stacked CMOS Takes Moore’s Law to New Heights
Kind of like building construction.
But what you choose on the bottom (Basement Layer) needs to not generate too much heat.
Ergo, what I put on that layer is very selective and critical to Min/Max the 2D Die Area.
Most of what goes on the bottom is L1.I$ & L1D$ along with μOp Cache.
I've measured the size of what is possible with TSMC 5nm & 3D CMOS stacking, and underneath, the main Zen 3 CCD logic area, you can place ALOT of L1.I$ & L1D$ & enlarged μOp Cache.
I finally found a space for my desired: 192 KiB of L1(I$ & D$) + 65,536 entry µOp cache w/ 16-way Associativity
This give enough Cache resources to allow for SMT1-8 for Regular future Zen # cores.
Zen #C cores would be limited to SMT 1-4 because of Maximum Cache Sizes.

With the increased density of transistors for core logic and the same Physical Die area limited to Zen 3's Die Area size.
L1$ & L2$ & L3$ SRAM Die Area due to the nature of SRAM Transistor Scaling stopping at 5nm for now. I'm working on a lay-out that needs 3D Transistor stacking. This way a basement layer can take most of the Cache & Cache Tags as needed to maximize actual SRAM cache on the main levels for L2$ & L3$.
L1$ will largely get shunted into the basement and send data above to the Core Logic area.

What density figures did you use for that?
TSMC 5nm for SRAM since that's when SRAM scaling has stopped until somebody figures out a better solution.

So, what are the tradeoffs?
For Cache-Coherence, Snooping tends to be faster if you have enough bandwidth available.
Directory-based Cache Coherence is better for scalability with many cores / CCD's.
Guess what era we're in, the era of ever more & more CCD's, cores, and caches.

Directory based Cache Coherence is also what TensTorrent is using for their upcoming RISC-V Tile/chiplet based CPU. It's better for CPU's that have high number of Cores/Tiles, while the traditional Bus Snooping method is based on Broadcasting out and waiting for reply's on Cache. Great when your core count is small.

But given the Hub-Spoke model that AMD has chosen, it makes sense to have a Directory on the cIOD for maintaining Cache Coherency across the CPU for all the various data's located on different caches.
Each local copy would make changes and pass along the changes as necessary across the CPU when it does.
Obviously you pass along which state a Cache line or segment is in. If it's locked by something else, you have to wait until you get the token to read/write to the Cache line/segment once it gets updated.

Having a central directory is going to be crucial once Caches grow larger, RAM / memory sizes grow larger.

Having multiple layers of cache as well on different parts of the system at different parts of the chain allow very high speeds, but also some complexity due to who is writing to what memory address.

I don't know specifics on those, but it seems a rather poor fit. The latency will be similar to DIMMs, so your only real benefit is bandwidth. That makes sense for graphics, since GPUs tend to be good at latency-hiding and are very bandwidth hungry. For CPU cores, it seems a lot less interesting.
I don't agree with using DRAM as L4$, the only thing you really gain is Massive Memory Capacity.
But the slow-ness of DRAM along with the need to pipe-line due to it being Half-Duplex in nature to get more bandwidth is unnecessary complexity where you don't want it.

SRAM has the Low Latency, High Bandwidth, inherent Full-Duplex nature, and the capability to scale with the IFOP links as needed over time. And I'm sure IFOP will continue to grow in bandwidth as PCIe PHY is growing along with how fast AMD feels like clocking it above PCIe spec using their own GMI protocols. That's why I want a L4$ SRAM based CCD (Cache Complex Die). With the right Cache Coherency Structure and design, it can literally be the correct buffer to keep all the CCD's filled on CPU's with more than 2 CCD's.

I'm thinking CPU's starting from 3x CCD's -> > 24x CCD's in the future.

Keeping every CCD filled and busy is going to be a issue on multiple levels.
 
Last edited:
I don't agree with using DRAM as L4$, the only thing you really gain is Massive Memory Capacity.
Using 16-64GB of on-package memory for mainstream and 64-512GB for servers isn't for caching, it is the main system working memory where nearly all computing that falls through L1-3 caches will occur and external memory becomes the new swapfile with ~100X lower latency than an SSD.
 
  • Like
Reactions: bit_user
I remember reading that TSMC stated that 12 hi is the maximum, but that's also limited by accumulated thermal limits between each layers. So you need to choose carefully what you put underneath.
It matters what you're stacking. According to the data @InvalidError provided, thermal issues would prevent SRAM from stacking more than a few layers.

I would if there was die area available, but it seems that the floor plan for the cIOD is maxed out.
It would make more sense to move the iGPU to another chiplet than to move L4 there. I think the iGPU is only on the IO Die for convenience. With L4, putting it on the I/O Die would probably be key for achieving decent latency and energy-efficiency.

No, I mean by building one layer of transistors below the main layer of transistors.
3D-Stacked CMOS Takes Moore’s Law to New Heights
Kind of like building construction.
Except not. It wouldn't make sense to partition the layers functionally, as if they're separate chiplets. Transistor-stacking would be better used to increase logic density. Maybe you can use it to increase memory density, as well.

The catch is that it won't be free. Probably cheaper than die-stacking, but more expensive than a single layer of transistors.

But what you choose on the bottom (Basement Layer) needs to not generate too much heat.
With such tight spacing, I doubt it matters whether you're talking about top or bottom layer. The main issue is going to be the areal density of thermal dissipation. It'd just be another layout constraint.

I've measured the size of what is possible with TSMC 5nm & 3D CMOS stacking, and underneath, the main Zen 3 CCD logic area, you can place ALOT of L1.I$ & L1D$ & enlarged μOp Cache.
Why?

But given the Hub-Spoke model that AMD has chosen, it makes sense to have a Directory on the cIOD for maintaining Cache Coherency across the CPU for all the various data's located on different caches.
What if using it that way is protected by a patent? Maybe Tenstorrent got around that by using it differently. There must be some reason AMD didn't do it (assuming they're not). Do you think they're dumb?
 
It matters what you're stacking. According to the data @InvalidError provided, thermal issues would prevent SRAM from stacking more than a few layers.
I concur, that's why I limited the L1$ to be one layer only on the bottom, and no more.

It would make more sense to move the iGPU to another chiplet than to move L4 there. I think the iGPU is only on the IO Die for convenience. With L4, putting it on the I/O Die would probably be key for achieving decent latency and energy-efficiency.
That would be nice if people would agree to that, I'm amenable to moving the iGPU out, but that requires agreements on what to do.

Except not. It wouldn't make sense to partition the layers functionally, as if they're separate chiplets. Transistor-stacking would be better used to increase logic density. Maybe you can use it to increase memory density, as well.
I chosen the parts to go underneath due to their limited transistor activity along with heat generation and necessary space.

The catch is that it won't be free. Probably cheaper than die-stacking, but more expensive than a single layer of transistors.
True, but it's cheaper to build on top in one continuous chip than to die stack multiple chips.

With such tight spacing, I doubt it matters whether you're talking about top or bottom layer. The main issue is going to be the areal density of thermal dissipation. It'd just be another layout constraint.
That's true with any die stacking or 3D transistors.

Because I have a ultimate goal for larger L1$ sizes and SMT1-8 functionality based on Core size.
192 KiB L1$ (I & D)for Zen # cores
_96 KiB L1$ (I & D) for Zen # C cores

Apple M1 was the first to hit massive L1$ w/ (192 KiB I & 128 KiB D)$ on their performance cores.
Their efficiency cores have L1$ w/ (128 KiB I & 64 D)$

I've always wanted more L1$ for SMT & efficiency purposes, but I never thought Apple would be the first one to get to that point, much faster than I thought. They really took a big risk in design choices and it paid off in spades.

Apple M1 really has very impressive Energy Efficiency per core.

What if using it that way is protected by a patent? Maybe Tenstorrent got around that by using it differently. There must be some reason AMD didn't do it (assuming they're not). Do you think they're dumb?
I don't think they're dumb, but given all the other stuff I want to do, like OMI with extra SRAM caches on the Memory Controller out near the DIMM along with L4$ SRAM Chiplets, I think a Directory Cache Coherency mechanism would be necessary given how many seperate SRAM caches will be laying about the CPU & MoBo package along with how much more actual Main Memory Address Space that I would be able to support.
 
Using 16-64GB of on-package memory for mainstream and 64-512GB for servers isn't for caching, it is the main system working memory where nearly all computing that falls through L1-3 caches will occur and external memory becomes the new swapfile with ~100X lower latency than an SSD.
It wouldn't be hard to beat most SSD's, either with DRAM or SRAM.

Both would be orders of magnitude faster than SSD NAND Flash, even with NVMe controllers.

DbvSGPm.jpg
 
That would be nice if people would agree to that, I'm amenable to moving the iGPU out, but that requires agreements on what to do.
The only person you need to negotiate with is yourself. You're a CPU architect in your own mind, and nowhere else.

Because I have a ultimate goal for larger L1$ sizes and SMT1-8 functionality based on Core size.
SMT8 won't happen. SMT4 is a stretch. Currently, the benefits shown by SMT2 are limited enough that I think even 4-way isn't a foregone conclusion.

Keep in mind that ARM Neoverse cores are competitive, even without SMT, and Intel seems to be moving that direction with Sierra Forest.

Apple M1 was the first to hit massive L1$ w/ (192 KiB I & 128 KiB D)$ on their performance cores.
That's easier to do when you're limited to a lower clockspeed. The latency hit from increasing the size is less, because each clock cycle takes longer.

Apple can also more readily afford to spend more $ on die area, if it saves them some power consumption.

Apple M1 really has very impressive Energy Efficiency per core.
Due to a lot more than just cache sizes.
 
It matters what you're stacking. According to the data @InvalidError provided, thermal issues would prevent SRAM from stacking more than a few layers.

The catch is that it won't be free. Probably cheaper than die-stacking, but more expensive than a single layer of transistors.
While the static leakage of 1GB of SRAM may be huge compared to 32GB DRAM (10W vs 200mW), SRAM only has a density of about 96MB/sqcm, which would be a power density of about 1W/cm2, quite trivial in a world of CPUs and GPUs pushing ~100W/cm2. Stacking a dozen of those could still be an issue with stacked thermal resistance though.

As for the cost of stacking vs multi-layer vs planar, what was the rationale for research into 30+ layers NAND that lead to today's 200+ layers chips? Wafers are expensive. Now we have 170+ layers NAND and can get 1TB SSDs for under $100. Based on that, it seems layers are much cheaper if you can find a way to make your chips that way. If fabs figure out how to make SRAM and DRAM cells in NAND-like layers, the cost could go down quite drastically over the following years.

Once layering DRAM and SRAM gets figured out, you should have everything necessary to layer DRAM and SRAM on top of logic. Then chip design will be primarily limited by thermal density - can't pack things tighter than you can cool them.
 
While the static leakage of 1GB of SRAM may be huge compared to 32GB DRAM (10W vs 200mW), SRAM only has a density of about 96MB/sqcm, which would be a power density of about 1W/cm2, quite trivial in a world of CPUs and GPUs pushing ~100W/cm2. Stacking a dozen of those could still be an issue with stacked thermal resistance though.

As for the cost of stacking vs multi-layer vs planar, what was the rationale for research into 30+ layers NAND that lead to today's 200+ layers chips? Wafers are expensive. Now we have 170+ layers NAND and can get 1TB SSDs for under $100. Based on that, it seems layers are much cheaper if you can find a way to make your chips that way. If fabs figure out how to make SRAM and DRAM cells in NAND-like layers, the cost could go down quite drastically over the following years.

Once layering DRAM and SRAM gets figured out, you should have everything necessary to layer DRAM and SRAM on top of logic. Then chip design will be primarily limited by thermal density - can't pack things tighter than you can cool them.
That's the same stacked L4$ design principle that I was advocating for using a CCD model.
Something that is already very familiar to AMD & TSMC
Just cut-out the core logic parts and have more L3$ to replace the rest of the Die Area.

Then stack as necessary up to 12-layers high and attach it to the cIOD as a CCD w/ GMI3 links to the cIOD.