News TSMC's 3nm Node: No SRAM Scaling Implies More Expensive CPUs and GPUs

Hell of a lot of numbers and nonsense to simply say - we want way more of your money, the world's getting more technical based, and we want all your money. JUST LOOK AT THESE CRAZY NUMBERS! Pay me, gimme your money! Because you know you will and you don't have a choice because the world is so tech based now. Thanks Covid.
 
I think once the cooling issue is handled, stacking SRAM on top of smaller logic and other "3D" techniques mitigate these issues.

Remember like 10 or 15 years ago when everyone was saying we were all doomed because of "quantum tunneling?" Well, here we are about to go down bellow 1nm. This is why research is always ongoing.
 
I could see Intel's chiplet method handling this the easiest. AMD's stacking method will always be hot because of the thermal insulation it makes. Also the cache on a Zen chiplet is of the same expensive node as the CPU logic and the stacked cache has to match perfectly so a cheaper node for the stacked cache seems out of the question. Intel is already attaching HBM2e, why not the same area in L3?
 
and heres NVIDIA's excuse next gen.

"and thats why the 5090 costs $3000" - jensen prolly
Don't temp Jensen Huang, he might be crazy enough to do it.

Hell of a lot of numbers and nonsense to simply say - we want way more of your money, the world's getting more technical based, and we want all your money. JUST LOOK AT THESE CRAZY NUMBERS! Pay me, gimme your money! Because you know you will and you don't have a choice because the world is so tech based now. Thanks Covid.
That was going to happen, regardless of COVID.
The question is "How Fast", and "How Soon" will we get there.
 
I think once the cooling issue is handled, stacking SRAM on top of smaller logic and other "3D" techniques mitigate these issues.

Remember like 10 or 15 years ago when everyone was saying we were all doomed because of "quantum tunneling?" Well, here we are about to go down bellow 1nm. This is why research is always ongoing.

Mainly because we are not even near 1nm in real life!
The marketing "nm" is so far from real nm that quantum tunneling is not problem for years!
 
  • Like
Reactions: prtskg
Hell of a lot of numbers and nonsense to simply say - we want way more of your money, the world's getting more technical based, and we want all your money. JUST LOOK AT THESE CRAZY NUMBERS! Pay me, gimme your money! Because you know you will and you don't have a choice because the world is so tech based now. Thanks Covid.

Chill out dude! No one is forcing you to buy anything. And you are not entitled to getting anything at the price you are willing to pay. If there is a match, fine - go and buy. If there isn't - well, maybe it's just out of your price range.

For companies to provide goods and services, they have to be able to make a profit. If there isn't, they won't do it. The only things that keep keep them in check are actual market demand and competition. You can not decree lower prices. I grew up in a former communist country and I've seen where this ends.

The only outcome from your whining is that manufacturers are reluctant to charge the equilibrium price for their product so they don't tarnish their image. Which invites scalpers. So you are not better off for that, but the actual manufacturers are worse. I know this is not a popular opinion but, unfortunately, common sense is not very common.
 
I guess miniaturization can only go so far, it's time to make another design, that either scale or go 2.5D like finfet did. I guess it's time for someone to notice you can get clusters of 4 cells with something shared, or maybe someone will find a way to overlap neighboring gates.
Whatever it will be, we need some innovation, not just gzip.
I hope smart guys can figure something out.
 
It seems like AMD has already implemented a good solution with the GCD/MCD arrangement on RDNA3 with the largest L3 caches made on an older, cheaper node. I expected more designs to pick that up in the future.
Yes, but it only helps so much.

The problem with SRAM not scaling at the same pace as logic is that you can fit more cores on a die, but you can't scale up the caches proportionately. Using old nodes doesn't change that, unless you start doing wacky things like die-stacking.

So, I expect we'll not only see more die-stacking of L3 cache, but also more in-package DRAM. That can reduce latency (a little) and increase bandwidth (a bit), but neither nearly as much as SRAM.
 
AMD's stacking method will always be hot because of the thermal insulation it makes.
Maybe there are ways around that. Also, in server CPUs you don't want dies running so hot (heat = inefficient), so it's at least not a deal-breaker for them.

Also the cache on a Zen chiplet is of the same expensive node as the CPU logic and the stacked cache has to match perfectly so a cheaper node for the stacked cache seems out of the question.
No, I think the cache doesn't need to be the same node. The connection between the dies uses TSVs, so you just need those to match up.

Intel is already attaching HBM2e, why not the same area in L3?
"attaching"? Their Sapphire Rapids Xeon Max seems to use separate HBM stacks, just like every other HBM-enabled product I think we've seen. I'm not sure what you're proposing ...unless it's to add a L3 cache die directly to the HBM stack. One downside I can see with that is slightly higher latency than putting the cache directly above the cores.
 
Maybe there are ways around that. Also, in server CPUs you don't want dies running so hot (heat = inefficient), so it's at least not a deal-breaker for them.
The 5800X has about 25% more thermal headroom than the 5800X3D under steady state load per Tom's testing: AMD Ryzen 7 5800X3D Boost Frequencies and Thermal Throttling Tests - AMD Ryzen 7 5800X3D Review: 3D V-Cache Powers a New Gaming Champion | Tom's Hardware (tomshardware.com)
You can mitigate thermal issues with stacked, but the same methods apply to flat. And they may decrease this but can never eliminate it. At some point it may pay to put the cache stack underneath and tsv down to the PCB.

Of course people don't use full load often yet, but being able to dissipate 110w will give you higher clocks than 86w.
No, I think the cache doesn't need to be the same node. The connection between the dies uses TSVs, so you just need those to match up.
I suppose not. They could spread out the lower cache to match TSVs.
"attaching"? Their Sapphire Rapids Xeon Max seems to use separate HBM stacks, just like every other HBM-enabled product I think we've seen. I'm not sure what you're proposing ...unless it's to add a L3 cache die directly to the HBM stack. One downside I can see with that is slightly higher latency than putting the cache directly above the cores.
You know, like RDNA3 or multiple core chiplets with SR. You could even have a few core chiplets surrounding a giant L3 chiplet similar to the current monolithic designs in layout. And have more on the outside with combo I/O RDNA3 style chiplets. That wouldn't be harder to cool. There would be more L3 latency but it would beat ram.
 
The 5800X has about 25% more thermal headroom than the 5800X3D
First gen products are often good learning opportunities. Wait a couple months, until we see what the Ryzen 7000X3D CPUs do.

However, since Zen 4 CCDs are smaller than Zen 3, cooling should be an even bigger challenge in the 7000X3D CPUs.

My solution is simply to make the CPU package into a vapor chamber.
 
  • Like
Reactions: jp7189
Remember like 10 or 15 years ago when everyone was saying we were all doomed because of "quantum tunneling?" Well, here we are about to go down bellow 1nm. This is why research is always ongoing.
You need to keep in mind that "nms" in process naming schemes doesn't have much to do with real-world nms anymore, hasn't had much to do with actual physical sizes since 45nm or so and the discrepancy between actual circuit density and node names is getting wider, especially with SRAM which appears to have hit a scaling brick wall, possibly from excess tunneling leakage across its dense regular pattern of rows and columns.
 
  • Like
Reactions: bit_user
So L1/L2/L3 Caches are going to be largely fixed in "Physical Size" moving forward unless something major develops to help shrink them.

Apple has 192 KiB of L1 Cache for their cores, so it's possible to still increase L1$ if you design around such massive caches.
 
So L1/L2/L3 Caches are going to be largely fixed in "Physical Size" moving forward unless something major develops to help shrink them.

Apple has 192 KiB of L1 Cache for their cores, so it's possible to still increase L1$ if you design around such massive caches.

Even more if you consider they split the L1 into two segments on their performance cores (192kb for instructions + 128kb for data).
 
Don't temp Jensen Huang, he might be crazy enough to do it.
Actually, the issue with SRAM scaling was exactly part of what Jensen meant when he said that Moore's Law is dead; SRAM not scaling well isn't exactly a new development and it was bound to lead to no scaling at all at some point. Workarounds are possible, but they don't completely solve the issue and come with their own limitations for now.
 
TSMC's 3nm nodes will make chips faster and more power efficient — and more expensive, too.

TSMC's 3nm Node: No SRAM Scaling Implies More Expensive CPUs and GPUs : Read more
What I find a bit vexing is that I can't find anyone explaining why SRAM should scale worse than logic...
All the stories are about the impact of that trend, not it's reason!

It seems to be an issue that's been well known and around for many generations, but I find it hard to believe explenations would need to be found in the pre-Internet ages: so am I just googling it all wrong?

From what little I know, SRAM is logic, Only that it's six or eight transistors, so it's relatively expensive, much more than DRAM with its single gate, which of course has to lug a capacious capacitor around and that seems to require surface area (which can be achieved via deep holes).

And I've eagerly followed the stories with alterante RAM designs, which allow single gate 'DRAM' density while they even eliminate the volatility of DRAM e.g. magnetic, phase-change, memristor or even more exotic designs, which unfortunately are much worse in terms of scaling, latency, wear etc.: those are all very different beasts, so the fact that mixing them with logic is difficult across process sizes creates challenges is easy to understand.

But for SRAM the 6-8 gate overhead is certainly a heavy burden, I just plain don't understand why it should scale less than logic. In the past I remember reading how the very regular structure of SRAM allowed many tweaks to allow better than logic density for the SRAM gates.

Do these tricks now simply unravel? Are we simply going from a better-than-logic density to equal-to-logic density?

That might feel like a relative slow-down and disappoint those who want linear improvement for shrinks anywhere, but would deserve being called out for what's really happening.

Now, I have seen mentions of wiring limitations, which could offer an alternate explanation.

Sure, I can see that SRAM needs plenty of wires and that these wires have to go a long way with long rows and columns. And, yes, I can also see how logic might fare better there, because lots of inputs might coalesce and then only be passed along to immediate neighbors: fewer and shorter wires: on average, not in all cases. In this scenario I see SRAM like a worst case of logic scenario in terms of signal lines: so is that the reason?

Or is it a combination of both?
 
So L1/L2/L3 Caches are going to be largely fixed in "Physical Size" moving forward unless something major develops to help shrink them.

Apple has 192 KiB of L1 Cache for their cores, so it's possible to still increase L1$ if you design around such massive caches.
Very interesting points. Perhaps what we'll see is a slight refactoring of the cache hierarchy, with it becoming flatter and the lower-numbered caches becoming larger?

The tension with L1 is always around size vs latency. However, I don't know if there's actually such a significant latency penalty from making L2 bigger.

Even more if you consider they split the L1 into two segments on their performance cores (192kb for instructions + 128kb for data).
L1 is already split into instruction vs. data. Perhaps you mean extending the split to L2?
 
Actually, the issue with SRAM scaling was exactly part of what Jensen meant when he said that Moore's Law is dead; SRAM not scaling well isn't exactly a new development and it was bound to lead to no scaling at all at some point.
LOL, funny that it happens exactly when Nvidia decides to add a stonking huge amount of it to their GPUs!

Seriously, the 4090 has more cache than the RX 7900XTX, in spite of those 6 extra dies in the AMD GPU. And they did it on a 4 nm node, whereas AMD was smart to use a more cost-efficient 6 nm node for that stuff.
 
Very interesting points. Perhaps what we'll see is a slight refactoring of the cache hierarchy, with it becoming flatter and the lower-numbered caches becoming larger?

The tension with L1 is always around size vs latency. However, I don't know if there's actually such a significant latency penalty from making L2 bigger.


L1 is already split into instruction vs. data. Perhaps you mean extending the split to L2?

Still talking about L1. Most chips split their total L1 into two segments, an instruction cache and a data cache. For example, Raptor Lake has an 80k L1 that is split into two reserved segments (instruction cache and data cache). The M1/M2 actually has a total of 320k total for L1 split into 192kb for instructions + 128kb for data. A lot of places only report the instruction cache size for whatever reason, but Intel and AMD lump those two together when they report L1 on their spec sheets.
 
  • Like
Reactions: bit_user