News AMD RDNA4 Navi 48 is 25% denser than Nvidia Blackwell GPUs — 53.9 billion transistors in a die smaller than GB203

If nVidia had a 25% denser package and kept the same yeilds that would translate to more chips being produced and more product to be sold.

Making more from less is better as long as you do not lose something in the process.

It should also make things faster and cost less energy as it has less distance to cover.
 
  • Like
Reactions: systemBuilder_49
It has lots of infinity cache which will drive up density vs the 4080. Cache circuitry is very small and packs extremely tightlyonto the chip!

The correct AMD chip to compare it to is 7800xt since its a beefed up 7800xt!

No sense comparing this chip to 5070 ti - that chip is ALREADY IN THE REARVIEW MIRROR as it's inferior ..
 
The 5090/5080 use the same N4p node from TSMC. I would be curious to know where the extra 25% density is coming from. I understand node process isn't everything, but 25% is a big difference for the same process node.
 
  • Like
Reactions: usertests
Navi 48 having more density and transistor count compared to GB203 while, most likely, being slower than an RTX5080 is not a win... just say'n
As we note in the final paragraph.

"Ultimately, though, it comes down to performance. Whatever the claimed transistor density, the faster chip will still be faster."

Also, there's questions to how dense these really are. I personally think there's a difference in counting transistors. Maybe AMD includes blank space, or debug logic that Nvidia omits, or something. I don't know. I'm just saying that all indications are AMD doesn't intend to take on the 5080, and so Nvidia having a slightly larger chip that's faster but has fewer transistors? That may or may not be fully accurate.
It has lots of infinity cache which will drive up density vs the 4080. Cache circuitry is very small and packs extremely tightly onto the chip!
AMD and Nvidia have both talked about how cache density scaling has slowed down. That was one of AMD's key talking points for GPU chiplets with RDNA 3! I'm still not fully convinced it's accurate, but I don't know. What I do know is that AMD and Nvidia both have a lot of cache in a monolithic design now, and transistor density doesn't seem to have been hurt too much.
 
As we note in the final paragraph.

"Ultimately, though, it comes down to performance. Whatever the claimed transistor density, the faster chip will still be faster."

Also, there's questions to how dense these really are. I personally think there's a difference in counting transistors. Maybe AMD includes blank space, or debug logic that Nvidia omits, or something. I don't know. I'm just saying that all indications are AMD doesn't intend to take on the 5080, and so Nvidia having a slightly larger chip that's faster but has fewer transistors? That may or may not be fully accurate.

AMD and Nvidia have both talked about how cache density scaling has slowed down. That was one of AMD's key talking points for GPU chiplets with RDNA 3! I'm still not fully convinced it's accurate, but I don't know. What I do know is that AMD and Nvidia both have a lot of cache in a monolithic design now, and transistor density doesn't seem to have been hurt too much.
The both are 'staying / going back to' monolithic because they are both stuck in the same mature process. If they broke out chiplets they would be in the same process so might as well keep them on the same die. Chiplets only work when some components can be fabricated on older tech. No money to save leaving part of the GPU on 5nm.
 
The article said:
AMD's decision to abandon RDNA 3's chiplet-style design for a return to a monolithic die does not seem to have sacrificed density or efficiency.
I found some info on how TSMC N4P compares with N5. According to wikichip, it's 6% denser than N5. So, replacing the high-speed links to the MCDs with 64 MiB of L3 cache would've yielded a density of 141.5 MTr/mm^2 on the N5 node used by the RX 7900 GCD.

In other words, they did sacrifice density vs. RDNA3, but (assuming both numbers are accurate) gained back enough to compensate. Had they kept the same architecture as RDNA3, they would've gotten up to perhaps 159 MTr/mm^2. That loss of potential density increase seems to me like a sacrifice.

Source:


BTW, "... or efficiency"? the RX 7900 XTX actually burned something like 10W on those chiplet links, IIRC. The chiplet approach was less efficient, but something AMD did for the sake of cost reduction and perhaps in anticipation that the supply of N5 production would be limited like it was in the pandemic era.

The article said:
AMD appears to be attacking Nvidia's upper-midrange products with a zeal we haven't seen from the company's GPU division in quite some time.
RDNA2 was quite competitive. AMD had a winning idea with Infinity Cache. I think they hoped their chiplets would be another game changer, but it just didn't pan out. If N5 production had been more scarce, maybe it'd have worked better for them.
 
Last edited:
Also, there's questions to how dense these really are. I personally think there's a difference in counting transistors.
Could be. I'm reminded of this piece, written by Anand himself:

AMD and Nvidia have both talked about how cache density scaling has slowed down. That was one of AMD's key talking points for GPU chiplets with RDNA 3! I'm still not fully convinced it's accurate, but I don't know.
It's true!

sram-density-tsmc-n3b-n3e.png


Source: https://fuse.wikichip.org/news/7343/iedm-2022-did-we-just-witness-the-death-of-sram/

Anton recently penned an update on TSMC N2 (and another article claimed that Intel's 18A has similar SRAM cell sizes):

What I do know is that AMD and Nvidia both have a lot of cache in a monolithic design now, and transistor density doesn't seem to have been hurt too much.
Well, AMD removed the 6 chiplet links and I/O drivers are way less dense than SRAM cells, IIUC.

Also, RDNA3 still had the tag RAM on the GCD, but that's probably only about 1/8th or so the size of the cache data memory.
 
  • Like
Reactions: helper800
The 5090/5080 use the same N4p node from TSMC.
Not exactly. They used "4NP", which we might presume to be similar to N4P, but TSMC has never said.

I would be curious to know where the extra 25% density is coming from.
True, the numbers seem to me like they can't have been accounted for in the same way. See my above link to Anandtech, for one possible explanation.

Another could have something to do with the clock speeds they were designed to run at. How similar are those? One way AMD reduced the size of its "C" cores is by making more use of density-optimized cell libraries, with the consequence that the C cores simply cannot clock as high. If AMD made heavier use of high-density cells in RDNA4, while Nvidia used more frequency-optimized cells, perhaps it could account for some of the difference, but I think probably not on the order of 25%.
 
It has lots of infinity cache which will drive up density vs the 4080.
NAVI 48 has 64 MiB of L3 cache, 8 MiB of L2 cache, and 2 MiB of what I'll call L1 cache. RTX 4080 & 5080 both have 64 MiB of L2 cache and the 4080 had 9.5 MiB of L1. So, basically the same amount of cache.

Oh, and here's where I got my figures on NAVI 48:

DhLZtTguv8ZThoBNT8BE4c-970-80.jpg


The correct AMD chip to compare it to is 7800xt since its a beefed up 7800xt!
Yeah, unless Jarred doesn't have a transistor count on the 7800 XT. Wikipedia claims it's 28.1B, but the source cited is an AMD press release and the link is broken so I couldn't verify whether it actually cited a count for the smaller GCD or whether somebody just extrapolated it.
 
  • Like
Reactions: usertests
If nVidia had a 25% denser package and kept the same yeilds that would translate to more chips being produced and more product to be sold.
Usually, the way you boost yields is by having some extra compute units, so that you can disable some that contain defects and still end up with a die you can sell. Sometimes, there end up being defects in a critical part of the die that can't have redundancy. But, as long as most of the die has redundancy and the defect rate is low enough, it's not worth worrying about those remaining errors.

I think rather than just wasting density, they would probably prefer to simply use a node that's cheaper and more mature. That should be both more economical and more effective at minimizing bad dies.
 
Navi 48 having more density and transistor count compared to GB203 while, most likely, being slower than an RTX5080 is not a win... just say'n
RTX 5080 costs more, burns more power, and uses 30 Gbps GDDR7 (compared to the RX 5070 XT's 20 Gbps GDDR6). So, roughly 50% more memory bandwidth. Heck, even the RTX 4080 had 12% more memory bandwidth than the RX 5070 XT.

But, I mean, it's Nvidia. The last time AMD equaled them was the RX 6950XT. Before that, it was Fury X, like 10 whole years ago! They're not easy to beat!

P.S. In my mind, memory bandwidth really jumps out as a significant weak spot of these cards. Then again, that's also one of the ways they're keeping costs down, so I shouldn't complain too much.
 
  • Like
Reactions: Notton
The both are 'staying / going back to' monolithic because they are both stuck in the same mature process. If they broke out chiplets they would be in the same process so might as well keep them on the same die. Chiplets only work when some components can be fabricated on older tech. No money to save leaving part of the GPU on 5nm.
5nm-class has been around for three years (give or take), sure, but it's still pretty expensive by all accounts. AMD did the MCDs on 7nm-class (N6) for RDNA 3. If what it said at the time was true — that cache and external interfaces didn't scale much with process node shrinks — it would remain true now. It could have used MCDs with a GCD on RDNA 4, just as it did with RDNA 3. Unless...

Unless the use of chiplets didn't actually work out well overall. It adds complexity to packaging for sure, and when everything is wrapped up, perhaps it was adding more latency and hurting performance while not really saving much money. That's what I suspect. RDNA 3 Navi 32 GPUs basically tied roughly equivalent RDNA 2 Navi 21 GPUs, meaning there was no performance uplift between the two architectures. Basically, RDNA 3 was to RDNA 2 what Blackwell seems to be to Ada Lovelace. Some minor performance improvements, and a few architectural tweak, but nothing massive.
 
5nm-class has been around for three years (give or take), sure, but it's still pretty expensive by all accounts. AMD did the MCDs on 7nm-class (N6) for RDNA 3. If what it said at the time was true — that cache and external interfaces didn't scale much with process node shrinks — it would remain true now. It could have used MCDs with a GCD on RDNA 4,
Manufacturing nodes tend to get cheaper as they mature. So, there's not as much pressure to use die space as efficiently.

Also, maybe the decision to go ahead with the MCDs was made at a time when the fabs were way backlogged and cutting edge capacity was more limited. If you thought maybe you could only get a subset of the N5 wafers you really wanted, then the chiplet strategy would make a lot of sense.

Finally, don't forget about MI300. Maybe AMD wanted to get some experience with chiplet-based GPUs before taking on that mammoth project and gaming GPUs were seen as a good avenue to try it out.

Don't forget they used a new chiplet interconnect technology that was supposedly like 10x as efficient as what Ryzen and EPYC used.

Unless the use of chiplets didn't actually work out well overall. It adds complexity to packaging for sure, and when everything is wrapped up,
Cost, complexity, and power are the likely factors that jump out at me.

I do think @lmcnabney has a point that if AMD weren't using a relatively mature node, but had instead moved to a N3-family node, there might've been more incentive to retain the GCD/MCD architecture.

perhaps it was adding more latency and hurting performance
Fortunately, Chips & Cheese actually tested that, comparing the monolithic RX 7600 vs. the RX 7900 XTX. Latency increased, but only a modest 9%, which is so small it's hard to know how much of that is due to the MCDs and how much is just because it's a bigger chip with more cache.

RDNA 3 was to RDNA 2 what Blackwell seems to be to Ada Lovelace. Some minor performance improvements, and a few architectural tweak, but nothing massive.
WMMA was new to RDNA 3, right? I also found some compute benchmarks on OpenBenchmarking.org, where the RX 7800 XT did a fair bit better than the RX 6800 XT. But yeah, somewhat surprisingly similar, overall. Mostly just a bit cheaper and lower-power.
 
  • Like
Reactions: helper800
NAVI 48 has 64 MiB of L3 cache, 8 MiB of L2 cache, and 2 MiB of what I'll call L1 cache. RTX 4080 & 5080 both have 64 MiB of L2 cache and the 4080 had 9.5 MiB of L1. So, basically the same amount of cache.
So to summarize this whole thing, disparity in cache amounts and Nvidia's heavy use of L1/L2 instead of L3 are responsible for the apparent differences in density.
 
  • Like
Reactions: helper800
So to summarize this whole thing, disparity in cache amounts and Nvidia's heavy use of L1/L2 instead of L3 are responsible for the apparent differences in density.
I was thinking about that and I do wonder how much cache cells differ between different levels. Obviously, the degree of associativity matters - and that's something we don't know and nobody has discussed.

I certainly wouldn't ascribe the density differences to the differences in their cache distribution. Maybe a little, but that seems like a reach, to me.