News AMD's Navi 48 GPU pictured: around 390 mm2, targeting mainstream gamers

The article said:
When AMD first showcased the Navi 48 GPU yesterday, it was presented as the company's Hawk Point Refresh processor. However, it certainly resembled the Navi 48 GPU from AMD's press materials, not a Hawk Point CPU. We asked AMD for clarification, but the company declined to comment.
Good job to all involved!

I'd also point out that the die-to-die communication did impose some power overhead, on its predacessor. Going monolithic should improve efficiency, even if it's a little more wasteful to fab I/Os and cache on N4P than N6.

I never found an answer to the question of whether MCDs only cached their own memory bank or whether their caching domain was global. If the former, it could've resulted in it behaving like a much smaller cache, in some cases. If the latter, it would've compounded the inefficiency of the die-to-die communication.
 
  • Like
Reactions: artk2219
Actually 7nm to 5nm still managed to maintain better SRAM scaling than 3nm did. Staying on-die for the L3 (excuse me, INFINITY CACHE) should be helpful for latency too.
 
  • Like
Reactions: artk2219
Staying on-die for the L3 (excuse me, INFINITY CACHE) should be helpful for latency too.
In RDNA2, I got the sense that Infinity Cache was basically L2. I think you're right that RDNA3's MCDs are L3, though.

GPUs tend to be pretty resilient to latency. The main benefit of something like a L3 cache would be bandwidth. For instance, the RTX 4090 has only two levels of cache and its L2 latency is 138 ns, which is way more than the DDR5 latency in desktop CPUs!!

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F675deb83-d802-462b-b2ba-ba6ce10a1fa2_1084x577.png


Source: https://chipsandcheese.com/p/microbenchmarking-nvidias-rtx-4090
And keep in mind, those measurements were taken when the GPU was otherwise idle. So, those are best-case latency numbers!

BTW, AMD said it put the Tag RAM for its L3 on the GCD. I guess that would help with latency, but it might also answer the question of whether the caching domain for the L3 was global or tied to that MCD.

Edit: here's the latency data from two RDNA3 GPUs. Even with the RX 7900 XTX's chiplet architecture, latency doesn't seem worse than the RTX 4090.
 
  • Like
Reactions: artk2219
Actually 7nm to 5nm still managed to maintain better SRAM scaling than 3nm did. Staying on-die for the L3 (excuse me, INFINITY CACHE) should be helpful for latency too.
Yeah even 5nm to 3nm was virtually zero scaling. However, it appears TSMC has had a breakthrough and 2nm will see the first tangible improvement in SRAM scaling since we went from 10nm to 7nm. Even 7nm to 5nm was not great.
 
AMD does net back some perf and power efficiency by not having MCD's, so yep, I'm curious to see where the 9070 lands in terms of gen-on-gen gains and against nVidia in both raw perf terms and value (perf/$ if you will).

Some waste as mentioned but also simpler packaging process, so it's probably more than made up in cost terms.