News AMD GPU Appears to Leave Room for Future 3D V-Cache

IMO, it's hardly surprising. Why else would they launch their flagship GPU with less cache than the previous gen, unless they had some avenue for later surpassing it?

Meanwhile, if you look at what Nvidia did, their 4000-series GPUs have 12x as much L2 cache as the corresponding 3000-series, even surpassing AMD's 7900 XTX!! So, this definitely seems like a key point for performance that AMD will want to exploit.

As for die-stacking vs. thermals, I presume the MCDs should be cooler than the GCD, so probably a non-issue. And it could further justify their decision to move cache off the GCD. I rather expected this is what they had up their sleeves.
 
It is so crazy that we are at the point where GDDR6 memory is considered slower! 😛
Than cache? It was always slower.

I found this example from the 2017 Tesla V100:
"V100 has a peak math rate of 125 FP16 Tensor TFLOPS, an off-chip memory bandwidth of approx. 900 GB/s, and an on-chip L2 bandwidth of 3.1 TB/s"​
The V100 was made on 12nm and use HBM2 memory, but this should give you some sense of the ratio of L2 to main memory bandwidth.


Another datapoint: AMD claimed something like 5.3 TB/s bandwidth to the L3 cache in its RX 7900 XTX, whereas its GDDR6 bandwidth is just 0.96 TB/s.
 
Last edited:
This option to stack V-Cache on the MCDs was leaked before the GPUs even came out.

*

Since AMD has been unable to meet their performance estimates on RDNA3, it's possible even a binned and super-overclocked refresh variant wouldn't be memory bandwidth starved enough to warrant the extra cost. (This would be the mid-gen refresh "RX 7950XTX" or "RX 7970XTX".)

And even if they manage to pull a rabbit out of a hat and fix the performance with drivers or firmware, there's still the alternative of equipping the card with faster GDDR6 instead, depending on which is cheaper.

*

Having said that, AMD tends to create reusable designs. Even if V-Cache isn't required this generation, it could come in handy for the next one. This same MCD can be used for the GDDR6 cards, at least.
 
I wouldn't be surprised if they created 3DvCache/Infinity Cache for L2$ & L1$ on the GPU/CPU at some point,
Definitely not L1 cache. The latency of going to a stacked die would be too high for that.

L2 cache... again, I'd expect latency would be a major issue for CPUs, but less so for GPUs. The biggest issue for GPUs might turn out to be the thermal impact of stacking anything atop the GCD.
 
A semiconductor engineer has discovered the same 3D V-Cache connection points on AMD's RX 7900XT, as was found on AMD's Zen 3 CPU architecture. Pointing to the fact AMD could be building 3D V-Cache GPUs in the future.

AMD GPU Appears to Leave Room for Future 3D V-Cache : Read more
Could AMD use these connectors to actually have 2 layers of chiplets for processing there?
instead of vcache two full processing cores one on top of the other?
 
Could AMD use these connectors to actually have 2 layers of chiplets for processing there?
instead of vcache two full processing cores one on top of the other?
The earliest leaks of the RDNA 3 architecture suggested AMD had planned for 16MB of L3 cache on each MCD, with the potential to stack another 16MB on top. There was even the possibility of doing a 2-high stack (an extra 32MB per MCD), but the benefits were outweighed by the cost.

In general, I think the gains from going to 192MB total L3 cache on a 7900 XTX will be relatively small. There are diminishing returns. Maybe best-case, AMD gets an additional 10–15% at 4K, and a future 7950 XTX certainly seems likely. But while going from 96MB to 192MB might get 10% or whatever, the move from 192MB to 288MB total probably ends up only adding another 5% or less is my bet. Only time will tell what happens, and perhaps the 1-high and 2-high stacks will be more for professional cards where they cost $2000 and so spending an extra $200 per chip for stacking isn't out of the question.
 
  • Like
Reactions: Makaveli and KyaraM
I'm wondering how they will use this "silver bullet" to shoot themselves in both feet.
1st thought is lower performance per dollar in stuff that doesnt benefit from it. (same way 5800x3d was slower than 5800x in non cache heavy stuff)

as well as how many applications can actually benefit from it (think early days RTX..no realy use for the hardware if barely anything can use it)
 
Presumably the MCDs won't have the same heat problem of stacking extra cache on top due to it being separated from the main logic in the GCD, unlike X3D cpus where it on top of the normal cpu
 
  • Like
Reactions: bit_user
Presumably the MCDs won't have the same heat problem of stacking extra cache on top due to it being separated from the main logic in the GCD, unlike X3D cpus where it on top of the normal cpu

V-Cache on the 5800X3D was stacked on top of the on-die L3 cache - not on top of the logic parts.

Thermals having been the problem is misinformation that AMD have addressed and debunked.

The 5800X3D being clocked lower (and locked) wasn't about thermals, but about the strict voltage requirements for the 1st gen V-Cache.

https://www.techpowerup.com/293001/amds-robert-hallock-confirms-lack-of-manual-cpu-overclocking-for-ryzen-7-5800x3d said:
It turns out that the 3D V-Cache is Voltage limited to a maximum of 1.3 to 1.35 Volts, which means that the regular boost Voltage of individual Ryzen CPU cores, which can hit 1.45 to 1.5 Volts, would be too high for the 3D V-Cache to handle. As such, AMD implemented the restrictions for this CPU.
 
Could AMD use these connectors to actually have 2 layers of chiplets for processing there?
instead of vcache two full processing cores one on top of the other?

I doubt it’s for anything other than cache expansion, but it would be cool to see processing-in-memory applied here. AMD can leverage FPGAs from its Xilinx acquisition and accelerate certain edge functions to speed up memory accesses (FPGA accelerator) and/or directly process AI algorithms in the stacked MCD. With the high-bandwidth interconnect, each MCD has 883GB/s to work with, and is mostly caching temporo-spatial frame data, plus some ray traversal/intersection data; the rest needs to stay very close to the CUs in the larger L0, doubled gfxL1s, 1.5x registers, various parameter caches, and L2. So, if you directly process AI within the MCD and those cached assets, theoretically, GPU proper (GCD) will save time by not having to use CUs to process matrix math ops.

Custom memory controllers can be used and Xilinx APIs need to be used to enable customization of FPGAs in MCDs (Infinity Link already handles communication to/from GCD). This is probably a long way off, but it stands to reason that it may be in conceptual development. It’s a logical way of offloading workloads within memory that’s already in possession of necessary data and has direct access to PHYs to access VRAM. This may take a long while to get to gaming devices, as it might be better to use something like this in professional workloads to accelerate processing. Also: cost!

EDIT: But, if it was implemented in a gaming GPU, improved on-the-fly upscaling and image AI processing via FSR and video encoding are logical targets. As a memory accelerator, FPGAs can learn common patterns in memory accesses and accelerate them, just like in networking. A chip like a GPU is actually a network of discrete IP blocks linked via interconnect.
 
Last edited:
Could AMD use these connectors to actually have 2 layers of chiplets for processing there?
instead of vcache two full processing cores one on top of the other?
Thermals would seem to prevent doing something like that. If I'm right, then stacking logic dies is only something you'd do in embedded processors that run at low clockspeeds for the sake of power-efficiency, but need extremely large amounts of compute (e.g. vision & AI for robotics).

it would be cool to see processing-in-memory applied here.
So much of what you describe sounds exactly like what Samsung and SK Hynix have been doing with their processing-enhanced memory products.

You can find details of SK Hynix' GDDR6-AiM at a link I left in the comments of this article:



You can see a rundown of Samsung's efforts in this area, here:



AMD can leverage FPGAs from its Xilinx acquisition and accelerate certain edge functions
Indeed, it would be an interesting application of FPGAs, though AMD would probably have to partner with a memory maker. The reason being that it's most efficient to put the processing in the memory dies, themselves. Putting it in the MCDs wouldn't save you much over simply integrating it directly into the GCD.

As a memory accelerator, FPGAs can learn common patterns in memory accesses and accelerate them, just like in networking.
Exactly where are FPGAs being used to learn memory access patterns, in networking? I'm skeptical of this claim for several reasons, not least of which being that FPGAs take a long time to reconfigure - many orders of magnitude longer than how quickly memory access patterns can change. Also, a high-performance memory controller is something you'd typically want to hard-wire, in order to keep its power, thermals, and area under control.