News AMD RDNA 3 GPU Architecture Deep Dive: The Ryzen Moment for GPUs

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.

mrmessma

Distinguished
Jun 27, 2008
24
5
18,515
... Nobody buying a $300-$400 GPU is (should be) making their purchasing decision on the sole notion/metric that the RTX4090 is the worlds fastest GPU, so they should choose Nvidia regardless of the performance tier.
That's here you're wrong buddy. So many people do this, despite it making no sense. "My ford fusion is great because the GT500 is the fastest car." It's dumb, but for some reason, people feel like they're on a winning team when they buy an inferior and overpriced product that shares a name with the performance king.
I try to talk to my friends about the price/performance of cards and it falls on deaf ears. They think AMD means you're poor, similar to a kid in high school gets laughed at for having an android instead of iphone. And my friends are on the wrong side of 35!
 
  • Like
Reactions: bit_user

jp7189

Distinguished
Feb 21, 2012
334
191
18,860
https://en.wikipedia.org/wiki/Aspirational_brand

Flagship products (let's call it product A) tend to be aspirational products, ie the average buyer wishes to own one, but can't because of its premium price. It creates a want, which is translated to the buyer buying a lower-line product (product B), which has fewer features, but has higher "value" ($/feature).

A's high pricing is intentional, because that's how it creates the perception of "good deal" for B. If A were priced at $1000, and B has 90% of A's features but is priced at $300, and B is "very good deal." Whereas if A were $400, then B's $300 is just an "OK deal, not too great." The higher A's price, the better B's price is perceived as a good deal. Again, the premium pricing is deliberate, and is a fundamental marketing strategy.

Now you why the RTX 4090 is $1600+.

Buying is not a matter of dollars and cents, but is about the psychology of buying behavior, and it can be manipulated through savvy marketing.

The above example is oversimplified to illustrated the point. Things get more complicated when competitors' pricing come into play. In this instance, there are only two GPU vendors, neither of whom wants to rock the boat with a price war, so the above example still holds pretty well.
Someone should let AMD know about this. Their A product is only $100 more. Makes the B product seem undesirable.
 

bit_user

Polypheme
Ambassador
Thanks for the thorough writeup, @JarredWaltonGPU ! I finally got back to finishing it!

I'm very hopeful AMD will unleash more performance, in future driver revisions. There's a lot here that sounds great - better than the initial performance data would seem to indicate. If this thing works like it's supposed to, I really don't see why the XTX shouldn't be trading blows with the RTX 4080 (except for ray tracing).
 

bit_user

Polypheme
Ambassador
The pressure is now on nVidia to get creative,
With SRAM scaling having stopped, Nvidia will be forced to do like AMD and put their SRAM on dies made on a bigger, cheaper node.

...however, nothing says the dies have to be connected through an interposer or substrate. Another option would be to put your SRAM on a second die that you stack atop (or beneath) your compute die. That would enable the highest-bandwidth and better topologies.
 

bit_user

Polypheme
Ambassador
they always launch at the high end, where they can charge enough to cover the process refinement period. Once they are building the parts efficiently, they launch the cut-downs.
Not always. Maxwell and Pascal launched the x80 Ti card last. Even their Titans came after the mid-range models. In thoes cases, they had such a big performance uplift over the previous generation that there wasn't much overlap in performance. That overlap is what channel partners and retailers hate, because it forces them to discount existing inventory.

Looking ahead to the 2000-series, that was a smaller performance increase and indeed started at the high-end + mid-range. But 12 nm also wasn't a cutting-edge node, so yields should've been good. Yields are one of the main reasons why I think the 900 and 1000 series didn't start with their biggest die, BTW. I think the idea behind including the mid-range was either they thought DLSS + RTX was enough of a selling point.

Finally, the 3000-series was another round of launching the high-end first, except they also followed quicker with the mid-range cards, probably because supply was so tight and demand was raging.
 

bit_user

Polypheme
Ambassador
I may be mistaken, but when I look at 6-64 bit vram controllers,
Where did you get 6x 64-bit memory controllers? GDDR6 uses 16-bit data channels. The memory controller certainly could interleave them, but given that a single package of GDDR6 is bifurcated into 2x independent 16-bit channels, that would seem counterproductive.


I see variable speed, capacity and bandwidth depending on how the GPU splits up the memory load.

AMD is getting good performance so this is likely well managed in the games tested, but if split to 3 controllers your size and speed would be cut in half.
I've been wondering about this, for a long time. The transaction size in GDDR5 and GDDR6 is 256 bits (32 bytes). They could do address interleaving, on that granularity, but I think there's greater potential in doing linear address mapping and having higher-level load-balancing of assets across the memory banks.

Maybe splitting up the data into small parts is an important step towards the goal of multiple GPU shader dies.
As AMD states in the presentation, the reason they kept the GCD monolithic is that there's too much global data movement, in graphics workloads. If you were to look at the cross-sectional bandwidth of the GCD, I'm sure it'd be a lot higher than the 5.3 TB/s aggregate between it and the MCDs.
 

rluker5

Distinguished
Jun 23, 2014
625
381
19,260
Where did you get 6x 64-bit memory controllers? GDDR6 uses 16-bit data channels. The memory controller certainly could interleave them, but given that a single package of GDDR6 is bifurcated into 2x independent 16-bit channels, that would seem counterproductive.



I've been wondering about this, for a long time. The transaction size in GDDR5 and GDDR6 is 256 bits (32 bytes). They could do address interleaving, on that granularity, but I think there's greater potential in doing linear address mapping and having higher-level load-balancing of assets across the memory banks.


As AMD states in the presentation, the reason they kept the GCD monolithic is that there's too much global data movement, in graphics workloads. If you were to look at the cross-sectional bandwidth of the GCD, I'm sure it'd be a lot higher than the 5.3 TB/s aggregate between it and the MCDs.
Where are the memory controllers on RDNA3?
How would they access their local cache banks with cache level latency? I got the segmented memory controllers from them when they explained what the chiplets were.
 

rluker5

Distinguished
Jun 23, 2014
625
381
19,260
Sorry, maybe I'm dumb, but I need you to spell it out more clearly. Or we can just drop it.
We could drop it. It is just conjecture on my part anyways.

I didn't see any possible way for a single MCD to access all 24GB vram when there are 5 others that also would have to be able to do the same. You would have 6x the number of connections and would have to organize their access.

Remember the memory controllers are on the MCDs. And the caches on the MCD's are to speed up the memory that MCD is accessing. If there were a master memory controller directing these chiplet ones with full granularity capable of interleaving you wouldn't need the MCDs. Without that there isn't interleaving and each MCD has 4GB vram at a 64 bit bus width and 16MB cache for that 4GB. The localized cache also makes interleaving problematic. How do you have a coherent chunk of data in a cache when the cache only sees 1/6 of the interleaved data? I'm guessing what is in charge of the MCDs parcels out the memory requests in a much simpler manner where coherent chunks of data are sent to and accessed from whichever MCD is available.

My conjecture also agrees with how the XT is reduced proportionally.

I was concerned this would lead to some bad frametimes. If the data on one MCD required more bandwidth than that MCD had to offer for the game to run smoothly you could have frametimes like Techpowerup's review showed in Days Gone, Deathloop, Dying Light 2, F1 22, Watch Dogs Legion, CP 2077. : AMD Radeon RX 7900 XTX Review - Disrupting the RTX 4080 - Frametime Analysis | TechPowerUp But it doesn't seem that bad that drivers can't fix it.

But I can't prove my conjecture to be true or anything.
 
  • Like
Reactions: bit_user

bit_user

Polypheme
Ambassador
I didn't see any possible way for a single MCD to access all 24GB vram when there are 5 others that also would have to be able to do the same.
Right. I assume the way it works is each MCD caches only the addresses held in the GDDR6 memory attached to it. The downside of this is that you could end up thrashing L3 in the new architecture, whereas if the L3 were unified you'd have enough.

It doesn't have to work that way, of course. Just seem like it'd be more efficient if the L3 content and memory controller were paired.

If there were a master memory controller directing these chiplet ones with full granularity capable of interleaving you wouldn't need the MCDs.
I think the memory controllers aren't what's doing the L3 cache lookups. They only enter the picture when you know L3 doesn't have the requested data.

The localized cache also makes interleaving problematic. How do you have a coherent chunk of data in a cache when the cache only sees 1/6 of the interleaved data?
The interleaving granularity would be some multiple of the GDDR6 transaction size, which is 32 bytes. In CPU caches, the typical cacheline size is 64 bytes, but that tells us nothing about what a GPU is doing. It's probably somehow related to the tile size they used for tiled-rendering.

I'm guessing what is in charge of the MCDs parcels out the memory requests in a much simpler manner where coherent chunks of data are sent to and accessed from whichever MCD is available.
There should be a MMU, which is comparable to the page table used on CPUs. Then, there's either going to be an interleaved or sequential address mapping of the GDDR6 dies. Whether there's any fixed mapping between the physical address and the L3 cache in the MCDs is something we'd probably only know after some careful experimentation.

I was concerned this would lead to some bad frametimes. If the data on one MCD required more bandwidth than that MCD had to offer for the game to run smoothly you could have frametimes like Techpowerup's review showed in Days Gone, Deathloop, Dying Light 2, F1 22, Watch Dogs Legion, CP 2077. : AMD Radeon RX 7900 XTX Review - Disrupting the RTX 4080 - Frametime Analysis | TechPowerUp But it doesn't seem that bad that drivers can't fix it.
Hmmm... needs someone to use a GPU performance analyzer tool, if we're have any real insight into what's going on, there.
 
  • Like
Reactions: rluker5

bit_user

Polypheme
Ambassador

rluker5

Distinguished
Jun 23, 2014
625
381
19,260
The RDNA3 arch has a lot of potential and should be doing better. I wonder if the lower cards with a traditional memory controller will do proportionately better relative to tflops, rops, tmus, etc. If they tune them down to keep market segmentation they could be very efficient and great for small form factor builds.
 
  • Like
Reactions: bit_user
Right. I assume the way it works is each MCD caches only the addresses held in the GDDR6 memory attached to it. The downside of this is that you could end up thrashing L3 in the new architecture, whereas if the L3 were unified you'd have enough.

It doesn't have to work that way, of course. Just seem like it'd be more efficient if the L3 content and memory controller were paired.

I think the memory controllers aren't what's doing the L3 cache lookups. They only enter the picture when you know L3 doesn't have the requested data.

The interleaving granularity would be some multiple of the GDDR6 transaction size, which is 32 bytes. In CPU caches, the typical cacheline size is 64 bytes, but that tells us nothing about what a GPU is doing. It's probably somehow related to the tile size they used for tiled-rendering.

There should be a MMU, which is comparable to the page table used on CPUs. Then, there's either going to be an interleaved or sequential address mapping of the GDDR6 dies. Whether there's any fixed mapping between the physical address and the L3 cache in the MCDs is something we'd probably only know after some careful experimentation.

Hmmm... needs someone to use a GPU performance analyzer tool, if we're have any real insight into what's going on, there.
FYI, and this may or may not be pertinent to what you're discussing, but all the TLBs and cache tables are stored on the GCD. I asked about that the other day. I had assumed it was but just wanted to double check, because otherwise there'd be a lot of traffic over the Infinity Fabric trying to determine if a needed piece of data was in cache. So the MCDs only cache their local memory, the main GCD TLBs know what has been cached and where, and the large L3 cache still behaves effectively like a unified L3, even though there are really (up to) six 16MB blocks scattered across the MCDs.
 
  • Like
Reactions: bit_user

bit_user

Polypheme
Ambassador
all the TLBs
This part seems the only sensible way to do it. The modern GPU uses paging and virtual memory just like a CPU, for security and other reasons. That's the type of thing that would have to be centralized.

and cache tables are stored on the GCD.
This part is very revealing. If "cache tables" means "tag RAM" for the L3, then it would pretty much mean the L3 cache slices on the MCDs form a unified L3 cache, and don't map to the memory controlled by the respective GDDR attached to each of them. I think that strongly suggests the L3 is a victim cache, as well. There's a very real chance it's write-through, instead of write-back!

So the MCDs only cache their local memory, the main GCD TLBs know what has been cached and where, and the large L3 cache still behaves effectively like a unified L3
I thinking the opposite. Otherwise, you'd put the tag RAM on the MCD, also. If the L3 cache is segmented, then you already know which MCD you're talking to, irrespective of whether the data you want is in cache or not.

When I say "unified", I mean that any MCD can cache data from any memory address. If it could only cache the data attached to that MCD, then the L3 would be segmented.

P.S. I hope you're having some relaxing holidays! I love that you added that info, but I'm good to wait for any replies you have until the new year!
: )
 
Last edited:
This part seems the only sensible way to do it. The modern GPU uses paging and virtual memory just like a CPU, for security and other reasons. That's the type of thing that would have to be centralized.


This part is very revealing. If "cache tables" means "tag RAM" for the L3, then it would pretty much mean the L3 cache slices on the MCDs form a unified L3 cache, and don't map to the memory controlled by the respective GDDR attached to each of them. I think that strongly suggests the L3 is a victim cache, as well. There's a very real chance it's write-through, instead of write-back!


I thinking the opposite. Otherwise, you'd put the tag RAM on the MCD, also. If the L3 cache is segmented, then you already know which MCD you're talking to, irrespective of whether the data you want is in cache or not.

When I say "unified", I mean that any MCD can cache data from any memory address. If it could only cache the data attached to that MCD, then the L3 would be segmented.

P.S. I hope you're having some relaxing holidays! I love that you added that info, but I'm good to wait for any replies you have until the new year!
: )
I believe Mike Mantor said the GCD has the TLBs as well as tag RAM. There's no need for the various L3 caches to cache anything other than data from the 4GB VRAM they connect to, which makes sense. Caches can't just arbitrarily store anything and everything in an ideal way, that's why there's set associativity. So for example an 8-way set associative cache effectively only has eight cache line locations that would map to any piece of data. A 4-way would only have four locations (which means if all four are in use, the least recently used gets evicted when a new line of data needs to be cached). No one AFAIK ever tries for full associativity as the lookup time becomes far too long.

So based on that, each MCD has 4GB of physical memory it can cache, perhaps as an 8-way set associative. All eight possible cache locations for any piece of data in that 4GB would of necessity only be in the cache lines for that MCD. The same goes for all other MCDs. But the GCD has the overall view of where everything sits, what's in cache at any time, etc. because it has the tag RAM and TLBs and such. Anyway, I did find this slide interesting:

176

AMD lists set associativity for every one of the caches... EXCEPT the L3 Infinity Cache. LOL. You could definitely be right about it being a victim cache, thus increasing the effective cache size. I'll ask, but I highly doubt we'll get a response before CES is over.

Actually... looking at my email, the actual response never mentions tag RAM, probably because my questions didn't include that. But I take "tag RAM" and "TLB" to be more or less equivalent. That's probably not entirely accurate, as there are nuances, but I won't bother with that... perhaps because I've forgotten too much about the differences and don't want to try and refresh my brain. LOL. But here's exactly what AMD said when I asked about the cache stuff:
  • The goal to make AMD RDNA™ 3 a highly efficient architecture resulted in the creation of 2nd Generation AMD Infinity Cache - a cache level that alters the way data is delivered in GPUs. Along with the wider memory bus, the cache hierarchy size has been balanced and optimized for the ideal mix of Infinity and L2 cache to keep everything moving at highest efficiency. This global cache allows fast data access and acts as a massive bandwidth amplifier, enabling high performance bandwidth with superb power efficiency.
  • The TLBs are all on the GCD and centrally located to minimize latency, and while physically distributed, the Infinity Cache operates as a fully unified cache.
 

bit_user

Polypheme
Ambassador
I believe Mike Mantor said the GCD has the TLBs as well as tag RAM.
Putting the tag RAM on the GCD lets it decide which MCD has the data. That's the main advantage I see of putting it there. This contradicts your next point...

There's no need for the various L3 caches to cache anything other than data from the 4GB VRAM they connect to, which makes sense.
But it doesn't make sense, if your reads and writes to the GDDR memory aren't both very well-balanced, across the MCDs. And that could be a big deal. It could result in 96 MB of L3 performing like far less. In the worst case, if you keep hitting one particular 4 GB range (for instance, if your frame buffer and Z-buffer is in there), you could get the effect of a mere 16 MB of L3, which wouldn't necessarily be enough to cache everything that's being hammered, whereas maybe a unified 96 MB L3 would.

The main disadvantage of treating the MCDs as a unified L3 cache is 1 extra data-move you'd have to do, when writing out a dirty cache line. If the cache line isn't local to the GDDR attached to that MCD, you'd have to bring the data back through the GCD and redirect it to the appropriate MCD, whenever it eventually gets flushed. How bad that is, in practice, has a lot to do with the ratio of reads vs. writes. I expect interactive rendering does a fair bit more reads than writes. You could mitigate it by having the eviction logic prefer to find a cacheline on the same MCD as the data.

Caches can't just arbitrarily store anything and everything in an ideal way, that's why there's set associativity. So for example an 8-way set associative cache effectively only has eight cache line locations that would map to any piece of data. A 4-way would only have four locations (which means if all four are in use, the least recently used gets evicted when a new line of data needs to be cached). No one AFAIK ever tries for full associativity as the lookup time becomes far too long.
Correct, but irrelevant. It doesn't tell us whether MCDs only map local GDDR or behave as a unified L3.

AMD lists set associativity for every one of the caches... EXCEPT the L3 Infinity Cache. LOL. You could definitely be right about it being a victim cache, thus increasing the effective cache size. I'll ask, but I highly doubt we'll get a response before CES is over.
Good catch! I saw the diagram and worked through the numbers to appreciate the data paths to each slice, but I missed that point about associativity. I also find it interesting that they also show the L3 as a logically unified block.

I take "tag RAM" and "TLB" to be more or less equivalent.
No, very different things. TLB is used to map addresses at the page level, and that's a durable mapping which is software-managed. Tag RAM is used to find whether/where an address resides in cache. It works at the cacheline granularity, and is completely hardware-managed.

The page table maps logical addresses to physical ones. Caches operate on physical address, not logical ones.

here's exactly what AMD said when I asked about the cache stuff:
  • The goal to make AMD RDNA™ 3 a highly efficient architecture resulted in the creation of 2nd Generation AMD Infinity Cache - a cache level that alters the way data is delivered in GPUs. Along with the wider memory bus, the cache hierarchy size has been balanced and optimized for the ideal mix of Infinity and L2 cache to keep everything moving at highest efficiency. This global cache allows fast data access and acts as a massive bandwidth amplifier, enabling high performance bandwidth with superb power efficiency.
  • The TLBs are all on the GCD and centrally located to minimize latency, and while physically distributed, the Infinity Cache operates as a fully unified cache.
Wow. So, that sounds a lot like what I'm thinking.
 
Putting the tag RAM on the GCD lets it decide which MCD has the data. That's the main advantage I see of putting it there. This contradicts your next point... But it doesn't make sense, if your reads and writes to the GDDR memory aren't both very well-balanced, across the MCDs. And that could be a big deal. It could result in 96 MB of L3 performing like far less. In the worst case, if you keep hitting one particular 4 GB range (for instance, if your frame buffer and Z-buffer is in there), you could get the effect of a mere 16 MB of L3, which wouldn't necessarily be enough to cache everything that's being hammered, whereas maybe a unified 96 MB L3 would.
The physical memory addresses should all be interleaved across the six (or five or whatever) MCDs. There's no other way that would make sense to me, because if you allocate a big chunk of memory — like 32MB for a 4K Z-buffer — you absolutely wouldn't want that all coming from the same MCD and GDDR6 chip. Caching (L1/L2) would help some, but AFAIK all modern systems interleave memory addresses across the available memory channels. That's the best way to mostly guarantee the additional bandwidth gets used. In hypothetical worst-case scenarios, where you know the interleaving size and intentionally access memory in staggered chunks (ie, hit every sixth piece of data so that they're all coming from one controller), performance would get worse, but real-world use should almost never do such an access pattern, especially on a GPU.

Related to this is the question of virtual (logical) memory (VRAM) versus physical memory. I'm not sure how modern GPU designs handle this. Can you pretend to have 40GB even if you only have 8GB VRAM? On one hand, I would think so, but in practice I know there are quite a few applications (Octane comes to mind) where if you try to load a data set that's too large for the physical VRAM, it just fails with an out of memory error. Regardless, I would expect the virtual memory tracking to only occur in drivers or perhaps on the GCD. The MCDs should be relatively stupid, so all they know is there's 4GB of attached physical VRAM, and then they'd store cachelines for that data.

So again, I think the MCD would only need to cache its own physical VRAM. It would all be interleaved, but I still don't see how you'd ever need to cache a non-local piece of data in the MCD's cache. That's why I mentioned set associativity. It means any physical address can only be mapped to N different cache lines for an N-way set associative cache. Since the physical addresses of the 4GB of VRAM connected to an MCD should be known, all of the cacheline locations for that memory should map to the local cache.

It would seem very bassackward to say, "Hey, give me data located at 0x12345600," which is connected to MCD-1 as an example, but then have MCD-4 end up having that data in its cache. The MCDs in that scenario would have MCD-1 pull the data from GDDR6-1, route it through the Infinity Fabric (and GCD probably), and then pass it over to MCD-4 for caching. And then next time something needs that piece of data, if the GCD isn't managing the tag RAM and knows which MCD cache holds it, things would get even worse. Plus, if MCD-4 can hold cache lines from MCD-1, then presumably it can hold cache lines from all of the other MCDs as well, which means the mappings are all scattered and at most 1/6 of the local cache lines for an MCD would be for the local GDDR6. That would kill throughput and efficiency of the Infinity Fabric if it were possible.

But this is far more granular than my knowledge of modern GPUs normally extends. I've assumed the logical addresses and their associated physical addresses would all be handled outside the MCD, probably with the drivers doing that bit. I could be totally wrong there. It's just I know in most situations, GPUs behave best when you don't exceed their physical VRAM capacity. Is that because the drivers then have to do more work managing things, plus pulling data over the PCIe bus? Or is it just the PCIe bus and GPUs already use logical memory allocations to allow the use of data sets larger than physical VRAM? I'm not sure, but either way, performance quickly tanks if you try to run a 9GB or 10GB data set (in a game) on an 8GB GPU. Something I've noticed happening increasingly often on the RTX 3070 Ti/3070/3060 Ti.
 

bit_user

Polypheme
Ambassador
The physical memory addresses should all be interleaved across the six (or five or whatever) MCDs. There's no other way that would make sense to me, because if you allocate a big chunk of memory — like 32MB for a 4K Z-buffer — you absolutely wouldn't want that all coming from the same MCD and GDDR6 chip.
I mostly agree. Two reasons they might not:
  1. Interleaving at cacheline boundaries could make contiguous access to a single GDDR channel shorter, limiting the potential to benefit from larger bursts. Of course, the interleaving granularity needn't be at cacheline intervals...
  2. You'd miss out on the opportunity to locate data closer to the compute unit that's using it. Interleaving it means potentially a lot more global communication within the GCD, which adds latency and wastes power. It could also be a bottleneck, if you exceed the capacity of the GCD's fabric.
TBH, I don't know how much of an issue or opportunity #1 really is. I'm presuming there's some ability to do a larger burst, but maybe it's simply fixed at 32 bytes per sub-channel. Additionally, I'm not sure if you get any benefit from doing consecutive bursts, or if the burst overhead is always the same.

I think #2 is potentially a significant win, but also very difficult to achieve in practice.

AFAIK all modern systems interleave memory addresses across the available memory channels. That's the best way to mostly guarantee the additional bandwidth gets used.
I wonder about that, quite a lot. The first thing that threw me was when the number of memory channels went to non- power-of-2. AFAIK, that happened with Nehalem, some 15 years ago. I always wondered if they really used a radix-3 divider or if the physical map was linear and they just let the OS balance utilization via the page table. An additional datapoint I have on this is that my old AMD Phenom II system had a BIOS option to interleave at page level (although, you had to read between the lines to figure out that's what it was), which I took as an indication that page-based interleaving was actually a thing.

As for GPUs, Nvidia likes to play with different numbers of memory channels. They'll often have not only multiples of 3, but also oddball configurations like 5, 10, or 11 channels. And then there was that controversy of the GTX 970, where one 32-bit channel was much slower than the rest of the memory, but it only caused a performance issue with games that utilized more than about 3.5 GB, which suggests at least that part of GDDR memory wasn't interleaved with the rest.

Related to this is the question of virtual (logical) memory (VRAM) versus physical memory. I'm not sure how modern GPU designs handle this. Can you pretend to have 40GB even if you only have 8GB VRAM?
They have a full MMU, so it should be possible to have a larger address range than the physically-available memory.

On one hand, I would think so, but in practice I know there are quite a few applications (Octane comes to mind) where if you try to load a data set that's too large for the physical VRAM, it just fails with an out of memory error.
Simply having a MMU isn't enough to enable memory over-subscription. You also need the concept of swap space, and it has to be enabled for the app. Perhaps the APIs allow for the app to have some data to get allocated in host memory, but I'm sure the application can override that since it could tank performance if not done judiciously (e.g. if your z-buffer randomly happened to get allocated in host memory, your framerate could drop to 0.1 fps).

Anyway, here's a blast from the past (Vega launch coverage, circa 2017):

"AMD also gives the high-bandwidth cache controller (no longer just the memory controller) access to a massive 512TB virtual address space for large datasets.​
When asked about how the Vega architecture's broader memory hierarchy might be utilized, AMD suggested that Vega can move memory pages in fine-grained fashion using multiple, programmable techniques. It can receive a request to bring in data and then retrieve it through a DMA transfer while the GPU switches to another thread and continues work without stalling. The controller can go get data on demand but also bring it back in predictively. Information in the HBM can be replicated in system memory like an inclusive cache, or the HBCC can maintain just one copy to save space. All of this is managed in hardware, so it should be quick and low-overhead.​
As it pertains to Radeon RX Vega 64, AMD exposes an option in its driver called HBCC Memory Segment to allocate system memory to Vega's cache controller. The corresponding slider determines how much memory gets set aside. Per AMD, once the HBCC is operating, it'll monitor the utilization of bits in local GPU memory and, if needed, move unused information to the slower system memory space, effectively increasing the capacity available to the GPU. Given Vega 64's 8GB of HBM2, this option is fairly forward-looking; there aren't many games that need more. However, AMD has shown off content creation workloads that truly need access to additional memory."​


For Nvidia's part, I ran across this snippet in their CUDA C Programming Guide:

"When the application is run as a 64-bit process, a single address space is used for the host and all the devices of compute capability 2.0 and higher. All host memory allocations made via CUDA API calls and all device memory allocations on supported devices are within this virtual address range. "​
and:

"Depending on the system properties, specifically the PCIe and/or NVLINK topology, devices are able to address each other’s memory (i.e., a kernel executing on one device can dereference a pointer to the memory of the other device). This peer-to-peer memory access feature is supported between two devices if cudaDeviceCanAccessPeer() returns true for these two devices."​


I still don't see how you'd ever need to cache a non-local piece of data in the MCD's cache.
Yeah, depending on where AMD went with the whole HBCC concept, maybe not. However, if Vega's functionality has been carried forward to RDNA3, then I would expect the L3 Tag RAM to have enough address bits to cache host memory, also. I know AMD has a protocol enabling cache-coherent shared memory, which relies on PCIe atomics. That's why their ROCm software framework requires at least a Haswell host CPU.

BTW, I also happen to know that NVLink is cache-coherent. This was highlighted as one of its advantages over PCIe.

It would seem very bassackward to say, "Hey, give me data located at 0x12345600," which is connected to MCD-1 as an example, but then have MCD-4 end up having that data in its cache. The MCDs in that scenario would have MCD-1 pull the data from GDDR6-1, route it through the Infinity Fabric (and GCD probably), and then pass it over to MCD-4 for caching.
With the Tag RAM on the GCD, that would never happen. The GCD is generating the reads & writes, and it would only steer them to a MCD after doing the tag lookup that told it that MCD had the address it wanted.
 
Last edited:
I mostly agree. Two reasons they might not:
  1. Interleaving at cacheline boundaries could make contiguous access to a single GDDR channel shorter, limiting the potential to benefit from larger bursts. Of course, the interleaving granularity needn't be at cacheline intervals...
  2. You'd miss out on the opportunity to locate data closer to the compute unit that's using it. Interleaving it means potentially a lot more global communication within the GCD, which adds latency and wastes power. It could also be a bottleneck, if you exceed the capacity of the GCD's fabric.
TBH, I don't know how much of an issue or opportunity #1 really is. I'm presuming there's some ability to do a larger burst, but maybe it's simply fixed at 32 bytes per sub-channel. Additionally, I'm not sure if you get any benefit from doing consecutive bursts, or if the burst overhead is always the same.

I think #2 is potentially a significant win, but also very difficult to achieve in practice.
Yeah, it's always a bit odd how much detail we have on the way system memory works and how little we know about GDDR memory. Timing and bursts and all that stuff should still exist I would think... but what are they? ¯\(ツ)
I wonder about that, quite a lot. The first thing that threw me was when the number of memory channels went to non- power-of-2. AFAIK, that happened with Nehalem, some 15 years ago. I always wondered if they really used a radix-3 divider or if the physical map was linear and they just let the OS balance utilization via the page table. An additional datapoint I have on this is that my old AMD Phenom II system had a BIOS option to interleave at page level (although, you had to read between the lines to figure out that's what it was), which I took as an indication that page-based interleaving was actually a thing.

As for GPUs, Nvidia likes to play with different numbers of memory channels. They'll often have not only multiples of 3, but also oddball configurations like 5, 10, or 11 channels. And then there was that controversy of the GTX 970, where one 32-bit channel was much slower than the rest of the memory, but it only caused a performance issue with games that utilized more than about 3.5 GB, which suggests at least that part of GDDR memory wasn't interleaved with the rest.

They have a full MMU, so it should be possible to have a larger address range than the physically-available memory.

Simply having a MMU isn't enough to enable memory over-subscription. You also need the concept of swap space, and it has to be enabled for the app. Perhaps the APIs allow for the app to have some data to get allocated in host memory, but I'm sure the application can override that since it could tank performance if not done judiciously (e.g. if your z-buffer randomly happened to get allocated in host memory, your framerate could drop to 0.1 fps).

Anyway, here's a blast from the past (Vega launch coverage, circa 2017): ...

For Nvidia's part, I ran across this snippet in their CUDA C Programming Guide: ...

Yeah, depending on where AMD went with the whole HBCC concept, maybe not. However, if Vega's functionality has been carried forward to RDNA3, then I would expect the L3 Tag RAM to have enough address bits to cache host memory, also. I know AMD has a protocol enabling cache-coherent shared memory, which relies on PCIe atomics. That's why their ROCm software framework requires at least a Haswell host CPU.

BTW, I also happen to know that NVLink is cache-coherent. This was highlighted as one of its advantages over PCIe.
I still assume interleaving at page level or some other boundary is basically a given. I don't think it would need to be interleaving at cacheline size, as that's too granular with the capacities we now deal with on GPUs, but a 4K or even 8K page interleave would be sufficient. Lots of textures would be potentially many MB in size, and so a request for any reasonably sized texture could pull data from all the memory channels for maximum throughput. A short aside is that it's still mind boggling to me how much data can pass through a GPU each second. Even if real-world throughput is a lot lower than the theoretical maximum, it's still a massive number. 🤯
With the Tag RAM on the GCD, that would never happen. The GCD is generating the reads & writes, and it would only steer them to a MCD after doing the tag lookup that told it that MCD had the address it wanted.
Even if the GDC manages the tag RAM, this would still require data to come from the GDDR6, through one MCD, to the GCD, and then pushed back out to potentially a different MCD for caching. It would be massively simpler and more efficient to only end up with cachelines on the MCD for the GDDR6 physically attached to that MCD. The GCD could still manage this, but it wouldn't need to push a copy back out to the MCD for caching — the MCD could have enough intelligence to implement the LRU algorithm or whatever it is that the GCD implements. Even having two copies of the tag RAM would be better than routing potentially twice as much data over the Infinity Fabric.

Anyway, I need to submit a list of questions on this, so here's what I've got. Edit / append to the list as desired and I'll fire it off to AMD (even though we'll likely only hear back after CES). No guarantees they'll give a full response, naturally. :)
  1. Does the GCD also hold the tag RAM for cachelines, or do the MCDs manage that — or is there potentially two copies of the tag RAM, one on the GCD and another on the MCDs?
  2. Do the MCDs cache only the locally connected GDDR6, or do they cache data from GDDR6 connected to other MCDs?
  3. How is memory interleaving handled? What size are the memory "pages" or whatever that get allocated? Or is that left to the OS / drivers to manage?
  4. How large are the cachelines — and are they all the same size for L0/L1/L2/L3, or are they of different sizes?
  5. We routinely get details on maximum throughput (bandwidth) of GDDR6 memory, but how do the transfers actually occur? I'd assume there's an initial timing delay for the requested data, then time to first word followed by a burst of data, but details on how that actually works on GPUs would be great to have. I'd also assume GDDR memory functions at least somewhat similarly to DDR system memory, but what are the actual timings and such? Ranges of numbers for GDD6 timings would be adequate.
  6. How does logical and physical memory mapping work with "oversubscription"? With a GPU that has 8GB VRAM for example, if someone attempts to run a 10GB data set, do the drivers manage that, or does the GPU itself handle the movement of data over the PCIe bus and the application just sees a large virtual address space? (And why do some applications get an out of memory error if there's logical memory?)
  7. Related to the above, what's the logical address space? Also, do modern GPUs still work with a 32-bit model using segment:eek:ffset stuff to access more than 4GB? (That would seem to be the case, and part of why ReBAR exists, but maybe I'm misunderstanding things.)
 

bit_user

Polypheme
Ambassador
Yeah, it's always a bit odd how much detail we have on the way system memory works and how little we know about GDDR memory. Timing and bursts and all that stuff should still exist I would think... but what are they? ¯\(ツ)
I did find some interesting GDDR6 explainers, but I didn't notice anyone getting into the question of larger or consecutive bursts, and don't really have the time to do more digging. It'd be cool to simply ask someone who'd know.

I did try asking a well-published authority on interactive 3D graphics I know about the ratio of reads vs. writes, but I don't know him very well and have no idea if he'll answer. It's the type of question that, if he doesn't know the answer, I think would probably interest him. I'll let you know if I ever hear back.

I don't think it would need to be interleaving at cacheline size, as that's too granular with the capacities we now deal with on GPUs, but a 4K or even 8K page interleave would be sufficient. Lots of textures would be potentially many MB in size,
There's an interesting nuance that's important to consider. Texture mapping typically uses tri-linear interpolation or anisotropic filtering. This means you're only accessing the texture at the level-of-detail where there's good spatial coherence, and that's usually going to be much lower than the full-resolution. Not only that, but texture lookups are going to be non-uniform and typically won't evenly cover the entire texture, if at all.

A short aside is that it's still mind boggling to me how much data can pass through a GPU each second. Even if real-world throughput is a lot lower than the theoretical maximum, it's still a massive number. 🤯
It sure is, but I also find it very interesting to look at the ratio of FLOPS per GB/s. Or, in its reduced form, floating point ops per byte. This shows both how dependent GPUs are on decent cache hit rates, as well as just how much "math" you can afford to do to compute each pixel. I was probably first clued into this idea in this blog post by Karl Rupp:


Fantastic data, IMO. Too bad he hasn't continued updating it. He's got a github link and I see it has about a dozen forks, but I haven't followed up on any of them yet.[

Annoyingly, they didn't seem to post plots anywhere. You have to clone the repo (or download a zip file) and run the scripts, yourself.​

If we just use the top-line numbers, the RTX 4090 can perform about 72.5 fp32 ops per byte (do note that fp32 is 4 bytes, but I'm sticking to his units). That's about 4 times any of the architectures Rupp plotted, in his 2013 post. Granted, he's more focused on HPC, so perhaps we should instead be looking at the H100.

AMD's RX 7900 XTX should come in at 48.6 fp32 ops per byte. So, almost exactly 2/3rds as much as Nvidia. That suggests either the RTX 4090 is more bandwidth-starved, or that the RX 7900 XTX is more compute bottlenecked. But it's interesting that they came to different conclusions about the optimal ratio of compute to bandwidth.

One factor likely weighing on Nvidia's decision is the massive amount of L2 cache in the RTX 4000 generation, compared to all of their prior gaming GPUs. It literally has more L2 cache than the RTX 7900 XTX has L3 cache! I think that's amazing! Also, I recall hearing that since the RTX 3000 series, the practical amount of fp32 throughput you can achieve is a lot lower than the theoretical number. So, maybe their ratios are more similar, in practice.

Even if the GDC manages the tag RAM, this would still require data to come from the GDDR6, through one MCD, to the GCD, and then pushed back out to potentially a different MCD for caching.
If the L3 is a victim cache, then the amount of data movement during loads doesn't change. Whenever you bring something in from GDDR, it's going to be triggered by a miss further up the cache hierarchy. So, that data will always be moving onto the GCD.

The place where you have some extra data movement is if the evicted cacheline is dirty. Then, when it gets evicted from L3 cache, it has to get written out. And if it's non-local to the MCD that's caching in, you need to pass it through the GCD to another MCD. How painful that is, in practice, really depends on the ratio of reads-to-writes, which is why I took interest in that matter.

Edit: I just had a flash of inspiration. If the L3 is an exclusive victim cache, then L3 would only get allocated when data is evicted from L2. And, when that happens, you'd know whether or not it's dirty (i.e. has been modified). If it's dirty, you could force it to be cached by the MCD with the corresponding GDDR chips. Otherwise, it could go anywhere! That would enable L3 cache to act as a unified cache for reads, but segmented for writes.

Anyway, I need to submit a list of questions on this, so here's what I've got.
OMG, so 😎! Even if you only get a couple partial answers, any clues would be awesome!

FWIW, I don't really care where the tag RAM lives. If we just know whether the MCDs cache their directly-connected GDDR DRAM (which I term "segmented") or act as a unified L3, that's the main thing I care about. I think they should be able to answer that, because their competitors could rather easily devise ways to find the answer, experimentally. So, now that the hardware is shipping, perhaps they will say.

Next, I agree that we want to know about whether & how the GDDR memory is mapped. Interleaved or not? And what's the granularity of the interleaving? Again, these are discoverable facts, by a reasonably-skilled practitioner.

The final question I'd ask would be a generic question about GDDR6, which is whether there's any benefit from doing reads from consecutive addresses from a sub-channel, or whether each burst has the same setup overhead regardless of whether it's consecutive. I dimly recall stuff about row & column address strobes, when it comes to DRAM timing, but I'm not sure if GDDR6 has the same concepts and how the rows and columns are (i.e. do they align with the burst size?). Personally, I wouldn't try to get into a discussion of the underlying mechanics, but would instead just focus on the key implications.

So, that roughly maps to your questions: #2, #3, and #5. However, I'd recommend not complicating question #3 with anything about pages. I think we know GPUs use pages and a CPU-like MMU/TLB. It's the native mechanism of CPUs and is necessary for multiple apps to securely share a GPU. Also, it's important for enabling GPU shaders to read host memory without opening a gaping security hole. I think we can reasonably assume pages are the same size as the host - 4 kB (dunno if they have hugepage support, but you could fake it, if not). Page management will definitely happen in the drivers/OS.

I guess, if the answer comes back that GDDR channels are mapped linearly (i.e. not interleaved), then it would be reasonable to wonder if there's interleaving implemented via the page table.


Also, do modern GPUs still work with a 32-bit model using segment:eek:ffset stuff to access more than 4GB? (That would seem to be the case, and part of why ReBAR exists, but maybe I'm misunderstanding things.)
Looking at the Vega 20 (7 nm) ISA manual, it seems to have full arithmetic support for 64-bit scalar ints and 64-bit addressing. I don't imagine RDNA walked back on that...

"9.3. Addressing
FLAT instructions support both 64- and 32-bit addressing. The address size is set using a mode register (PTR32), and a local copy of the value is stored per wave.​
The addresses for the aperture check differ in 32- and 64-bit mode; however, this is not covered here.​
64-bit addresses are stored with the LSBs in the VGPR at ADDR, and the MSBs in the VGPR at ADDR+1. "​

Note how the 64-bit values are stored in register pairs, however.

Edit: here's the analogous bit from the RDNA2 ISA manual:
9.3.1. Legal Addressing Combinations
Not every combination of addressing modes is legal for each type of instruction. The legal combinations are:​
• FLAT​
a. VGPR (32 or 64 bit) supplies the complete address. SADDR must be NULL.​
• Global​
a. VGPR (32 or 64 bit) supplies the address. Indicated by: SADDR == NULL.​
b. SGPR (64 bit) supplies an address, and a VGPR (32 bit) supplies an offset​
• SCRATCH​
a. VGPR (32 bit) supplies an offset. Indicated by SADDR==NULL.​
b. SGPR (32 bit) supplies an offset. Indicated by SADDR!=NULL.​
Every mode above can also add the "instruction immediate offset" to the address.​
 
Last edited: