Yeah, it's always a bit odd how much detail we have on the way system memory works and how little we know about GDDR memory. Timing and bursts and all that stuff should still exist I would think... but what are they? ¯\(ツ)/¯
I did find some interesting GDDR6 explainers, but I didn't notice anyone getting into the question of larger or consecutive bursts, and don't really have the time to do more digging. It'd be cool to simply ask someone who'd know.
I
did try asking a well-published authority on interactive 3D graphics I know about the ratio of reads vs. writes, but I don't know him very well and have no idea if he'll answer. It's the type of question that, if he
doesn't know the answer, I think would probably interest him. I'll let you know if I ever hear back.
I don't think it would need to be interleaving at cacheline size, as that's too granular with the capacities we now deal with on GPUs, but a 4K or even 8K page interleave would be sufficient. Lots of textures would be potentially many MB in size,
There's an interesting nuance that's important to consider. Texture mapping typically uses tri-linear interpolation or anisotropic filtering. This means you're only accessing the texture at the level-of-detail where there's good spatial coherence, and that's usually going to be much lower than the full-resolution. Not only that, but texture lookups are going to be non-uniform and typically won't evenly cover the entire texture, if at all.
A short aside is that it's still mind boggling to me how much data can pass through a GPU each second. Even if real-world throughput is a lot lower than the theoretical maximum, it's still a massive number. 🤯
It sure is, but I also find it very interesting to look at the ratio of FLOPS per GB/s. Or, in its reduced form, floating point ops per byte. This shows both how dependent GPUs are on decent cache hit rates, as well as just how much "math" you can afford to do to compute each pixel. I was probably first clued into this idea in this blog post by Karl Rupp:
Fantastic data, IMO. Too bad he hasn't continued updating it. He's got a github link and I see it has about a dozen forks
, but I haven't followed up on any of them yet.[
Data repository supplementing my blog post comparing hardware characteristics of CPUs, GPUs, and MICs - Forks · karlrupp/cpu-gpu-mic-comparison
github.com
Annoyingly, they didn't seem to post plots anywhere. You have to clone the repo (or download a zip file) and run the scripts, yourself.
If we just use the top-line numbers, the RTX 4090 can perform about 72.5 fp32 ops per byte (do note that fp32 is 4 bytes, but I'm sticking to his units). That's about 4 times any of the architectures Rupp plotted, in his 2013 post. Granted, he's more focused on HPC, so perhaps we should instead be looking at the H100.
AMD's RX 7900 XTX should come in at 48.6 fp32 ops per byte. So, almost exactly 2/3rds as much as Nvidia. That suggests either the RTX 4090 is more bandwidth-starved, or that the RX 7900 XTX is more compute bottlenecked. But it's interesting that they came to different conclusions about the optimal ratio of compute to bandwidth.
One factor likely weighing on Nvidia's decision is the massive amount of L2 cache in the RTX 4000 generation, compared to all of their prior gaming GPUs. It literally has more L2 cache than the RTX 7900 XTX has L3 cache! I think that's amazing! Also, I recall hearing that since the RTX 3000 series, the practical amount of fp32 throughput you can achieve is a lot lower than the theoretical number. So, maybe their ratios are more similar, in practice.
Even if the GDC manages the tag RAM, this would still require data to come from the GDDR6, through one MCD, to the GCD, and then pushed back out to potentially a different MCD for caching.
If the L3 is a victim cache, then the amount of data movement during loads doesn't change. Whenever you bring something in from GDDR, it's going to be triggered by a miss further up the cache hierarchy. So, that data will always be moving onto the GCD.
The place where you have some extra data movement is if the evicted cacheline is dirty. Then, when it gets evicted from L3 cache, it has to get written out. And if it's non-local to the MCD that's caching in, you need to pass it through the GCD to another MCD. How painful that is, in practice, really depends on the ratio of reads-to-writes, which is why I took interest in that matter.
Edit: I just had a
flash of inspiration. If the L3 is an exclusive victim cache, then L3 would only get allocated when data is evicted from L2. And, when that happens, you'd know whether or not it's dirty (i.e. has been modified). If it's dirty, you could force it to be cached by the MCD with the corresponding GDDR chips. Otherwise, it could go anywhere! That would enable L3 cache to act as a
unified cache for reads, but
segmented for writes.
Anyway, I need to submit a list of questions on this, so here's what I've got.
OMG, so 😎! Even if you only get a couple partial answers, any clues would be awesome!
FWIW, I don't really care where the tag RAM lives. If we just know whether the MCDs cache their directly-connected GDDR DRAM (which I term "segmented") or act as a unified L3, that's the main thing I care about. I
think they should be able to answer that, because their competitors could rather easily devise ways to find the answer, experimentally. So, now that the hardware is shipping, perhaps they will say.
Next, I agree that we want to know about whether & how the GDDR memory is mapped. Interleaved or not? And what's the granularity of the interleaving? Again, these are discoverable facts, by a reasonably-skilled practitioner.
The final question I'd ask would be a generic question about GDDR6, which is whether there's any benefit from doing reads from consecutive addresses from a sub-channel, or whether each burst has the same setup overhead regardless of whether it's consecutive. I dimly recall stuff about row & column address strobes, when it comes to DRAM timing, but I'm not sure if GDDR6 has the same concepts and how the rows and columns are (i.e. do they align with the burst size?). Personally, I wouldn't try to get into a discussion of the underlying mechanics, but would instead just focus on the key implications.
So, that roughly maps to your questions: #2, #3, and #5. However, I'd recommend not complicating question #3 with anything about pages. I think we know GPUs use pages and a CPU-like MMU/TLB. It's the native mechanism of CPUs and is necessary for multiple apps to securely share a GPU. Also, it's important for enabling GPU shaders to read host memory without opening a gaping security hole. I think we can reasonably assume pages are the same size as the host - 4 kB (dunno if they have hugepage support, but you could fake it, if not). Page management will definitely happen in the drivers/OS.
I guess, if the answer comes back that GDDR channels are mapped linearly (i.e. not interleaved), then it would be reasonable to wonder if there's interleaving implemented via the page table.
Also, do modern GPUs still work with a 32-bit model using segment😱ffset stuff to access more than 4GB? (That would seem to be the case, and part of why ReBAR exists, but maybe I'm misunderstanding things.)
Looking at the Vega 20 (7 nm) ISA manual, it seems to have full arithmetic support for 64-bit scalar ints and 64-bit addressing. I don't imagine RDNA walked back on that...
"9.3. Addressing
FLAT instructions support both 64- and 32-bit addressing. The address size is set using a mode register (PTR32), and a local copy of the value is stored per wave.
The addresses for the aperture check differ in 32- and 64-bit mode; however, this is not covered here.
64-bit addresses are stored with the LSBs in the VGPR at ADDR, and the MSBs in the VGPR at ADDR+1. "
Note how the 64-bit values are stored in register
pairs, however.
Edit: here's the analogous bit from the RDNA2 ISA manual:
9.3.1. Legal Addressing Combinations
Not every combination of addressing modes is legal for each type of instruction. The legal combinations are:
• FLAT
a. VGPR (32 or 64 bit) supplies the complete address. SADDR must be NULL.
• Global
a. VGPR (32 or 64 bit) supplies the address. Indicated by: SADDR == NULL.
b. SGPR (64 bit) supplies an address, and a VGPR (32 bit) supplies an offset
• SCRATCH
a. VGPR (32 bit) supplies an offset. Indicated by SADDR==NULL.
b. SGPR (32 bit) supplies an offset. Indicated by SADDR!=NULL.
Every mode above can also add the "instruction immediate offset" to the address.