Question Nvidia, does each SM have it's own, independent 32-bit memory data bus?

80251 · Jan 29, 2025

According to this https://forums.developer.nvidia.com/t/what-is-cores-per-sm/29997/4 Cuda cores are part of SMs. So do SMs provide individual MMUs over a 32-bit memory data bus to specific VRAM ICs? Or is there an independent MMU that accesses all VRAM for the individual SMs using some sort of ring bus or infinity fabric? In the past I had thought I read somewhere that each SM had its own independent , 32-bit, memory data bus.

Eximo · Jan 29, 2025

From some quick searching, sort of? Certainly is a 32bit data bus for each SM (128 cores), but sounds like the MMU is virtualized allowing them all to share the total memory pool. How that works 'physically' is not clear to me.

https://images.nvidia.cn/aem-dam/So...ell/nvidia-rtx-blackwell-gpu-architecture.pdf

80251 · Jan 29, 2025

Thanks Eximo, your answer brings up almost as many questions as it answers. Maybe each SM has its own MMU and they each communicate w/each other over some sort of bus? Maybe if each individual SM MMU controls a specific address space then if another SM MMU needed data from that address space it would know to which SM MMU to send its request? In which case the videocard would be a NUMA device despite the fact all the SMs are on the same die? Are the L2 caches specific to each SM?
The waveforms for the clocks for GDDR6x and GDDR7 memory were amazing to see -- way beyond edge triggered or level sensitive latches.

Eximo · Jan 30, 2025

Lv0 and Lv1 cache are on the SM, and the large L2 cache pool must be shared.
If each quadrant of the SM can pump out 32 threads per clock, perhaps there is an internal bus between them? They don't really go into detail on what the Load/Store blocks are capable of, or the Special Function Units. That might be available in older white papers though.
LoadStore units might be able to hold enough data while waiting for the bus between the shaders to be free to transmit on, but the block diagram leaves much unspoken I am sure.

80251 · Jan 30, 2025

@Eximo, I noticed when you used to be able to directly mod Nvidia vBIOSs for Maxwell that there were four (aside from the VRAM frequency) clocks you could modify: PC=core clock L2C= cache XBAR= crossbar SYS=system. Are these four frequency domains still present on modern Nvidia GPUs?
If there is a shared L2 cache for all SMs wouldn't that require an MMU as well? I don't think you can have a cache without an MMU because something has to monitor the RAM addresses being requested to determine if there's a cache hit/miss.

Eximo · Jan 30, 2025

I don't see why they wouldn't be, but since Nvidia locked down BIOS editing to online only tools, not really something I have delved into. More often I underclock my GPU these days, can get like 95% performance out of 80% power. Maxwell days I didn't have much need to overclock, I was running dual 980, I think I only tweaked the memory.

Mentioned in the architecture for Blackwell is a virtualized MMU. Not really sure where that would hang out, but I would assume it would be able to reach everything, else how would it manage things.

Search

Question Nvidia, does each SM have it's own, independent 32-bit memory data bus?

80251

Distinguished

Eximo

Titan

80251

Distinguished

Eximo

Titan

80251

Distinguished

Eximo

Titan

TRENDING THREADS

Latest posts

Moderators online

Share this page