Question Nvidia, does each SM have it's own, independent 32-bit memory data bus?

80251

Distinguished
Jan 5, 2015
357
70
18,860
According to this https://forums.developer.nvidia.com/t/what-is-cores-per-sm/29997/4 Cuda cores are part of SMs. So do SMs provide individual MMUs over a 32-bit memory data bus to specific VRAM ICs? Or is there an independent MMU that accesses all VRAM for the individual SMs using some sort of ring bus or infinity fabric? In the past I had thought I read somewhere that each SM had its own independent , 32-bit, memory data bus.
 
Thanks Eximo, your answer brings up almost as many questions as it answers. Maybe each SM has its own MMU and they each communicate w/each other over some sort of bus? Maybe if each individual SM MMU controls a specific address space then if another SM MMU needed data from that address space it would know to which SM MMU to send its request? In which case the videocard would be a NUMA device despite the fact all the SMs are on the same die? Are the L2 caches specific to each SM?
The waveforms for the clocks for GDDR6x and GDDR7 memory were amazing to see -- way beyond edge triggered or level sensitive latches.
 
Lv0 and Lv1 cache are on the SM, and the large L2 cache pool must be shared.
If each quadrant of the SM can pump out 32 threads per clock, perhaps there is an internal bus between them? They don't really go into detail on what the Load/Store blocks are capable of, or the Special Function Units. That might be available in older white papers though.
LoadStore units might be able to hold enough data while waiting for the bus between the shaders to be free to transmit on, but the block diagram leaves much unspoken I am sure.
 
  • Like
Reactions: 80251
@Eximo, I noticed when you used to be able to directly mod Nvidia vBIOSs for Maxwell that there were four (aside from the VRAM frequency) clocks you could modify: PC=core clock L2C= cache XBAR= crossbar SYS=system. Are these four frequency domains still present on modern Nvidia GPUs?
If there is a shared L2 cache for all SMs wouldn't that require an MMU as well? I don't think you can have a cache without an MMU because something has to monitor the RAM addresses being requested to determine if there's a cache hit/miss.
 
I don't see why they wouldn't be, but since Nvidia locked down BIOS editing to online only tools, not really something I have delved into. More often I underclock my GPU these days, can get like 95% performance out of 80% power. Maxwell days I didn't have much need to overclock, I was running dual 980, I think I only tweaked the memory.

Mentioned in the architecture for Blackwell is a virtualized MMU. Not really sure where that would hang out, but I would assume it would be able to reach everything, else how would it manage things.