GPU deals with a huge data I dont know how caching would solve the memory bandwidth here ... it would add latency if it needs fetching most of the time.
pre-emptive scheduling. No different than how a CPU prefetches data and puts it into a cache when the decode engine starts reading instructions and makes predictions about what sections of memory it will read from.
<prefetch> Okay I'm going assign a CU to render a block in the upper left hand corner. Let's grab all the predicted textures in advance and create a small frame buffer for it.
<CU> Setting up triangles and calculating lighting in advance. Also doing ray hit testing. (200 cycles)
<CU> Okay I'm ready to apply that texture. Luckily for me it's already in the cache and I'm ready to go. (5 cycle penalty)
<CU> Okay I'm ready to apply that texture 2. Luckily for me it's already in the cache and I'm ready to go. (5 cycle penalty)
<CU> Okay I'm ready to apply that texture 3. Luckily for me it's already in the cache and I'm ready to go. (5 cycle penalty)
Old way:
<prefetch> Okay I'm going assign a CU to render a block in the upper left hand corner.
<CU> Setting up triangles and calculating lighting in advance. Also doing ray hit testing. (200 cycles)
<CU> Okay I'm ready to apply that texture. Let me retrieve texture 1 from VRAM (20 cycle penalty)
<CU> Okay I need texture 2. Let me retrieve that from VRAM (20 cycle penalty)
<CU> Okay I need texture 3. Let me retrieve that from VRAM (20 cycle penalty)
CU's handle small blocks at a time so you only need relatively small chunks of cache for them.
Plus cross CU cache coherency delays are really lowered. This is important when one CU block is reading/writing from another from the same mem space. (And why crossfire/SLI didn't work well and had glitches)
See the diff?