Well-said. I do think offloading in-line computations (where this tends to be an issue) is always going to be a challenge, because there's always going to be some communication latency. So, you're generally not going to see really fine-grained compute being offloaded.One of the biggest issues with GPGPU is it's often not worth it to move some computation to discreet GPU even if that part of algorithm is well suited for GPU parallelism because data transfer time would take more than speedup of computation.
So, faster interfaces would allow more GPU-acceleration in wide variety of programs.
I think an interesting metric is to look at bandwidth per FLOPS. For computations that can be asynchronously dispatched, this can reveal potential bottlenecks. Below, I'm just using nominal and peak numbers:
GPU | fp32 TFLOPS | PCIe Bandwidth (uni-dir) GB/s | fp32 kFLO per PCIe Byte (uni-dir) |
---|---|---|---|
RTX 3090 | 35.6 | 32 | 1.11 |
RTX 4090 | 82.6 | 32 | 2.58 |
RX 6950 XT | 47.3 | 32 | 1.48 |
RX 7900 XTX | 61.4 | 32 | 1.92 |
MTT S80 | 14.4 | 64 | 0.23 |
So, it has about 11x the bandwidth per fp32 FLOPS as the RTX 4090. Still, if what you're doing is computationally cheap, then it'll be bandwidth-limited at 230 FLO per byte (or 920 FLO per fp32). Then again, if your compute runs close to memcpy speed, there's not much reason to dispatch it to a GPU in the first place.