And once again, a wavefront is not an execution unit any more than a 'thread' is an FPU. You're fundamentally missing the point.The wavefront is an instruction stream, just like a warp or what CPU folks call a "thread". Nvidia likes to confuse people, by pretending each SIMD lane is a thread, so we have to talk about warps, instead.
Instruction issuance is not instruction execution. Instructions take more than one cycle to complete. We're talking about the actual execution, which takes place over multiple clock cycles, not the dispatching.First, we should rewind our mental model back to the era of in-order CPUs, like the original Pentium. Now, if you think about a single or dual-issue in-order CPU, you can appreciate that the instruction stream can only dispatch a max of one or two instructions each cycle, regardless of whether the backend has more potential for concurrency than that. If you're issuing a tensor product instruction, that displaces your ability to execute some other operation in that same slot.
It's physically different hardware performing the execution.In the article, Paul said that Tensor Cores are fundamentally different than WMMA. At best, there are some implementation differences in Tensor Cores that might support higher throughput, but they're not really different in kind.

If the Tensor cores were merely re-using the CUDA ALUs, then they would not be able to achieve the performance deltas seen in real world benchmarking. e.g. a FP16 4x4 FMA operation (multiplying two 4x4 matrices together and then adding an additional 4x4 matrix) is composed of 64 individual multiplication or add operations. If Tensor cores were merely ALUs working together, we would expect to see that general purpose FP16 performance to be the same, or very close. Instead, if we look at the H100 (for example) standard FP16 performance is 134 TFLOPS (double the 67 TFLOPS FP32 performance due to packed math, as expected) but Tensor FP16 performance is 1979 TFLOPS. If Tensor cores were merely the ALUs with different dispatching, then that would mean that over 90% of them just sit and do nothing for no reason rather than contributing to regular compute. That's a lot of wasted die area!
In reality, you have 1979TFLOPS of FP16 operations... only if those operations are part of matrix FMAs. You can't 'split out' those FLOPS for other calculations, the Tensor cores physically cannot decompose themselves into generic ALUs because they're not built that way. The entire reason they can pack so many operations per unit die area is the hypersepcialisation in capability. This is why you can run the Tensor core sin parallel with other generic ALU operations: it's two different pieces of hardware doing two different things.
Last edited: