BTW, this is actually novel:
So, if they mean they can recompose a warp, when some lanes stall on a memory access or synchronization, that is meaningfully different than SIMD.
Until this, everything I could find about Nvidia's SIMT seemed easily achievable with software tools driving standard SIMD + SMT hardware.
Update: I've learned more about this feature. It does not recompose warps. Not at the hardware level, at least, and not on things like memory stalls. It basically amounts to a more fancy predication mechanism, where some "threads" which don't take a branch can have their PC (Program Counter register) stored on a stack. While this does help with several-level-deep branching and function calls taken by only a subset of the threads, the rest of the warp basically idles until active part rejoins their point of execution.
So, even Volta is still just SIMD + SMT, although with a few more tweaks and boatloads of fp16 matrix multiply throughput.
Source: http://www.anandtech.com/show/11367/nvidia-volta-unveiled-gv100-gpu-and-tesla-v100-accelerator-announced... NVIDIA has also unveiled that they’ve made a pretty significant change to how SIMT works for Volta. The individual CUDA cores within a 32-thread warp now have a limited degree of autonomy; threads can now be synchronized at a fine-grain level, and while the SIMT paradigm is still alive and well, it means greater overall efficiency. Importantly, individual threads can now yield, and then be rescheduled together. This also means that a limited amount of scheduling hardware is back in NV’s GPUs.
So, if they mean they can recompose a warp, when some lanes stall on a memory access or synchronization, that is meaningfully different than SIMD.
Until this, everything I could find about Nvidia's SIMT seemed easily achievable with software tools driving standard SIMD + SMT hardware.
Update: I've learned more about this feature. It does not recompose warps. Not at the hardware level, at least, and not on things like memory stalls. It basically amounts to a more fancy predication mechanism, where some "threads" which don't take a branch can have their PC (Program Counter register) stored on a stack. While this does help with several-level-deep branching and function calls taken by only a subset of the threads, the rest of the warp basically idles until active part rejoins their point of execution.
So, even Volta is still just SIMD + SMT, although with a few more tweaks and boatloads of fp16 matrix multiply throughput.