Nvidia is always a bit cagey about the exact details
Anandtech did a deep dive on this, back when Volta launched. Here's the key bit you need to know:
For each sub-core, the scheduler issues one warp instruction per clock to the local branch unit (BRU), the tensor core array, math dispatch unit, or shared MIO unit. For one, this precludes issuing a combination of tensor core operations and other math simultaneously. In utilizing the two tensor cores, the warp scheduler issues matrix multiply operations directly, and after receiving the input matrices from the register, perform 4 x 4 x 4 matrix multiplies. Once the full matrix multiply is completed, the tensor cores write the resulting matrix back into the register.
...
After the matrix multiply-accumulate operation, the result is spread out in fragments in the destination registers of each thread.
So, it breaks the SIMD model, harnessing the warp + registers used to drive & feed the normal CUDA cores.
AMD publishes details of its GPUs on GPUOpen.
The AMD RDNA™ 3 ISA reference guide is now available! The ISA guide is useful for anyone interested in the lowest level operation of the RDNA 3 shader core.
gpuopen.com
I haven't gone through it, yet.
AFAIK, Intel doesn't publish such programming details of their GPUs, so you'd have to piece it together by analyzing the source code of their open source driver & deep learning framework code.
So with DLSS2 on Ampere, the initial frame rendering gets finished and the GPU loves onto rendering the next frame, while at the same time upscaling the previous frame. Or at least, I remember Nvidia talking about it being possible to do that.
The thing is, you could already do that, in Turing. For most of the rendering process, shader occupancy isn't 100%. It's just that, if the tensor cores are truly independent from the CUDA cores, then you'd be able concurrently use the two with less interference. The downside is that you'd have more dark silicon, most of the time - it'd be less area-efficient, because it's rare that you'd be concurrently driving both near peak-occupancy.
As for pipelining, I even heard about one game that overlaps rendering of 4 consecutive frames, in order to achieve good shader occupancy and the highest framerates. I assume it was probably more like a RTS game, because 3 extra frames of latency sounds horrible for a twitchy FPS game.
I mentioned in this article on DLSS 3.5 how it was interesting to see how relatively poorly the RTX 2080 Ti did, and guessed it’s the lack of concurrent RT and Tensor workloads.
Turing was also made on 12 nm and the first iteration of RT and but a small revision to Volta's Tensor cores. So, I wouldn't expect it to perform very well.