From the blog post, it appears that NVIDIA’s algorithm is representing these in three dimensions, tensors like a 3-D matrix . But the only thing that NTC assumes is that each texture has the same size which can be a drawback of this method if not implemented properly.
That's why the render time is actually higher than BC, and also visual degradation at low bitrates. All the maps need to be the same size before compression, which is bound to complicate workflows, and algorithm's speed.
But i's an interesting concept nonetheless. NTC seems to employ matrix multiplication techniques, which at least makes it more feasible and versatile due to reduced disk and/or memory limitations.
And with GPU manufacturers being strict with VRAM even on the newest mainstream/mid-range GPUs, the load is now on software engineers to find a way to squeeze more from the hardware available today. Maybe we can see this as more feasible after 2 or 3 more generations.
Kind of reminds me of PyTorch implementation.