Wow, there's a lot to unpack, here. Thank you for covering this,
@PaulAlcorn, and especially for your question about timeframes. I doubt it will surprise you that I have some feedback.
The article said:
When AMD moved on from its GCN microarchitecture back in 2019, the company decided to split its new graphics microarchitecture into two different designs, with RDNA designed to power gaming graphics products for the consumer market while the CDNA architecture was designed specifically to cater to compute-centric AI and HPC workloads in the data center.
Yes, but... for the most part, CDNA was just a rebrand of GCN. The two main changes I'm aware of were:
- the widening of their registers & datapaths to 64-bit, primarily in order to sustain 64-bit arithmetic at the same rate as fp32.
- the addition of "Matrix Cores".
The significant departure was actually RDNA, which halved the Wavefront size (bringing it in line with Nvidia) and substantially cut down compute latency, as well.
The article said:
Nvidia began laying the foundation of its empire when it started with CUDA eighteen long years ago, and perhaps one of its most fundamental advantages is signified by the 'U' in CUDA, the Compute Unified Device Architecture. Nvidia has but one CUDA platform for all uses, and it leverages the same underlying microarchitectures for AI, HPC, and gaming.
I think you're reading too much into the "Unified" part of CUDA. The key thing it does is to provide a unified
API across all of their devices. It doesn't mean all of their devices have the same capabilities. If you study this table carefully, you can see that there are what appear to be "regressions", which generally correspond to some divergence of capabilities between their client and 100-series models.
Furthermore, when you compile CUDA code, you have to actually specify which architectures you want to compile it for.
The article said:
Huynh told me that CUDA has four million developers
That's a curious number. I wonder how it was arrived at. I'd be surprised if there were more than a few tens of thousands of
serious CUDA developers, but lots more people download their libraries and the other packages needed to compile software that
contains CUDA code. Maybe somewhere in between is the number of devs writing host code that utilizes CUDA-accelerated libraries - not that they ever looked under the covers or tinkered with the CUDA code, itself.
The article said:
The company also remains focused on ROCm despite the emergence of the UXL Foundation, an open software ecosystem for accelerators
HiP is essentially AMD's CUDA compatibility layer. It's virtually identical to CUDA, except they did a search-and-replace, to reduce the chances of being attacked for copyright infringement.
The article said:
one clear potential pain point has been the lack of dedicated AI acceleration units in RDNA. Nvidia brought tensor cores to then entire RTX line starting in 2018. AMD only has limited AI acceleration in RDNA 3, basically accessing the FP16 units in a more optimized fashion via WMMA instructions,
Tensor cores are "cores" in pretty much the same sense Nvidia calls everything a "core". In other words, they're not.
You feed them using Warp instructions and SIMD registers pretty much exactly how RDNA's WMMA and CDNA's MFMA instructions work. See:
The article said:
Given the preponderance of AI work being done on both data center and client GPUs these days, adding tensor support to client GPUs seems like a critical need.
RDNA3's WMMA already has what you need. This is why TinyBox wanted to pack six RX 7900 XTX GPUs into a low-cost server for training.
Look, they can do BF16 matrix product with fp32 accumulate:
This blog is a quick how-to guide for using the WMMA feature with our RDNA 3 GPU architecture using a Hello World example.
gpuopen.com
IMO, the main area where RDNA has been lacking is in support for HPC apps - not AI. These days, I'm sure HPC is a far smaller market.
The article said:
The unified UDNA architecture is a good next logical step on the journey to competing with CUDA
We'll see. The distinctions between Nvidia's 100-series vs. their client GPUs hasn't seemed a major impediment to their world domination. It seemed to me like what AMD did between RDNA and CDNA was largely mirroring that. I think AMD can't afford to make
either less adept at its purpose. To be honest, I expected both AMD and Nvidia actually to specialize
more in the direction of AI, sacrificing some general-purpose programmability and HPC features of their datacenter-specific models. This unification almost seems like a step backwards, but I'll reserve judgement until we learn some specifics about what AMD has in mind -
perhaps they're looking to ditch the wavefront matrix instructions in favor of integrating XDMA engines into their dGPUs?
I think what developers most wanted was a robust, well-supported software stack. ROCm just took too long to reach maturity, is too narrowly supported, and has too many issues on non-supported hardware. It didn't help that AMD changed its API strategy half-way through, from previously focusing OpenCL (which is pretty much what Intel and UXL are doing) to building a CUDA work-alike, with HiP.
One way CUDA became dominant in AI is by being general enough that it could handle whatever people wanted to do with it, and that positioned it well for those looking to accelerate neural networks and deep learning. However, for someone trying to gain AI dominance today, I think there are short-cuts that make a lot more sense.