News AMD announces unified UDNA GPU architecture — bringing RDNA and CDNA together to take on Nvidia's CUDA ecosystem

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
When I first heard AMD talk about splitting their GPU architectures into RDNA and CDNA, I thought it sounded good, as the idea was to remove circuitry in the RDNA architectures that would allow for a larger silicon budget dedicated to gaming hardware and function, while removing some of the transistor hardware optimization geared around data-center workloads. Vice versa for CDNA, for data-center optimized silicon. It did sound excellent at the time.

Enter the "AI" market craze. That puts a different spin on things, eh? If it is true that "AI" turns out to have the long life ahead that proponents are advertising, then UDNA makes a great deal of sense as transistor designs and layouts for "AI" types of computation have a lot in common with data-center optimization, while gaming function and performance are always important. Time and future GPUs will tell, I guess...😉
Hopefully that means FP64 and larger FP Data Type versions won't get gimped on the Radeon Side.
 
So... They're bringing GCN back from the dead? LOL.

Christ...

EDIT: Just to add a bit more to my knee-jerk reaction to the overall information (thanks for the interviews, BTW!) in regards to my comment...

The reason why CUDA is king is longevity and support. AMD needs to stop screwing around with the long term strategy and flip-flopping to much and stick to something for longer than 3 generations.
The error was not going to microsoft, and agreeing to a common API for windows.
After Windows demands an API, programmers can just target that API, and forget about the particular hardware.
 
not really. you are thinking about a normal persons view of $60k.
Once you cross the 10B (and the ones buying these are in the 100B+) value of company mere $60k is just cost of doing business & can possibly be turned into negative value thus lowering the company as a wholes business taxes.



Nvidia didn't.
Nvidia changed ToU to state only nvidia hardware can run CUDA but thats basically threat to their customers (as they acant actually tell who uses it)
AMD is one who made the dev of ZLUDA take it down and start re-doing the code from prior to the AMD period code.
If you believe businesses don’t care about the amount of money they spend then you’ve not run a business. More importantly if you think everyone is willing to just pony up to build out training farms you are really mistaken about where the industry is it at the moment. You have a good amount of investment and enthusiasm but 60K is still 60K when you need thousands of them. It’s assured unless ROI comes a knocking these prices will not be sustainable.
 
"AMD only has limited AI acceleration in RDNA 3, basically accessing the FP16 units in a more optimized fashion via WMMA instructions"

Is that how that works in actuality? They certainly don't talk about it in a fashion that makes it sound so, though I suppose that would be why they don't have some cute marketable name for their's like Tensor cores, or Matrix.... whatever Intel calls their AI cores.

On the bright side, sounds like we will indeed have a "Radeon VII Mark 2".... or better yet, another proper GCN type GPU on our hands finally! (Not that Navi 31 is too far off)
 
They're not going to gain meaningful market share as long as they keep their pricing so close to Nvidia. Undercut them by a third or more while offering near equal performance and a 16GB VRAM size and then things will move.
The thing is, what would stop Nvidia from lowering their prices too? they have the pockets to weather any meaningful price war from AMD. In your scenario, both AMD and Nvidia loose. Only the consumer wins, but neither Nvidia nor AMD are in business to make the consumer happier. Their fiduciary responsibility is to maximize value to shareholders, and your strategy does really get them there.

For the consumer this sucks, and the only real way out of it is to have more competitors in the space, so start rooting for Intel (and hopefully others)
 
I can’t fault the general message and their strategy in going unified but considering Huynh was evasive when you asked about a clear timeline of implementation; I guess I’ll believe it when I see it.
To be fair, he did say "UDNA 6, UDNA 7" at one point, which could indicate inner machinations to unify things after the alleged "RDNA 5" if he has his way.
 
  • Like
Reactions: bit_user
Wow, there's a lot to unpack, here. Thank you for covering this, @PaulAlcorn, and especially for your question about timeframes. I doubt it will surprise you that I have some feedback.

The article said:
When AMD moved on from its GCN microarchitecture back in 2019, the company decided to split its new graphics microarchitecture into two different designs, with RDNA designed to power gaming graphics products for the consumer market while the CDNA architecture was designed specifically to cater to compute-centric AI and HPC workloads in the data center.
Yes, but... for the most part, CDNA was just a rebrand of GCN. The two main changes I'm aware of were:
  1. the widening of their registers & datapaths to 64-bit, primarily in order to sustain 64-bit arithmetic at the same rate as fp32.
  2. the addition of "Matrix Cores".

The significant departure was actually RDNA, which halved the Wavefront size (bringing it in line with Nvidia) and substantially cut down compute latency, as well.

The article said:
Nvidia began laying the foundation of its empire when it started with CUDA eighteen long years ago, and perhaps one of its most fundamental advantages is signified by the 'U' in CUDA, the Compute Unified Device Architecture. Nvidia has but one CUDA platform for all uses, and it leverages the same underlying microarchitectures for AI, HPC, and gaming.
I think you're reading too much into the "Unified" part of CUDA. The key thing it does is to provide a unified API across all of their devices. It doesn't mean all of their devices have the same capabilities. If you study this table carefully, you can see that there are what appear to be "regressions", which generally correspond to some divergence of capabilities between their client and 100-series models.

Furthermore, when you compile CUDA code, you have to actually specify which architectures you want to compile it for.

The article said:
Huynh told me that CUDA has four million developers
That's a curious number. I wonder how it was arrived at. I'd be surprised if there were more than a few tens of thousands of serious CUDA developers, but lots more people download their libraries and the other packages needed to compile software that contains CUDA code. Maybe somewhere in between is the number of devs writing host code that utilizes CUDA-accelerated libraries - not that they ever looked under the covers or tinkered with the CUDA code, itself.

The article said:
The company also remains focused on ROCm despite the emergence of the UXL Foundation, an open software ecosystem for accelerators
HiP is essentially AMD's CUDA compatibility layer. It's virtually identical to CUDA, except they did a search-and-replace, to reduce the chances of being attacked for copyright infringement.

The article said:
one clear potential pain point has been the lack of dedicated AI acceleration units in RDNA. Nvidia brought tensor cores to then entire RTX line starting in 2018. AMD only has limited AI acceleration in RDNA 3, basically accessing the FP16 units in a more optimized fashion via WMMA instructions,
Tensor cores are "cores" in pretty much the same sense Nvidia calls everything a "core". In other words, they're not. You feed them using Warp instructions and SIMD registers pretty much exactly how RDNA's WMMA and CDNA's MFMA instructions work. See:

The article said:
Given the preponderance of AI work being done on both data center and client GPUs these days, adding tensor support to client GPUs seems like a critical need.
RDNA3's WMMA already has what you need. This is why TinyBox wanted to pack six RX 7900 XTX GPUs into a low-cost server for training.

Look, they can do BF16 matrix product with fp32 accumulate:


IMO, the main area where RDNA has been lacking is in support for HPC apps - not AI. These days, I'm sure HPC is a far smaller market.

The article said:
The unified UDNA architecture is a good next logical step on the journey to competing with CUDA
We'll see. The distinctions between Nvidia's 100-series vs. their client GPUs hasn't seemed a major impediment to their world domination. It seemed to me like what AMD did between RDNA and CDNA was largely mirroring that. I think AMD can't afford to make either less adept at its purpose. To be honest, I expected both AMD and Nvidia actually to specialize more in the direction of AI, sacrificing some general-purpose programmability and HPC features of their datacenter-specific models. This unification almost seems like a step backwards, but I'll reserve judgement until we learn some specifics about what AMD has in mind - perhaps they're looking to ditch the wavefront matrix instructions in favor of integrating XDMA engines into their dGPUs?

I think what developers most wanted was a robust, well-supported software stack. ROCm just took too long to reach maturity, is too narrowly supported, and has too many issues on non-supported hardware. It didn't help that AMD changed its API strategy half-way through, from previously focusing OpenCL (which is pretty much what Intel and UXL are doing) to building a CUDA work-alike, with HiP.

One way CUDA became dominant in AI is by being general enough that it could handle whatever people wanted to do with it, and that positioned it well for those looking to accelerate neural networks and deep learning. However, for someone trying to gain AI dominance today, I think there are short-cuts that make a lot more sense.
 
"AMD only has limited AI acceleration in RDNA 3, basically accessing the FP16 units in a more optimized fashion via WMMA instructions"

Is that how that works in actuality? They certainly don't talk about it in a fashion that makes it sound so, though I suppose that would be why they don't have some cute marketable name for their's like Tensor cores, or Matrix.... whatever Intel calls their AI cores.
Good questions. Check out some of the links in my previous post.

CDNA did actually use the term "Matrix Cores", but they also include support for higher-precision arithmetic formats that RDNA GPUs lack. This is key to them targeting the HPC market and not just AI.

On the bright side, sounds like we will indeed have a "Radeon VII Mark 2".... or better yet, another proper GCN type GPU on our hands finally! (Not that Navi 31 is too far off)
I doubt it. Even if AMD re-unifies their GPUs' ISA, that still doesn't mean their client GPUs will waste lots of area on fp64 arithmetic that consumers don't need. Nvidia doesn't do that, in their client GPUs.

Hopefully that means FP64 and larger FP Data Type versions won't get gimped on the Radeon Side.
The last GPU to gimp fp64 was Radeon VII. After that, both AMD and Nvidia only had scalar fp64 in their client GPUs. Why waste die area for a feature with almost no consumer demand?
 
Last edited:
The error was not going to microsoft, and agreeing to a common API for windows.
After Windows demands an API, programmers can just target that API, and forget about the particular hardware.
LOL, this is wrong on so many levels.
  1. Nvidia didn't do that. They just made CUDA and that was that. They quickly dominated GPU compute, and now they basically own the AI hardware market.
  2. AMD partnered with Microsoft on multiple different GPU computing projects & frameworks, including C++AMP and DirectCompute. Didn't get much uptake by developers.
  3. In the world of HPC and AI, Linux rules. There are various reasons. Certainly, you can do those things on Windows, but most of Nvidia's AI-oriented hardware purchases are by cloud & embedded users who are running Linux.

both AMD and Intel could just help OpenCL be relevant, but they aren't because they want their own stuff to be relevant, which is hilarious to see (how they fail).
AMD pretty much turned their back on OpenCL, but not Intel. Intel is basically the main proponent of it and SYCL (another Khronos standard, based on OpenCL). Those technologies form the basis of their oneAPI.

Maybe Khronos would be to blame there? Not sure. Just throw money at the problem, I guess.
Open standards don't necessarily just happen. For the most part, they need some significant demand from enough customers that vendors can't ignore them in favor of their own proprietary alternatives.

For instance, I think the main reason Vulkan happened was due to Google and their desire for Android to have a decent replacement for OpenGL. Through Android, Google had enough leverage to force enough vendors to get onboard and critical mass was quickly reached.

AMD GPUs can run CUDA, AMD pulled the plug on the software project. Technically there is no physical reason AMD couldn’t make the hardware interface CUDA compliant
At an API level, HiP is AMD's CUDA work-alike.

AMD would like people to port their CUDA code to HIP, which you can run on both AMD and Nvidia GPUs. I'll bet you can count the number of such examples on a single hand. If I had a mature CUDA codebase, and I were still interested in running it on Nvidia hardware (if no longer exclusively), I couldn't imagine it not running somehow worse after converting to HIP.
 
Last edited:
I can’t fault the general message and their strategy in going unified but considering Huynh was evasive when you asked about a clear timeline of implementation; I guess I’ll believe it when I see it.
The timeline thing is weird, eh?

Usually, these companies don't like to spread confusion in the marketplace, which this announcement does in spades. Normally, such announcements happen much closer to the availability of hardware, and there are specifics about exactly what the new technology entails (even if available only under NDA).

What I think is going on here is basically a big, attention-grabbing announcement essentially to try and claw back some customer & developer mindshare. It seems desperate. Kinda like the corporate equivalent of shouting: "I know I messed up. I can change, baby. Please, give me another chance!"

So... They're bringing GCN back from the dead? LOL.
No, it's UDNA is probably going to look more like RDNA.

Speaking of which, RDNA 1 had backwards compatibility for running GCN code on it! I assumed this was primarily so that PS5 and the latest XBox could still play previous-generation content. I wonder if this capability is still present in RDNA3.
 
Last edited:
AMD would like people to port their CUDA code to HIP, which you can run on both AMD and Nvidia GPUs. I'll bet you can count the number of such examples on a single hand. If I had a mature CUDA codebase, and I were still interested in running it on Nvidia hardware (if no longer exclusively), I couldn't imagine it not running somehow worse after converting to HIP.
It’s a tale of 2 companies, or a tale of 2 cousins, think free sync vs Gsync.
Perhaps to get adoption AMD needs to open source? Nvidia will always go for closed, locked down solutions.

CUDA has many years of relatively easily achieved optimisations, as in it is a system designed to run on a limited set of GPUs. All else being equal it should be quicker than HiP. The advantage of going with HiP would be a degree of code portability.

What is preferable? Depends on your use case.
 
It’s a tale of 2 companies, or a tale of 2 cousins, think free sync vs Gsync.
Flawed analogy, but okay.

Perhaps to get adoption AMD needs to open source?
HIP is open source. So is ROCm.

Intel's oneAPI is also all open source.

Nvidia will always go for closed, locked down solutions.
They crack the door open, occasionally, when it suits them. The latest example is their Linux device driver.
 
The timeline thing is weird, eh?

Usually, these companies don't like to spread confusion in the marketplace, which this announcement does in spades. Normally, such announcements happen much closer to the availability of hardware, and there are specifics about exactly what the new technology entails (even if available only under NDA).

What I think is going on here is basically a big, attention-grabbing announcement essentially to try and claw back some customer & developer mindshare. It seems desperate. Kinda like the corporate equivalent of shouting: "I know I messed up. I can change, baby. Please, give me another chance!"


No, it's UDNA is probably going to look more like RDNA.

Speaking of which, RDNA 1 had backwards compatibility for running GCN code on it! I assumed this was primarily so that PS5 and the latest XBox could still play previous-generation content. I wonder if this capability is still present in RDNA3.
That's my sentiment as well, especially following the other interview where he announced that AMD would be focusing on budget to midrange market going forward.
 
Flawed analogy, but okay.


HIP is open source. So is ROCm.

Intel's oneAPI is also all open source.


They crack the door open, occasionally, when it suits them. The latest example is their Linux device driver.
Yes , HiP and ROCm are open source, my point/question is that given Nvidia’s entrenched position is that to get adoption AMD need to encourage users by going open source, intel being really late to the party.. likewise.

It suits them to open source the Linux driver to persuade Linux users to play with their hardware, it’s only to expand their market. If they were dominant in the Linux ecosystem they would not be open sourcing their drivers.
 
Nvidia has a pretty smooth gradient of CUDA capable cards from low end home user/college student, to high-end consumer that doesn't break the bank for smaller teams, to Quadro with double the vram and double the price, and finally $20k datacenter offerings.

AMD gave up the high-end consumer, so what's the plan? A couple hundred dollar mid/low end and then nothing until $20k datacenter GPUs?

Frankly trying to unify those two is not a good idea. Make an ultra focused console gaming gpu and a datacenter AI accelerator.
 
Tensor cores are "cores" in pretty much the same sense Nvidia calls everything a "core". In other words, they're not. You feed them using Warp instructions and SIMD registers pretty much exactly how RDNA's WMMA and CDNA's MFMA instructions work.
The difference is that in RDNA, you are using the same INT and FP units for regular usage and for matrix usage, just feeding them in a different manner to eke out some extra efficiency. If you're doing regular GPU stuff, those cores are doing regular FP and INT maths, and if you're doing AI matrix math then those same cores are instead used for that. If you want to do both at the same time... you need to partition or alternate.
On the Nvidia side of things, the Tensor cores are fixed-function-blocks dedicated ONLY to FMA operations. You can't feed them regular math (ADD/MUL/etc), you can only feed them matrices and fun FMA. But they do this vary vast and very efficiently. Because they're separate FFBs from the rest of the INT and FP units for the CUDA cores, you can use them in parallel with the all the normal hardware.

This difference in implementation is what leads to AMD's improved raster performance for similar GPUs: they're not 'wasting' die area on Tensor cores or RT units (that accelerate BVH traversal), it's allllll general purpose compute. But there's a downside: if you want to do general compute tasks (e.g. gaming) and Tensor tasks (e.g. DLSS) or RT, then RDNA needs to lower its raster performance in order to borrow some cores to do those tasks, whereas on the green side of things you're now using that previously dark die area. This is why FSR does not use trained model upscaling - because it would a direct trade in raster performance for upscaling performance. DLSS gets the upscaling nearly 'for free' because most of it is performance on the Tensor cores that would otherwise just not be doing anything.
 
Nvidia already did forced to remove the program that allowed AMD gpus to run CUDA based programs... so yeah.
No Nvidia forced a small nobody to close down a translation layer. That’s COMPLETELY different from AMD just building GPUs that natively run CUDA code. If AMD doesn’t advertise that they “run CUDA” and it just magically works, what could Nvidia do?
 
The difference is that in RDNA, you are using the same INT and FP units for regular usage and for matrix usage, just feeding them in a different manner to eke out some extra efficiency. If you're doing regular GPU stuff, those cores are doing regular FP and INT maths, and if you're doing AI matrix math then those same cores are instead used for that. If you want to do both at the same time... you need to partition or alternate.
On the Nvidia side of things, the Tensor cores are fixed-function-blocks dedicated ONLY to FMA operations. You can't feed them regular math (ADD/MUL/etc), you can only feed them matrices and fun FMA. But they do this vary vast and very efficiently. Because they're separate FFBs from the rest of the INT and FP units for the CUDA cores, you can use them in parallel with the all the normal hardware.

This difference in implementation is what leads to AMD's improved raster performance for similar GPUs: they're not 'wasting' die area on Tensor cores or RT units (that accelerate BVH traversal), it's allllll general purpose compute. But there's a downside: if you want to do general compute tasks (e.g. gaming) and Tensor tasks (e.g. DLSS) or RT, then RDNA needs to lower its raster performance in order to borrow some cores to do those tasks, whereas on the green side of things you're now using that previously dark die area. This is why FSR does not use trained model upscaling - because it would a direct trade in raster performance for upscaling performance. DLSS gets the upscaling nearly 'for free' because most of it is performance on the Tensor cores that would otherwise just not be doing anything.
You do realize AMD and Nvidia both build ray tracing accelerators into the basic building blocks of their GPUs, right? On Nvidia the RT cores will always be X per SM and on AMD it’s X per CU. Nvidia also does it with Tensor cores. They’re not some separate block on the GPU.
 
You do realize AMD and Nvidia both build ray tracing accelerators into the basic building blocks of their GPUs, right? On Nvidia the RT cores will always be X per SM and on AMD it’s X per CU. Nvidia also does it with Tensor cores. They’re not some separate block on the GPU.
They are separate blocks in that they are not sharing hardware with the regular shader blocks.
For the Tensor cores, if you want to add two basic INTs, then it goes to one of the regular INT units. You can't send it to the Tensor units, they don't know what to do with anything that isn't presented as a set of 3 matrixes. If you present a set of 3 INT matrixes to FMA, then you can send it to a Tensor unit as a single operation, or split it out into lots of additions and multiplications and do them on the regular INT units. But different areas on the die are used for each, and you could even do both in parallel (Nvidia explicitly lists concurrent operation as an optimisation for maximising performance).
On RDNA, whether you are doing a basic INT addition or a matrix FMA, the same hardware on the silicon is used in either case. You can think of it as a little bit like SMT, in that there are not actually two different pieces of hardware performing the two different operations, but the hardware presents as if there were to simplify operation.

As for RT: as of RDNA3, there are not dedicated FFBs for BVH traversal or intersection, they use the same ALUs as other operations. The ALUs can be addressed in a much more efficient manner using BVH instructions than breaking down the calculations beforehand, but those ALUs involved in BVH traversal cannot be simultaneously used for other tasks (e.g. rasterisation) because it's not actually separate hardware. It's a different philosophy of how the architect the GPU, and in tasks outside of RT or matrix FMA AMD's approach means much more of the idea area can actually be used for other tasks, which is why AMD's cards tend to have better performance in pure rasterisation tasks - if you do not expect to be doing much RT or AI processing, then AMD's approach is more die-area efficient so more cost-efficient.
 
It suits them to open source the Linux driver to persuade Linux users to play with their hardware, it’s only to expand their market. If they were dominant in the Linux ecosystem they would not be open sourcing their drivers.
LOL, no. They definitely dominate, on Linux. That's the main workhorse for AI in the cloud (also, embedded).

They have other reasons for open sourcing it, mostly having to do with some licensing restrictions on which kernel features a proprietary driver is allowed to access. They even suffered the removal of some of their kernel patches removed, because there was no open source driver using them, at the time.
 
Quadro with double the vram and double the price,
Only double the price?? Where are you finding these amazing deals??
: D

BTW, they killed off the Quadro branding, years ago.

AMD gave up the high-end consumer, so what's the plan? A couple hundred dollar mid/low end and then nothing until $20k datacenter GPUs?
He sort of walked that back, with a reference to multi-chiplet GPUs. He just said the main area where they will be competing is for the midrange and below.
 
The difference is that in RDNA, you are using the same INT and FP units for regular usage and for matrix usage, just feeding them in a different manner to eke out some extra efficiency. If you're doing regular GPU stuff, those cores are doing regular FP and INT maths, and if you're doing AI matrix math then those same cores are instead used for that. If you want to do both at the same time... you need to partition or alternate.
On the Nvidia side of things, the Tensor cores are fixed-function-blocks dedicated ONLY to FMA operations. You can't feed them regular math (ADD/MUL/etc), you can only feed them matrices and fun FMA.
No, they're exactly the same. In all cases:
  • it's a wavefront or warp instruction
  • the data is to/from the regular CU/SM SIMD registers
  • the wavefront or warp executing it can't issue another instruction, in that slot.

This difference in implementation is what leads to AMD's improved raster performance for similar GPUs: they're not 'wasting' die area on ... RT units (that accelerate BVH traversal), it's allllll general purpose compute.
Both Nvidia and AMD have some degree of fixed-function hardware for RT. Nvidia has never publicly said what their hardware implementation of RT looks like, but here's what AMD does:

"AMD implements raytracing acceleration by adding intersection test instructions to the texture units. Instead of dealing with textures though, these instructions take a box or triangle node in a predefined format. Box nodes can represent four boxes, and triangle nodes can represent four triangles. The instruction computes intersection test results for everything in that node, and hands the results back to the shader. Then, the shader is responsible for traversing the BVH and handing the next node to the texture units. RDNA 3 additionally has specialized LDS instructions to make managing the traversal stack faster."

Source: https://chipsandcheese.com/2023/03/22/raytracing-on-amds-rdna-2-3-and-nvidias-turing-and-pascal/

if you want to do general compute tasks (e.g. gaming) and Tensor tasks (e.g. DLSS) or RT, then RDNA needs to lower its raster performance in order to borrow some cores to do those tasks, whereas on the green side of things you're now using that previously dark die area.
Nope. Nvidia is exactly the same. Want to use DLSS? That's going to tie up some of the same SM resources that normal rendering would do. Look how the fp32 core utilization plunges right when the tensor cores are active, in the DLSS case (bottom):

8kv5vvZ.png


I find it interesting that the same appears to hold true for RT cores. However, because we don't see the same sharp drop-off before/after they become active, it's a little harder to tell if they're tying up resources that would be used for other things, or if there's just not much else to do, at that point in the rendering process. Still, the way all of the core types appear to sum to <= 100% implies they're all tying up warp instruction throughput.

This is why FSR does not use trained model upscaling - because it would a direct trade in raster performance for upscaling performance.
No, it didn't use the same approach as Nvidia, because RDNA didn't have anything equivalent to their tensor product instructions. RDNA didn't get that functionality until RDNA 3. So, AMD had to resort to other upscaling techniques, simply due to the fact that their neural network inferencing throughput was so inferior.

DLSS gets the upscaling nearly 'for free' because most of it is performance on the Tensor cores that would otherwise just not be doing anything.
It literally does not.

I think it's little coincidence that you've provided zero sources. Had you even tried, you'd have had to walk back most or all of your claims.
 
Last edited:
That’s COMPLETELY different from AMD just building GPUs that natively run CUDA code. If AMD doesn’t advertise that they “run CUDA” and it just magically works, what could Nvidia do?
That's not feasible without some serious software translation layer. The hardware/software interface of GPUs is a lot more "messy" than CPUs. With a CPU, there's an ISA standard and if you CPU does the wrong thing, it's a bug and you must fix it. People can use any toolchain they want, and they expect the code to run exactly as it should.

With a GPU, the toolchain is provided by the GPU vendor and it has intimate knowledge of the hardware, including various different generations, their quirks, bugs, limitations, and capabilities. You couldn't make hardware that's natively compatible with compiled CUDA, without extensive reverse engineering. Even then, it wouldn't be as fast, and it would only be compatible at a specific point-in-time.

Also, Nvidia uses signed firmware images on its GPUs. I'm pretty certain the CUDA runtime can check to ensure the GPU is running signed Nvidia firmware, making it virtually impossible to run CUDA code on a fake GPU, without also hacking the CUDA runtime. And what serious customer wants that?
 
No, they're exactly the same. In all cases:
  • it's a wavefront or warp instruction
  • the data is to/from the regular CU/SM SIMD registers
  • the wavefront or warp executing it can't issue another instruction, in that slot.
The 'wavefront' is not executing anything, the wavefront is effectively the job dispatcher. It's not the component actually executing the instructions, that's entirely different hardware.

That's like discussing the difference between integer and float execution units and claiming they are the same because they are both addressed by the same instruction decoder. It's a fundamental misunderstanding of how the processor operates.
Nope. Nvidia is exactly the same. Want to use DLSS? That's going to tie up some of the same SM resources that normal rendering would do.
False. the Tensor cores can and do operate in parallel with the other ALUs in the CUDA cores. Even as far back as Volta, Nvidia tell you explicitly to do this for improved performance:
An additional optimization is to use CUDA cores and Tensor Cores concurrently. This can be achieved by using CUDA streams in combination with CUDA 9 WMMA.
If the execution were occurring on the same hardware, this would be a literal impossibility.
Look how the fp32 core utilization plunges right when the tensor cores are active, in the DLSS case (bottom):

8kv5vvZ.png

Those are graphs of percentage utilisation over time for different tests, not normalised graphs of absolute utilisation for the same calculation performed by different dispatching (matrix FMA vs. the same calculations split out to individual instructions to address the ALUs). You cannot draw the conclusion you are claiming from them.
 
Last edited:
The 'wavefront' is not executing anything, the wavefront is effectively the job dispatcher. It's not the component actually executing the instructions, that's entirely different hardware.
The wavefront is an instruction stream, just like a warp or what CPU folks call a "thread". Nvidia likes to confuse people, by pretending each SIMD lane is a thread, so we have to talk about warps, instead.

That's like discussing the difference between integer and float execution units and claiming they are the same because they are both addressed by the same instruction decoder. It's a fundamental misunderstanding of how the processor operates.
First, we should rewind our mental model back to the era of in-order CPUs, like the original Pentium. Now, if you think about a single or dual-issue in-order CPU, you can appreciate that the instruction stream can only dispatch a max of one or two instructions each cycle, regardless of whether the backend has more potential for concurrency than that. If you're issuing a tensor product instruction, that displaces your ability to execute some other operation in that same slot.

the Tensor cores can and do operate in parallel with the other ALUs in the CUDA cores. Even as far back as Volta, Nvidia tell you explicitly to do this for improved performance:
I never said the hardware wasn't pipelined or had no concurrency. You're the one claiming that RDNA 3 is any different, in this regard.

Those are graphs of percentage utilisation over time for different tests, not normalised graphs of absolute utilisation for the same calculation performed by different dispatching (matrix FMA vs. the same calculations split out to individual instructions to address the ALUs). You cannot draw the conclusion you are claiming from them.
As I understand it, they're showing the warp instruction breakdown, with idle SMs or stalled warps issuing no instructions. All of these "cores" are eating from the same pie of instruction (and register) bandwidth. Calling them different "cores" gives the impression of more concurrency and decoupling than what actually exists.

In the article, Paul said that Tensor Cores are fundamentally different than WMMA. At best, there are some implementation differences in Tensor Cores that might support higher throughput, but they're not really different in kind.