News AMD announces unified UDNA GPU architecture — bringing RDNA and CDNA together to take on Nvidia's CUDA ecosystem

edzieba · Sep 10, 2024

bit_user said:
The wavefront is an instruction stream, just like a warp or what CPU folks call a "thread". Nvidia likes to confuse people, by pretending each SIMD lane is a thread, so we have to talk about warps, instead.

And once again, a wavefront is not an execution unit any more than a 'thread' is an FPU. You're fundamentally missing the point.

First, we should rewind our mental model back to the era of in-order CPUs, like the original Pentium. Now, if you think about a single or dual-issue in-order CPU, you can appreciate that the instruction stream can only dispatch a max of one or two instructions each cycle, regardless of whether the backend has more potential for concurrency than that. If you're issuing a tensor product instruction, that displaces your ability to execute some other operation in that same slot.

Instruction issuance is not instruction execution. Instructions take more than one cycle to complete. We're talking about the actual execution, which takes place over multiple clock cycles, not the dispatching.

In the article, Paul said that Tensor Cores are fundamentally different than WMMA. At best, there are some implementation differences in Tensor Cores that might support higher throughput, but they're not really different in kind.

It's physically different hardware performing the execution.

If the Tensor cores were merely re-using the CUDA ALUs, then they would not be able to achieve the performance deltas seen in real world benchmarking. e.g. a FP16 4x4 FMA operation (multiplying two 4x4 matrices together and then adding an additional 4x4 matrix) is composed of 64 individual multiplication or add operations. If Tensor cores were merely ALUs working together, we would expect to see that general purpose FP16 performance to be the same, or very close. Instead, if we look at the H100 (for example) standard FP16 performance is 134 TFLOPS (double the 67 TFLOPS FP32 performance due to packed math, as expected) but Tensor FP16 performance is 1979 TFLOPS. If Tensor cores were merely the ALUs with different dispatching, then that would mean that over 90% of them just sit and do nothing for no reason rather than contributing to regular compute. That's a lot of wasted die area!
In reality, you have 1979TFLOPS of FP16 operations... only if those operations are part of matrix FMAs. You can't 'split out' those FLOPS for other calculations, the Tensor cores physically cannot decompose themselves into generic ALUs because they're not built that way. The entire reason they can pack so many operations per unit die area is the hypersepcialisation in capability. This is why you can run the Tensor core sin parallel with other generic ALU operations: it's two different pieces of hardware doing two different things.

mikeztm · Sep 10, 2024

bit_user said:
The wavefront is an instruction stream, just like a warp or what CPU folks call a "thread". Nvidia likes to confuse people, by pretending each SIMD lane is a thread, so we have to talk about warps, instead.

First, we should rewind our mental model back to the era of in-order CPUs, like the original Pentium. Now, if you think about a single or dual-issue in-order CPU, you can appreciate that the instruction stream can only dispatch a max of one or two instructions each cycle, regardless of whether the backend has more potential for concurrency than that. If you're issuing a tensor product instruction, that displaces your ability to execute some other operation in that same slot.

I never said the hardware wasn't pipelined or had no concurrency. You're the one claiming that RDNA 3 is any different, in this regard.

As I understand it, they're showing the warp instruction breakdown, with idle SMs or stalled warps issuing no instructions. All of these "cores" are eating from the same pie of instruction (and register) bandwidth. Calling them different "cores" gives the impression of more concurrency and decoupling than what actually exists.

In the article, Paul said that Tensor Cores are fundamentally different than WMMA. At best, there are some implementation differences in Tensor Cores that might support higher throughput, but they're not really different in kind.

I think you missed the point here.
RDNA3 and Volta+ both run same command with same input and output for those matrix calculations. But the similarities ends here.

Volta dispatch those commands onto dedicated ports and execution unit which runs multiple times faster than running them through general ALU/FPU.

RDNA3 is lacking this special unit and just run them on other ports of the WGP just like normal shader stuff.

Tensor cores are not cores just like any other “cores” of GPU. But it is a dedicated hardware unit within the working group.

You can not say i486 and zen 4 are not fundamentally different due to they both runs x86.

bit_user · Sep 10, 2024

edzieba said:
And once again, a wavefront is not an execution unit any more than a 'thread' is an FPU. You're fundamentally missing the point.

I never said it was an execution unit. A wavefront is an instruction stream, just like a warp.

edzieba said:
Instruction issuance is not instruction execution. Instructions take more than one cycle to complete.

Right, but GPUs can only issue a couple instructions per wavefront (or equivalent) per cycle. If one of those is a tensor product instruction, that means it can't also be a SIMD arithmetic operation.

edzieba said:
We're talking about the actual execution, which takes place over multiple clock cycles, not the dispatching.
It's physically different hardware performing the execution.

Your own diagram shows it quite clearly. The warp scheduler says it's dispatching one warp instruction per clock. Furthermore, the "Tensor Cores" are controlled by Warp Level Matrix Multiply-Accumulate Instructions, as detailed here:

https://docs.nvidia.com/cuda/parall...level-matrix-multiply-accumulate-instructions

Go ahead: search Nvidia's own GPU instruction set manual for "Tensor Core". You won't find it. Tensor Cores are a marketing name for something that's merely another pipeline controlled by the same wavefront as their "CUDA Cores".

Also, you keep turning the topic back to Nvidia, never substantiating the claim that AMD's WMMA isn't implemented in a similar way.

edzieba said:
If the Tensor cores were merely re-using the CUDA ALUs,

That's a straw man. I never said or implied it.

qwertymac93 · Sep 10, 2024

This is just the most strategic way to say they've cancelled all further development of RDNA and are focusing on making CDNA more scalable. Their consumer graphics division is not seen as a worthwhile investment so they are shifting more resources toward data center and embedded.

bit_user · Sep 10, 2024

mikeztm said:
I think you missed the point here.
RDNA3 and Volta+ both run same command with same input and output for those matrix calculations. But the similarities ends here.

Volta dispatch those commands onto dedicated ports and execution unit which runs multiple times faster than running them through general ALU/FPU.

I get how instruction dispatch works, but where do you guys get the idea that RDNA 3 WMMA is using general ALU?

mikeztm said:
RDNA3 is lacking this special unit and just run them on other ports of the WGP just like normal shader stuff.

Then how do they deliver double the fp16 ops/clock/CU vs. RDNA2 (which already supported fp16 packed arithmetic)?

mikeztm said:
Tensor cores are not cores just like any other “cores” of GPU. But it is a dedicated hardware unit within the working group.

It's like a FPU vs. integer ALU. Same instruction stream, same execution state, and (in some ISAs) even the same architectural registers.

bit_user · Sep 10, 2024

qwertymac93 said:
This is just the most strategic way to say they've cancelled all further development of RDNA and are focusing on making CDNA more scalable. Their consumer graphics division is not seen as a worthwhile investment so they are shifting more resources toward data center and embedded.

Source?

Their consumer graphics division powers their console business and APU graphics. In case you haven't been paying attention, Intel has been making huge strides in integrated graphics performance. AMD can't afford to take their foot off the gas, there.

The client GPU business has been very profitable for them, in the recent past. It's strategically important to their other businesses. They're not just going to revert back to the bad old days of the latter GCN era.

mikeztm · Sep 10, 2024

bit_user said:
I get how instruction dispatch works, but where do you guys get the idea that RDNA 3 WMMA is using general ALU?

Then how do they deliver double the fp16 ops/clock/CU vs. RDNA2 (which already supported fp16 packed arithmetic)?

It's like a FPU vs. integer ALU. Same instruction stream, same execution state, and (in some ISAs) even the same architectural registers.

RDNA3 have dual issue for all FP32/FP16 math. That’s how they got double the output not with any matrix unit.

CDNA3 have 8x throughput when using its matrix unit comparing to using general ALU/FPU.

And we are not frontend issuers bond for those calculations. Why are you acting like execution part is not important and issuing is the bottleneck?

acadia11 · Sep 10, 2024

waltc3 said:
When I first heard AMD talk about splitting their GPU architectures into RDNA and CDNA, I thought it sounded good, as the idea was to remove circuitry in the RDNA architectures that would allow for a larger silicon budget dedicated to gaming hardware and function, while removing some of the transistor hardware optimization geared around data-center workloads. Vice versa for CDNA, for data-center optimized silicon. It did sound excellent at the time.

Enter the "AI" market craze. That puts a different spin on things, eh? If it is true that "AI" turns out to have the long life ahead that proponents are advertising, then UDNA makes a great deal of sense as transistor designs and layouts for "AI" types of computation have a lot in common with data-center optimization, while gaming function and performance are always important. Time and future GPUs will tell, I guess...😉

It’s not a matter if, but when. And if that when is too long for investors. The revolution is here, it’s televised, socialized, and quantized but not necessarily at the pace necessary for dividends to be realized. And that’s what may impact those jumping in the ocean- the big fish will likely all float, but the not so big fish may be swallowed up in a prolonged adoption and transition phase.

It’s makes sense in that data centers and Ai will be everywhere, but will games be everywhere. This seems to align also to the strategy of not focusing on consumer GPUs high end.

Pierce2623 · Sep 10, 2024

bit_user said:
That's not feasible without some serious software translation layer. The hardware/software interface of GPUs is a lot more "messy" than CPUs. With a CPU, there's an ISA standard and if you CPU does the wrong thing, it's a bug and you must fix it. People can use any toolchain they want, and they expect the code to run exactly as it should.

With a GPU, the toolchain is provided by the GPU vendor and it has intimate knowledge of the hardware, including various different generations, their quirks, bugs, limitations, and capabilities. You couldn't make hardware that's natively compatible with compiled CUDA, without extensive reverse engineering. Even then, it wouldn't be as fast, and it would only be compatible at a specific point-in-time.

Also, Nvidia uses signed firmware images on its GPUs. I'm pretty certain the CUDA runtime can check to ensure the GPU is running signed Nvidia firmware, making it virtually impossible to run CUDA code on a fake GPU, without also hacking the CUDA runtime. And what serious customer wants that?

Of course it requires extensive reverse engineering. You think these companies don’t reverse engineer every product their competitors put out? You don’t think they do all kinds of crazy scans on the silicon to be able to look at what each other are doing? Of course they do. Between micro-benchmarking and various scanning/visualization techniques, they know EXACTLY what their competitors have done with products already in the marketplace. Their toolchain is so advanced that I guarantee they have their hands on full copies of each other’s GPU ISAs.

As an aside, even if they did it with a software translation layer, if they don’t advertise it, there wouldn’t be much Nvidia could do without trying to publicly force the courts to remove capabilities from a competitor’s product that said competitor didn’t advertise, which we all know how much courts cater to public opinion these days. Yes, I understand that a software translation layer isn’t part of a GPU but I’m not Joe Public or an average tech illiterate judge.

You think Nvidia could have bullied the CUDA translation layer out of business like they did if AMD and Intel had cooperated to deliver it? I don’t.

bit_user · Sep 11, 2024

mikeztm said:
RDNA3 have dual issue for all FP32/FP16 math. That’s how they got double the output not with any matrix unit.

Uh, WMMA doesn't even support multiplication of fp32 matrices, so it's difficult to determine whether that's actually true.

mikeztm said:
CDNA3 have 8x throughput when using its matrix unit comparing to using general ALU/FPU.

That sounds nice, but I'm sure it's mostly a function of how much silicon they wanted to devote to it.

mikeztm said:
And we are not frontend issuers bond for those calculations. Why are you acting like execution part is not important and issuing is the bottleneck?

If you have a source on that (or anything else you've stated), I'd love to see it.

edzieba · Sep 11, 2024

bit_user said:
I never said it was an execution unit. A wavefront is an instruction stream, just like a warp.

Your words:

bit_user said:
the wavefront or warp executing it

The wavefront/warp are issuing instructions, but those instructions are executed on different hardware, with multiple instructions in flight simultaneously.

bit_user said:
Go ahead: search Nvidia's own GPU instruction set manual for "Tensor Core". You won't find it.

The instruction is FMA, or Fused Multiply Add. There are a plethora of FMA instructions for different number types (INT8, FP16, etc) and data structures (mostly relating the the dimensions of the matrices involved). Since these are multidimensional array operations they fall under the general umbrella of Tensor Mathematics, hence 'Tensor cores' for the cores executing those operations. Nvidia do not call the instruction 'Tensor'.
AMD brands these operations as WMMA (Wave Matrix Multiple Add) but that is just Fused Multiply Add with AMDs branding. To add confusion, in CDNA, AMD do have dedicated execution units for FMA, that they call "matrix cores" or "matrix engines" or even "matrix core engines" depending on which marketing materials you look at. These units are not present in RDNA.

bit_user said:
Tensor Cores are a marketing name for something that's merely another pipeline controlled by the same wavefront as their "CUDA Cores".

That they are issued instructions by the same scheduler as other logic blocks is utterly irrelevant, as every execution unit in the SM (be it ALU or matrix or vector or BVH-traversal) is issued instructions by that scheduler. It's the instruction scheduler, that's what it does and all it does.

bit_user said:
Also, you keep turning the topic back to Nvidia, never substantiating the claim that AMD's WMMA isn't implemented in a similar way.

Here's AMD's slide shows 64x general purpose ALUs ganged together when executing WMMA (what AMD brands FMA) operations. These are the same ALUs that, when not executing matrix FMA operations, would be instead executing other INT or FLOAT operations.
And this is the key difference: on the Nvidia chips, FMA execution is not using the ALUs, but has its own seperate hardware executing those operations: the Tensor cores.

bit_user said:
That's a straw man. I never said or implied it.

Your words:

bit_user said:
Nope. Nvidia is exactly the same. Want to use DLSS? That's going to tie up some of the same SM resources that normal rendering would do.

-----------------------------------------

Since you seem to be confused over instruction issuance vs. instruction execution, let's use an analogy: a kitchen.

A customer may order a steak, or fried eggs (general INT and FP operations), or they may order a crazy teetering 4x4x4 omlette cube (an FMA operation). A single server will approach the window and yell out an order (the warp scheduler issuing instructions). The various cooks and chefs in the kitchen will then make the order, and return the dish to the window.

In the RDNA2 kitchen, the server yells out "steak!", and one of the 64 chefs goes and makes a steak. The server yells out "fried egg!" and a chef goes and fries an egg. If the server yells out "steak!" 64 times, the 64 chefs can all work on and return one steak each. The server yells out "4x4x4 omlette!" and the chefs have no idea what they heck they are talking about. The server instead needs to tell each chef "go and make an omlette" then tell each one once they come back to the window with an omlette "when you've made it, go and stack them in 4x4 squares" and when those are made and returned to the window "stack the squares into a cube". Whilst the 64 chefs are running around to make all these omelettes, they cannot also be making steaks or fried eggs.
The RDNA2 kitchen does not support matrix FMA operations (WMMA) so those operations need to be decomposed into the individual instructions to be executed.

In the RDNA3 kitchen, if the server yells out "steak!" 64 times, the 64 chefs can all work on and return one steak each. If the server yells "4x4x4 omlette!" and the 64 chefs spring into action, working together to make all the 64 omlettes, stack them into squares, stack the squares into a cube, and then return the cube to the window.
Because the chefs are not running back and forth for individual instructions on what to do from the server, they make the 4x4x4 omlette cube much faster than the RDNA2 kitchen. The RDNA3 kitchen adds WMMA (matrix FMA) support, but it still uses those same 64 chefs to do the work, they cannot be making any steaks or fired eggs, even if the server were to yell out orders for them whilst they were stacking the 4x4x4 omlette.

In the H100 kitchen, there are 32 chefs in the kitchen, but there is also one abomination of a mutated chef with 64 tentacles, who has no idea how to cook anything except omlettes. Throw them a steak, and they just look at you in confusion, ask them to fry an egg and they look at you in confusion, hand them less than 64 eggs and they look at you in confusion, etc. The server yells "4x4x4 omlette!" and the 32 chefs do nothing, but our tentacled mutant springs into action, making and stacking omlettes with consummate skill and handing back the stack in record time.
The key part is that whilst that tentacled fiend is working, the server can yell "steak!" "fried eggs!" and the other 32 chefs can go and make those orders (execute those instructions). On the other hand, if the server yells out "steak!" 64 times, those 32 chefs will have to run back and forth twice each to make them, unlike the RDNA2 or RDNA3 kitchens.

jp7189 · Sep 11, 2024

bit_user said:
Only double the price?? Where are you finding these amazing deals??
: D

BTW, they killed off the Quadro branding, years ago.

Lol. I know Nvidia is running away with pricing.

I am living in the past. I just don't have a great way to denote the workstation class cards anymore. Then again, maybe you can't call them that since they killed off nvlink around the same time they killed the Quadro name. That really irks me for the Ada gen. Faster GPU, but vram pooling is less efficient and harder to do. Once the workload grows past a single card, it's better off with the previous gen.. boo.

bit_user · Sep 11, 2024

edzieba said:
Your words:

the wavefront or warp executing it

Click to expand...

Sure. Just like I would talk about the CPU thread that's making some syscall or something. Sorry if the wording was unclear, but I didn't mean it how I think you interpreted it.

edzieba said:
The wavefront/warp are issuing instructions, but those instructions are executed on different hardware, with multiple instructions in flight simultaneously.

Yes, they're issuing a max of one instruction per warp, per cycle. And yes, I'm aware the hardware is pipelined and has multiple pipelines.

edzieba said:
The instruction is FMA, or Fused Multiply Add. There are a plethora of FMA instructions for different number types (INT8, FP16, etc) and data structures (mostly relating the the dimensions of the matrices involved). Since these are multidimensional array operations they fall under the general umbrella of Tensor Mathematics, hence 'Tensor cores' for the cores executing those operations. Nvidia do not call the instruction 'Tensor'.

Not sure what you're talking about with "FMA". I linked directly to the section on 9.7.15. Warp Level Matrix Multiply-Accumulate Instructions and they're the wmma and mma instructions that it describes in excruciating detail.

By contrast, the fma instructions are merely SIMD vector instructions (see 9.7.3.6. Floating Point Instructions: fma).

edzieba said:
That they are issued instructions by the same scheduler as other logic blocks is utterly irrelevant,

It might not seem relevant to someone used to superscalar CPUs with like 8-wide dispatch, but GPU execution units are much simpler and you only get one instruction per warp, per cycle (assuming no bank conflicts or other scheduling constraints).

The reason I made the point about Tensor Cores being tied to the instruction stream and execution state of a warp is that they sound like a completely separate thing. For instance, Intel recently added something a little similar (if you squint), called AMX. They didn't call it a matrix core. They branded it as a new feature and instruction set extension. However, it's supported by not only a whole new pipeline, but also a distinct register file, which makes it an even bigger deal than Nvidia's Tensor Cores (architecturally speaking).

edzieba said:
Here's AMD's slide shows 64x general purpose ALUs ganged together when executing WMMA (what AMD brands FMA) operations. These are the same ALUs that, when not executing matrix FMA operations, would be instead executing other INT or FLOAT operations.
And this is the key difference: on the Nvidia chips, FMA execution is not using the ALUs, but has its own seperate hardware executing those operations: the Tensor cores.

Finally. Thanks for finding that. I also just ran across a description of WMMA in the ISA manual, when I was checking your claim that it's just a rebranding of FMA (which I don't agree with, FWIW):

7.9. Wave Matrix Multiply Accumulate (WMMA): "These instructions work over multiple cycles to compute the result matrix and internally use the DOT instructions."

https://www.amd.com/content/dam/amd...r-instruction-set-architecture-feb-2023_0.pdf

At some level, FMA operations are involved, but WMMA aren't simply alternate opcodes for V_FMA_F16 or anything so simple.

edzieba said:
Your words:

Nope. Nvidia is exactly the same. Want to use DLSS? That's going to tie up some of the same SM resources that normal rendering would do.

Click to expand...

Yup, and it's true! Again, read the actual PTX instruction manual that I've linked now twice. It's very clearly spelled out in there. You're burning instruction slots and register file bandwidth that could otherwise be driving the CUDA cores.

The distinction might be in the degree to which shader/CUDA execution resources are consumed/displaced by matrix ops, between RDNA 3 vs. Nvidia's Tensor Cores, but the basic facts are sound.

edzieba said:
Since you seem to be confused over instruction issuance vs. instruction execution,

Nope. In Hopper, each SM has 4x warp schedulers, 4x 32-lane SIMD pipelines, and 4x Tensor cores. By tying up warp instruction bandwidth with "Warp Level Matrix Multiply-Accumulate Instructions" those normal SIMD pipelines are getting starved. In an extreme scenario, if the Tensor Cores had a recovery time of 1 cycle (i.e. meaning you could issue a new instruction every cycle), then you could theoretically starve out the SIMD and scalar pipelines, entirely, with a full tensor workload.

In your analogy, the warp scheduler is spending lots of time feeding instructions + ingredients & hanlding the resulting omelets for the octopus chef. Meanwhile, the normal line chefs are starting to idle. It's a net win, because the throughput of the octopus chef is so much higher, but it doesn't change the fact that whichever types of instructions you're issuing, they all come out of the same pies of total instruction dispatch and register file bandwidth.

bit_user · Sep 11, 2024

jp7189 said:
I am living in the past. I just don't have a great way to denote the workstation class cards anymore. Then again, maybe you can't call them that since they killed off nvlink around the same time they killed the Quadro name.

I agree it was nice when they had an overall branding for the product line. The workstation cards have several distinctions you tend not to find on gaming cards:

Usually 2-4x the memory capacity.
Lower power consumption & de-rated specs.
Thinner form factors (1-slot or 2-slot), allowing more cards to be packed into a single workstation.
Power connectors on the short edge, enabling the full-height cards to be placed in a 3U chassis.
Longer warranty period.
Rated for 24/7 operation and (last I heard) "datacenter-class" workloads*
Virtualization support?

* Nvidia has some language in the license terms of their CUDA drivers or runtimes that says you're not allowed to use their gaming cards for datacenter type workloads.

edzieba · Sep 11, 2024

bit_user said:
Nope. In Hopper, each SM has 4x warp schedulers, 4x 32-lane SIMD pipelines, and 4x Tensor cores. By tying up warp instruction bandwidth with "Warp Level Matrix Multiply-Accumulate Instructions" those normal SIMD pipelines are getting starved.

That's going to be a big old [Citation Needed]. Instruction dispatch is not the bottleneck, since each ALU and each Tensor core only receive one instruction at a time, and each takes more than one clock cycle to execute that instruction (the Tensor unit is even getting you 64x actual operations for each instruction dispatched, so even better). If the instruction dispatch were somehow the bottleneck then nobody (Nvidia or AMD) would be putting 32x or 64x ALUs behind a single warp scheduler, because it would mean adding 31x/63x ALUs to the die that cannot do any work.
In reality, instruction dispatch is scaled to execution unit performance to prevent starving those execution units.

bit_user · Sep 11, 2024

edzieba said:
Instruction dispatch is not the bottleneck, since each ALU and each Tensor core only receive one instruction at a time, and each takes more than one clock cycle to execute that instruction (the Tensor unit is even getting you 64x actual operations for each instruction dispatched, so even better). If the instruction dispatch were somehow the bottleneck then nobody (Nvidia or AMD) would be putting 32x or 64x ALUs behind a single warp scheduler, because it would mean adding 31x/63x ALUs to the die that cannot do any work.

SIMD stands for Single Instruction; Multiple Data. There's just one instruction dispatched by the warp scheduler per SIMD-32 unit. Each SM has 4 of these, totaling 128 "CUDA Cores" per SM. Each warp controls 32 "threads", each of which is basically a SIMD lane.

edzieba said:
In reality, instruction dispatch is scaled to execution unit performance to prevent starving those execution units.

Hopper has 4 warp schedulers per SM. Check out their whitepaper.

https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/#h100_sm_architecture

acadia11 · Sep 11, 2024

Pierce2623 said:
Of course it requires extensive reverse engineering. You think these companies don’t reverse engineer every product their competitors put out? You don’t think they do all kinds of crazy scans on the silicon to be able to look at what each other are doing? Of course they do. Between micro-benchmarking and various scanning/visualization techniques, they know EXACTLY what their competitors have done with products already in the marketplace. Their toolchain is so advanced that I guarantee they have their hands on full copies of each other’s GPU ISAs.

As an aside, even if they did it with a software translation layer, if they don’t advertise it, there wouldn’t be much Nvidia could do without trying to publicly force the courts to remove capabilities from a competitor’s product that said competitor didn’t advertise, which we all know how much courts cater to public opinion these days. Yes, I understand that a software translation layer isn’t part of a GPU but I’m not Joe Public or an average tech illiterate judge.

You think Nvidia could have bullied the CUDA translation layer out of business like they did if AMD and Intel had cooperated to deliver it? I don’t.

If I’m not mistaken but CUDA is a different approach from what Intel or AMD are doing on the GPU wouldn’t it have to adopt a similar hardware design approach. CUDA works considering Nvidias hardware design impl and the acUDA core. Not sure intel or amd without implementing CUDA cores would benefit if they simply copy their approach? And by default making Nvidia the de facto standard GPU ISA and hardware impl, basically ending up where we are today?

bit_user · Sep 11, 2024

acadia11 said:
CUDA works considering Nvidias hardware design impl and the acUDA core. Not sure intel or amd without implementing CUDA cores would benefit if they simply copy their approach?

"CUDA cores" are just a marketing name. At their heart, Nvidia GPUs are basically just SMID + SMT. Nvidia branded this combination SIMT and has subsequently tweaked with the recipe, a little bit, making it more than just a trivial combination of the two technologies.

All of the big GPU makers are doing the same thing. AMD first adopted this model with their "GCN" architecture, in 2012. I think Intel first added SMT way back in 2006 (Gen 4 of their GMA chipset graphics), but their SIMD has traditionally been much narrower than AMD and Nvidia's.

acadia11 said:
And by default making Nvidia the de facto standard GPU ISA and hardware impl, basically ending up where we are today?

It's not really an ISA that's the standard/dominant part. It's the CUDA API and device kernel language that are the key dependencies the industry has on CUDA.

That's why people can make API translation layers that run CUDA on ROCm or CUDA on oneAPI - because all you need to do is emulate the API calls and translate the "kernel" code to whatever the native implementation is for your hardware. As I said before, AMD's HIP already is this, but they renamed everything to give themselves more legal protection, and therefore you need to run your code through some tools they wrote that replace the CUDA calls with HIP calls.

acadia11 · Sep 11, 2024

bit_user said:
"CUDA cores" are just a marketing name. At their heart, Nvidia GPUs are basically just SMID + SMT. Nvidia branded this combination SIMT and has subsequently tweaked with the recipe, a little bit, making it more than just a trivial combination of the two technologies.

All of the big GPU makers are doing the same thing. AMD first adopted this model with their "GCN" architecture, in 2012. I think Intel first added SMT way back in 2006 (Gen 4 of their GMA chipset graphics), but their SIMD has traditionally been much narrower than AMD and Nvidia's.

It's not really an ISA that's the standard/dominant part. It's the CUDA API and device kernel language that are the key dependencies the industry has on CUDA.

That's why people can make API translation layers that run CUDA on ROCm or CUDA on oneAPI - because all you need to do is emulate the API calls and translate the "kernel" code to whatever the native implementation is for your hardware. As I said before, AMD's HIP already is this, but they renamed everything to give themselves more legal protection, and therefore you need to run your code through some tools they wrote that replace the CUDA calls with HIP calls.

Got it thanks for the clarification. I was wrongly considering the CUDA interface essentially nvidia GPUs ISA? Would you then consider the driver level apis the nvidias IsA? Actually looked up it’s nvidias ptx isa.

But yes as people are using CUDA api for their integrations you are right this is the dependency. But that dependency exists because everyone started using Nvidia GPUs … as there is no standard ISA as you pointed out everyone supports their own function set based on their own GPU architecture. And it has to be more than just a tweak as nvidias performance is second to none in this space at the moment. All cars use engines that apply rotational force about an axle but they all sure don’t provide the same performance. But I get your point.

bit_user · Sep 11, 2024

acadia11 said:
Got it thanks for the clarification. I was wrongly considering the CUDA interface essentially nvidia GPUs ISA? Would you then consider the driver level apis the nvidias IsA?

No, the driver API is only used by Nvidia's own runtime libraries, as far as I know. The main way to use CUDA is via the API and compute kernels. The foundations are documented here:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

acadia11 said:
Actually looked up it’s nvidias ptx isa.

PTX is interesting, since it's like a pseudo-assembly language, from what I understand. Nvidia publishes it, but I doubt most people write it either by hand or via custom tools. I'm sure there are some examples of that, but it's not the main way people use CUDA. I think PTX is indeed the closest thing there is to a Nvidia ISA.

acadia11 said:
But yes as people are using CUDA api for their integrations you are right this is the dependency. But that dependency exists because everyone started using Nvidia GPUs … as there is no standard ISA as you pointed out everyone supports their own function set based on their own GPU architecture.

There were several efforts that basically originated in academia, in the early 2000's. Not long after Nvidia introduced CUDA, Apple lead the development of OpenCL, a free & open standard which distilled the core ideas in CUDA and a couple other efforts to make general-purpose GPU programming frameworks that had come before. Unfortunately, Apple turned their back on OpenCL and no other big software vendors embraced it. Google preferred their own solution, called RenderScript (now deprecated). Microsoft had their own technologies (C++ AMP, among others). AMD hitched its wagon to OpenCL and HSA for a while, but eventually gave up on the standards-based approach. That left Intel as the main backer of OpenCL (and related standard, SYCL), which forms the basis of their oneAPI.

For about the past 5 years or so, AMD has been trying to copy Nvidia by making a CUDA clone (HIP). They've ported lots of open source CUDA-based software to HIP, but I think they've had mixed success in upstreaming these ports back to their corresponding project repos. Consequently, if you want to use stuff on ROCm, you sometimes have to take AMD's fork of a project (if there is one), and it might not be up-to-date with the latest. As I mentioned, AMD maintains a backend for HIP that runs on Nvidia's hardware, in hopes that people will simply transition from CUDA to HIP, so they can run it on both AMD and Nvidia hardware. I think very few projects have gone this route of completely abandoning CUDA for HIP.

I'm eager to see if UXL can breathe new life into the OpenCL ecosystem and truly extend oneAPI beyond Intel hardware.

bit_user · Sep 12, 2024

vijosef said:
You are just the troll of tomshardware,

Ad hominem attacks are not allowed, here. If you don't agree with my reply, try actually explaining why I'm wrong. Maybe we can both learn something.

vijosef said:
it is absolutely clear that I was clearly NOT SPEAKING ABOUT NVIDIA.

The reason I pointed out that Nvidia used CUDA to dominate without the involvement of Microsoft was to make the case that Microsoft is not a king maker, in the GPU compute market, and never was. That seemed to be an underlying assumption of your post. It's interesting you didn't address the rest of my reply, because they only served to underscore this point.

qwertymac93 · Sep 12, 2024

bit_user said:
Source?

Their consumer graphics division powers their console business and APU graphics. In case you haven't been paying attention, Intel has been making huge strides in integrated graphics performance. AMD can't afford to take their foot off the gas, there.

The client GPU business has been very profitable for them, in the recent past. It's strategically important to their other businesses. They're not just going to revert back to the bad old days of the latter GCN era.

Consoles are very low margin and they've expressed time and time again they want higher ASP (average selling price). Besides that, I never stated they'd abandon the consumer graphics market, I stated they will make CDNA more scalable, IE from data center to embedded.

Source is pure conjecture, as is customary in a comment section.

bit_user · Sep 12, 2024

qwertymac93 said:
I never stated they'd abandon the consumer graphics market, I stated they will make CDNA more scalable, IE from data center to embedded.

Source is pure conjecture, as is customary in a comment section.

FWIW, I don't expect them to take client graphics back to GCN/CDNA/Wave64. RDNA was a big win for graphics. The latency inherent in GCN might be fine for compute workloads, but it created real shader occupancy problems for interactive graphics. The move to Wave-32 was a lot of the reason for the performance and efficiency improvements in RDNA.

Plus, if you look at Nvidia, they're using SIMD-32 everywhere, which I think stands as a testament that it's viable for compute. Therefore, I expect AMD will move more in the direction of drawing from RDNA, when they craft the new UDNA.

snapdragon-x · Sep 12, 2024

bit_user said:
The client GPU business has been very profitable for them, in the recent past. It's strategically important to their other businesses. They're not just going to revert back to the bad old days of the latter GCN era.

Agree, and the client side won't get less important very soon. It will be a hard sell if a platform doesn't offer a powerful GPU/NPU that doesn't come with good software support and performs badly with industry standards.

acadia11 · Sep 12, 2024

bit_user said:
No, the driver API is only used by Nvidia's own runtime libraries, as far as I know. The main way to use CUDA is via the API and compute kernels. The foundations are documented here:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

PTX is interesting, since it's like a pseudo-assembly language, from what I understand. Nvidia publishes it, but I doubt most people write it either by hand or via custom tools. I'm sure there are some examples of that, but it's not the main way people use CUDA. I think PTX is indeed the closest thing there is to a Nvidia ISA.

There were several efforts that basically originated in academia, in the early 2000's. Not long after Nvidia introduced CUDA, Apple lead the development of OpenCL, a free & open standard which distilled the core ideas in CUDA and a couple other efforts to make general-purpose GPU programming frameworks that had come before. Unfortunately, Apple turned their back on OpenCL and no other big software vendors embraced it. Google preferred their own solution, called RenderScript (now deprecated). Microsoft had their own technologies (C++ AMP, among others). AMD hitched its wagon to OpenCL and HSA for a while, but eventually gave up on the standards-based approach. That left Intel as the main backer of OpenCL (and related standard, SYCL), which forms the basis of their oneAPI.

For about the past 5 years or so, AMD has been trying to copy Nvidia by making a CUDA clone (HIP). They've ported lots of open source CUDA-based software to HIP, but I think they've had mixed success in upstreaming these ports back to their corresponding project repos. Consequently, if you want to use stuff on ROCm, you sometimes have to take AMD's fork of a project (if there is one), and it might not be up-to-date with the latest. As I mentioned, AMD maintains a backend for HIP that runs on Nvidia's hardware, in hopes that people will simply transition from CUDA to HIP, so they can run it on both AMD and Nvidia hardware. I think very few projects have gone this route of completely abandoning CUDA for HIP.

I'm eager to see if UXL can breathe new life into the OpenCL ecosystem and truly extend oneAPI beyond Intel hardware.

Great insights and info. Seems to me it’s early days of CPU evolution, multiple competing companies implementing their approach, except the market has coalesced around nvidia rapidly as they did with 8086 and Intel.

Additionally, microprocessors providers understood to facilitate adoption of their solutions develop the software necessary to make utilization of your designs simple for programmers. This is pretty normal in this industry and Nvidia did it better than others. I’m not sure one ring or api to rule them all would ever work out for everyone such as openCL. And exactly as the market is right now may have been inevitable. Given the state of affairs how do you see AMD and others being able to compete? it seems to be past time of simply standard interface as you suggested HIP What’s the incentive if 90% of the market is an nvidia?

News AMD announces unified UDNA GPU architecture — bringing RDNA and CDNA together to take on Nvidia's CUDA ecosystem

Distinguished

Distinguished

Titan

Distinguished

Titan

Titan

Distinguished

Distinguished

Commendable

Titan

Distinguished

Distinguished

Titan

Titan

Distinguished

Titan

Distinguished

Titan

Distinguished

Titan

Titan

Distinguished

Titan

Distinguished

Share this page