News AMD Addresses Controversy: RDNA 3 Shader Pre-Fetching Works Fine

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Yeah, but also consider that RTX 4000 has a lot more L2 cache than its predecessors. To the point that the RTX 4090 actually has more cache than the RX 7900 XTX has Infinity Cache!

So, when you say RDNA 3 underperformed, how much of that is comparing it with RDNA 2 versus comparing it with a new RTX that's might have done more to catch up than we expected?
That gets too complicated. Nvidia is too different for some kind of clean comparison.
 
A prefetch would be useful for a hardware scheduler to preload textures before they are needed using a predictive alg.

There was talk of a standard to allow the gpu hardware to directly fetch from memory completely bypassing the CPU. I'm not sure how this would work, as the MCU is on the CPU package these days unless the GPU can address the MCU on the CPU.. But I haven't heard anything since.

There was talk for a long time of AMDs GPUs running at 2x the compute performance that had NVIDIA panicking. AMD never squashed these rumors. We had a saying "under promise and over deliver." AMD seems to go by the opposite philosophy. So it's obvious something didn't go as planned.

Can we say this feature is broken? They (AMD) are likely quite honest. It's a feature that just didn't make release, and was considered experimental "nice to have" but non essential to release.
 
Last edited:
There's truth to this, but we also know of other examples, like that Vega shipped with some broken features... and we know that wasn't intentional, because they advertised them in their pre-launch messaging about it.

So, we always have to be skeptical about what we're being told, but we should also keep some perspective that chips have bugs, most of those bugs have workarounds, and most of them aren't performance-critical or else there would've been another respin.

I remember that. Dual ops math with reduced precision never became a reality. What a cluster that was. I mean they (AMD) hinted at it, but it never came to pass. So the promised increases never materialized.

God I hate NVIDIA, but AMD just keeps letting us down. The RT gap has grown this gen and I think their last gen was closer to taking the crown.
 
Last edited:
I remember that. Dual ops math with reduced precision never became a reality. What a cluster that was. I mean they (AMD) hinted at it, but it never came to pass. So the promised increases never materialized.
What I had in mind was the Draw Stream Binning Rasterizer, and wasn't there something broken with another unit for doing Mesh Shaders, or something like that?

God I hate NVIDIA, but AMD just keeps letting us down. The RT gap has grown this gen and I think their last gen was closer to taking the crown.
The RX 6000 wasn't a let-down, was it? A lot of people complained about the small amount of rasterization improvements in the RTX 2000 series, so I don't think Nvidia's track record is perfect, either.

I share your wishes for more competitive ray tracing performance. Aside from that, the minimum AMD needs to do is deliver performance that's price-competitive.

If there's a silver lining to this news, it's that the RX 7000 series is a node behind Nvidia and lacking at least a couple of features. That means there's more gas left in the tank for the 8000 series to deliver further improvements without necessarily having to make their chips a lot bigger (i.e. more expensive). Plus, the cache chiplets make a lot more sense now that we know SRAM scales so poorly. Hopefully, the 8000-series will further refine how the cache dies are used, in order to squeeze more performance from them.
 
  • Like
Reactions: atomicWAR
What I had in mind was the Draw Stream Binning Rasterizer, and wasn't there something broken with another unit for doing Mesh Shaders, or something like that?


The RX 6000 wasn't a let-down, was it? A lot of people complained about the small amount of rasterization improvements in the RTX 2000 series, so I don't think Nvidia's track record is perfect, either.

I share your wishes for more competitive ray tracing performance. Aside from that, the minimum AMD needs to do is deliver performance that's price-competitive.

If there's a silver lining to this news, it's that the RX 7000 series is a node behind Nvidia and lacking at least a couple of features. That means there's more gas left in the tank for the 8000 series to deliver further improvements without necessarily having to make their chips a lot bigger (i.e. more expensive). Plus, the cache chiplets make a lot more sense now that we know SRAM scales so poorly. Hopefully, the 8000-series will further refine how the cache dies are used, in order to squeeze more performance from them.
Nvidia is on TSMC 5nm same as AMD. The only difference is nvidia worked with TSMC to tweak their 5nm to suit nvidia’s needs and then called it 4N to have a paper win. The full name is 4N 5nm.
https://www.techgoing.com/nvidia-cl...the,to a large number of media writing errors.
 
What I had in mind was the Draw Stream Binning Rasterizer, and wasn't there something broken with another unit for doing Mesh Shaders, or something like that?


The RX 6000 wasn't a let-down, was it? A lot of people complained about the small amount of rasterization improvements in the RTX 2000 series, so I don't think Nvidia's track record is perfect, either.
Yep those never materialized either. I forgot about those missing features.

Also the 6000 series was the most successful launch since gcn 1 (7970) IMHO . But they made it $50 cheaper while lacking a dlss competitor or anything close in rt to the 3080.

I know it was wishful pricing due to mining. However the 6800XT wasn't worth it unless you wanted raw raster in 1440p and below.

This gen has even a wider gap. Something flubbed up somewhere given the claims of double the compute performance from early on.
 
Last edited:
AMD addresses a recent flurry of rumors claiming that its silicon was shipped in an unfinished state.

AMD Addresses Controversy: RDNA 3 Shader Pre-Fetching Works Fine : Read more

Shock as internet click hunters sensationalise a non-event, much like the inability of people to u push a plug all the way in. These things always turn out to be blown out of proportion by fanboys of an opposing brand and fed on by 'youtubers' who'll talk about anything as long as they can post it.
 
There's truth to this, but we also know of other examples, like that Vega shipped with some broken features... and we know that wasn't intentional, because they advertised them in their pre-launch messaging about it.

So, we always have to be skeptical about what we're being told, but we should also keep some perspective that chips have bugs, most of those bugs have workarounds, and most of them aren't performance-critical or else there would've been another respin.

The only feature that wasn’t enabled in Vega was primitive shaders, which AMD could get working in professional applications, but not in games; Vega shipped with fixed function geometry engines, so the larger issue was the whitepaper hype of up to 17x (!!) higher primitive cull rates. DSBR (tiled-immediate mode rasterizer) worked and was good for about 10% increased performance, clock-for-clock over Fiji. Packed FP16 (2xFP16) also worked.

Vega also had a revision to TSMC N7 as Vega 20 (enabling 4xINT8, 8xINT4, and DOT8/DOT4 instructions), which was primarily for Instinct MI50/60, but also spawned Radeon VII and Vega II Pro Duo (Mac). No, primitive shaders were not enabled. RDNA1/Navi 10, with its redesigned geometry subsystem, had primitive shaders from the start and that continues with RDNA3/Navi 31.
 
The only feature that wasn’t enabled in Vega was primitive shaders, which AMD could get working in professional applications, but not in games; Vega shipped with fixed function geometry engines, so the larger issue was the whitepaper hype of up to 17x (!!) higher primitive cull rates. DSBR (tiled-immediate mode rasterizer) worked and was good for about 10% increased performance, clock-for-clock over Fiji.
Yes, Primitive Shaders are what I was trying to remember. As for DSBR, that wasn't enabled until at least 6 months after launch? Smells like a hardware bug requiring complex workarounds. 10% is quite likely a lot less than Nvidia got from tile-based rendering, which further suggests maybe a performance-robbing workaround was needed.
 
The only feature that wasn’t enabled in Vega was primitive shaders, which AMD could get working in professional applications, but not in games; Vega shipped with fixed function geometry engines, so the larger issue was the whitepaper hype of up to 17x (!!) higher primitive cull rates. DSBR (tiled-immediate mode rasterizer) worked and was good for about 10% increased performance, clock-for-clock over Fiji. Packed FP16 (2xFP16) also worked.

Vega also had a revision to TSMC N7 as Vega 20 (enabling 4xINT8, 8xINT4, and DOT8/DOT4 instructions), which was primarily for Instinct MI50/60, but also spawned Radeon VII and Vega II Pro Duo (Mac). No, primitive shaders were not enabled. RDNA1/Navi 10, with its redesigned geometry subsystem, had primitive shaders from the start and that continues with RDNA3/Navi 31.
2xfp16 (reduced precision packed mode) never materialized any gains. Most ops only need fp16 but for some reason that packed mode never benefited AMD the way it was hinted to.

Under utilization of the wave fronts resources is what lead AMD to abandon GCN.
 
The Ryzen 1000 launch was "incredibly rough"? I bought an 1800X on release, paired it with 64GB DRAM from the QVL, and it worked fine from Day 1. Other than enabling XMP I did nothing to juice it up. My impression then and now was that 90% of the "problems" were people who ignored the QVL and / or chose to go beyond stock settings. AMD advertised stock performance at a ground-breaking price for 8 cores, and delivered. You run it 1 Hz higher than stock and problems are on you.
 
2xfp16 (reduced precision packed mode) never materialized any gains. Most ops only need fp16 but for some reason that packed mode never benefited AMD the way it was hinted to.
It wasn't a general capability, with a full contingent of instructions. Vega had only a 5 instructions which used them, and those were mostly aimed at deep learning.
  • Min
  • Max
  • Add
  • Mul
  • FMA

There are other _F16 instructions, but these are the only packed ("PK") ones.

See https://developer.amd.com/wp-content/resources/Vega_Shader_ISA.pdf (page 44)
 
  • Like
Reactions: digitalgriffin
I wish more devs would use fp24, a nice balance between fp16 & fp32.
What would be the benefit? All fp32 operations issue in a single clock cycle. Some fp16 operations can be doubled-up, but the other benefit of fp16 is that you can fit a fp16 image in half the memory that it would take to represent it as fp32.

I don't see a real benefit to fp24, because the packing/unpacking is awkward and would impose overhead to do, and there's no throughput advantage.

It's also supported in AMD GPU's.
Not from what I can tell. Are you thinking of like 20 years ago, when AMD first introduced floating point shaders? Because, I know they started out with some kind of 24-bit floats... but I believe standardized on IEEE 754 in the HD 4000 era, or so.

Perhaps more importantly, it's not a standard data type in GLSL. Not sure about HLSL (D3D).
 
  • Like
Reactions: TJ Hooker
What would be the benefit? All fp32 operations issue in a single clock cycle. Some fp16 operations can be doubled-up, but the other benefit of fp16 is that you can fit a fp16 image in half the memory that it would take to represent it as fp32.
It all depends on how large of a numerical range your data needs to use. Some workloads needs more than what the span of fp16 can offer.
But you don't need the full range that fp32 provides.
By using fp24, you use less data overall when your dataset doesn't need to span the full breadth of fp32.

I don't see a real benefit to fp24, because the packing/unpacking is awkward and would impose overhead to do, and there's no throughput advantage.
There's no real overhead when it's supported in hardware, which AMD had previously announced support of in their GPU's.

Not from what I can tell. Are you thinking of like 20 years ago, when AMD first introduced floating point shaders? Because, I know they started out with some kind of 24-bit floats... but I believe standardized on IEEE 754 in the HD 4000 era, or so.
Alternative floating point standards are being used in the industry as is.
Look at bFloat16, it co-exists with standard fp16

Then there's nVIDIA's maddening 19-bit TensorFloat
Why on Earth would nVIDIA make a 19-bit floating point data-type?
It doesn't even Byte-Align, so you have 5-bits wasted when being stored in every 3-bytes.

At least AMD's fp24 is a slight variance of Pixar's PXR24.

Adding support into your FPU to support all those data types, when you already support the full range of IEEE 754 fp data types is a good idea in this day and age where you have people with different needs for different data types of different sizes.

Think about it, Pixar has their own 24-bit fp data type, that has a VERY minor difference with AMD's fp24.

Perhaps more importantly, it's not a standard data type in GLSL. Not sure about HLSL (D3D).
Back in the day, DX 9.0 had a minimum of fp24 required to support the spec.

So it worked back in the day.

But with modern computing, having flexibility of data type and size can affect your Data-set size, computational performance, etc.

That's why there are so many weird floating point sizes in existence now that need hardware support to fully benefit.
 
It all depends on how large of a numerical range your data needs to use. Some workloads needs more than what the span of fp16 can offer.
But you don't need the full range that fp32 provides.
By using fp24, you use less data overall when your dataset doesn't need to span the full breadth of fp32.
I'm not asking about merely theoretical benefits, though. What I was asking is if their GPU lacks hardware support for packing/unpacking and can issue 1/cycle - the same as fp32 - then it's cumbersome to work with (due to manually packing/unpacking) for what benefit?

There's no real overhead when it's supported in hardware, which AMD had previously announced support of in their GPU's.
Look at the Vega ISA manual I linked a couple posts ago, and tell me where it says they support 24-bit floating point in hardware.

Alternative floating point standards are being used in the industry as is.
Look at bFloat16, it co-exists with standard fp16
I know about BFloat16. It has many of the same advantages as fp16, except they trade some precision for more range (and easy conversion to/from fp32). But, we weren't talking about BFloat16!

Then there's nVIDIA's maddening 19-bit TensorFloat
Why on Earth would nVIDIA make a 19-bit floating point data-type?
It doesn't even Byte-Align, so you have 5-bits wasted when being stored in every 3-bytes.
According to this, it sounds like TF32 is just an in-register format? It probably just rounds the fraction to 10 bits.



The reason for reducing precision is to make the Tensor ALUs smaller, simpler and more power-efficient. The size of a FP multiplier is supposed to grow as the square of the number of fractional bits. That was one of the main arguments for BFloat16, but I guess someone decided it could use another 3 bits of precision. I think that makes it applicable to a much larger problem domain, such as audio processing, but maybe it also helps improve model convergence times.

At least AMD's fp24 is a slight variance of Pixar's PXR24.
That's compressed using a dictionary scheme. If it's not supported in hardware (which I highly doubt), then decode performance on a GPU will be terrible. If you need to load or save images in PXR24, just convert them to/from a standard texture format on the CPU.

Adding support into your FPU to support all those data types, when you already support the full range of IEEE 754 fp data types is a good idea in this day and age where you have people with different needs for different data types of different sizes.
Not really, because modern GPUs have many thousands of ALUs, so any feature you add to them uses tens of thousands of times as much die area as just adding it to a single ALU.

Think about it, Pixar has their own 24-bit fp data type, that has a VERY minor difference with AMD's fp24.
Pixar has been around since the 1980's. There might be a lot of stuff out there with their name on it, that they themselves no longer even use.

But the bigger issue with AMD having some proprietary floating point format "just because", is that you need people to write AMD-specific shaders for it to serve any purpose, and game developers aren't going to do that if the benefits of using it don't significantly outweigh the effort of using it. Even then, a lot of game developers still won't bother.

Back in the day, DX 9.0 had a minimum of fp24 required to support the spec.

So it worked back in the day.
A "minimum" implies the implementation can go beyond. They probably specified it that way, so that DX 9 would run on old cards that only had fp24.
 
I've got a hypothesis as to why RDNA3 shaders underperform relative to RDNA2 . The infinity cache has been diminished. RDNA2 performed unexpectedly well vs RDNA1 when it got it, and now that cache has been split into little 16 chunks.
The cache helped not just bandwidth, but latency as well. Like the X3D.
It would be interesting to see a comparison of the three arches and see if 3 acts more like a large 1 than 2. If there are games that favor cache on the GPU side.


i think its also because part of this gpu shares some of the same resources so it cant stretch its legs properly. if I recall didn't amds bulldozer cpus do something similar granted gpus are much more different.

(Along with the extra 32-bit floating-point compute, AMD also doubled the matrix (AI) throughput as the AI Matrix Accelerators appear to at least partially share some of the execution resources.)

i do agree with splitting it into 16 chunks didnt help maybe the scaled down gpus will give some light on that.
 
1bit is some control bit, 8bits is for numeric range (same as in fp32), 10bits for precision (same bit range as in fp16)
so it is less precise then fp32, but it supports full fp32 range
so basicly fp32 with 18bits of data, its a hybrid for sure
But then there's 5 bits that are wasted because it isn't "Byte-Aligned".
2 Bytes = 16-bits
nVIDIA Tensor Float = 19 -bits
3 Bytes = 24-bits

Seriously, the person who created the Tensor Float couldn't "Byte-Align" his new numerical floating-point Data-Type?
 
I'm not asking about merely theoretical benefits, though. What I was asking is if their GPU lacks hardware support for packing/unpacking and can issue 1/cycle - the same as fp32 - then it's cumbersome to work with (due to manually packing/unpacking) for what benefit?
The whole point is to have hardware support so you can use the data type. If it's not supported in hardware anymore, then they'll deprecate support for it.

Look at the Vega ISA manual I linked a couple posts ago, and tell me where it says they support 24-bit floating point in hardware.
I don't see it inside, AMD must've moved on then.
The last mention of fp24 was in the Radeon R300 series in the wikipedia article.
ATI's Radeon chips did not go above FP24 until R520.
So it seems AMD moved past FP24 by the time of R520.

I know about BFloat16. It has many of the same advantages as fp16, except they trade some precision for more range (and easy conversion to/from fp32). But, we weren't talking about BFloat16!
=(

According to this, it sounds like TF32 is just an in-register format? It probably just rounds the fraction to 10 bits.



The reason for reducing precision is to make the Tensor ALUs smaller, simpler and more power-efficient. The size of a FP multiplier is supposed to grow as the square of the number of fractional bits. That was one of the main arguments for BFloat16, but I guess someone decided it could use another 3 bits of precision. I think that makes it applicable to a much larger problem domain, such as audio processing, but maybe it also helps improve model convergence times.
I wish IEEE754 would just officially certify fp24 and call it a day, we need more fp data types of varying sizes so we can use the one with the appropriate size.

That's compressed using a dictionary scheme. If it's not supported in hardware (which I highly doubt), then decode performance on a GPU will be terrible. If you need to load or save images in PXR24, just convert them to/from a standard texture format on the CPU.
Isn't that going to be super slow to convert to/from on CPU?

Not really, because modern GPUs have many thousands of ALUs, so any feature you add to them uses tens of thousands of times as much die area as just adding it to a single ALU.
So we should be more picky as to which data types we should support?

Pixar has been around since the 1980's. There might be a lot of stuff out there with their name on it, that they themselves no longer even use.
Apparently, the reason PXR24 isn't as popular is because it's a lossy compression format.
=(

But the bigger issue with AMD having some proprietary floating point format "just because", is that you need people to write AMD-specific shaders for it to serve any purpose, and game developers aren't going to do that if the benefits of using it don't significantly outweigh the effort of using it. Even then, a lot of game developers still won't bother.
Then let's NOT make it "AMD Proprietary". Let's get IEEE 754 to certify it, get it added in properly to all the compilers.
Get the Hardware vendors to properly support the new fp data types.
Let's make these other sized fp data types a legit format to use.

A "minimum" implies the implementation can go beyond. They probably specified it that way, so that DX 9 would run on old cards that only had fp24.
IC, but I can see value in using smaller data types that aren't the standard IEEE 754 approved fp16, fp32, fp64, fp128, fp256
There are so many other fp data types that need love:
fp08, fp24, fp40, fp48, fp56, fp80, fp96, fp112, fp160, fp192, fp224
They all need love =D
 
But then there's 5 bits that are wasted because it isn't "Byte-Aligned".
2 Bytes = 16-bits
nVIDIA Tensor Float = 19 -bits
3 Bytes = 24-bits

Seriously, the person who created the Tensor Float couldn't "Byte-Align" his new numerical floating-point Data-Type?
tensor cores pretty much removes byte-alignment requirements...atleast thats what nvidia says and performance goes up against fp32...sooo

tXw9lBc.png


but i think we went a lil offtopic, thread is about amd 😛
 
Last edited:
tensor cores pretty much removes byte-alignment requirements...atleast thats what nvidia says and performance goes up against fp32...sooo
That's not how I read it. The text at the bottom indicates Align1 is 2-byte -aligned, Align2 is 4-byte -aligned, and Align8 is 16-byte -aligned. And even in the V100, the performance delta between Align2 and Align8 are easily big enough to justify some extra trouble.

But, it's irrelevant to @Kamen Rider Blade 's concern, which was that TF32 wouldn't be byte-aligned. Well, in the post I linked, they said that TF32 were by definition compatible with fp32. That must mean that TF32 just have zeroed-out low-order bits. The values still reside in 32-bit registers, and when you write them out to memory, they'll still be 32-bits each.
 
  • Like
Reactions: TJ Hooker
That's not how I read it. The text at the bottom indicates Align1 is 2-byte -aligned, Align2 is 4-byte -aligned, and Align8 is 16-byte -aligned. And even in the V100, the performance delta between Align2 and Align8 are easily big enough to justify some extra trouble.
text at bottom says 16bit muplied by n, align1 = 16bit, align2 = 32bit, align4 = 64bit, align8 = 128bit

Well, in the post I linked, they said that TF32 were by definition compatible with fp32. That must mean that TF32 just have zeroed-out low-order bits. The values still reside in 32-bit registers, and when you write them out to memory, they'll still be 32-bits each.
input from FP32, output to FP32, operands are rounded to FP16, code will see it as fp32...there are some bandwith sawings, while memory footprint remains same as with fp32
its easier to use than FP16/bfloat16 which is used for performance reasons (not for memory footprint reasons) ands its way faster than fp16 (atleast on ampere and up)