News AMD Addresses Controversy: RDNA 3 Shader Pre-Fetching Works Fine

bit_user · Dec 20, 2022

kerberos_20 said:
text at bottom says 16bit muplied by n, align1 = 16bit, align2 = 32bit, align4 = 64bit, align8 = 128bit

Yes. And what that means is what I said. None of it has anything to do with @Kamen Rider Blade 's thinking that TF32 values are packed in memory.

Kamen Rider Blade · Dec 20, 2022

How many extra operations must you perform every time you convert to & from register just to use TF32?

What about when you have to store the data.

That's extra work / operations just to support TF32 if it's not natively supported in hardware & spec as a native fp type.

bit_user · Dec 20, 2022

Kamen Rider Blade said:
How many extra operations must you perform every time you convert to & from register just to use TF32?

The code sample from the link I posted makes it look like a single instruction converts from FP32 -> TF32. Probably for an entire 32-element Warp.

Kamen Rider Blade said:
What about when you have to store the data.

That same link said there's no inverse conversion instruction, because TF32 are conformant to FP32. That makes it sound like the conversion instruction just rounds off the fractional part and replaces the low-order bits with 0's.

Kamen Rider Blade said:
That's extra work / operations just to support TF32 if it's not natively supported in hardware & spec as a native fp type.

A conversion instruction and the fact that the tensor cores support it means it is supported in the HW, IMO.

Kamen Rider Blade · Dec 20, 2022

bit_user said:
The code sample from the link I posted makes it look like a single instruction converts from FP32 -> TF32. Probably for an entire 32-element Warp.

I wonder how much work goes underneath the instruction to get it to convert.

That same link said there's no inverse conversion instruction, because TF32 are conformant to FP32. That makes it sound like the conversion instruction just rounds off the fractional part and replaces the low-order bits with 0's.

I'm not much into AI, so I don't know if "rounding off the fractional part and replacing the low-order bits with 0's" is 'Ok' to do.

A conversion instruction and the fact that the tensor cores support it means it is supported in the HW, IMO.

In nVIDIA hardware specifically.
It's not part of IEEE 754 though, it's nVIDIA's own variant for their hardware, which has massive marketshare.
So that's fine for them.

But the bigger question is, should there be larger adoptions in terms of the wider Standards Body in the programming / hardware community?

Or is there a alternative solution?

kerberos_20 · Dec 20, 2022

Kamen Rider Blade said:
I wonder how much work goes underneath the instruction to get it to convert.

I'm not much into AI, so I don't know if "rounding off the fractional part and replacing the low-order bits with 0's" is 'Ok' to do.

In nVIDIA hardware specifically.
It's not part of IEEE 754 though, it's nVIDIA's own variant for their hardware, which has massive marketshare.
So that's fine for them.

But the bigger question is, should there be larger adoptions in terms of the wider Standards Body in the programming / hardware community?

Or is there a alternative solution?

alternative solution is fp16 or bfloat16, which also tells you if its okay to use it or not, you just dont use fp16 if you need high precision

bit_user · Dec 20, 2022

kerberos_20 said:
alternative solution is fp16 or bfloat16,

To @Kamen Rider Blade 's point about standardization, fp16 is actually included in the 2008 revision of IEEE 754.

It was only about 5 or 6 years ago that the Deep Mind folks (a UK-based company that Google acquired) decided fp16 had the wrong balance between precision and range, for AI. That's when they proposed BFloat16, which they added to Google's own TPUs and many hardware vendors have adopted, since. Interestingly, it was not included in the 2020 update of IEEE 754. If it continues to see widespread use, maybe it'll appear in the 2028 revision.

If you want to understand the rationale behind Nvidia TF32 format, here's probably the best explanation you'll find:

NVIDIA Blogs: TensorFloat-32 Accelerates AI Training HPC upto 20x

NVIDIA's Ampere architecture with TF32 speeds single-precision work, maintaining accuracy and using no new code.

blogs.nvidia.com

Something I hadn't previously noticed: it has just enough precision to exactly represent all fp16 values! That makes it a superset of both BFloat16 and fp16! In fact, it's the smallest such representation.

Kamen Rider Blade · Dec 21, 2022

bit_user said:
Something I hadn't previously noticed: it has just enough precision to exactly represent all fp16 values! That makes it a superset of both BFloat16 and fp16! In fact, it's the smallest such representation.

If that's the case, implementing a fp24 wouldn't be a bad idea.
AMD had implemented fp24 in the past, I'm sure they can implment it again in their hardware.
AMD have a Special version of fp24 that matches Pixar's PXR24 format in bit size, but was slightly off in bit-layout.
I'm sure they can make a slight adjustment to support PXR24.

This way it covers everything nVIDIA's TensorFloat covers & more.
_1-bit: Sign
_8-bit: Exponent
15-bit: fraction
That easily "Meets & Exceeds" nVIDIA's TensorFloat while being "Bit-Aligned".

Instead of bFloat16, we can call it aFloat24 =D

It should easily allow casting from FP32 down to AMD's FP24 formats.

Then we need to convince IEEE 754 that it's worth adding these FP24 formats into the standard and get all the compilers to support it.

bit_user · Dec 21, 2022

Kamen Rider Blade said:
Then we need to convince IEEE 754 that it's worth adding these FP24 formats into the standard and get all the compilers to support it.

I still don't know what problem you're trying to solve, here.

Anyway, a nice thing about TF32 is that it's already usable as fp32. The only compiler support needed is to convert to it, but you do that on the GPU and Nvidia's tools already support it.

Kamen Rider Blade · Dec 21, 2022

bit_user said:
I still don't know what problem you're trying to solve, here.

Wider Standards support for more fp data types on CPU & across the industry in all sorts of hardware.

Anyway, a nice thing about TF32 is that it's already usable as fp32. The only compiler support needed is to convert to it, but you do that on the GPU and Nvidia's tools already support it.

But that's proprietary to nVIDIA.
I'm not a big fan of "Proprietary" in any way / shape / form.

I like wide open standards support by the industry using the same data types.
Making things nice and portable between all vendors of Hardware & Software.

bit_user · Dec 21, 2022

Kamen Rider Blade said:
Wider Standards support for more fp data types on CPU & across the industry in all sorts of hardware.

That just increases hardware complexity.

BTW, since I noticed that TF32 can losslessly represent both fp16 and bf16, I have come to believe the reason they did it was simply because it's what the underlying hardware had to do to support both formats. Then, someone got the bright idea to expose it, so that users could simultaneously utilize both the range of bf16 and the precision of fp16. I don't think the format is as arbitrary as we might've presumed.

Kamen Rider Blade said:
But that's proprietary to nVIDIA.

So is CUDA. They seem to like it that way. Vendor lock-in, you know?

Kamen Rider Blade said:
I like wide open standards support by the industry using the same data types.

Same, but I doubt you'll convince Nvidia of that. Especially if it makes their hardware more expensive, less efficient, and/or slower.

nimbulan · Dec 21, 2022

atomicWAR said:
I don't. AMD gpu drivers have notariously been slow to extract the full power of their GPUs. This is where the mis-'belief' AMD GPUs age like fine wine. In reality Nvidia has a much larger software team dedicated to drivers and they tend to extract most out of their GPUs right away whereas due to AMDs more limited team size it takes them months to years to fully utilize their silicon. So yeah if history is any indicator in two years time these 7000 series GPUs could gain an additional 10 percent or more in performance.

Oh I'm fully aware, and I frequently point this out to others. It just seems like too much of a performance deficit to be explained by "fine wine" right now. The average performance in benchmarks right now is indicating essentially zero performance improvement from the dual-issue shaders, while nVidia got a ~15% improvement adding support for parallel INT execution with the 20 series, and another 25% adding FP support to that INT pipeline with the 30 series. Of course some of that improvement will be due to other architectural changes, but I don't have any way to break it down. Either way, that seems like far too much ground to make up to just be explained by drivers.

Kamen Rider Blade · Dec 21, 2022

bit_user said:
That just increases hardware complexity.

::shrugs:: Oh Well.
I'm sure everybody has said that about every new instruction set added by Intel
Sometimes you gotta put in the work to have new options for improved long term performance.

BTW, since I noticed that TF32 can losslessly represent both fp16 and bf16, I have come to believe the reason they did it was simply because it's what the underlying hardware had to do to support both formats. Then, someone got the bright idea to expose it, so that users could simultaneously utilize both the range of bf16 and the precision of fp16. I don't think the format is as arbitrary as we might've presumed.

Neither is my proposal for more fp data types, it's not arbitrary.

The current fp data types are very "One Size Fits" all and not enough finesse in terms of data size options for end users to choose from.

It's like clothing sizes, we need more options since people's body types vary by a wide degree.

So is CUDA. They seem to like it that way. Vendor lock-in, you know?

I know! That's nVIDIA's way, they're the proprietary everything vendor.
Many people in the industry HATE them for that.
Especially the "Open Source" community.

Same, but I doubt you'll convince Nvidia of that. Especially if it makes their hardware more expensive, less efficient, and/or slower.

I don't expect to convince Jensen Huang of anything, he's happy at the perch at the top of his little dGPU mountain with his 80% dGPU marketshare.

But the rest of the industry can work on solutions and options that are "Open Standards".

That's why I want to see fp24 so badly.

It is the middle size data type between fp16 & fp32 that has been missing and can be used appropriately for AI or Gaming.

Depends on which variant of fp24 you want to use.

Each one has it's use case.

bit_user · Dec 21, 2022

Kamen Rider Blade said:
I'm sure everybody has said that about every new instruction set added by Intel

When Intel added a new instruction, the hardware implementing it would usually have to be added per-core, meaning somewhere between only one and a couple dozen times per CPU. When a GPU adds something like that, it gets replicated anywhere between a hundred and tens of thousands of times.

Kamen Rider Blade said:
The current fp data types are very "One Size Fits" all and not enough finesse in terms of data size options for end users to choose from.

On Nvidia, you have:

fp8 (Hopper and later)
fp16
bf16 (Ampere and later)
tf32 (Ampere and later)
fp32
fp64

That's as many as 6 sizes. I don't think we need any more. That's just my opinion.

Kamen Rider Blade · Dec 21, 2022

bit_user said:
When Intel added a new instruction, the hardware implementing it would usually have to be added per-core, meaning somewhere between only one and a couple dozen times per CPU. When a GPU adds something like that, it gets replicated anywhere between a hundred and tens of thousands of times.

On Nvidia, you have:

fp8 (Hopper and later)

fp16

bf16 (Ampere and later)

tf32 (Ampere and later)

fp32

fp64

That's as many as 6 sizes. I don't think we need any more. That's just my opinion.

Even if you cap the upper data size at fp64:
What about:
fp24
fp40
fp48
fp56

Those are all viable data sizes.

umeng2002_2 · Dec 21, 2022

Arbie said:
The Ryzen 1000 launch was "incredibly rough"? I bought an 1800X on release, paired it with 64GB DRAM from the QVL, and it worked fine from Day 1. Other than enabling XMP I did nothing to juice it up. My impression then and now was that 90% of the "problems" were people who ignored the QVL and / or chose to go beyond stock settings. AMD advertised stock performance at a ground-breaking price for 8 cores, and delivered. You run it 1 Hz higher than stock and problems are on you.

People really only had issues with overclocking. Overclocking is so easy these day, people expect it to be flawless. It reminds me of people get mad at AMD for USB issues when they OC their infinity fabric near the known limits.

bit_user · Dec 22, 2022

Kamen Rider Blade said:
Even if you cap the upper data size at fp64:
What about:
fp24
fp40
fp48
fp56

Those are all viable data sizes.

But what is the value proposition? What is the benefit which justifies the extra hardware needed to implement them? That's the part I'm struggling with.

Just because you can do something doesn't mean it's worth doing.

Kamen Rider Blade · Dec 22, 2022

bit_user said:
But what is the value proposition? What is the benefit which justifies the extra hardware needed to implement them? That's the part I'm struggling with.

Just because you can do something doesn't mean it's worth doing.

Depending on the problem you're trying to solve, using a fp data type of the appropriate size would lead to RAM savings which affects how fast you can load / process everything.

That's the entire point of having data types of the correct size.

If you don't need the full range of fp64, but you need more than fp32, there should be some options in between.

Same with fp16 and fp32.

It all boils down to how efficient you want to be about solving your problem.

kerberos_20 · Dec 22, 2022

Kamen Rider Blade said:
That's why I want to see fp24 so badly.

It is the middle size data type between fp16 & fp32 that has been missing and can be used appropriately for AI or Gaming.

for gaming...
fp24 was abondoned with shader model 3.0
directx dictates min precision storage 10bits, fp24 had 16bit precision, but exponent was low, it wasnt enough so they went with fp32 instead

Kamen Rider Blade said:
Depending on the problem you're trying to solve, using a fp data type of the appropriate size would lead to RAM savings which affects how fast you can load / process everything.

That's the entire point of having data types of the correct size.

If you don't need the full range of fp64, but you need more than fp32, there should be some options in between.

Same with fp16 and fp32.

It all boils down to how efficient you want to be about solving your problem.

you know, you can order costumized silicon to your needs, right?

bit_user · Dec 22, 2022

Kamen Rider Blade said:
Depending on the problem you're trying to solve, using a fp data type of the appropriate size would lead to RAM savings which affects how fast you can load / process everything.

That's the entire point of having data types of the correct size.

If you don't need the full range of fp64, but you need more than fp32, there should be some options in between.

One approach to this is memory compression. In graphics, texture compression is implemented in hardware and used quite a lot. In AI, some hardware uses similar methods to compress AI model weights.

fp64 is something of a special case. That's mainly used by science, finance, and engineering (mechanical, structural, aerospace, etc.) applications. They would probably prefer to have more precision, even at the expense of a little performance, than to tweak it down to the last few % at the risk of errors.

Kamen Rider Blade · Dec 23, 2022

kerberos_20 said:
for gaming...
fp24 was abondoned with shader model 3.0
directx dictates min precision storage 10bits, fp24 had 16bit precision, but exponent was low, it wasnt enough so they went with fp32 instead

Basically, the PXR 24-bit fp format would've been more useful then?
PXR's 1-bit sign, 8-bit exponent, & 15-bit fraction would be closer to what they wanted?
Or was it fp32 was the only acceptable solution?

you know, you can order costumized silicon to your needs, right?

That's not the point, having customized silicon isn't the solution I want to see.
It's to get the industry to go in on the new expanded standards together.
Be more flexible on data type size.

Kamen Rider Blade · Dec 23, 2022

bit_user said:
One approach to this is memory compression. In graphics, texture compression is implemented in hardware and used quite a lot. In AI, some hardware uses similar methods to compress AI model weights.

fp64 is something of a special case. That's mainly used by science, finance, and engineering (mechanical, structural, aerospace, etc.) applications. They would probably prefer to have more precision, even at the expense of a little performance, than to tweak it down to the last few % at the risk of errors.

But Gaming and other fields can get by with less precision and have had done so in the past.
Not everything needs absolute precision of the larger data-types.

And you can still use memory compression on smaller data types.

bit_user · Dec 23, 2022

Kamen Rider Blade said:
Basically, the PXR 24-bit fp format would've been more useful then?

No, because it's dictionary-based. That's not friendly for being accelerated in texture engines.

You might find it interesting to read about the texture compression formats currently in use.

Kamen Rider Blade said:
But Gaming and other fields can get by with less precision and have had done so in the past.

Games aren't the ones using fp64 (with perhaps the rare exception). fp64 is typically implemented at 1/32 the rate of fp32, in gaming GPUs. That's a strong incentive for games not to use it.

Kamen Rider Blade · Dec 23, 2022

bit_user said:
No, because it's dictionary-based. That's not friendly for being accelerated in texture engines.

That's not what I mean, I'm talking about the bit-arrangement of PXR24 vs AMD's FP24 format.

You might find it interesting to read about the texture compression formats currently in use.

Ok, when I get free time.

bit_user said:
Games aren't the ones using fp64 (with perhaps the rare exception). fp64 is typically implemented at 1/32 the rate of fp32, in gaming GPUs. That's a strong incentive for games not to use it.

I know, most Games have no real reason to use FP64 for numerous reasons.
But somewhere in between FP32 & FP64 might have a data type that could be useful for certain use cases.

kerberos_20 · Dec 23, 2022

Kamen Rider Blade said:
That's not what I mean, I'm talking about the bit-arrangement of PXR24 vs AMD's FP24 format.

Ok, when I get free time.

I know, most Games have no real reason to use FP64 for numerous reasons.
But somewhere in between FP32 & FP64 might have a data type that could be useful for certain use cases.

fpu can handle 80bit..u can use it if you need higher precision

Kamen Rider Blade · Dec 23, 2022

kerberos_20 said:
fpu can handle 80bit..u can use it if you need higher precision

Is this what you're talking about:
https://en.wikipedia.org/wiki/Extended_precision#x86_extended_precision_format

This is what I'm talking about:

News AMD Addresses Controversy: RDNA 3 Shader Pre-Fetching Works Fine

Titan

Distinguished

Titan

Distinguished

Champion

Titan

Distinguished

Titan

Distinguished

Titan

Distinguished

Distinguished

Titan

Distinguished

Reputable

Titan

Distinguished

Champion

Titan

Distinguished

Distinguished

Titan

Distinguished

Champion

Distinguished

Share this page