News AMD to split flagship AI GPUs into specialized lineups for for AI and HPC, add UALink — Instinct MI400-series models takes a different path

Admin · May 15, 2025

AMD will split its Instinct MI400-series into separate AI-focused (MI450X) and HPC-focused (MI430X) GPUs in H2 2026 to maximize performance per workload, but limited availability of UALink switches may hinder scalability.

AMD to split flagship AI GPUs into specialized lineups for for AI and HPC, add UALink — Instinct MI400-series models takes a different path : Read more

bit_user · May 15, 2025

The article said:
Right now, AMD's Instinct MI300-series accelerators are aimed at both AI and HPC, which makes them universal but lowers maximum performance for both types of workloads. Starting from its next-generation Instinct MI400-series, AMD will offer distinct processors for AI and supercomputers in a bid to maximize performance for each workload, according to SemiAnalysis.

Yes, it's been obvious this would happen. Nvidia was first to show signs of moving in this direction, if you look at how Blackwell started to back away from fp64 compute, in favor of more AI horsepower.

In the chiplet era, I could imagine HPC and AI chiplets being combined in a package. You'd have models with all AI chiplets, some with a 50/50 of AI and HPC... not sure if there's still enough of a market for 100% HPC chiplet accelerators, but they could do it if there were.

cheesecake1116 · May 15, 2025

bit_user said:
Yes, it's been obvious this would happen. Nvidia was first to show signs of moving in this direction, if you look at how Blackwell started to back away from fp64 compute, in favor of more AI horsepower.

In the chiplet era, I could imagine HPC and AI chiplets being combined in a package. You'd have models with all AI chiplets, some with a 50/50 of AI and HPC... not sure if there's still enough of a market for 100% HPC chiplet accelerators, but they could do it if there were.

Except a split of these 2 doesn't make any sense...

This would imply that AMD is making 2 different compute chiplets for MI430X and MI450X which I very much doubt they are doing based on AMD"s prior behavior WRT how they deal with different market segments...

And I don't even understand what the limitation would even be for having the FP64 matrix compute there in the first place... the vast vast majority of the workloads that these processors are used in are memory bandwidth and more importantly memory capacity bound... not compute bound... And that is true for Nvidia as well...

bit_user · May 15, 2025

cheesecake1116 said:
This would imply that AMD is making 2 different compute chiplets for MI430X and MI450X which I very much doubt they are doing based on AMD"s prior behavior WRT how they deal with different market segments...

Well, that's the point of this announcement. They're saying they are going to start deviating from their prior behavior.

cheesecake1116 said:
And I don't even understand what the limitation would even be for having the FP64 matrix compute there in the first place...

FP64 uses a lot of silicon. The amount of silicon used by FP multipliers scales roughly as a square of the mantissa.

cheesecake1116 said:
the vast vast majority of the workloads that these processors are used in are memory bandwidth and more importantly memory capacity bound... not compute bound... And that is true for Nvidia as well...

In general, they put only as much compute as needed, in order to have a balanced system. So, I don't accept the notion that these accelerators have compute power much in excess of what the HBM memory subsystems can keep fed.

cheesecake1116 · May 15, 2025

bit_user said:
Well, that's the point of this announcement. They're saying they are going to start deviating from their prior behavior.

FP64 uses a lot of silicon. The amount of silicon used by FP multipliers scales roughly as a square of the mantissa.

In general, they put only as much compute as needed, in order to have a balanced system. So, I don't accept the notion that these accelerators have compute power much in excess of what the HBM memory subsystems can keep fed.

1) Yeah, I don't see that change happening the way they think it will in my opinion, because...

2) Yes FP64 does use a lot of silicon, but as a reminder since CDNA2 CDNA has used FP64 ALUs instead of FP32 ALUs like every other GPU...

3) As for putting only the amount of compute as needed to have a balanced system, that is not the case at all... Depending on the arithmetic intensity of an algorithm you can fall WELL short of the on paper numbers... For example, for an FP64 workload on MI300X, if you have a working dataset that is running mostly out of the Infinity Cache and you have an arithmetic intensity of ~1FLOP per byte you aren't going to break 20TF of the hypothetical 80TF of FP64 compute... you need an arithmetic intensity of about 4.5FLOPs per byte to hit the 80TF of on paper compute... and if you are working out of DRAM then you are going to need an intensity of about 15FLOPs per byte to hit the 80TF of FP64 compute... This is why roofline models are useful because it tells you how easy or hard it is to feed a specific processor using a specific datatype...

3en88 · May 15, 2025

Another shot in the foot by AMD.

bit_user · May 16, 2025

cheesecake1116 said:
2) Yes FP64 does use a lot of silicon, but as a reminder since CDNA2 CDNA has used FP64 ALUs instead of FP32 ALUs

That won them some supercomputer contracts and allowed them to leapfrog Nvidia and Intel on vector fp64, but fp64 has limited general applicability, which is why client CPUs implement only scalar support for it.

cheesecake1116 said:
3) As for putting only the amount of compute as needed to have a balanced system, that is not the case at all...

The range of potential workloads is so diverse that you can always find some which are memory-bottlenecked and others that are compute-bottlenecked. They probably have some suite of applications they're trying to target, which inform their decisions about how to balance compute vs. memory bandwidth.

cheesecake1116 said:
if you are working out of DRAM then you are going to need an intensity of about 15FLOPs per byte to hit the 80TF of FP64 compute...

15 FLO/byte isn't hard to reach, if you're dealing in large convolutions or tensor products.

cheesecake1116 · May 16, 2025

bit_user said:
That won them some supercomputer contracts and allowed them to leapfrog Nvidia and Intel on vector fp64, but fp64 has limited general applicability, which is why client CPUs implement only scalar support for it.

The range of potential workloads is so diverse that you can always find some which are memory-bottlenecked and others that are compute-bottlenecked. They probably have some suite of applications they're trying to target, which inform their decisions about how to balance compute vs. memory bandwidth.

15 FLO/byte isn't hard to reach, if you're dealing in large convolutions or tensor products.

1) Client CPUs, not GPUs, can do vector FP64... Just look at the VFMADD132PD instruction in AVX512... Client GPUs can do FP64 as well, just that they cut alot of the throughput because CLIENT workloads don't need much of it... but HPC sure does...

2) But the majority of workloads, are going to be memory bandwidth or memory capacity bound... look at all the roof line models... that are out there...

3) 15FLOPs/Byte may not be hard to hit... but what about over 245 FLOPs per Byte which is what you'd need to hit for fully saturating FP16 Matrix on the MI300X... that WILL be a lot harder to hit...

bit_user · May 16, 2025

cheesecake1116 said:
1) Client CPUs, not GPUs, can do vector FP64...

Oops. I'm terribly sorry. That was a simple typo, on my part. I meant to type "client GPUs", as in the gaming GPUs. I'm well aware that x86 CPUs have supported vector fp64, ever since SSE2.

cheesecake1116 said:
Just look at the VFMADD132PD instruction in AVX512...

Yes, and thank you for even citing a specific example.

cheesecake1116 said:
Client GPUs can do FP64 as well,

This is the part where I was trying to say they only do scalar fp64. The way to see this for Nvidia is slightly convoluted. First you have to see which CUDA Compute Capability is supported by a given GPU. Then, you can read about its features and properties. So, we can take the example of their latest Blackwell client & server GPUs, but this holds for all of the ones they've shipped in the past decade.

According to this, the RTX 5000-series (i.e. Blackwell client GPUs) support CUDA Compute Capability 12.0, while the server Blackwells support 10.0:

https://developer.nvidia.com/cuda-gpus

In section 17.9.1 of their CUDA Programming Guide, they say that a 10.x SM (Shader Multiprocessor) contains 128x fp32 pipelines, 64x fp64 pipelines, and 64x int32 pipelines. A warp is their unit of SIMD and is 32-wide. So, that corresponds to an ability for concurrent dispatch of up to 4 warp-level fp32 operations, 2 warp-level fp64 operations, or 2 warp-level int32 operations. They don't say how you can mix and match, but I assume they have only enough bandwidth to/from the vector register file to sustain 4x 32-bit warp ops per cycle.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capability-10-x

Contrast this with what 12.0 supports, in section 17.10.1. There, the number of fp64 pipelines per SM drops to just 2. That's their way of saying there are only two scalar fp64 piplines per SM and no vector fp64 capability.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capability-12-0

You can go through and see a similar pattern between all of their client vs. server GPUs.

cheesecake1116 said:
just that they cut alot of the throughput because CLIENT workloads don't need much of it... but HPC sure does...

Yeah, funny enough, Intel's Alchemist generation dGPUs dropped all hardware fp64 capabilities and support it only through emulation. In the Battlemage generation, they brought back true hardware scalar fp64 support.

If you look at their GPU architecture slides from about 4+ years ago, they list fp64 support as being optional for the Xe cores used in both the "Xe-HPG" (High-Performance Gaming) GPUs and Xe-HP (general-purpose cloud GPU). They subsequently cancelled Xe-HP, which was probably the version of the core that actually had scalar harware fp64, leaving only Xe-HPG and Xe-HPC (AKA Ponte Vecchio, which has full vector fp64). I'm having trouble locating a copy of the slide...

In AMD's case, at least RDNA4 seems to take a different approach. I'm not sure if this is true for all generations of RDNA, as I only checked the latest, but you do find a full contingent of V_..._F64 instructions in their ISA manual:

https://www.amd.com/content/dam/amd...ctures/rdna4-instruction-set-architecture.pdf

The specs indicate a 64:1 ratio of fp32 to fp64 TFLOPS. I think they might microcode those Vector F64 instructions to run on a single scalar port per WGP, but I haven't found direct confirmation.

cheesecake1116 said:
3) 15FLOPs/Byte may not be hard to hit... but what about over 245 FLOPs per Byte which is what you'd need to hit for fully saturating FP16 Matrix on the MI300X... that WILL be a lot harder to hit...

A couple of points about that. The first is that you're often convolving multiple data items against the same weights. The weights can be held in registers, leaving only the data items to be streamed in/out.

Next, if we're talking about matrix multiply, there's an inherent data efficiency in matrix multiply hardware. The amount of compute scales as roughly a cube of the input/output data width. This is how "AI" GPUs and NPUs can tout such eye-watering numbers of TOPS and why it was such a game-changer when Nvidia first added Tensor cores.

Silicon area is expensive, which is why I'm rather confident there's not an over-capacity of compute power, for the key workloads they're targeting.

ET3D · Jun 1, 2025

As someone working with HPC that doesn't require the highest accuracy, I find BP16 helpful for performance (while FP4/FP8 are useless). float21 (storing 3 floats in 64 bits) would probably have been a better compromise, but still... So the combination I'd have liked is FP32+BF16, rather than the two extremes here. (And stochastic rounding in hardware would also have been nice.)

bit_user · Jun 1, 2025

ET3D said:
As someone working with HPC that doesn't require the highest accuracy, I find BP16 helpful for performance (while FP4/FP8 are useless).

Interesting. For my part, I think IEEE754 half-precision is more generally useful. In fact, Intel only recently added support for it to AVX-512 (circa Sapphire Rapids), apparently for signal-processing applications. The precision of BF16 is generally too low for something like that.

Half-precision is actually the first primitive GPUs added acceleration for, with AI in mind. Intel, Nvidia, and AMD all support packing two half-precision numbers per 32-bits. Nvidia's GP100 and Intel's Broadwell iGPU were the first to support accelerated dot-product operations on these, with AMD's Vega soon to follow.

Also, Ivy Bridge added AVX instructions for packing/unpacking them, even though it had no arithmetic support for the datatype. I guess the idea was that you'd use that for your in-memory representation, in order to reduce memory bottlenecks.

ET3D said:
float21 (storing 3 floats in 64 bits) would probably have been a better compromise, but still... So the combination I'd have liked is FP32+BF16, rather than the two extremes here. (And stochastic rounding in hardware would also have been nice.)

Nvidia has their TF32 format, which appears to share the same mantissa as IEEE754 half-precision and the same exponent as BF16. In total, it's just 19 bits. I have no idea whether they have any hardware support for packing/unpacking them in 64-bit chunks.

NVIDIA Blogs: TensorFloat-32 Accelerates AI Training HPC upto 20x

NVIDIA's Ampere architecture with TF32 speeds single-precision work, maintaining accuracy and using no new code.

blogs.nvidia.com

ET3D · Jun 12, 2025

bit_user said:
Interesting. For my part, I think IEEE754 half-precision is more generally useful.

(Sorry for the late reply. I don't visit the forums often.)

It's useful for some things, but if the value range is large, its exponent can be too small. Sure, you can renormalise the data occasionally, but having more exponent bits is easier, and in some cases the loss of precision doesn't matter that much.

bit_user said:
Nvidia has their TF32 format, which appears to share the same mantissa as IEEE754 half-precision and the same exponent as BF16. In total, it's just 19 bits. I have no idea whether they have any hardware support for packing/unpacking them in 64-bit chunks.

I looked into it in the past, and it didn't support easy memory packing. So for application where memory throughput is the bottleneck, that won't help much.

bit_user · Jun 12, 2025

ET3D said:
I looked into it in the past, and it didn't support easy memory packing. So for application where memory throughput is the bottleneck, that won't help much.

Perhaps TF32 is good for use as an intermediate format, but you'd still load & save in either BF16 or FP16.

I think it's where they basically just exposed their internal computational capability (I couldn't imagine they actually had separate BF16 and FP16 FMACs), in case people found it useful. I agree that most will probably prefer one of the packed 16-bit formats, but if you needed more range or precision than either of those will allow, you could switch to TF32 and then switch back.

Search

News AMD to split flagship AI GPUs into specialized lineups for for AI and HPC, add UALink — Instinct MI400-series models takes a different path

Admin

Administrator

bit_user

Titan

cheesecake1116

Distinguished

bit_user

Titan

cheesecake1116

Distinguished

3en88

bit_user

Titan

cheesecake1116

Distinguished

bit_user

Titan

ET3D

Distinguished

bit_user

Titan

NVIDIA Blogs: TensorFloat-32 Accelerates AI Training HPC upto 20x

ET3D

Distinguished

bit_user

Titan

TRENDING THREADS

Latest posts

Moderators online

Share this page