News AMD to split flagship AI GPUs into specialized lineups for for AI and HPC, add UALink — Instinct MI400-series models takes a different path

The article said:
Right now, AMD's Instinct MI300-series accelerators are aimed at both AI and HPC, which makes them universal but lowers maximum performance for both types of workloads. Starting from its next-generation Instinct MI400-series, AMD will offer distinct processors for AI and supercomputers in a bid to maximize performance for each workload, according to SemiAnalysis.
Yes, it's been obvious this would happen. Nvidia was first to show signs of moving in this direction, if you look at how Blackwell started to back away from fp64 compute, in favor of more AI horsepower.

In the chiplet era, I could imagine HPC and AI chiplets being combined in a package. You'd have models with all AI chiplets, some with a 50/50 of AI and HPC... not sure if there's still enough of a market for 100% HPC chiplet accelerators, but they could do it if there were.
 
Yes, it's been obvious this would happen. Nvidia was first to show signs of moving in this direction, if you look at how Blackwell started to back away from fp64 compute, in favor of more AI horsepower.

In the chiplet era, I could imagine HPC and AI chiplets being combined in a package. You'd have models with all AI chiplets, some with a 50/50 of AI and HPC... not sure if there's still enough of a market for 100% HPC chiplet accelerators, but they could do it if there were.
Except a split of these 2 doesn't make any sense...

This would imply that AMD is making 2 different compute chiplets for MI430X and MI450X which I very much doubt they are doing based on AMD"s prior behavior WRT how they deal with different market segments...

And I don't even understand what the limitation would even be for having the FP64 matrix compute there in the first place... the vast vast majority of the workloads that these processors are used in are memory bandwidth and more importantly memory capacity bound... not compute bound... And that is true for Nvidia as well...
 
This would imply that AMD is making 2 different compute chiplets for MI430X and MI450X which I very much doubt they are doing based on AMD"s prior behavior WRT how they deal with different market segments...
Well, that's the point of this announcement. They're saying they are going to start deviating from their prior behavior.

And I don't even understand what the limitation would even be for having the FP64 matrix compute there in the first place...
FP64 uses a lot of silicon. The amount of silicon used by FP multipliers scales roughly as a square of the mantissa.

the vast vast majority of the workloads that these processors are used in are memory bandwidth and more importantly memory capacity bound... not compute bound... And that is true for Nvidia as well...
In general, they put only as much compute as needed, in order to have a balanced system. So, I don't accept the notion that these accelerators have compute power much in excess of what the HBM memory subsystems can keep fed.
 
Well, that's the point of this announcement. They're saying they are going to start deviating from their prior behavior.


FP64 uses a lot of silicon. The amount of silicon used by FP multipliers scales roughly as a square of the mantissa.


In general, they put only as much compute as needed, in order to have a balanced system. So, I don't accept the notion that these accelerators have compute power much in excess of what the HBM memory subsystems can keep fed.
1) Yeah, I don't see that change happening the way they think it will in my opinion, because...

2) Yes FP64 does use a lot of silicon, but as a reminder since CDNA2 CDNA has used FP64 ALUs instead of FP32 ALUs like every other GPU...

3) As for putting only the amount of compute as needed to have a balanced system, that is not the case at all... Depending on the arithmetic intensity of an algorithm you can fall WELL short of the on paper numbers... For example, for an FP64 workload on MI300X, if you have a working dataset that is running mostly out of the Infinity Cache and you have an arithmetic intensity of ~1FLOP per byte you aren't going to break 20TF of the hypothetical 80TF of FP64 compute... you need an arithmetic intensity of about 4.5FLOPs per byte to hit the 80TF of on paper compute... and if you are working out of DRAM then you are going to need an intensity of about 15FLOPs per byte to hit the 80TF of FP64 compute... This is why roofline models are useful because it tells you how easy or hard it is to feed a specific processor using a specific datatype...
 
  • Like
Reactions: Peksha
2) Yes FP64 does use a lot of silicon, but as a reminder since CDNA2 CDNA has used FP64 ALUs instead of FP32 ALUs
That won them some supercomputer contracts and allowed them to leapfrog Nvidia and Intel on vector fp64, but fp64 has limited general applicability, which is why client CPUs implement only scalar support for it.

3) As for putting only the amount of compute as needed to have a balanced system, that is not the case at all...
The range of potential workloads is so diverse that you can always find some which are memory-bottlenecked and others that are compute-bottlenecked. They probably have some suite of applications they're trying to target, which inform their decisions about how to balance compute vs. memory bandwidth.

if you are working out of DRAM then you are going to need an intensity of about 15FLOPs per byte to hit the 80TF of FP64 compute...
15 FLO/byte isn't hard to reach, if you're dealing in large convolutions or tensor products.
 
That won them some supercomputer contracts and allowed them to leapfrog Nvidia and Intel on vector fp64, but fp64 has limited general applicability, which is why client CPUs implement only scalar support for it.


The range of potential workloads is so diverse that you can always find some which are memory-bottlenecked and others that are compute-bottlenecked. They probably have some suite of applications they're trying to target, which inform their decisions about how to balance compute vs. memory bandwidth.


15 FLO/byte isn't hard to reach, if you're dealing in large convolutions or tensor products.
1) Client CPUs, not GPUs, can do vector FP64... Just look at the VFMADD132PD instruction in AVX512... Client GPUs can do FP64 as well, just that they cut alot of the throughput because CLIENT workloads don't need much of it... but HPC sure does...

2) But the majority of workloads, are going to be memory bandwidth or memory capacity bound... look at all the roof line models... that are out there...

3) 15FLOPs/Byte may not be hard to hit... but what about over 245 FLOPs per Byte which is what you'd need to hit for fully saturating FP16 Matrix on the MI300X... that WILL be a lot harder to hit...
 
1) Client CPUs, not GPUs, can do vector FP64...
Oops. I'm terribly sorry. That was a simple typo, on my part. I meant to type "client GPUs", as in the gaming GPUs. I'm well aware that x86 CPUs have supported vector fp64, ever since SSE2.

Just look at the VFMADD132PD instruction in AVX512...
Yes, and thank you for even citing a specific example.

Client GPUs can do FP64 as well,
This is the part where I was trying to say they only do scalar fp64. The way to see this for Nvidia is slightly convoluted. First you have to see which CUDA Compute Capability is supported by a given GPU. Then, you can read about its features and properties. So, we can take the example of their latest Blackwell client & server GPUs, but this holds for all of the ones they've shipped in the past decade.

According to this, the RTX 5000-series (i.e. Blackwell client GPUs) support CUDA Compute Capability 12.0, while the server Blackwells support 10.0:

In section 17.9.1 of their CUDA Programming Guide, they say that a 10.x SM (Shader Multiprocessor) contains 128x fp32 pipelines, 64x fp64 pipelines, and 64x int32 pipelines. A warp is their unit of SIMD and is 32-wide. So, that corresponds to an ability for concurrent dispatch of up to 4 warp-level fp32 operations, 2 warp-level fp64 operations, or 2 warp-level int32 operations. They don't say how you can mix and match, but I assume they have only enough bandwidth to/from the vector register file to sustain 4x 32-bit warp ops per cycle.

Contrast this with what 12.0 supports, in section 17.10.1. There, the number of fp64 pipelines per SM drops to just 2. That's their way of saying there are only two scalar fp64 piplines per SM and no vector fp64 capability.

You can go through and see a similar pattern between all of their client vs. server GPUs.

just that they cut alot of the throughput because CLIENT workloads don't need much of it... but HPC sure does...
Yeah, funny enough, Intel's Alchemist generation dGPUs dropped all hardware fp64 capabilities and support it only through emulation. In the Battlemage generation, they brought back true hardware scalar fp64 support.

If you look at their GPU architecture slides from about 4+ years ago, they list fp64 support as being optional for the Xe cores used in both the "Xe-HPG" (High-Performance Gaming) GPUs and Xe-HP (general-purpose cloud GPU). They subsequently cancelled Xe-HP, which was probably the version of the core that actually had scalar harware fp64, leaving only Xe-HPG and Xe-HPC (AKA Ponte Vecchio, which has full vector fp64). I'm having trouble locating a copy of the slide...

In AMD's case, at least RDNA4 seems to take a different approach. I'm not sure if this is true for all generations of RDNA, as I only checked the latest, but you do find a full contingent of V_..._F64 instructions in their ISA manual:

The specs indicate a 64:1 ratio of fp32 to fp64 TFLOPS. I think they might microcode those Vector F64 instructions to run on a single scalar port per WGP, but I haven't found direct confirmation.

3) 15FLOPs/Byte may not be hard to hit... but what about over 245 FLOPs per Byte which is what you'd need to hit for fully saturating FP16 Matrix on the MI300X... that WILL be a lot harder to hit...
A couple of points about that. The first is that you're often convolving multiple data items against the same weights. The weights can be held in registers, leaving only the data items to be streamed in/out.

Next, if we're talking about matrix multiply, there's an inherent data efficiency in matrix multiply hardware. The amount of compute scales as roughly a cube of the input/output data width. This is how "AI" GPUs and NPUs can tout such eye-watering numbers of TOPS and why it was such a game-changer when Nvidia first added Tensor cores.

Silicon area is expensive, which is why I'm rather confident there's not an over-capacity of compute power, for the key workloads they're targeting.
 
Last edited: