News AMD CDNA 3 Roadmap: MI300 APU With 5X Performance/Watt Uplift

Bikki

Reputable
Jun 23, 2020
24
13
4,515
If they claim 8x AI training, that is float matrix multiplication performance, something like bfloat16 and not int8.
 
  • Like
Reactions: bit_user
If they claim 8x AI training, that is float matrix multiplication performance, something like bfloat16 and not int8.
AI training has been looking into lower precision alternatives for years. Google's TPUs focus mostly on INT8 from what we can tell. Nvidia has talked teraops (INT8) for a few years, and I think Intel or Nvidia even talked about INT2 for certain AI applications as having a benefit. Given MI250X has the same teraflops for bfloat16, fp16, int8, and int4, that means there's no speedup right now. But if AMD reworks things so that two int8 or four int4 can execute in the same time as a single 16-bit operation, they get a 2x or 4x speedup.
 

bit_user

Polypheme
Ambassador
AI training has been looking into lower precision alternatives for years. Google's TPUs focus mostly on INT8 from what we can tell. Nvidia has talked teraops (INT8) for a few years, and I think Intel or Nvidia even talked about INT2 for certain AI applications as having a benefit.
Jarred, you're missing a key point: training vs. inference. @Bikki was pointing out that int8 isn't useful for training, which traditionally requires more range & precision, like BF16. It's really inference that uses the lower-precision data types you mentioned.
 
Jarred, you're missing a key point: training vs. inference. @Bikki was pointing out that int8 isn't useful for training, which traditionally requires more range & precision, like BF16. It's really inference that uses the lower-precision data types you mentioned.
AFAIK, Nvidia and others are actively researching training as well as inference using lower precision formats. Some things do fine, others need the higher precision of BF16. If some specific algorithms can work with INT8 or FP8 instead of BF16/FP16, that portion of the algorithm can effectively run twice as fast. Nvidia's transformer engine is supposed to help with switching formats based on what is needed. https://blogs.nvidia.com/blog/2022/03/22/h100-transformer-engine/
 
  • Like
Reactions: bit_user

bit_user

Polypheme
Ambassador
AFAIK, Nvidia and others are actively researching training as well as inference using lower precision formats. Some things do fine, others need the higher precision of BF16. If some specific algorithms can work with INT8 or FP8 instead of BF16/FP16, that portion of the algorithm can effectively run twice as fast. Nvidia's transformer engine is supposed to help with switching formats based on what is needed. https://blogs.nvidia.com/blog/2022/03/22/h100-transformer-engine/
That link only mentions int8 in passing, but actually talks about using fp8 for training.

The cute thing about fp8 is that it's so small you can exhaustively enumerate all possible values in a reasonably-sized table. The Wikipedia page has one that's 32 rows and 8 columns: