News AMD CDNA 3 Roadmap: MI300 APU With 5X Performance/Watt Uplift

Admin · Jun 9, 2022

Probably lots of INT8 matrix operations

AMD CDNA 3 Roadmap: MI300 APU With 5X Performance/Watt Uplift : Read more

Bikki · Jun 9, 2022

If they claim 8x AI training, that is float matrix multiplication performance, something like bfloat16 and not int8.

JarredWaltonGPU · Jun 10, 2022

Bikki said:
If they claim 8x AI training, that is float matrix multiplication performance, something like bfloat16 and not int8.

AI training has been looking into lower precision alternatives for years. Google's TPUs focus mostly on INT8 from what we can tell. Nvidia has talked teraops (INT8) for a few years, and I think Intel or Nvidia even talked about INT2 for certain AI applications as having a benefit. Given MI250X has the same teraflops for bfloat16, fp16, int8, and int4, that means there's no speedup right now. But if AMD reworks things so that two int8 or four int4 can execute in the same time as a single 16-bit operation, they get a 2x or 4x speedup.

bit_user · Dec 9, 2022

JarredWaltonGPU said:
AI training has been looking into lower precision alternatives for years. Google's TPUs focus mostly on INT8 from what we can tell. Nvidia has talked teraops (INT8) for a few years, and I think Intel or Nvidia even talked about INT2 for certain AI applications as having a benefit.

Jarred, you're missing a key point: training vs. inference. @Bikki was pointing out that int8 isn't useful for training, which traditionally requires more range & precision, like BF16. It's really inference that uses the lower-precision data types you mentioned.

JarredWaltonGPU · Dec 10, 2022

bit_user said:
Jarred, you're missing a key point: training vs. inference. @Bikki was pointing out that int8 isn't useful for training, which traditionally requires more range & precision, like BF16. It's really inference that uses the lower-precision data types you mentioned.

AFAIK, Nvidia and others are actively researching training as well as inference using lower precision formats. Some things do fine, others need the higher precision of BF16. If some specific algorithms can work with INT8 or FP8 instead of BF16/FP16, that portion of the algorithm can effectively run twice as fast. Nvidia's transformer engine is supposed to help with switching formats based on what is needed. https://blogs.nvidia.com/blog/2022/03/22/h100-transformer-engine/

bit_user · Dec 10, 2022

JarredWaltonGPU said:
AFAIK, Nvidia and others are actively researching training as well as inference using lower precision formats. Some things do fine, others need the higher precision of BF16. If some specific algorithms can work with INT8 or FP8 instead of BF16/FP16, that portion of the algorithm can effectively run twice as fast. Nvidia's transformer engine is supposed to help with switching formats based on what is needed. https://blogs.nvidia.com/blog/2022/03/22/h100-transformer-engine/

That link only mentions int8 in passing, but actually talks about using fp8 for training.

The cute thing about fp8 is that it's so small you can exhaustively enumerate all possible values in a reasonably-sized table. The Wikipedia page has one that's 32 rows and 8 columns:

https://en.wikipedia.org/wiki/Minifloat#Table_of_values

Search

News AMD CDNA 3 Roadmap: MI300 APU With 5X Performance/Watt Uplift

Admin

Administrator

Bikki

Reputable

JarredWaltonGPU

Senior GPU Editor

bit_user

Polypheme

JarredWaltonGPU

Senior GPU Editor

bit_user

Polypheme

TRENDING THREADS

Latest posts

Moderators online

Share this page