News Nvidia Claims Doubled Inference Performance with H100

Status
Not open for further replies.
I do have to wonder how much of this is simply from using FP8 computations instead of FP16 / BF16. Half the bandwidth, double the compute, double the performance. But I would seriously doubt that all AI algorithms could use FP8 without encountering problems due to the loss of precision.

More likely is that this is simply a case of the base models and algorithms not being tuned very well. Getting a 2X speedup by focusing on optimizations, especially when done by Nvidia people with a deep knowledge of the hardware, is definitely possible.
 
I do have to wonder how much of this is simply from using FP8 computations instead of FP16 / BF16. Half the bandwidth, double the compute, double the performance. But I would seriously doubt that all AI algorithms could use FP8 without encountering problems due to the loss of precision.

More likely is that this is simply a case of the base models and algorithms not being tuned very well. Getting a 2X speedup by focusing on optimizations, especially when done by Nvidia people with a deep knowledge of the hardware, is definitely possible.
Inference in many cases can go much lower than eight bit. Large language models are functioning at upwards of 98% of full precision accuracy with just five bits and even two bit inference is usable. FP8 will in most cases be indistinguishable from full precision.
 
Status
Not open for further replies.