Nvidia unveils TensorRT-LLM software that promises a major boost for LLM interference.
Nvidia Claims Doubled Inference Performance with H100 : Read more
Nvidia Claims Doubled Inference Performance with H100 : Read more
The scheduled forum maintenance has now been completed. If you spot any issues, please report them here in this thread. Thank you!
Inference in many cases can go much lower than eight bit. Large language models are functioning at upwards of 98% of full precision accuracy with just five bits and even two bit inference is usable. FP8 will in most cases be indistinguishable from full precision.I do have to wonder how much of this is simply from using FP8 computations instead of FP16 / BF16. Half the bandwidth, double the compute, double the performance. But I would seriously doubt that all AI algorithms could use FP8 without encountering problems due to the loss of precision.
More likely is that this is simply a case of the base models and algorithms not being tuned very well. Getting a 2X speedup by focusing on optimizations, especially when done by Nvidia people with a deep knowledge of the hardware, is definitely possible.