Ugh. Okay, here we go...
This immediately stuck me as weird, because Nvidia added hardware acceleration for JPEG decoding. It's described specifically in reference to the A100, here:
According to surveys, the average person produces 1.2 trillion images that are captured by either a phone or a digital camera. The storage of such images, especially in high-resolution raw format…
developer.nvidia.com
There, they compare it to CPU-accelerated and GPU (software)-accelerated decoding:
As that implies, there are two options for GPU-accelerated decoding:
- Use generic CUDA code/cores to accelerate the parallel portions of decoding (e.g. dequantization, IDCT, resampling, and colorspace transform).
- Use the NVJPEG engine, newly added to the A100.
According to this diagram, the NVJPEG Engine even handles huffman decoding:
That diagram assumes you want the final image on the CPU, but elsewhere in the page they state the library has the capability for:
* Input to the library is in the host memory, and the output is in the GPU memory.
So, what are we to make of Hugging Face's data? I checked the blog entry cited by the article, and they make absolutely no mention of nvJPEG. I went one step further, and searched the linked git repo, also finding no reference to nvJPEG. I wouldn't say that's conclusive, because I don't know enough about how all of its dependencies are provided or exactly where you'd expect to see nvJPEG show up, if it's indeed capable of being used. However, I think I've done enough digging that questions should be raised and answered for.
If their blog post were instead an academic paper, you'd absolutely expect them to mention nvJPEG and either demonstrate that it's being used or explain why not. If they were comparing against nvJPEG, then you'd expect them to point out how superior Habana's solution is to even Nvidia's purpose-built hardware engine. As it stands, this smells fishy. Either the study's authors are not truly disinterested in the outcome, or somewhat surprisingly ignorant and incurious about Nvidia's solution to this problem. Given that they correctly pointed out that it's a bottleneck, it'd be awfully surprising for Nvidia not to have taken notice or done anything to effectively alleviate it.
Another thought I had is that I don't know how heavy-weight that Hugging Face model is. If I were looking to accentuate a bottleneck in JPEG decoding, I'd use a relatively light-weight model that caters well to the other strengths of Habana's hardware. In other words, even if their experiment is properly conducted, their findings might not be applicable to many other models people are using, not to mention newer chips like the H100.