InvalidError :
Sounds like you think that FP16 is slower.
Because
it is. You're so wrapped up in thinking about the hardware implementation that you're forgetting the vast majority of GPUs out there have token support for it, at best. All of the consumer Pascal GPUs have fp16 performance that's 1/64th the performance of fp32. They only included one 2x fp16 unit per SM.
InvalidError :
An FP16 multiplier is about 1/4th the size of an FP32 multiplier and in many cases, FP16 multiplication is achieved by partitioning the FP32 multiplier instead of using completely separate circuitry.
That's how Nvidia did it, in the GP100, but I don't know about Intel or AMD. It still costs die area and power, as well. All for a feature that is unused in most/all games, today. That's probably why it didn't take hold until machine learning offered a sufficiently large benefit to justify adding it.
InvalidError :
The only reason you are seeing FP16 being slower than FP32 on consumer GPUs is arbitrary vendor restrictions, as evidenced by how the same architectures in "professional" variants mysteriously become twice as fast at FP16 than FP32. Why only twice as fast instead of my 4X claimed above? Because without extending the input register size by one method or another, halving the operand size only doubles the number of possible inputs and outputs from those same registers.
Great, so not only do they have the overhead of partitioning for fp16, but you also want them to double their registers and all the datapaths connecting them? Do you realize the GP102 has
2 million 32-bit registers? You could enable the use of register pairs (ignoring, for a moment, all the complexity
that would add), but you'd still need to double the data paths.
InvalidError :
If GPUs went FP16-centric, FP16 performance would be quadruple that of FP32 performance.
It won't happen. The reason being that fp16 can only be substituted for fp32 in the minority of cases. Otherwise, they'd have done it, already. Like I said, fp16 (aka half-precision) has been kicking around in shading languages, for a while. x86 CPUs even have a pair of instructions for converting to/from it, which I'm pretty sure was a nod to GPUs, since those are the
only fp16 instructions they currently have.
The route Nvidia seems to be taking to unlock the full potential of fp16 is with their special-purpose tensor engines. Great for machine learning, but not very interesting for other fp16 opportunities, like your HDR example. To my earlier point, the rise of machine learning might not actually benefit graphics processing horsepower, in the end.