News BFloat16 Deep Dive: ARM Brings BF16 Deep Learning Data Format to ARMv8-A

Nice article. It's good to see this sort of content, on the site.

a number thus contains three pieces of information: its sign, its mantissa (which itself can be positive or negative) and the exponent.
Um, I think you meant to say the exponent can itself be positive or negative. The sign bit applies to the mantissa, but the exponent is biased (i.e. so that an 8-bit exponent of 127 is 0, anything less than that is negative, and anything greater is positive).

To make matters simple, IEEE has standardized several of these floating-point number formats for computers
You could list the IEEE standard (or even provide a link), so that people could do some more reading, themselves. I applaud your efforts to explain these number formats, but grasping such concepts from quite a brief description is a lot to expect of readers without prior familiarity. To that end, perhaps the Wikipedia page is a reasonable next step for any who're interested:


Nvidia recognized this trend early on (possibly aided by its mobile aspirations at the time) and introduced half-precision support in Maxwell in 2014 at twice the throughput (FLOPS) of FP32.
That's only sort of true. Their Tegra X1 is the only Maxwell-derived architecture to have it. And of Pascal (the following generation), the only chip to have it was the server-oriented P100.

In fact, Intel was first to the double-rate fp16 party, with their Gen8 (Broadwell) iGPU!

AMD was a relative late-comer, only adding full support in Vega. However, a few generations prior, they had load/save support for fp16, so that it could be used as the in-memory representation while actual computations continued to use full fp32.

It could be noted that use of fp16 in GPUs goes back about a decade further, when people had aspirations of using it for certain graphical computations (think shading or maybe Z-buffering, rather than geometry). And that format was included in the 2008 version of the standard. Unfortunately, there was sort of a chicken-and-egg problem, with GPUs adding only token hardware support for it and therefore few games bothered to use it.

aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS9GL0wvODQ5NDQxL29yaWdpbmFsLzE5MDkwMl9iZmxvYXQxNi5wbmc=


This is a nice diagram, but It would've been interesting to see FP16 aligned on the exponent-fraction boundary.

hardware area (number of transistors) scales roughly with the square of the mantissa width
Important point - thanks for mentioning.

The elements of a Flexpoint tensor are (16-bit) integers, but they have a shared (5-bit) exponent whose storage and communication can be amortized over the whole tensor
As a side note, there are some texture compression formats like this. Perhaps that's where they got the idea?

ARM too has not followed FP32 rigorously and instead introduced some simplifications.
Specific to BFloat16 instructions, right? Otherwise, I believe ARMv8A is IEEE 754-compliant.

The new BF16 instructions will be included in the next update of the Armv8-A instruction set architecture. Albeit not yet announced, this would be ARMv8.5-A. They should find its way to ARM processors from its partners after that.
This strikes me as a bid odd. I just don't see people building AI training chips out of ARMv8A cores. I suppose people can try, but they're already outmatched.
 
Nice article. It's good to see this sort of content, on the site.
Thanks.

Um, I think you meant to say the exponent can itself be positive or negative.
Correct.

That's only sort of true. Their Tegra X1 is the only Maxwell-derived architecture to have it. And of Pascal (the following generation), the only chip to have it was the server-oriented P100.

In fact, Intel was first to the double-rate fp16 party, with their Gen8 (Broadwell) iGPU!

(...)
Good comment, P100 indeed introduced it for deep learning, not so much Maxwell.
As a side note, there are some texture compression formats like this. Perhaps that's where they got the idea?
I think they just tried to come up with a scheme to be able to use integer hardware instead of FP.


Specific to BFloat16 instructions, right? Otherwise, I believe ARMv8A is IEEE 754-compliant.
Yes, I was talking about the new BF16 instructions.


This strikes me as a bid odd. I just don't see people building AI training chips out of ARMv8A cores. I suppose people can try, but they're already outmatched.
I guess we'll see. Arm is adding the support, so someone will use it eventually, I'd imagine.
 
  • Like
Reactions: bit_user
I guess we'll see. Arm is adding the support, so someone will use it eventually, I'd imagine.
My guess is that ARM got requests for it, a couple years ago, in the earlier days of the AI boom. Sometimes, feature requests take a while to percolate through the product development pipeline and, by the time they finally reach the market, everybody has moved on.

That's sort of how I see AMD's fumbling with deep learning features, only they've done a little bit better. For a couple generations, they managed to leap-frog Nvidia's previous generation, but were well-outmatched by their current offering. So, I'm wondering whether AMD will either get serious about building a best-in-class AI chip, or just accept that they missed the market window and back away from it.