News BFloat16 Deep Dive: ARM Brings BF16 Deep Learning Data Format to ARMv8-A

Admin · Sep 21, 2019

ARM recently announced its intent to support bfloat16 in the next revision of the ARMv8-A architecture. This article takes a look at the rise of the new deep learning number format.

BFloat16 Deep Dive: ARM Brings BF16 Deep Learning Data Format to ARMv8-A : Read more

setx · Sep 21, 2019

Too bad that BFloat16 is pretty much useless for anything besides neural networks as 8-bit precision is just too poor.

bit_user · Sep 22, 2019

Nice article. It's good to see this sort of content, on the site.

a number thus contains three pieces of information: its sign, its mantissa (which itself can be positive or negative) and the exponent.

Um, I think you meant to say the exponent can itself be positive or negative. The sign bit applies to the mantissa, but the exponent is biased (i.e. so that an 8-bit exponent of 127 is 0, anything less than that is negative, and anything greater is positive).

To make matters simple, IEEE has standardized several of these floating-point number formats for computers

You could list the IEEE standard (or even provide a link), so that people could do some more reading, themselves. I applaud your efforts to explain these number formats, but grasping such concepts from quite a brief description is a lot to expect of readers without prior familiarity. To that end, perhaps the Wikipedia page is a reasonable next step for any who're interested:

IEEE 754 - Wikipedia

en.wikipedia.org

Nvidia recognized this trend early on (possibly aided by its mobile aspirations at the time) and introduced half-precision support in Maxwell in 2014 at twice the throughput (FLOPS) of FP32.

That's only sort of true. Their Tegra X1 is the only Maxwell-derived architecture to have it. And of Pascal (the following generation), the only chip to have it was the server-oriented P100.

In fact, Intel was first to the double-rate fp16 party, with their Gen8 (Broadwell) iGPU!

AMD was a relative late-comer, only adding full support in Vega. However, a few generations prior, they had load/save support for fp16, so that it could be used as the in-memory representation while actual computations continued to use full fp32.

It could be noted that use of fp16 in GPUs goes back about a decade further, when people had aspirations of using it for certain graphical computations (think shading or maybe Z-buffering, rather than geometry). And that format was included in the 2008 version of the standard. Unfortunately, there was sort of a chicken-and-egg problem, with GPUs adding only token hardware support for it and therefore few games bothered to use it.

aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS9GL0wvODQ5NDQxL29yaWdpbmFsLzE5MDkwMl9iZmxvYXQxNi5wbmc=

This is a nice diagram, but It would've been interesting to see FP16 aligned on the exponent-fraction boundary.

hardware area (number of transistors) scales roughly with the square of the mantissa width

Important point - thanks for mentioning.

The elements of a Flexpoint tensor are (16-bit) integers, but they have a shared (5-bit) exponent whose storage and communication can be amortized over the whole tensor

As a side note, there are some texture compression formats like this. Perhaps that's where they got the idea?

ARM too has not followed FP32 rigorously and instead introduced some simplifications.

Specific to BFloat16 instructions, right? Otherwise, I believe ARMv8A is IEEE 754-compliant.

The new BF16 instructions will be included in the next update of the Armv8-A instruction set architecture. Albeit not yet announced, this would be ARMv8.5-A. They should find its way to ARM processors from its partners after that.

This strikes me as a bid odd. I just don't see people building AI training chips out of ARMv8A cores. I suppose people can try, but they're already outmatched.

witeken · Oct 10, 2019

bit_user said:
Nice article. It's good to see this sort of content, on the site.

Thanks.

Um, I think you meant to say the exponent can itself be positive or negative.

Correct.

That's only sort of true. Their Tegra X1 is the only Maxwell-derived architecture to have it. And of Pascal (the following generation), the only chip to have it was the server-oriented P100.

In fact, Intel was first to the double-rate fp16 party, with their Gen8 (Broadwell) iGPU!

(...)

Good comment, P100 indeed introduced it for deep learning, not so much Maxwell.

As a side note, there are some texture compression formats like this. Perhaps that's where they got the idea?

I think they just tried to come up with a scheme to be able to use integer hardware instead of FP.

Specific to BFloat16 instructions, right? Otherwise, I believe ARMv8A is IEEE 754-compliant.

Yes, I was talking about the new BF16 instructions.

This strikes me as a bid odd. I just don't see people building AI training chips out of ARMv8A cores. I suppose people can try, but they're already outmatched.

I guess we'll see. Arm is adding the support, so someone will use it eventually, I'd imagine.

bit_user · Oct 11, 2019

witeken said:
I guess we'll see. Arm is adding the support, so someone will use it eventually, I'd imagine.

My guess is that ARM got requests for it, a couple years ago, in the earlier days of the AI boom. Sometimes, feature requests take a while to percolate through the product development pipeline and, by the time they finally reach the market, everybody has moved on.

That's sort of how I see AMD's fumbling with deep learning features, only they've done a little bit better. For a couple generations, they managed to leap-frog Nvidia's previous generation, but were well-outmatched by their current offering. So, I'm wondering whether AMD will either get serious about building a best-in-class AI chip, or just accept that they missed the market window and back away from it.

Search

News BFloat16 Deep Dive: ARM Brings BF16 Deep Learning Data Format to ARMv8-A

Admin

Administrator

setx

Distinguished

bit_user

Titan

IEEE 754 - Wikipedia

witeken

bit_user

Titan

TRENDING THREADS

Latest posts

Moderators online

Share this page