My guess is that it's basically like texture compression. I've heard of mobile NPUs using compression on the weights, but not really seen AMD or Nvidia talk about it (other than handling a limited degree of sparseness). It always seemed to me like a natural thing for GPUs to do, given their texture units are already in the datapath and already have hardware support for texture compression. ...except this isn't a GPU!
No. ...I'm 99.9% sure it's not.
But, that's irrelevant if it's basically invisible to software, as they claim. The main reason for IEEE standards is to have some consistency between hardware implementations, so that software doesn't have to introduce a ton of special cases, for each hardware implementation.
Yeah, could be, but I'm doubtful because it seems like a stumbling block scaling size & speed long term.
I got the feeling it was more like how BFloat16 (the other B[F]16 🤓 ) deals with FP32 for rounding up/down, so it just becomes an 'efficient'/elegant math work around, that can be used in any scenario, rather than adding silicon for de/compression, but that's just my take/guess on it.
It's something I wish they spent more time on, but I suspect we'll get that in the deep dives in the coming day/weeks before launch.
My initial read on it was to react "OH, this is how AMD is going to react to not having intel's AVX FP16 support" which of course is helpful for Ai workloads.
Who knows it may be the best of both worlds giving them speed down low for consumer Ai applications/platforms, which are definitely more about doing 95% of the job in 1/3 the time rather than speed up the full-fat FP16/32 by 20-50% (usually closer to 10% because at that level you are still heavily memory bound/restricted [more so if you're compression dependent IMO ]). 🤔🤷🏻♂️
Too me BFP16 like adding a small turbo to a small engine in a light sports car (Lotus 7) .... boom huge impact , while FP16 is like adding another turbo to a Chiron... OK, improvement, but not dramatic, and not where the majority of the market will be for Copilot+ level Ai PCs for the next coupla years.
The more precise faster seems like still the domain of commercial applications.... (where dedicated racks of precision are still telling people to eat rocks 🥸 🤣 ).
Now if only Block FP16 were as exciting as the French movie District BF13. 🤡
(* after reading your other post in the other thread, I think I added too many unnecessary analogies and aphorisms, when you likely would've been fine with just half the words in the second paragraph. THG has definitely added more depth in the forums since I left over a decade ago. Wish there were more deep-dive folks then. Now... to try and curb the dad humour.... must resist. 🥸 😉 )