News Microsoft researchers build 1-bit AI LLM with 2B parameters — model small enough to run on some CPUs

Admin · Apr 17, 2025

Microsoft researchers developed a 1-bit AI model that's efficient enough to run on traditional CPUs without needing specialized chips like NPUs or GPUs.

Microsoft researchers build 1-bit AI LLM with 2B parameters — model small enough to run on some CPUs : Read more

Giroro · Apr 17, 2025

So, like, CPUs can access way more memory than a GPU.
Or in the case of the mentioned Apple M2, the memory is shared.

So I'm not sure what they're trying to brag about.

ezst036 · Apr 17, 2025

Microsoft? Light weight? Could they port some of these enhancements over to Windows?

hotaru251 · Apr 17, 2025

that it’s lightweight enough to work efficiently on a CPU, with TechCrunch saying an Apple M2 chip can run it.

....that isnt saying much given M2 has unified memory.

coolitic · Apr 17, 2025

Bitnets use 1-bit weights with only three possible values: -1, 0, and +1

That's at least 2 bits, not 1.

AngelusF · Apr 17, 2025

coolitic said:
That's at least 2 bits, not 1.

Well, the article then goes on to describe it as 1.58 bits, which is more accurate. Although I've also heard that described as ternary.

jp7189 · Apr 17, 2025

The race to the bottom is finished... time to turn around and start heading back up. Though I think we'll need a paradigm shift to improve efficiency at higher precision. I still believe that eventually we'll get there.

usertests · Apr 17, 2025

jp7189 said:
The race to the bottom is finished... time to turn around and start heading back up. Though I think we'll need a paradigm shift to improve efficiency at higher precision. I still believe that eventually we'll get there.

If it works, who cares? But I say that more about the precision than the parameter count. Parameter counts being optimized downward has been an important development, but at some point you have to increase it, and you mostly need more memory which desktop CPUs/APUs should be able to have easily.

We can make LLMs run on 8 GB phones, but it would be better to have 24-32 GB of memory. I believe we can reach a point where budget smartphones can have 32-64 GB of RAM, if Samsung/Micron/SK Hynix successfully develop 3D DRAM (like 3D NAND) in the early-mid 2030s, and that drives cost-per-bit down by an order of magnitude.

As far as improving efficiency at higher precision goes, I think AMD's adoption of "Block FP16" in the XDNA2 NPU could qualify for that. They promoted it as having the accuracy of 16-bit with the speed of INT8. There are only so many mathematical tricks you could pull to double performance though, and maybe we're already stuck with a choice of BFP16, INT8, INT4, FP4, INT2, INT1.58, etc.

jp7189 · Apr 17, 2025

usertests said:
If it works, who cares? But I say that more about the precision than the parameter count. Parameter counts being optimized downward has been an important development, but at some point you have to increase it, and you mostly need more memory which desktop CPUs/APUs should be able to have easily.

We can make LLMs run on 8 GB phones, but it would be better to have 24-32 GB of memory. I believe we can reach a point where budget smartphones can have 32-64 GB of RAM, if Samsung/Micron/SK Hynix successfully develop 3D DRAM (like 3D NAND) in the early-mid 2030s, and that drives cost-per-bit down by an order of magnitude.

As far as improving efficiency at higher precision goes, I think AMD's adoption of "Block FP16" in the XDNA2 NPU could qualify for that. They promoted it as having the accuracy of 16-bit with the speed of INT8. There are only so many mathematical tricks you could pull to double performance though, and maybe we're already stuck with a choice of BFP16, INT8, INT4, FP4, INT2, INT1.58, etc.

I don't think gpu or even npu will scale well enough to get beyond small incremental improvements, and as long as that's the paradigm we'll keeping working at the bottom.. lower precision, lower parameters, better distillation techniques.

I'm thinking something entirely new will be needed to make something significantly more useful than what we have today. I have no idea what that will look like. A new achitecture? New math? Cheap quantum endpoints? Who knows.

usertests · Apr 17, 2025

jp7189 said:
I'm thinking something entirely new will be needed to make something significantly more useful than what we have today. I have no idea what that will look like. A new achitecture? New math? Cheap quantum endpoints? Who knows.

We're on an "s-curve" desperately searching for the next s-curve. This research could be applicable, if low-precision math can be used for e.g. a spiking neuron model. On the hardware side maybe we'll see 3D neuromorphic chips developed to accelerate it. I'm predicting that an ever-closer "brain imitation" will be the next big thing, and planar chips are insufficient. The current NPUs could become a thing of the past or repurposed for non-AI matrix operations.

I don't think we've seen much adoption of "in-memory computing" yet.

Rob1C · Apr 18, 2025

Just when you think they couldn't get decent results below FP4 this comes out, and it's been a thing for a year or more:

https://github.com/ggml-org/llama.cpp/pull/8151

Introducing Spectra Suite

Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models

blog.nolano.ai

https://arxiv.org/abs/2403.01241

Search

News Microsoft researchers build 1-bit AI LLM with 2B parameters — model small enough to run on some CPUs

Admin

Administrator

Giroro

Splendid

ezst036

Honorable

hotaru251

Splendid

coolitic

Distinguished

AngelusF

Great

jp7189

Distinguished

usertests

Distinguished

jp7189

Distinguished

usertests

Distinguished

Rob1C

Distinguished

Introducing Spectra Suite

TRENDING THREADS

Latest posts

Moderators online

Share this page