News OpenAI intros two lightweight open-model language models that can run on consumer GPUs — optimized to run on devices with just 16GB of memory.

Both models take advantage of a Transformer using the mixture-of-experts model, a model that was popularized with DeepSeek R1. Despite gpt-oss-120b and 20b's design focus towards consumer GPUs, both support up to 131,072 context lengths, the longest available for local inference. gpt-oss-120b activates 5.1 billion parameters per token, and gpt-oss-20b activates 3.6 billion parameters per token. Both models use alternating dense and locally banded sparse attention patterns and use grouped multi-query attention with a group size of 8.
SMH*10^23