OpenAIMoEApache 2.0

gpt-oss 20B (MoE)

gpt-oss 20B (MoE) is a mixture-of-experts (MoE) transformer language model from the OpenAI family, containing 21B parameters across 24 layers. It has 21B total parameters loaded into VRAM with 3.6B active per token. It supports up to 131K t

21.0B

Parameters

3.6B

Active

128K

Max Context

MoE

Architecture

Released

Text

Modality

About gpt-oss 20B (MoE)

gpt-oss 20B (MoE) is a mixture-of-experts (MoE) transformer language model from the OpenAI family, containing 21B parameters across 24 layers. It has 21B total parameters loaded into VRAM with 3.6B active per token. It supports up to 131K tokens of context with a hidden dimension of 2880 and 8 KV heads for efficient grouped-query attention (GQA). Apache 2.0. MoE: 32 experts, top-4 routing. Fits 16GB at MXFP4. Strong local reasoning.

Reasoning

Technical Specifications

Total Parameters21.0B
Active Parameters3.6B per token
ArchitectureMixture of Experts
Total Experts3.6
Attention TypeGQA (MoE)
Hidden Dimensiond = 2,880
Transformer Layers24
Attention Heads64
KV Headsn_kv = 8
Head Dimensiond_head = 64
Activation FunctionSwiGLU
NormalizationRMSNorm
Position EmbeddingRoPE

System Requirements

Estimated VRAM at 10% overhead for different quantization methods and context sizes.

Quantization1K ctx128K ctx
Q4_K_M0.50 B/W
~97% of FP16
10.90Consumer GPU
16.85Consumer GPU
Q8_01.00 B/W
~100% of FP16
21.76Consumer GPU
27.71Datacenter GPU
F162.00 B/W
Reference
43.47Datacenter GPU
49.42Datacenter GPU
Fits 24 GB consumer GPU
Fits 80 GB datacenter GPU
Requires cluster / multi-GPU

Other OpenAI Models

View All

Find the right GPU for gpt-oss 20B (MoE)

Use the interactive VRAM Calculator to see exactly how much memory you need at any quantization level, context length, and overhead setting.