GLM-5.1 (MoE)
GLM-5.1 is Z.ai's flagship model for long-horizon agentic coding tasks. Built on a novel GlmMoeDSA architecture with 754B total parameters (256 routed + 1 shared experts, 8+1 active per token) across 78 layers, it combines Gated DeltaNet li…
754.0B
Parameters
32.0B
Active
128K
Max Context
MoE
Architecture
Apr 7, 2026
Released
Text
Modality
About GLM-5.1 (MoE)
GLM-5.1 is Z.ai's flagship model for long-horizon agentic coding tasks. Built on a novel GlmMoeDSA architecture with 754B total parameters (256 routed + 1 shared experts, 8+1 active per token) across 78 layers, it combines Gated DeltaNet linear attention with standard attention and sparse MoE feed-forward networks — enabling efficient inference while delivering top-tier intelligence. Achieves state-of-the-art 58.4% on SWE-Bench Pro, 63.5% on Terminal-Bench 2.0, 95.3% on AIME 2026, and 86.2% on GPQA-Diamond. Uniquely designed for 8-hour sustained autonomous execution — breaking complex engineering tasks into iterative experiment-analyze-optimize loops. Supports 200K context window and 128K max output tokens. Available via API as glm-5.1 on Z.ai and BigModel.cn. Released April 7, 2026 under MIT license.
Technical Specifications
System Requirements
Estimated VRAM at 10% overhead for different quantization methods and context sizes.
| Quantization | 1K ctx | 128K ctx |
|---|---|---|
Q4_K_M0.50 B/W ~97% of FP16 | 390.0Cluster / Multi-GPU | 429.7Cluster / Multi-GPU |
Q8_01.00 B/W ~100% of FP16 | 779.8Cluster / Multi-GPU | 819.5Cluster / Multi-GPU |
F162.00 B/W Reference | 1559.2Cluster / Multi-GPU | 1598.9Cluster / Multi-GPU |
Other GLM Models
View AllFind the right GPU for GLM-5.1 (MoE)
Use the interactive VRAM Calculator to see exactly how much memory you need at any quantization level, context length, and overhead setting.