Local LLM Models Directory
Complete reference of 94 open-source local LLMs across 21 model families. Compare parameters, architecture, layers, context length, and Q4_K_M VRAM requirements.
Showing 94 of 94 models
| Model | Family | Params | Arch | Layers | Max Context | Q4_K_M VRAM | Calculator |
|---|---|---|---|---|---|---|---|
| Qwen 2.5 0.5B | Qwen | 490M | Dense | 24 | 32K | 228 MB | Calc |
| Qwen 3 0.6B Apache 2.0. Thinking mode toggle. Tied embeddings. | Qwen | 600M | Dense | 28 | 32K | 279 MB | Calc |
| Qwen 3.5 0.8B Apache 2.0. Hybrid DeltaNet+Attn (25% layers KV cache). 262K→1M ctx. | Qwen | 800M | Dense | 6 | 256K | 373 MB | Calc |
| Gemma 3 1B | Gemma | 1.0B | Dense | 26 | 32K | 466 MB | Calc |
| LFM2 1.2B LFM Open License (Apache 2.0 based). On-device hybrid model. Fast CPU/mobile inference. | Liquid | 1.2B | Dense | 24 | 32K | 559 MB | Calc |
| Llama 3.2 1B | Llama | 1.2B | Dense | 16 | 128K | 577 MB | Calc |
| Qwen 2.5 1.5B | Qwen | 1.5B | Dense | 28 | 32K | 717 MB | Calc |
| DeepSeek R1 Distill Qwen 1.5B Reasoning distilled into Qwen 2.5 1.5B base. | DeepSeek | 1.5B | Dense | 28 | 32K | 717 MB | Calc |
| Qwen 3 1.7B Apache 2.0. Thinking mode toggle. | Qwen | 1.7B | Dense | 28 | 32K | 801 MB | Calc |
| Qwen 3.5 2B Apache 2.0. Hybrid DeltaNet+Attn (25% layers KV cache). 262K→1M ctx. | Qwen | 2.0B | Dense | 6 | 256K | 931 MB | Calc |
| Granite 3.1 2B Apache 2.0. Enterprise RAG, code, safety. | IBM Granite | 2.0B | Dense | 32 | 128K | 931 MB | Calc |
| Ministral 3 3B Apache 2.0. Cascade-distilled from Mistral Small 3.1. | Mistral | 3.0B | Dense | 26 | 256K | 1.4 GB | Calc |
| SmolLM3 3B Apache 2.0. Tiny model for CPU/browser/phone. Educational use. | SmolLM | 3.0B | Dense | 32 | 8K | 1.4 GB | Calc |
| Qwen 2.5 3B | Qwen | 3.1B | Dense | 36 | 32K | 1.4 GB | Calc |
| Llama 3.2 3B | Llama | 3.2B | Dense | 28 | 128K | 1.5 GB | Calc |
| Phi-4-mini 3.8B | Phi | 3.8B | Dense | 32 | 16K | 1.8 GB | Calc |
| Qwen 3.5 4B Apache 2.0. Hybrid DeltaNet+Attn (25% layers KV cache). 262K→1M ctx. | Qwen | 4.0B | Dense | 8 | 256K | 1.9 GB | Calc |
| Gemma 3 4B | Gemma | 4.0B | Dense | 34 | 32K | 1.9 GB | Calc |
| Nemotron 3 Nano 4B Nemotron Open Model License. Hybrid Mamba2-Transformer. Laptop/workstation friendly. | Nvidia | 4.0B | Dense | 32 | 256K | 1.9 GB | Calc |
| Qwen 3 4B Apache 2.0. Thinking mode toggle. Great small local model. | Qwen | 4.0B | Dense | 36 | 32K | 1.9 GB | Calc |
| Gemma 4 E2B Effective 2.3B active via PLE. Hybrid local+global attn. Audio+image. 128K ctx. | Gemma | 5.1B | Dense | 35 | 128K | 2.4 GB | Calc |
| Phi-4-multimodal 5.6B MIT license. Image + audio + text multimodal. Good compact multimodal local. | Phi | 5.6B | Dense | 36 | 16K | 2.6 GB | Calc |
| OLMo 3 7B Fully open data/code/weights. Transparent research model. | AI2 OLMo | 7.0B | Dense | 32 | 32K | 3.3 GB | Calc |
| StarCoder2 7B OpenRAIL BigCode license. Code completion/instruct. Mature local code base. | BigCode | 7.0B | Dense | 32 | 16K | 3.3 GB | Calc |
| Mistral 7B v0.3 | Mistral | 7.3B | Dense | 32 | 32K | 3.4 GB | Calc |
| Qwen 2.5 7B | Qwen | 7.6B | Dense | 28 | 128K | 3.5 GB | Calc |
| Qwen 2.5 Coder 7B Apache 2.0. Mature GGUF/MLX support. Excellent laptop coding. | Qwen | 7.6B | Dense | 28 | 32K | 3.5 GB | Calc |
| DeepSeek R1 Distill Qwen 7B Reasoning distilled into Qwen 2.5 7B base. Great local reasoning. | DeepSeek | 7.6B | Dense | 28 | 32K | 3.5 GB | Calc |
| Ministral 3 8B Apache 2.0. Cascade-distilled from Mistral Small 3.1. | Mistral | 8.0B | Dense | 34 | 256K | 3.7 GB | Calc |
| Gemma 4 E4B Effective 4.5B active via PLE. Hybrid local+global attn. Audio+image. 128K ctx. | Gemma | 8.0B | Dense | 42 | 128K | 3.7 GB | Calc |
| Llama-Nemotron 8B Nvidia fine-tune of Llama 3.1 8B for reasoning. | Nvidia | 8.0B | Dense | 32 | 128K | 3.7 GB | Calc |
| Granite 3.1 8B Apache 2.0. Enterprise chat, code, safety. Mature local deployments. | IBM Granite | 8.0B | Dense | 40 | 128K | 3.7 GB | Calc |
| MiniCPM 4 8B Open weights. On-device agents + MCP tool use. Good local tools/agents. | MiniCPM | 8.0B | Dense | 32 | 32K | 3.7 GB | Calc |
| Llama 3.1 8B | Llama | 8.0B | Dense | 32 | 128K | 3.7 GB | Calc |
| Qwen 3 8B Apache 2.0. Hybrid reasoning. Strong all-round local model. | Qwen | 8.2B | Dense | 36 | 128K | 3.8 GB | Calc |
| Qwen 3.5 9B Apache 2.0. Hybrid DeltaNet+Attn (25% layers KV cache). 13x smaller than gpt-oss-120b. | Qwen | 9.0B | Dense | 8 | 256K | 4.2 GB | Calc |
| Nemotron-Nano 9B v2 NVIDIA Open Model License. Unified reasoning/non-reasoning. 128K ctx. | Nvidia | 9.0B | Dense | 40 | 128K | 4.2 GB | Calc |
| Yi-Coder 9B Apache 2.0. Chinese/English coding. Mature GGUF support. | Yi | 9.0B | Dense | 40 | 32K | 4.2 GB | Calc |
| Mistral NeMo 12B Apache 2.0. Quantization-aware. NVIDIA collaboration. 128K context. | Mistral | 12.0B | Dense | 40 | 128K | 5.6 GB | Calc |
| Gemma 3 12B | Gemma | 12.0B | Dense | 48 | 32K | 5.6 GB | Calc |
| Ministral 3 14B Apache 2.0. Includes vision encoder. Strong laptop coding option. | Mistral | 14.0B | Dense | 40 | 256K | 6.5 GB | Calc |
| DeepCoder 14B RL-derived code reasoning. Good local coding reasoner class. | Coding | 14.0B | Dense | 40 | 32K | 6.5 GB | Calc |
| Qwen 2.5 14B | Qwen | 14.7B | Dense | 48 | 128K | 6.8 GB | Calc |
| Qwen 2.5 Coder 14B Apache 2.0. Strong local coding with mature runtime support. | Qwen | 14.7B | Dense | 48 | 32K | 6.8 GB | Calc |
| DeepSeek R1 Distill Qwen 14B Reasoning distilled into Qwen 2.5 14B base. | DeepSeek | 14.7B | Dense | 48 | 32K | 6.8 GB | Calc |
| Phi-4 14B MIT license. Math/reasoning specialist. High quality for size. | Phi | 14.7B | Dense | 40 | 16K | 6.8 GB | Calc |
| Qwen 3 14B Apache 2.0. Dense 14B. Excellent workstation model. | Qwen | 14.8B | Dense | 40 | 128K | 6.9 GB | Calc |
| StarCoder2 15B OpenRAIL BigCode license. Strong code completion with responsible-use clauses. | BigCode | 15.0B | Dense | 40 | 16K | 7.0 GB | Calc |
| Granite 3.1 20B Apache 2.0. Strong enterprise local option. | IBM Granite | 20.0B | Dense | 52 | 128K | 9.3 GB | Calc |
| gpt-oss 20B (MoE) Apache 2.0. MoE: 32 experts, top-4 routing. Fits 16GB at MXFP4. Strong local reasoning. | OpenAI | 21.0B3.6B active | MoE | 24 | 128K | 9.8 GB | Calc |
| Mistral Small 3.1 24B Apache 2.0. Runs on RTX 4090 / 32GB Mac. Vision + function calling. | Mistral | 24.0B | Dense | 56 | 128K | 11.2 GB | Calc |
| Magistral Small 24B Apache 2.0. Reasoning-focused dense model. Good workstation option. | Mistral | 24.0B | Dense | 56 | 128K | 11.2 GB | Calc |
| Gemma 4 26B-A4B (MoE) MoE: 128 experts, 8 active + 1 shared. Sliding window 1K. 256K ctx. | Gemma | 25.2B3.8B active | MoE | 30 | 256K | 11.7 GB | Calc |
| Qwen 3.5 27B Apache 2.0. Dense 27B. Hybrid DeltaNet+Attn (25% layers KV cache). 262K ctx. | Qwen | 27.0B | Dense | 16 | 256K | 12.6 GB | Calc |
| Qwen 3.6 27B Apache 2.0. Dense 27B coding specialist. Hybrid DeltaNet+Attn (25% KV cache layers). | Qwen | 27.0B | Dense | 16 | 256K | 12.6 GB | Calc |
| Gemma 3 27B | Gemma | 27.0B | Dense | 64 | 128K | 12.6 GB | Calc |
| Qwen 3 30B-A3B (MoE) Apache 2.0. MoE: efficient local model. 3B active per token. | Qwen | 30.0B3.0B active | MoE | 48 | 128K | 14.0 GB | Calc |
| Nemotron 3 Nano 30B-A3B (MoE) Nemotron Open Model License. MoE. Up to 1M context. Efficient local reasoning/agents. | Nvidia | 30.0B3.0B active | MoE | 40 | 256K | 14.0 GB | Calc |
| Gemma 4 31B Dense 31B. Hybrid local+global attn. Dual RoPE. TurboQuant 3-bit KV. 256K ctx. #3 open model on Arena. | Gemma | 30.7B | Dense | 60 | 256K | 14.3 GB | Calc |
| OLMo 3 32B Fully open research model. Instruction/thinking variants. | AI2 OLMo | 32.0B | Dense | 64 | 32K | 14.9 GB | Calc |
| Qwen 2.5 32B | Qwen | 32.5B | Dense | 64 | 128K | 15.1 GB | Calc |
| Qwen 2.5 Coder 32B Apache 2.0. Top local coding model with mature support. | Qwen | 32.5B | Dense | 64 | 32K | 15.1 GB | Calc |
| DeepSeek R1 Distill Qwen 32B Reasoning distilled into Qwen 2.5 32B base. Top local reasoning. | DeepSeek | 32.5B | Dense | 64 | 32K | 15.1 GB | Calc |
| Qwen 3 32B Apache 2.0. Dense 32B. Top-tier workstation coding/general. | Qwen | 32.8B | Dense | 64 | 128K | 15.3 GB | Calc |
| Qwen 3.5 35B-A3B (MoE) Apache 2.0. MoE: 256 experts, 8+1 active. DeltaNet+MoE hybrid. 3.5 tok/s on RTX 4090. | Qwen | 35.0B3.0B active | MoE | 10 | 256K | 16.3 GB | Calc |
| Qwen 3.6 35B-A3B (MoE) Apache 2.0. MoE: 256 experts, 8+1 active. DeltaNet+GA hybrid. 262K ctx, ext to ~1M with YaRN. SWE-bench 73.4. | Qwen | 35.0B3.0B active | MoE | 10 | 256K | 16.3 GB | Calc |
| Command R 35B CC-BY-NC. RAG, multilingual, tool use specialist. 128K context. | Cohere | 35.0B | Dense | 40 | 128K | 16.3 GB | Calc |
| Mixtral 8x7B (MoE) MoE: 8 experts, 2 active. All 46.7B params loaded. | Mistral | 46.7B12.9B active | MoE | 32 | 32K | 21.7 GB | Calc |
| Llama 3.1 70B | Llama | 70.6B | Dense | 80 | 128K | 32.9 GB | Calc |
| Llama 3.3 70B | Llama | 70.6B | Dense | 80 | 128K | 32.9 GB | Calc |
| DeepSeek R1 Distill Llama 70B Reasoning distilled into Llama 3.3 70B base. Workstation class. | DeepSeek | 70.6B | Dense | 80 | 32K | 32.9 GB | Calc |
| Qwen 2.5 72B | Qwen | 72.7B | Dense | 80 | 128K | 33.9 GB | Calc |
| Command R+ 104B | Cohere | 104.0B | Dense | 64 | 125K | 48.4 GB | Calc |
| Llama 4 Scout (MoE) MoE: 16 experts, 2 active. All 109B params loaded into VRAM. | Llama | 109.0B17.0B active | MoE | 48 | 256K | 50.8 GB | Calc |
| gpt-oss 120B (MoE) Apache 2.0. MoE: 128 experts, top-4 routing. Single 80GB GPU capable. 128K YaRN context. | OpenAI | 117.0B5.1B active | MoE | 36 | 128K | 54.5 GB | Calc |
| Qwen 3.5 122B-A10B (MoE) Apache 2.0. MoE: 256 experts. DeltaNet+MoE hybrid. Server/high-end workstation. | Qwen | 122.0B10.0B active | MoE | 12 | 256K | 56.8 GB | Calc |
| Devstral 2 123B Modified MIT. Agentic coding dense model. 256K context. Server class. | Mistral | 123.0B | Dense | 96 | 256K | 57.3 GB | Calc |
| DBRX 132B (MoE) Databricks Open Model License. Older but important open MoE. | Databricks | 132.0B36.0B active | MoE | 40 | 32K | 61.5 GB | Calc |
| Mixtral 8x22B (MoE) MoE: 8 experts, 2 active. All 141B params loaded. | Mistral | 141.0B39.0B active | MoE | 56 | 64K | 65.7 GB | Calc |
| Qwen 3 235B-A22B (MoE) Apache 2.0. MoE flagship. Server class. | Qwen | 235.0B22.0B active | MoE | 96 | 128K | 109.4 GB | Calc |
| DeepSeek V4-Flash (MoE) April 2026. 284B total / 13B active. 1M context. Economical V4 variant. High-memory server class. | DeepSeek | 284.0B13.0B active | MoE | 48 | 1.0M | 132.2 GB | Calc |
| GLM-4.5 (MoE) MIT license. MoE. 200K context. Server class. | GLM | 355.0B32.0B active | MoE | 64 | 200K | 165.3 GB | Calc |
| Qwen 3.5 397B-A17B (MoE) Apache 2.0. MoE flagship: 512 experts. DeltaNet+MoE hybrid. Server class. | Qwen | 397.0B17.0B active | MoE | 15 | 256K | 184.9 GB | Calc |
| Llama 4 Maverick (MoE) MoE: 128 experts, 16 active. All ~400B params loaded. | Llama | 400.0B40.0B active | MoE | 48 | 256K | 186.3 GB | Calc |
| Llama 3.1 405B Server/cluster class. Full precision impractical for consumer hardware. | Llama | 405.0B | Dense | 126 | 128K | 188.6 GB | Calc |
| Qwen 3 Coder 480B-A35B (MoE) Apache 2.0. Agentic coding MoE. Up to 1M extrapolated ctx. Server class. | Qwen | 480.0B35.0B active | MoE | 96 | 256K | 223.5 GB | Calc |
| Snowflake Arctic (MoE) Apache 2.0. Enterprise SQL/coding MoE. Server class. | Snowflake | 480.0B17.0B active | MoE | 64 | 32K | 223.5 GB | Calc |
| DeepSeek R1 (MoE) MoE: 256 experts, 8 active. MLA compresses KV cache ~95%. All 671B loaded. | DeepSeek | 671.0B37.0B active | MoE | 61 | 64K | 312.5 GB | Calc |
| DeepSeek V3 (MoE) Same architecture as R1. Non-reasoning variant. | DeepSeek | 671.0B37.0B active | MoE | 61 | 64K | 312.5 GB | Calc |
| Mistral Large 3 (MoE) Apache 2.0. MoE: 128 experts, top-4 routing. Server class. | Mistral | 675.0B41.0B active | MoE | 88 | 256K | 314.3 GB | Calc |
| DeepSeek V3 0324 (MoE) March 2024 update. 685B total params. MLA compressed KV cache. | DeepSeek | 685.0B37.0B active | MoE | 61 | 64K | 319.0 GB | Calc |
| GLM-5.1 (MoE) MIT license. MoE: DSA attention. FP8 repo ~1.5 TB. Agentic engineering. Server class. | GLM | 754.0B32.0B active | MoE | 80 | 128K | 351.1 GB | Calc |
| Kimi K2.6 (MoE) Modified MIT. MoE: 384 experts, 8+1 active. MLA for KV compression. Multimodal (MoonViT 400M). Server class. 1T total params. | Kimi | 1.0T32.0B active | MoE | 61 | 256K | 465.7 GB | Calc |
| DeepSeek V4-Pro (MoE) April 2026 preview. 1.6T total / 49B active. 1M context. DSA + token compression. Cluster class. | DeepSeek | 1.6T49.0B active | MoE | 80 | 1.0M | 745.1 GB | Calc |
Qwen 2.5 0.5B
Qwen490M
228 MB
24
32K
896
2
Qwen 3 0.6B
QwenApache 2.0. Thinking mode toggle. Tied embeddings.
600M
279 MB
28
32K
1,024
8
Qwen 3.5 0.8B
QwenApache 2.0. Hybrid DeltaNet+Attn (25% layers KV cache). 262K→1M ctx.
800M
373 MB
6
256K
1,024
2
Gemma 3 1B
Gemma1.0B
466 MB
26
32K
1,152
3
LFM2 1.2B
LiquidLFM Open License (Apache 2.0 based). On-device hybrid model. Fast CPU/mobile inference.
1.2B
559 MB
24
32K
2,048
8
Llama 3.2 1B
Llama1.2B
577 MB
16
128K
2,048
8
Qwen 2.5 1.5B
Qwen1.5B
717 MB
28
32K
1,536
2
DeepSeek R1 Distill Qwen 1.5B
DeepSeekReasoning distilled into Qwen 2.5 1.5B base.
1.5B
717 MB
28
32K
1,536
2
Qwen 3 1.7B
QwenApache 2.0. Thinking mode toggle.
1.7B
801 MB
28
32K
2,048
8
Qwen 3.5 2B
QwenApache 2.0. Hybrid DeltaNet+Attn (25% layers KV cache). 262K→1M ctx.
2.0B
931 MB
6
256K
2,048
4
Granite 3.1 2B
IBM GraniteApache 2.0. Enterprise RAG, code, safety.
2.0B
931 MB
32
128K
2,048
8
Ministral 3 3B
MistralApache 2.0. Cascade-distilled from Mistral Small 3.1.
3.0B
1.4 GB
26
256K
3,072
8
SmolLM3 3B
SmolLMApache 2.0. Tiny model for CPU/browser/phone. Educational use.
3.0B
1.4 GB
32
8K
2,560
8
Qwen 2.5 3B
Qwen3.1B
1.4 GB
36
32K
2,048
2
Llama 3.2 3B
Llama3.2B
1.5 GB
28
128K
3,072
8
3.8B
1.8 GB
32
16K
3,072
6
Qwen 3.5 4B
QwenApache 2.0. Hybrid DeltaNet+Attn (25% layers KV cache). 262K→1M ctx.
4.0B
1.9 GB
8
256K
2,560
4
Gemma 3 4B
Gemma4.0B
1.9 GB
34
32K
2,560
4
Nemotron 3 Nano 4B
NvidiaNemotron Open Model License. Hybrid Mamba2-Transformer. Laptop/workstation friendly.
4.0B
1.9 GB
32
256K
2,560
8
Qwen 3 4B
QwenApache 2.0. Thinking mode toggle. Great small local model.
4.0B
1.9 GB
36
32K
2,560
8
Gemma 4 E2B
GemmaEffective 2.3B active via PLE. Hybrid local+global attn. Audio+image. 128K ctx.
5.1B
2.4 GB
35
128K
2,560
4
MIT license. Image + audio + text multimodal. Good compact multimodal local.
5.6B
2.6 GB
36
16K
3,584
7
OLMo 3 7B
AI2 OLMoFully open data/code/weights. Transparent research model.
7.0B
3.3 GB
32
32K
4,096
8
StarCoder2 7B
BigCodeOpenRAIL BigCode license. Code completion/instruct. Mature local code base.
7.0B
3.3 GB
32
16K
4,096
8
Mistral 7B v0.3
Mistral7.3B
3.4 GB
32
32K
4,096
8
Qwen 2.5 7B
Qwen7.6B
3.5 GB
28
128K
3,584
4
Apache 2.0. Mature GGUF/MLX support. Excellent laptop coding.
7.6B
3.5 GB
28
32K
3,584
4
DeepSeek R1 Distill Qwen 7B
DeepSeekReasoning distilled into Qwen 2.5 7B base. Great local reasoning.
7.6B
3.5 GB
28
32K
3,584
4
Ministral 3 8B
MistralApache 2.0. Cascade-distilled from Mistral Small 3.1.
8.0B
3.7 GB
34
256K
4,096
8
Gemma 4 E4B
GemmaEffective 4.5B active via PLE. Hybrid local+global attn. Audio+image. 128K ctx.
8.0B
3.7 GB
42
128K
3,072
6
Llama-Nemotron 8B
NvidiaNvidia fine-tune of Llama 3.1 8B for reasoning.
8.0B
3.7 GB
32
128K
4,096
8
Granite 3.1 8B
IBM GraniteApache 2.0. Enterprise chat, code, safety. Mature local deployments.
8.0B
3.7 GB
40
128K
4,096
8
MiniCPM 4 8B
MiniCPMOpen weights. On-device agents + MCP tool use. Good local tools/agents.
8.0B
3.7 GB
32
32K
4,096
8
Llama 3.1 8B
Llama8.0B
3.7 GB
32
128K
4,096
8
Qwen 3 8B
QwenApache 2.0. Hybrid reasoning. Strong all-round local model.
8.2B
3.8 GB
36
128K
4,096
8
Qwen 3.5 9B
QwenApache 2.0. Hybrid DeltaNet+Attn (25% layers KV cache). 13x smaller than gpt-oss-120b.
9.0B
4.2 GB
8
256K
4,096
4
Nemotron-Nano 9B v2
NvidiaNVIDIA Open Model License. Unified reasoning/non-reasoning. 128K ctx.
9.0B
4.2 GB
40
128K
4,096
8
Apache 2.0. Chinese/English coding. Mature GGUF support.
9.0B
4.2 GB
40
32K
4,096
8
Mistral NeMo 12B
MistralApache 2.0. Quantization-aware. NVIDIA collaboration. 128K context.
12.0B
5.6 GB
40
128K
5,120
8
Gemma 3 12B
Gemma12.0B
5.6 GB
48
32K
3,840
6
Ministral 3 14B
MistralApache 2.0. Includes vision encoder. Strong laptop coding option.
14.0B
6.5 GB
40
256K
5,120
8
DeepCoder 14B
CodingRL-derived code reasoning. Good local coding reasoner class.
14.0B
6.5 GB
40
32K
5,120
8
Qwen 2.5 14B
Qwen14.7B
6.8 GB
48
128K
5,120
8
Apache 2.0. Strong local coding with mature runtime support.
14.7B
6.8 GB
48
32K
5,120
8
DeepSeek R1 Distill Qwen 14B
DeepSeekReasoning distilled into Qwen 2.5 14B base.
14.7B
6.8 GB
48
32K
5,120
8
Phi-4 14B
PhiMIT license. Math/reasoning specialist. High quality for size.
14.7B
6.8 GB
40
16K
5,120
10
Qwen 3 14B
QwenApache 2.0. Dense 14B. Excellent workstation model.
14.8B
6.9 GB
40
128K
5,120
8
StarCoder2 15B
BigCodeOpenRAIL BigCode license. Strong code completion with responsible-use clauses.
15.0B
7.0 GB
40
16K
6,144
8
Granite 3.1 20B
IBM GraniteApache 2.0. Strong enterprise local option.
20.0B
9.3 GB
52
128K
5,120
8
gpt-oss 20B (MoE)
OpenAIApache 2.0. MoE: 32 experts, top-4 routing. Fits 16GB at MXFP4. Strong local reasoning.
21.0B(3.6B active)
9.8 GB
24
128K
2,880
8
Mistral Small 3.1 24B
MistralApache 2.0. Runs on RTX 4090 / 32GB Mac. Vision + function calling.
24.0B
11.2 GB
56
128K
6,144
8
Magistral Small 24B
MistralApache 2.0. Reasoning-focused dense model. Good workstation option.
24.0B
11.2 GB
56
128K
6,144
8
MoE: 128 experts, 8 active + 1 shared. Sliding window 1K. 256K ctx.
25.2B(3.8B active)
11.7 GB
30
256K
4,096
8
Qwen 3.5 27B
QwenApache 2.0. Dense 27B. Hybrid DeltaNet+Attn (25% layers KV cache). 262K ctx.
27.0B
12.6 GB
16
256K
5,120
4
Qwen 3.6 27B
QwenApache 2.0. Dense 27B coding specialist. Hybrid DeltaNet+Attn (25% KV cache layers).
27.0B
12.6 GB
16
256K
5,120
4
Gemma 3 27B
Gemma27.0B
12.6 GB
64
128K
5,632
8
Apache 2.0. MoE: efficient local model. 3B active per token.
30.0B(3.0B active)
14.0 GB
48
128K
4,096
8
Nemotron Open Model License. MoE. Up to 1M context. Efficient local reasoning/agents.
30.0B(3.0B active)
14.0 GB
40
256K
2,560
8
Gemma 4 31B
GemmaDense 31B. Hybrid local+global attn. Dual RoPE. TurboQuant 3-bit KV. 256K ctx. #3 open model on Arena.
30.7B
14.3 GB
60
256K
5,632
8
OLMo 3 32B
AI2 OLMoFully open research model. Instruction/thinking variants.
32.0B
14.9 GB
64
32K
5,120
8
Qwen 2.5 32B
Qwen32.5B
15.1 GB
64
128K
5,120
8
Apache 2.0. Top local coding model with mature support.
32.5B
15.1 GB
64
32K
5,120
8
DeepSeek R1 Distill Qwen 32B
DeepSeekReasoning distilled into Qwen 2.5 32B base. Top local reasoning.
32.5B
15.1 GB
64
32K
5,120
8
Qwen 3 32B
QwenApache 2.0. Dense 32B. Top-tier workstation coding/general.
32.8B
15.3 GB
64
128K
5,120
8
Apache 2.0. MoE: 256 experts, 8+1 active. DeltaNet+MoE hybrid. 3.5 tok/s on RTX 4090.
35.0B(3.0B active)
16.3 GB
10
256K
2,048
2
Apache 2.0. MoE: 256 experts, 8+1 active. DeltaNet+GA hybrid. 262K ctx, ext to ~1M with YaRN. SWE-bench 73.4.
35.0B(3.0B active)
16.3 GB
10
256K
2,048
2
Command R 35B
CohereCC-BY-NC. RAG, multilingual, tool use specialist. 128K context.
35.0B
16.3 GB
40
128K
8,192
8
Mixtral 8x7B (MoE)
MistralMoE: 8 experts, 2 active. All 46.7B params loaded.
46.7B(12.9B active)
21.7 GB
32
32K
4,096
8
Llama 3.1 70B
Llama70.6B
32.9 GB
80
128K
8,192
8
Llama 3.3 70B
Llama70.6B
32.9 GB
80
128K
8,192
8
DeepSeek R1 Distill Llama 70B
DeepSeekReasoning distilled into Llama 3.3 70B base. Workstation class.
70.6B
32.9 GB
80
32K
8,192
8
Qwen 2.5 72B
Qwen72.7B
33.9 GB
80
128K
8,192
8
Command R+ 104B
Cohere104.0B
48.4 GB
64
125K
12,288
8
Llama 4 Scout (MoE)
LlamaMoE: 16 experts, 2 active. All 109B params loaded into VRAM.
109.0B(17.0B active)
50.8 GB
48
256K
5,120
8
gpt-oss 120B (MoE)
OpenAIApache 2.0. MoE: 128 experts, top-4 routing. Single 80GB GPU capable. 128K YaRN context.
117.0B(5.1B active)
54.5 GB
36
128K
2,880
8
Apache 2.0. MoE: 256 experts. DeltaNet+MoE hybrid. Server/high-end workstation.
122.0B(10.0B active)
56.8 GB
12
256K
3,072
2
Devstral 2 123B
MistralModified MIT. Agentic coding dense model. 256K context. Server class.
123.0B
57.3 GB
96
256K
10,240
16
DBRX 132B (MoE)
DatabricksDatabricks Open Model License. Older but important open MoE.
132.0B(36.0B active)
61.5 GB
40
32K
6,144
8
Mixtral 8x22B (MoE)
MistralMoE: 8 experts, 2 active. All 141B params loaded.
141.0B(39.0B active)
65.7 GB
56
64K
6,144
8
Apache 2.0. MoE flagship. Server class.
235.0B(22.0B active)
109.4 GB
96
128K
8,192
8
DeepSeek V4-Flash (MoE)
DeepSeekApril 2026. 284B total / 13B active. 1M context. Economical V4 variant. High-memory server class.
284.0B(13.0B active)
132.2 GB
48
1.0M
6,144
8
MIT license. MoE. 200K context. Server class.
355.0B(32.0B active)
165.3 GB
64
200K
7,168
8
Apache 2.0. MoE flagship: 512 experts. DeltaNet+MoE hybrid. Server class.
397.0B(17.0B active)
184.9 GB
15
256K
4,096
2
MoE: 128 experts, 16 active. All ~400B params loaded.
400.0B(40.0B active)
186.3 GB
48
256K
6,400
8
Llama 3.1 405B
LlamaServer/cluster class. Full precision impractical for consumer hardware.
405.0B
188.6 GB
126
128K
16,384
8
Apache 2.0. Agentic coding MoE. Up to 1M extrapolated ctx. Server class.
480.0B(35.0B active)
223.5 GB
96
256K
8,192
8
Snowflake Arctic (MoE)
SnowflakeApache 2.0. Enterprise SQL/coding MoE. Server class.
480.0B(17.0B active)
223.5 GB
64
32K
7,168
8
DeepSeek R1 (MoE)
DeepSeekMoE: 256 experts, 8 active. MLA compresses KV cache ~95%. All 671B loaded.
671.0B(37.0B active)
312.5 GB
61
64K
7,168
8
DeepSeek V3 (MoE)
DeepSeekSame architecture as R1. Non-reasoning variant.
671.0B(37.0B active)
312.5 GB
61
64K
7,168
8
Mistral Large 3 (MoE)
MistralApache 2.0. MoE: 128 experts, top-4 routing. Server class.
675.0B(41.0B active)
314.3 GB
88
256K
12,288
8
DeepSeek V3 0324 (MoE)
DeepSeekMarch 2024 update. 685B total params. MLA compressed KV cache.
685.0B(37.0B active)
319.0 GB
61
64K
7,168
8
MIT license. MoE: DSA attention. FP8 repo ~1.5 TB. Agentic engineering. Server class.
754.0B(32.0B active)
351.1 GB
80
128K
8,192
8
Kimi K2.6 (MoE)
KimiModified MIT. MoE: 384 experts, 8+1 active. MLA for KV compression. Multimodal (MoonViT 400M). Server class. 1T total params.
1.0T(32.0B active)
465.7 GB
61
256K
7,168
8
DeepSeek V4-Pro (MoE)
DeepSeekApril 2026 preview. 1.6T total / 49B active. 1M context. DSA + token compression. Cluster class.
1.6T(49.0B active)
745.1 GB
80
1.0M
8,192
8
About This Data
Q4_K_M VRAM
Estimated GPU memory for model weights at Q4_K_M quantization (0.5 bytes/param). Actual usage will be higher with KV cache. For precise calculations including context length and overhead, use the VRAM Calculator.
MoE Models
Mixture of Experts models show both total parameters (all experts loaded into VRAM) and active parameters (per-token compute). MoE models need VRAM for all experts but run faster than dense models of equivalent total size.
Architecture Types
Dense: All parameters active per token. Standard transformer architecture.
MoE: Expert sub-networks with sparse activation. Better quality-per-FLOP ratio.