LLM VRAM Calculator
Estimate GPU memory requirements for running large language models locally. Select a model, choose your quantization, and dial in your context length to see exactly how much VRAM you need — plus which GPUs can handle it.
PC Part Guide is supported by its audience. We may earn commissions from qualifying purchases through affiliate links on this page. Full disclosure
8.0B params · 32 layers · d=4096 · 8 KV heads · max 131,072 ctx
4-bit (Medium) — ~97% of FP16
Batch > 1 multiplies KV cache for parallel sequences. Prompt processing (batch > 1) requires additional scratch memory.
Default 10% covers cuBLAS buffers and workspace. Increase for multi-GPU or unusual setups.
GPUs with ≥ 5 GB VRAM

Radeon RX 7600
AMD

GeForce RTX 5060
NVIDIA

GeForce RTX 4060
NVIDIA

Radeon RX 9060 XT
AMD

GeForce RTX 4060 Ti
NVIDIA

Radeon RX 7800 XT
AMD

GeForce RTX 5070
NVIDIA

Radeon RX 9070
AMD
How Does Local LLM VRAM Calculation Work?
This calculator uses an empirical formula derived from real-world testing with the llama.cpp inference backend. VRAM usage consists of two distinct components: a fixed cost(model weights, CUDA overhead, scratchpad) and a variable cost that grows linearly with context length (KV cache).
Learn more: What is a Large Language Model? (Wikipedia) · GPU Memory & LLM Inference Explained (BentoML)
1. Fixed Cost (Constant)
- Model Weights: The quantized parameters stored in GPU memory.
- CUDA Overhead: ~10% of model weights for cuBLAS buffers, workspace, and driver allocations.
- Scratchpad Memory: ~1% of model weights for temporary tensors and activations during inference.
2. Variable Cost (Linear with Context)
- KV Cache: Grows linearly with every token in your prompt + generated output.
- Per-token cost: Depends on the model's layer count, KV head count, and head dimension — not on total parameters.
- Context Impact: Long conversations or documents can consume more VRAM than the model weights themselves.
Precise VRAM Formulas for Local LLMs
1. Standard Architecture (Llama, Mistral, Qwen 2.5, Gemma, Phi)
Equivalent to d/g where g = attention_heads / kv_heads (GQA factor). For Llama 3.1 8B: d/g = 4096/4 = 1024, and kv_heads × head_dim = 8 × 128 = 1024.
2. Decoupled Head Dimension (Qwen 3 32B)
Used for architectures where the K and V head dimension is decoupled from the Q (query) dimension. Qwen 3 32B uses this architecture — its KV cache calculation differs from the standard d/g formula. Qwen 3 8B and other models in the family use the standard formula.
3. MoE Models (Mixtral, DeepSeek, Llama 4)
All expert parameters must be loaded into VRAM, not just the active subset. For example, Mixtral 8×7B has 46.7B total parameters (all 8 experts), even though only 12.9B are active per token. DeepSeek R1/V3 loads all 671B parameters across 256 experts (37B active). The calculator uses total parameters for the model weights calculation.
Reference Examples (Verified)
| Model | Quant | Context | Weights | KV Cache | Total | Real-World |
|---|---|---|---|---|---|---|
| Llama 3.1 8B | Q4_K_M | 4,096 | 4.01 GB | 0.50 GB | 4.96 GB | ~4.8 GB |
| Qwen 2.5 7B | Q4_K_M | 4,096 | 3.80 GB | 0.22 GB | 4.42 GB | ~4.2 GB |
| Llama 3.3 70B | Q4_K_M | 4,096 | 35.3 GB | 1.25 GB | 40.6 GB | ~39 GB |
| Mixtral 8×7B | Q4_K_M | 4,096 | 23.4 GB | 0.50 GB | 26.5 GB | ~25 GB |
| Qwen 2.5 32B | Q4_K_M | 4,096 | 16.3 GB | 0.50 GB | 18.5 GB | ~18 GB |
| DeepSeek R1 | Q4_K_M | 4,096 | 335.5 GB | 0.48 GB | 370.6 GB | N/A* |
* DeepSeek R1 uses Multi-head Latent Attention (MLA) which compresses the KV cache significantly. The listed KV cache for DeepSeek uses the standard formula; actual MLA KV cache is much smaller (~95% compression). Real-world measurement not available due to extreme hardware requirements.
Frequently Asked Questions
What is the KV cache and why does it grow with context?+
Why does quantization reduce VRAM for weights but not the KV cache?+
Can I offload layers to system RAM (CPU) to use a larger model?+
Why do MoE models like DeepSeek R1 require so much VRAM?+
Does batch size affect VRAM requirements?+
Related Guides
Turn these VRAM numbers into a buying decision. Our in-depth GPU guides walk you through the best hardware for every budget and model size.
Best GPU for Local LLMs
Complete GPU buying guide for running LLMs at home.
How Much VRAM for Local LLMs?
VRAM requirements for every popular model explained.
Is 16GB VRAM Enough for Local LLMs?
What you can run with 16GB of VRAM.
Is 24GB VRAM Enough for Local LLMs?
24GB GPU capabilities for LLM inference.
24GB vs 32GB GPU for Local LLMs
Is the extra 8GB worth the premium?
Best Budget GPU for Local LLMs
Affordable GPU options for LLM experimentation.
Best Used GPU for Local LLMs
Used GPU deals that make sense for LLMs.
Best NVIDIA GPU for Local LLMs
NVIDIA GPU rankings for LLM workloads.
VRAM Calculator — PCPARTGUIDE
Last updated: April 28, 2026