Home/Tools/VRAM Calculator

LLM VRAM Calculator

Estimate GPU memory requirements for running large language models locally. Select a model, choose your quantization, and dial in your context length to see exactly how much VRAM you need — plus which GPUs can handle it.

PC Part Guide is supported by its audience. We may earn commissions from qualifying purchases through affiliate links on this page. Full disclosure

Model

8.0B params · 32 layers · d=4096 · 8 KV heads · max 131,072 ctx

Quantization

4-bit (Medium) — ~97% of FP16

Context Length (tokens)

Max: 131,072

Batch Size

Batch > 1 multiplies KV cache for parallel sequences. Prompt processing (batch > 1) requires additional scratch memory.

CUDA Overhead %

Default 10% covers cuBLAS buffers and workspace. Increase for multi-GPU or unusual setups.

VRAM Estimate

Total VRAM Required

4.65 GB

Base: 4.15 GB + KV Cache: 0.50 GB

Per-token KV cost: 0.00 GB/1K tokens

Memory Usage vs Top GPU

0 GB16 GB

Memory Breakdown

Model Weights3.74 GB

8.0B params × 0.5 bytes/weight

KV Cache0.50 GB

4,096 tokens · 125.00 MB per 1K tokens

System Overhead0.37 GB

10% of model weights (cuBLAS + workspace)

Scratchpad0.04 GB

~1% of weights (temporary tensors)

Formula: Standard — KV = 2 × L × (d / g) × b_kv

Recommended GPUs

GPUs with ≥ 5 GB VRAM

View All GPUs →

Radeon RX 7600

AMD

8 GB VRAM$259.99

VRAM8GB

VRAM TypeGDDR6

Boost Clock2,655 MHz

View on Amazon Part Details

GeForce RTX 5060

NVIDIA

8 GB VRAM$299.99

VRAM8GB

VRAM TypeGDDR7

Boost Clock2,497 MHz

View on Amazon Part Details

GeForce RTX 4060

NVIDIA

8 GB VRAM$299.99

VRAM8GB

VRAM TypeGDDR6

Boost Clock2,460 MHz

View on Amazon Part Details

Radeon RX 9060 XT

AMD

8 GB VRAM$349.99

VRAM8GB

VRAM TypeGDDR6

Boost Clock2,690 MHz

View on Amazon Part Details

GeForce RTX 4060 Ti

NVIDIA

8 GB VRAM$399.99

VRAM8GB

VRAM TypeGDDR6

Boost Clock2,535 MHz

View on Amazon Part Details

Radeon RX 7800 XT

AMD

16 GB VRAM$499.99

VRAM16GB

VRAM TypeGDDR6

Boost Clock2,430 MHz

View on Amazon Part Details

GeForce RTX 5070

NVIDIA

12 GB VRAM$549.99

VRAM12GB

VRAM TypeGDDR7

Boost Clock2,512 MHz

View on Amazon Part Details

Radeon RX 9070

AMD

16 GB VRAM$549.99

VRAM16GB

VRAM TypeGDDR6

Boost Clock2,700 MHz

View on Amazon Part Details

How Does Local LLM VRAM Calculation Work?

This calculator uses an empirical formula derived from real-world testing with the llama.cpp inference backend. VRAM usage consists of two distinct components: a fixed cost(model weights, CUDA overhead, scratchpad) and a variable cost that grows linearly with context length (KV cache).

Learn more: What is a Large Language Model? (Wikipedia) · GPU Memory & LLM Inference Explained (BentoML)

1. Fixed Cost (Constant)

Model Weights: The quantized parameters stored in GPU memory.
CUDA Overhead: ~10% of model weights for cuBLAS buffers, workspace, and driver allocations.
Scratchpad Memory: ~1% of model weights for temporary tensors and activations during inference.

2. Variable Cost (Linear with Context)

KV Cache: Grows linearly with every token in your prompt + generated output.
Per-token cost: Depends on the model's layer count, KV head count, and head dimension — not on total parameters.
Context Impact: Long conversations or documents can consume more VRAM than the model weights themselves.

Key Insight

The KV cache, not model size, often becomes the limiting factor for long contexts. Understanding this helps you optimize your hardware choices and usage patterns — a 7B model at 128K context can require more VRAM than a 32B model at 4K context.

Precise VRAM Formulas for Local LLMs

1. Standard Architecture (Llama, Mistral, Qwen 2.5, Gemma, Phi)

VRAM [GB] = (P × B) + (O + S) + (2 × L × KV_heads × head_dim × b_kv × C) ⁄ (1024³)

P = params in billions

B = bytes per weight

O = overhead (default 10%)

S = scratchpad (~1%)

L = transformer layers

KV_heads = GQA KV heads

head_dim = KV head dimension

b_kv = 2 bytes (FP16)

C = context length

Equivalent to d/g where g = attention_heads / kv_heads (GQA factor). For Llama 3.1 8B: d/g = 4096/4 = 1024, and kv_heads × head_dim = 8 × 128 = 1024.

2. Decoupled Head Dimension (Qwen 3 32B)

VRAM [GB] = (P × B) + (O + S) + (2 × L × n_kv × head_dim_kv × b_kv × C) ⁄ (1024³)

Used for architectures where the K and V head dimension is decoupled from the Q (query) dimension. Qwen 3 32B uses this architecture — its KV cache calculation differs from the standard d/g formula. Qwen 3 8B and other models in the family use the standard formula.

3. MoE Models (Mixtral, DeepSeek, Llama 4)

All expert parameters must be loaded into VRAM, not just the active subset. For example, Mixtral 8×7B has 46.7B total parameters (all 8 experts), even though only 12.9B are active per token. DeepSeek R1/V3 loads all 671B parameters across 256 experts (37B active). The calculator uses total parameters for the model weights calculation.

Reference Examples (Verified)

Model	Quant	Context	Weights	KV Cache	Total	Real-World
Llama 3.1 8B	Q4_K_M	4,096	4.01 GB	0.50 GB	4.96 GB	~4.8 GB
Qwen 2.5 7B	Q4_K_M	4,096	3.80 GB	0.22 GB	4.42 GB	~4.2 GB
Llama 3.3 70B	Q4_K_M	4,096	35.3 GB	1.25 GB	40.6 GB	~39 GB
Mixtral 8×7B	Q4_K_M	4,096	23.4 GB	0.50 GB	26.5 GB	~25 GB
Qwen 2.5 32B	Q4_K_M	4,096	16.3 GB	0.50 GB	18.5 GB	~18 GB
DeepSeek R1	Q4_K_M	4,096	335.5 GB	0.48 GB	370.6 GB	N/A*

* DeepSeek R1 uses Multi-head Latent Attention (MLA) which compresses the KV cache significantly. The listed KV cache for DeepSeek uses the standard formula; actual MLA KV cache is much smaller (~95% compression). Real-world measurement not available due to extreme hardware requirements.

Frequently Asked Questions

What is the KV cache and why does it grow with context?+

During auto-regressive generation, the model stores Key (K) and Value (V) tensors for every processed token to avoid re-computing them. For each new token, only the new K/V pair is computed and appended to the cache. The cache grows linearly at a fixed rate per token, determined by: 2 × layers × kv_heads × head_dim × 2 bytes. At 128K context for a 70B model, the KV cache alone can exceed 40 GB.

Why does quantization reduce VRAM for weights but not the KV cache?+

Quantization compresses model weights because they are static and loaded once. The KV cache, however, is dynamic — new entries are added at every token and must be computed in FP16 (or FP8 on newer hardware) for numerical stability during attention. The KV cache is always stored at full precision regardless of the weight quantization format.

Can I offload layers to system RAM (CPU) to use a larger model?+

Yes. GGUF/llama.cpp supports partial GPU offloading with the --n-gpu-layers flag. This calculator shows the total memory required — if you only offload N layers to GPU, the GPU VRAM usage scales proportionally. The remaining layers run on CPU at reduced speed (typically 10-50% of GPU throughput). Use a GPU with enough VRAM for the best experience.

Why do MoE models like DeepSeek R1 require so much VRAM?+

Mixture of Experts (MoE) models have multiple "expert" sub-networks. Only a subset of experts activates per token, but all experts must be loaded into memory. DeepSeek R1 has 256 experts x ~2.4B params each + shared parameters = 671B total. Even though only 37B are active per token, the full 671B must fit in VRAM.

Does batch size affect VRAM requirements?+

Yes. Batch size > 1 multiplies the KV cache requirement (each sequence has its own KV cache) and increases scratchpad memory for parallel prompt processing. For llama.cpp, the KV cache portion is multiplied by batch size. For production serving (vLLM, TGI), PagedAttention reduces this overhead significantly.

Related Guides

Turn these VRAM numbers into a buying decision. Our in-depth GPU guides walk you through the best hardware for every budget and model size.

Last updated: April 28, 2026

LLM VRAM Calculator

GPUs with ≥ 5 GB VRAM

Radeon RX 7600

GeForce RTX 5060

GeForce RTX 4060

Radeon RX 9060 XT

GeForce RTX 4060 Ti

Radeon RX 7800 XT

GeForce RTX 5070

Radeon RX 9070

How Does Local LLM VRAM Calculation Work?

1. Fixed Cost (Constant)

2. Variable Cost (Linear with Context)

Precise VRAM Formulas for Local LLMs

1. Standard Architecture (Llama, Mistral, Qwen 2.5, Gemma, Phi)

2. Decoupled Head Dimension (Qwen 3 32B)

3. MoE Models (Mixtral, DeepSeek, Llama 4)

Reference Examples (Verified)

Frequently Asked Questions

Related Guides

Best GPU for Local LLMs

How Much VRAM for Local LLMs?

Is 16GB VRAM Enough for Local LLMs?

Is 24GB VRAM Enough for Local LLMs?

24GB vs 32GB GPU for Local LLMs

Best Budget GPU for Local LLMs

Best Used GPU for Local LLMs

Best NVIDIA GPU for Local LLMs