Apr 28, 2026

How much VRAM you actually need for local LLMs

VRAM is the single bottleneck that decides which models you can run. Here is the formula, the numbers, and the tiers - so you buy once and do not overspend.

How much VRAM you actually need for local LLMs
A
Andre
GPUAILLM
1.0

The VRAM formula

Total VRAM = model_weights + KV_cache + overhead model_weights = parameters x bytes_per_weight
QuantizationBytes/Param7B Model70B Model
FP162.0~14 GB~140 GB
Q81.0~7 GB~70 GB
Q4_K_M0.5~4 GB~38 GB
Q3_K_M0.375~3 GB~30 GB
Q2_K0.25~2 GB~20 GB

Overhead (tokenizer, graph, framework buffers) adds 10-20% on top of raw weights.

2.0

The hidden cost: KV cache

Every token in your conversation occupies memory in the KV cache. For a 7B model, each additional 1,000 tokens of context costs roughly 50-100 MB. For a 70B model, that jumps to 300-500 MB per 1,000 tokens. Long context windows can double or triple your total VRAM usage beyond the model weights alone.

Context Length7B at Q4 Total70B at Q4 Total
2,048 tokens~4.5 GB~40 GB
8,192 tokens~5.5 GB~46 GB
32,768 tokens~8 GB~62 GB
128,000 tokens~14 GB~110 GB
3.0

Bandwidth determines token speed

During autoregressive generation, the GPU reads the entire model for each token. The theoretical maximum token speed is bandwidth divided by model size. Real-world throughput is typically 60-80% of theoretical due to overhead and KV cache access.

tokens/second = bandwidth_GB_per_s / model_size_GB x efficiency RTX 4090 (1,008 GB/s), 70B at Q4 (38 GB): ~1,008 / 38 x 0.7 = ~18.5 t/s
4.0

Quick tier summary

VRAMModels You Can RunVerdict
8 GB7B at Q4Experimentation only
12 GB7B at Q8, 13B at Q4Basic inference
16 GB13B at Q8, 34B at Q3Solid 7B-13B usage
24 GB35B at Q4, Mixtral 8x7B, 70B at Q3Sweet spot for most
32 GB70B at Q4, long context 35B+Fewest compromises
48 GB+70B at FP16, Mixtral 8x22BMulti-GPU territory
Related

Related Guides

Frequently Asked Questions

Does quantization quality affect VRAM usage?
Linearly. A 7B model at FP16 needs ~14 GB. At Q8, ~7 GB. At Q4, ~4 GB. Q4_K_M is the most popular quantization because it retains ~97% of FP16 quality at roughly 25% of the size. The relationship is simply: VRAM = parameters x bytes_per_weight.
How does context length affect VRAM?
The KV cache grows linearly with context length. For a 7B model at Q4, 2,048 tokens of context might use 4.5 GB total, while 32,768 tokens could push that to 8+ GB. Budget extra VRAM for long context - roughly 30% more than model weights alone.
Can I offload part of a model to system RAM?
Yes, llama.cpp supports GPU/CPU split. You can run a model larger than your VRAM by keeping some layers on the GPU and the rest in system RAM. The downside is speed: CPU layers run 3-5x slower. Keep as many layers on GPU as your VRAM allows.
Do I need ECC memory for local LLMs?
No. Consumer GPUs without ECC work fine for inference. ECC matters more for training where bit flips can corrupt weights over many iterations. For inference, a random bit flip barely affects output quality.

End of Document

Back to all articles
Share this article