Apr 28, 2026

How much VRAM you actually need for local LLMs

VRAM is the single bottleneck that decides which models you can run. Here is the formula, the numbers, and the tiers - so you buy once and do not overspend.

Andre

GPUAILLM

1.0

The VRAM formula

Total VRAM = model_weights + KV_cache + overhead model_weights = parameters x bytes_per_weight

Quantization	Bytes/Param	7B Model	70B Model
FP16	2.0	~14 GB	~140 GB
Q8	1.0	~7 GB	~70 GB
Q4_K_M	0.5	~4 GB	~38 GB
Q3_K_M	0.375	~3 GB	~30 GB
Q2_K	0.25	~2 GB	~20 GB

Overhead (tokenizer, graph, framework buffers) adds 10-20% on top of raw weights.

2.0

The hidden cost: KV cache

Every token in your conversation occupies memory in the KV cache. For a 7B model, each additional 1,000 tokens of context costs roughly 50-100 MB. For a 70B model, that jumps to 300-500 MB per 1,000 tokens. Long context windows can double or triple your total VRAM usage beyond the model weights alone.

Context Length	7B at Q4 Total	70B at Q4 Total
2,048 tokens	~4.5 GB	~40 GB
8,192 tokens	~5.5 GB	~46 GB
32,768 tokens	~8 GB	~62 GB
128,000 tokens	~14 GB	~110 GB

3.0

Bandwidth determines token speed

During autoregressive generation, the GPU reads the entire model for each token. The theoretical maximum token speed is bandwidth divided by model size. Real-world throughput is typically 60-80% of theoretical due to overhead and KV cache access.

tokens/second = bandwidth_GB_per_s / model_size_GB x efficiency RTX 4090 (1,008 GB/s), 70B at Q4 (38 GB): ~1,008 / 38 x 0.7 = ~18.5 t/s

4.0

Quick tier summary

VRAM	Models You Can Run	Verdict
8 GB	7B at Q4	Experimentation only
12 GB	7B at Q8, 13B at Q4	Basic inference
16 GB	13B at Q8, 34B at Q3	Solid 7B-13B usage
24 GB	35B at Q4, Mixtral 8x7B, 70B at Q3	Sweet spot for most
32 GB	70B at Q4, long context 35B+	Fewest compromises
48 GB+	70B at FP16, Mixtral 8x22B	Multi-GPU territory

Is 16 GB Enough?

What fits, what does not, and when to step up.

Is 24 GB Enough?

The enthusiast sweet spot analyzed.

24 GB vs 32 GB

When each tier makes sense.

Best GPU for Local LLMs

Specific GPU recommendations by budget.

Frequently Asked Questions

Does quantization quality affect VRAM usage?

Linearly. A 7B model at FP16 needs ~14 GB. At Q8, ~7 GB. At Q4, ~4 GB. Q4_K_M is the most popular quantization because it retains ~97% of FP16 quality at roughly 25% of the size. The relationship is simply: VRAM = parameters x bytes_per_weight.

How does context length affect VRAM?

The KV cache grows linearly with context length. For a 7B model at Q4, 2,048 tokens of context might use 4.5 GB total, while 32,768 tokens could push that to 8+ GB. Budget extra VRAM for long context - roughly 30% more than model weights alone.

Can I offload part of a model to system RAM?

Yes, llama.cpp supports GPU/CPU split. You can run a model larger than your VRAM by keeping some layers on the GPU and the rest in system RAM. The downside is speed: CPU layers run 3-5x slower. Keep as many layers on GPU as your VRAM allows.

Do I need ECC memory for local LLMs?

No. Consumer GPUs without ECC work fine for inference. ECC matters more for training where bit flips can corrupt weights over many iterations. For inference, a random bit flip barely affects output quality.

End of Document

Back to all articles

Share this article