How much VRAM you actually need for local LLMs
VRAM is the single bottleneck that decides which models you can run. Here is the formula, the numbers, and the tiers - so you buy once and do not overspend.

The VRAM formula
| Quantization | Bytes/Param | 7B Model | 70B Model |
|---|---|---|---|
| FP16 | 2.0 | ~14 GB | ~140 GB |
| Q8 | 1.0 | ~7 GB | ~70 GB |
| Q4_K_M | 0.5 | ~4 GB | ~38 GB |
| Q3_K_M | 0.375 | ~3 GB | ~30 GB |
| Q2_K | 0.25 | ~2 GB | ~20 GB |
Overhead (tokenizer, graph, framework buffers) adds 10-20% on top of raw weights.
The hidden cost: KV cache
Every token in your conversation occupies memory in the KV cache. For a 7B model, each additional 1,000 tokens of context costs roughly 50-100 MB. For a 70B model, that jumps to 300-500 MB per 1,000 tokens. Long context windows can double or triple your total VRAM usage beyond the model weights alone.
| Context Length | 7B at Q4 Total | 70B at Q4 Total |
|---|---|---|
| 2,048 tokens | ~4.5 GB | ~40 GB |
| 8,192 tokens | ~5.5 GB | ~46 GB |
| 32,768 tokens | ~8 GB | ~62 GB |
| 128,000 tokens | ~14 GB | ~110 GB |
Bandwidth determines token speed
During autoregressive generation, the GPU reads the entire model for each token. The theoretical maximum token speed is bandwidth divided by model size. Real-world throughput is typically 60-80% of theoretical due to overhead and KV cache access.
Quick tier summary
| VRAM | Models You Can Run | Verdict |
|---|---|---|
| 8 GB | 7B at Q4 | Experimentation only |
| 12 GB | 7B at Q8, 13B at Q4 | Basic inference |
| 16 GB | 13B at Q8, 34B at Q3 | Solid 7B-13B usage |
| 24 GB | 35B at Q4, Mixtral 8x7B, 70B at Q3 | Sweet spot for most |
| 32 GB | 70B at Q4, long context 35B+ | Fewest compromises |
| 48 GB+ | 70B at FP16, Mixtral 8x22B | Multi-GPU territory |
Related Guides
Frequently Asked Questions
Does quantization quality affect VRAM usage?
How does context length affect VRAM?
Can I offload part of a model to system RAM?
Do I need ECC memory for local LLMs?
End of Document