Why 24 GB is the sweet spot for local LLMs
24 GB covers the widest range of useful models at usable speeds. It runs Mixtral 8x7B and Qwen 32B entirely on GPU, and can handle 70B models with partial offloading. Here is exactly where 24 GB is enough - and where it falls short.

What fits in 24 GB
| Model | Quant | VRAM | Speed | Quality |
|---|---|---|---|---|
| Llama 3.1 8B | FP16 | ~14 GB | ~70 t/s | Perfect |
| Mixtral 8x7B | Q4_K_M | ~14 GB | ~35 t/s | Very good |
| Qwen 2.5 32B | Q4_K_M | ~18 GB | ~25 t/s | Very good |
| Command R 35B | Q4_K_M | ~20 GB | ~20 t/s | Good |
| Llama 3.1 70B | Q3_K_M | ~30 GB | ~12 t/s* | Good |
| Llama 3.1 70B | Q4_K_M | ~38 GB | ~8 t/s* | Very good |
* Partial CPU offload required. Speed estimates assume ~1,000 GB/s bandwidth (RTX 4090/3090 class). 70B speeds reflect mixed GPU+CPU inference.
The 70B problem
The most common question about 24 GB: can you run the big models? At Q3 quantization, 70B needs ~30 GB. You keep ~24 GB on the GPU and offload ~6 GB to system RAM. The speed penalty for that small offload is typically 15-20% - acceptable for interactive use.
At Q4, 70B needs ~38 GB. Offloading 14 GB to RAM hurts more - expect 40-60% speed reduction. The quality improvement from Q3 to Q4 is real but modest for most tasks. If you regularly need Q4 70B with zero compromise, 32 GB (RTX 5090) is the answer.
Token speed at different bandwidths
Bandwidth is the primary bottleneck for token generation. All three 24 GB consumer GPUs have different bandwidth, which translates directly to different speeds for the same model.
| GPU | Bandwidth | Mixtral 8x7B Q4 | Qwen 32B Q4 |
|---|---|---|---|
| RTX 4090 | 1,008 GB/s | ~50 t/s | ~35 t/s |
| RX 7900 XTX | 960 GB/s | ~45 t/s | ~33 t/s |
| RTX 3090 | 936 GB/s | ~42 t/s | ~30 t/s |
Estimates at 70% efficiency. Actual speeds depend on framework, batch size, and KV cache size.
When you need 32 GB instead
- -You run 70B at Q4 daily. 32 GB eliminates offloading entirely for 70B Q4. The speed difference is 2-3x vs partial offload on 24 GB.
- -You use long context (32K+) on 35B+ models. KV cache at long contexts can push 24 GB models into offloading territory.
- -You want zero VRAM anxiety. 32 GB handles anything short of 70B FP16 without thinking about it.
See 24 GB vs 32 GB for Local LLMs for the full comparison, or Best GPU for Local LLMs for GPU recommendations.
Related Guides
Frequently Asked Questions
Can 24 GB run all the popular models?
How much faster is 24 GB than CPU offloading from 16 GB?
Is the RTX 3090 still worth buying in 2026 for 24 GB?
End of Document