Llama 3.1 8B
Llama 3.1 8B is the most popular local LLM in the world. Trained by Meta on 15 trillion tokens, it delivers strong general-purpose performance across chat, coding, reasoning, and retrieval tasks. At Q4_K_M quantization it requires only ~4 G…
8.0B
Parameters
128K
Max Context
Dense
Architecture
Jul 23, 2024
Released
Text
Modality
About Llama 3.1 8B
Llama 3.1 8B is the most popular local LLM in the world. Trained by Meta on 15 trillion tokens, it delivers strong general-purpose performance across chat, coding, reasoning, and retrieval tasks. At Q4_K_M quantization it requires only ~4 GB VRAM, making it the default starting point for anyone getting into local LLMs. It supports a 128K context window, tool use, and multilingual output across 8 languages. The broad ecosystem of fine-tunes, quantized GGUF variants, and framework support (llama.cpp, Ollama, LM Studio, vLLM) makes it the best-supported local model available.
Technical Specifications
System Requirements
Estimated VRAM at 10% overhead for different quantization methods and context sizes.
| Quantization | 1K ctx | 128K ctx |
|---|---|---|
Q4_K_M0.50 B/W ~97% of FP16 | 4.28Consumer GPU | 20.15Consumer GPU |
Q8_01.00 B/W ~100% of FP16 | 8.43Consumer GPU | 24.30Datacenter GPU |
F162.00 B/W Reference | 16.73Consumer GPU | 32.60Datacenter GPU |
Other Llama Models
View AllFind the right GPU for Llama 3.1 8B
Use the interactive VRAM Calculator to see exactly how much memory you need at any quantization level, context length, and overhead setting.