LlamaDenseLlama 3.2 Community License

Llama 3.2 3B

Llama 3.2 3B is a sweet spot for lightweight local deployment — capable enough for meaningful assistant tasks while fitting comfortably on any GPU. At 3.21B parameters it runs fast on laptop GPUs, phones, and even reasonably on CPU. Good fo

3.2B

Parameters

128K

Max Context

Dense

Architecture

Sep 25, 2024

Released

Text

Modality

About Llama 3.2 3B

Llama 3.2 3B is a sweet spot for lightweight local deployment — capable enough for meaningful assistant tasks while fitting comfortably on any GPU. At 3.21B parameters it runs fast on laptop GPUs, phones, and even reasonably on CPU. Good for quick chat, simple coding help, and text processing tasks where you want sub-second latency.

Lightweight AssistantMobileText ProcessingBasic Code

Technical Specifications

Total Parameters3.2B
ArchitectureDense
Attention TypeGQA (Grouped Query Attention)
Hidden Dimensiond = 3,072
Transformer Layers28
Attention Heads24
KV Headsn_kv = 8
Head Dimensiond_head = 128
Activation FunctionSwiGLU
NormalizationRMSNorm
Position EmbeddingRoPE

System Requirements

Estimated VRAM at 10% overhead for different quantization methods and context sizes.

Quantization1K ctx128K ctx
Q4_K_M0.50 B/W
~97% of FP16
1.77Consumer GPU
15.66Consumer GPU
Q8_01.00 B/W
~100% of FP16
3.43Consumer GPU
17.32Consumer GPU
F162.00 B/W
Reference
6.75Consumer GPU
20.64Consumer GPU
Fits 24 GB consumer GPU
Fits 80 GB datacenter GPU
Requires cluster / multi-GPU

Other Llama Models

View All

Find the right GPU for Llama 3.2 3B

Use the interactive VRAM Calculator to see exactly how much memory you need at any quantization level, context length, and overhead setting.