How to choose the right quantization level based on your hardware
Knowledge of your system available RAM and a base model in mind
What this does
Measures available system memory, estimates RAM demands for quantized models, and selects an appropriate quantization level so inference runs without swapping or OOM errors. The result is a model and quantization pairing that fits the hardware budget.
Steps
Check available system RAM. Measures the memory ceiling available for model loading.
free -hExpected output: Total, used, and available memory columns.
Estimate RAM requirement from model size and quantization. A 7B model in Q4_K_M needs roughly 4.5-5 GB on disk and 6-8 GB during inference. Larger counts scale proportionally.
Select quantization matching available memory. Use this guide:
- Q8_0: requires 12+ GB free RAM for 7B models
- Q5_K_M: suits systems with 8-12 GB free
- Q4_K_M: ideal for 6-8 GB free
- Q3_K_M: for 4-6 GB free
- Q2_K: use when RAM is tightly constrained
Pull the chosen variant. Downloads the selected quantization level.
ollama pull llama3:q4_K_MExpected output: Progress bars and
success.
Verification
free -h | awk 'NR==2{print "Available RAM: " $7}' && ollama list | grep q4_K_M
# Expected: Available RAM greater than estimated model need; model variant present in list
Common failures
- OOM killer triggers during model load - Available RAM was overestimated; close other applications or switch to a lighter quantization.
- disk size != RAM usage - On-disk size underreports RAM need; real memory depends on context window and batch settings.
- GPU offload complications - Quantization levels expecting GPU offload may fail without CUDA; check runtime GPU support.
- confusing VRAM vs RAM - On discrete GPU systems, VRAM and system RAM are separate pools; each must be considered independently.