What this does

Measures available system memory, estimates RAM demands for quantized models, and selects an appropriate quantization level so inference runs without swapping or OOM errors. The result is a model and quantization pairing that fits the hardware budget.

Steps

Check available system RAM. Measures the memory ceiling available for model loading.
```
free -h
```
Expected output: Total, used, and available memory columns.
Estimate RAM requirement from model size and quantization. A 7B model in Q4_K_M needs roughly 4.5-5 GB on disk and 6-8 GB during inference. Larger counts scale proportionally.
Select quantization matching available memory. Use this guide:
- Q8_0: requires 12+ GB free RAM for 7B models
- Q5_K_M: suits systems with 8-12 GB free
- Q4_K_M: ideal for 6-8 GB free
- Q3_K_M: for 4-6 GB free
- Q2_K: use when RAM is tightly constrained
Pull the chosen variant. Downloads the selected quantization level.
```
ollama pull llama3:q4_K_M
```
Expected output: Progress bars and success.

Verification

free -h | awk 'NR==2{print "Available RAM: " $7}' && ollama list | grep q4_K_M
# Expected: Available RAM greater than estimated model need; model variant present in list

Common failures

OOM killer triggers during model load - Available RAM was overestimated; close other applications or switch to a lighter quantization.
disk size != RAM usage - On-disk size underreports RAM need; real memory depends on context window and batch settings.
GPU offload complications - Quantization levels expecting GPU offload may fail without CUDA; check runtime GPU support.
confusing VRAM vs RAM - On discrete GPU systems, VRAM and system RAM are separate pools; each must be considered independently.

How to choose the right quantization level based on your hardware

What this does

Steps

Verification

Common failures

Related guides