How to run quantized models on systems with limited RAM
System with 8-16 GB RAM, Ollama installed
What this does
Configures a low-memory Ollama environment by selecting small parameter-count models, applying Q4 quantization, and reducing the context window to keep RAM usage within system limits. The result is a working inference setup that avoids out-of-memory failures on 8-16 GB systems.
Steps
Select a model with 1B-3B parameters in a Q4 variant. Small parameter counts have the lowest memory floor. Examples include Phi-3 and Gemma-2B.
ollama pull phi3:q4_K_MExpected output: Progress bars and
success.Reduce the context window to lower peak memory usage. Sets
num_ctxto a smaller value, typically 1024-2048 tokens.OLLAMA_NUM_CTX=1024 ollama run phi3:q4_K_M "Hello, how are you?"Expected output: A generated response with lower peak RAM consumption.
Monitor RAM during inference. Confirms memory stays within limits while the model generates.
free -hExpected output: Available memory remains positive even during peak generation.
Persist the context limit in the environment. Makes the setting apply to all subsequent Ollama calls.
export OLLAMA_NUM_CTX=1024Expected output: No output; variable is set silently.
Verification
OLLAMA_NUM_CTX=1024 ollama run phi3:q4_K_M "Write one sentence." && free -h | awk 'NR==2{print "Free RAM: " $7}'
# Expected: generated output followed by free memory greater than 1 GB
Common failures
- OOM kill during first prompt - Context window still too high; lower
OLLAMA_NUM_CTXincrementally until successful. - response quality degradation - Context window too small truncates history; increase to 2048 if RAM permits.
- model fails to generate - Q4 variant may be missing for this model; try a different base model or quantization level.
- environment variable not persisting - Export command must run in the same shell as subsequent Ollama commands.