What this does

Configures a low-memory Ollama environment by selecting small parameter-count models, applying Q4 quantization, and reducing the context window to keep RAM usage within system limits. The result is a working inference setup that avoids out-of-memory failures on 8-16 GB systems.

Steps

Select a model with 1B-3B parameters in a Q4 variant. Small parameter counts have the lowest memory floor. Examples include Phi-3 and Gemma-2B.
```
ollama pull phi3:q4_K_M
```
Expected output: Progress bars and success.
Reduce the context window to lower peak memory usage. Sets num_ctx to a smaller value, typically 1024-2048 tokens.
```
OLLAMA_NUM_CTX=1024 ollama run phi3:q4_K_M "Hello, how are you?"
```
Expected output: A generated response with lower peak RAM consumption.
Monitor RAM during inference. Confirms memory stays within limits while the model generates.
```
free -h
```
Expected output: Available memory remains positive even during peak generation.
Persist the context limit in the environment. Makes the setting apply to all subsequent Ollama calls.
```
export OLLAMA_NUM_CTX=1024
```
Expected output: No output; variable is set silently.

Verification

OLLAMA_NUM_CTX=1024 ollama run phi3:q4_K_M "Write one sentence." && free -h | awk 'NR==2{print "Free RAM: " $7}'
# Expected: generated output followed by free memory greater than 1 GB

Common failures

OOM kill during first prompt - Context window still too high; lower OLLAMA_NUM_CTX incrementally until successful.
response quality degradation - Context window too small truncates history; increase to 2048 if RAM permits.
model fails to generate - Q4 variant may be missing for this model; try a different base model or quantization level.
environment variable not persisting - Export command must run in the same shell as subsequent Ollama commands.

How to run quantized models on systems with limited RAM

What this does

Steps

Verification

Common failures

Related guides