HOW-TO · INF

How to run quantized models on systems with limited RAM

intermediate15 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

System with 8-16 GB RAM, Ollama installed

What this does

Configures a low-memory Ollama environment by selecting small parameter-count models, applying Q4 quantization, and reducing the context window to keep RAM usage within system limits. The result is a working inference setup that avoids out-of-memory failures on 8-16 GB systems.

Steps

  1. Select a model with 1B-3B parameters in a Q4 variant. Small parameter counts have the lowest memory floor. Examples include Phi-3 and Gemma-2B.

    ollama pull phi3:q4_K_M
    

    Expected output: Progress bars and success.

  2. Reduce the context window to lower peak memory usage. Sets num_ctx to a smaller value, typically 1024-2048 tokens.

    OLLAMA_NUM_CTX=1024 ollama run phi3:q4_K_M "Hello, how are you?"
    

    Expected output: A generated response with lower peak RAM consumption.

  3. Monitor RAM during inference. Confirms memory stays within limits while the model generates.

    free -h
    

    Expected output: Available memory remains positive even during peak generation.

  4. Persist the context limit in the environment. Makes the setting apply to all subsequent Ollama calls.

    export OLLAMA_NUM_CTX=1024
    

    Expected output: No output; variable is set silently.

Verification

OLLAMA_NUM_CTX=1024 ollama run phi3:q4_K_M "Write one sentence." && free -h | awk 'NR==2{print "Free RAM: " $7}'
# Expected: generated output followed by free memory greater than 1 GB

Common failures

  • OOM kill during first prompt - Context window still too high; lower OLLAMA_NUM_CTX incrementally until successful.
  • response quality degradation - Context window too small truncates history; increase to 2048 if RAM permits.
  • model fails to generate - Q4 variant may be missing for this model; try a different base model or quantization level.
  • environment variable not persisting - Export command must run in the same shell as subsequent Ollama commands.

Related guides

RELATED GUIDES