Performance Tuning — Local AI on macOS (Chapter 9)

Performance tuning on Apple Silicon has two levers: memory allocation and batch sizing. Everything else is secondary.

Memory headroom is the first tuning step. If your model is consuming 90% of RAM, reduce the context window. In Ollama:

# Run with a reduced context
ollama run llama3.2:3b -c 1024   # 1024 max context tokens instead of default 2048+

# Or set in the API
curl -X POST http://localhost:11434/api/generate \
  -d '{"model":"llama3.2:3b","prompt":"Hello","options":{"num_ctx":1024}}'

A smaller context window directly reduces memory usage. The KV cache (which stores context state) scales roughly linearly with context length. Cutting from 4096 to 1024 tokens can reduce memory pressure by 70% and boost throughput by 2–3×.

The second lever is batch size, which controls how many tokens are processed in parallel before returning results. In llama.cpp-based runtimes (including Ollama), batch size is controlled via numa settings and thread count:

# Ollama environment tuning
export OLLAMA_NUM_PARALLEL=4      # max concurrent requests
export OLLAMA_MAX_LOADED_MODELS=1  # only one model at a time
export OLLAMA_GPU_OVERHEAD=0       # minimal VRAM reservation

# For llama.cpp CLI
./quantize model.gguf --output model-q4km.gguf Q4_K_M
./main -m model-q4km.gguf -c 2048 -t 8 -ngl 99
# -t 8: use 8 CPU threads
# -ngl 99: offload all layers to GPU (Metal)

Real failure mode: setting -ngl 99 on an M1 with 8 GB causes OOM because all layers in a 7B model require more GPU memory than is available. Start with -ngl 1 (offload one layer at a time) or explicitly limit with -ngl 32 for a 32-layer model on a constrained chip.

GPU temperature management matters on MacBooks. Metal performance throttles when thermal limits are reached. You cannot change the thermal management, but you can reduce sustained load by limiting batch sizes or using a slightly smaller model.