Performance Tuning — Ollama — Installation to Mastery (Chapter 9)

Several parameters affect inference speed and quality. Tuning them requires understanding the tradeoffs between response speed, coherence, and resource usage.

Context Window Size

The num_ctx parameter sets the context window-the number of tokens the model can consider. Smaller contexts are faster but limit long conversations:

ollama run llama3.2:1b --param num_ctx 512

Reducing num_ctx from the default (often 4096 or 8192) to 512 cuts memory usage and speeds up processing for short prompts.

Temperature and Sampling

Temperature controls randomness:

0.0 - Deterministic output. Same prompt gives same response.
0.7 - Balanced creativity. Good for general conversation.
1.0 - High creativity. May produce incoherent responses.
1.5+ - Chaotic. Often unusable.

Use temperature 0 for code generation or factual??. Use temperature 0.8+ for creative writing.

ollama run llama3.2:1b --param temperature 0.0

GPU Layer Allocation

By default, Ollama loads all model layers onto the GPU. For very large models or systems with limited VRAM, you can offload some layers to CPU:

ollama run llama3.2:3b --param num_gpu 16

Lower num_gpu values reduce VRAM usage but slow down inference. The optimal value depends on your GPU memory and model size.

Batch Size

The num_batch parameter controls how many tokens are processed together. Higher values increase throughput but require more memory:

ollama run llama3.2:1b --param num_batch 512

For typical single-user scenarios, the default batch size works well. Increase it for high-throughput batch processing.

Measuring Performance

Use the timing information from API responses:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2:1b",
  "prompt": "Write a paragraph about computers",
  "stream": false
}'

Key metrics:

total_duration - Total request time in nanoseconds
load_duration - Time to load model into memory
prompt_eval_duration - Time to process input tokens
eval_duration - Time to generate output tokens
eval_count - Number of tokens generated

Tokens per second = eval_count / (eval_duration / 1e9)