RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Ollama — Installation to Mastery
  6. /Ch. 9
Ollama — Installation to Mastery

09. Performance Tuning

Chapter 9 of 20 · 20 min
KEY INSIGHT

The fastest model is useless if the output quality suffers. Tune `temperature` and `num_ctx` for your use case, then measure actual throughput with `eval_count` and `eval_duration`.

Several parameters affect inference speed and quality. Tuning them requires understanding the tradeoffs between response speed, coherence, and resource usage.

Context Window Size

The num_ctx parameter sets the context window-the number of tokens the model can consider. Smaller contexts are faster but limit long conversations:

ollama run llama3.2:1b --param num_ctx 512

Reducing num_ctx from the default (often 4096 or 8192) to 512 cuts memory usage and speeds up processing for short prompts.

Temperature and Sampling

Temperature controls randomness:

  • 0.0 - Deterministic output. Same prompt gives same response.
  • 0.7 - Balanced creativity. Good for general conversation.
  • 1.0 - High creativity. May produce incoherent responses.
  • 1.5+ - Chaotic. Often unusable.

Use temperature 0 for code generation or factual??. Use temperature 0.8+ for creative writing.

ollama run llama3.2:1b --param temperature 0.0

GPU Layer Allocation

By default, Ollama loads all model layers onto the GPU. For very large models or systems with limited VRAM, you can offload some layers to CPU:

ollama run llama3.2:3b --param num_gpu 16

Lower num_gpu values reduce VRAM usage but slow down inference. The optimal value depends on your GPU memory and model size.

Batch Size

The num_batch parameter controls how many tokens are processed together. Higher values increase throughput but require more memory:

ollama run llama3.2:1b --param num_batch 512

For typical single-user scenarios, the default batch size works well. Increase it for high-throughput batch processing.

Measuring Performance

Use the timing information from API responses:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2:1b",
  "prompt": "Write a paragraph about computers",
  "stream": false
}'

Key metrics:

  • total_duration - Total request time in nanoseconds
  • load_duration - Time to load model into memory
  • prompt_eval_duration - Time to process input tokens
  • eval_duration - Time to generate output tokens
  • eval_count - Number of tokens generated

Tokens per second = eval_count / (eval_duration / 1e9)

EXERCISE

Run the same prompt with temperature 0, 0.7, and 1.2. Compare the responses for factual content (like "What is the capital of France?") versus creative content (like "Write a story about a robot").

← Chapter 8
GPU vs CPU Inference
Chapter 10 →
Concurrent Requests