09. Performance Tuning
Several parameters affect inference speed and quality. Tuning them requires understanding the tradeoffs between response speed, coherence, and resource usage.
Context Window Size
The num_ctx parameter sets the context window-the number of tokens the model can consider. Smaller contexts are faster but limit long conversations:
ollama run llama3.2:1b --param num_ctx 512
Reducing num_ctx from the default (often 4096 or 8192) to 512 cuts memory usage and speeds up processing for short prompts.
Temperature and Sampling
Temperature controls randomness:
0.0- Deterministic output. Same prompt gives same response.0.7- Balanced creativity. Good for general conversation.1.0- High creativity. May produce incoherent responses.1.5+- Chaotic. Often unusable.
Use temperature 0 for code generation or factual??. Use temperature 0.8+ for creative writing.
ollama run llama3.2:1b --param temperature 0.0
GPU Layer Allocation
By default, Ollama loads all model layers onto the GPU. For very large models or systems with limited VRAM, you can offload some layers to CPU:
ollama run llama3.2:3b --param num_gpu 16
Lower num_gpu values reduce VRAM usage but slow down inference. The optimal value depends on your GPU memory and model size.
Batch Size
The num_batch parameter controls how many tokens are processed together. Higher values increase throughput but require more memory:
ollama run llama3.2:1b --param num_batch 512
For typical single-user scenarios, the default batch size works well. Increase it for high-throughput batch processing.
Measuring Performance
Use the timing information from API responses:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2:1b",
"prompt": "Write a paragraph about computers",
"stream": false
}'
Key metrics:
total_duration- Total request time in nanosecondsload_duration- Time to load model into memoryprompt_eval_duration- Time to process input tokenseval_duration- Time to generate output tokenseval_count- Number of tokens generated
Tokens per second = eval_count / (eval_duration / 1e9)
Run the same prompt with temperature 0, 0.7, and 1.2. Compare the responses for factual content (like "What is the capital of France?") versus creative content (like "Write a story about a robot").