13. Temperature and Sampling
Understanding Temperature
Temperature controls how "random" the model's output is.
High temperature (e.g., 0.9-1.2):
- More creative, varied output
- Good for brainstorming, creative writing
- Higher chance of unexpected (sometimes wrong) responses
Low temperature (e.g., 0.1-0.3):
- More focused, deterministic output
- Good for factual responses, code, structured tasks
- More consistent across runs
Temperature = 0:
- Greedy decoding—always picks the most likely next token
- Deterministic but often lower quality (repetitive)
How It Works
At each step, the model produces a probability distribution over possible next tokens. With temperature = 1, sampling uses the natural probabilities. Lower temperature makes high-probability tokens more likely. Higher temperature flattens the distribution, giving low-probability tokens a chance.
Token probabilities (example):
"the": 0.15, "a": 0.08, "cat": 0.05, "dog": 0.04, ...
Temperature 0.1: "the" becomes ~0.9 probability
Temperature 1.0: Keep original distribution
Temperature 2.0: Almost uniform—any token is equally likely
Setting Temperature in Ollama
# Set temperature inline
ollama run llama3.2:7b "Write a poem about stars" --param temperature 0.9
# Or in Modelfile
echo 'PARAMETER temperature 0.7' >> Modelfile
Other Sampling Parameters
top_p (nucleus sampling):
Controls the percentage of probability mass considered. top_p 0.9 means only tokens in the top 90% of probability mass are considered.
ollama run llama3.2:7b --param top_p 0.9 "Continue this story"
top_k:
Limits to the top k most likely tokens. top_k 40 means only the 40 most likely tokens can be chosen.
ollama run llama3.2:7b --param top_k 40 "Explain recursion"
Typical values for creative tasks: temperature 0.8-1.0, top_p 0.9-1.0 Typical values for factual/coding: temperature 0.2-0.5, top_p 0.9
Common Issues
Too high temperature:
- Nonsensical output
- Repetition
- Incoherence
Too low temperature:
- Repetitive, formulaic responses
- "Safe" but boring
- May miss creative solutions
Interaction with top_p: Often best to set temperature OR top_p, not both. Default Ollama behavior is usually fine.
Practical Guidelines
| Task | Recommended Settings |
|---|---|
| Creative writing, brainstorming | temp 0.8-1.0, top_p 0.95 |
| Code generation | temp 0.2-0.5, top_p 0.9 |
| Factual Q&A | temp 0.1-0.3, top_p 0.9 |
| Summarization | temp 0.3-0.5, top_p 0.9 |
| Translation | temp 0.2-0.4, top_p 0.9 |
Take the same creative prompt and run it three times with Ollama: once with temperature 0.2, once with temperature 0.7, once with temperature 1.2. Compare the outputs for creativity, coherence, and variation. Then try the same with a factual question (e.g., "What is the capital of Brazil?")—notice how temperature affects accuracy.