Temperature and Sampling — What is Local AI — And Why It Matters (Chapter 13)

Understanding Temperature

Temperature controls how "random" the model's output is.

High temperature (e.g., 0.9-1.2):

More creative, varied output
Good for brainstorming, creative writing
Higher chance of unexpected (sometimes wrong) responses

Low temperature (e.g., 0.1-0.3):

More focused, deterministic output
Good for factual responses, code, structured tasks
More consistent across runs

Temperature = 0:

Greedy decoding—always picks the most likely next token
Deterministic but often lower quality (repetitive)

How It Works

At each step, the model produces a probability distribution over possible next tokens. With temperature = 1, sampling uses the natural probabilities. Lower temperature makes high-probability tokens more likely. Higher temperature flattens the distribution, giving low-probability tokens a chance.

Token probabilities (example):
"the": 0.15, "a": 0.08, "cat": 0.05, "dog": 0.04, ...

Temperature 0.1: "the" becomes ~0.9 probability
Temperature 1.0: Keep original distribution
Temperature 2.0: Almost uniform—any token is equally likely

Setting Temperature in Ollama

# Set temperature inline
ollama run llama3.2:7b "Write a poem about stars" --param temperature 0.9

# Or in Modelfile
echo 'PARAMETER temperature 0.7' >> Modelfile

Other Sampling Parameters

top_p (nucleus sampling): Controls the percentage of probability mass considered. top_p 0.9 means only tokens in the top 90% of probability mass are considered.

ollama run llama3.2:7b --param top_p 0.9 "Continue this story"

top_k: Limits to the top k most likely tokens. top_k 40 means only the 40 most likely tokens can be chosen.

ollama run llama3.2:7b --param top_k 40 "Explain recursion"

Typical values for creative tasks: temperature 0.8-1.0, top_p 0.9-1.0 Typical values for factual/coding: temperature 0.2-0.5, top_p 0.9

Common Issues

Too high temperature:

Nonsensical output
Repetition
Incoherence

Too low temperature:

Repetitive, formulaic responses
"Safe" but boring
May miss creative solutions

Interaction with top_p: Often best to set temperature OR top_p, not both. Default Ollama behavior is usually fine.

Practical Guidelines

Task	Recommended Settings
Creative writing, brainstorming	temp 0.8-1.0, top_p 0.95
Code generation	temp 0.2-0.5, top_p 0.9
Factual Q&A	temp 0.1-0.3, top_p 0.9
Summarization	temp 0.3-0.5, top_p 0.9
Translation	temp 0.2-0.4, top_p 0.9