RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Transformer & LLM components / Top-p (Nucleus) Sampling
Transformer & LLM components

Top-p (Nucleus) Sampling

Top-p (nucleus) sampling is a text generation strategy that selects from the smallest set of tokens whose cumulative probability exceeds a threshold p (e.g., 0.9). Instead of considering all possible next tokens, the model filters out low-probability tokens, keeping only the 'nucleus' of likely candidates. This reduces the chance of generating nonsensical or overly random outputs while maintaining diversity. Operators adjust top-p alongside temperature to control creativity: lower p (e.g., 0.5) makes output more focused, higher p (e.g., 0.95) allows more variety.

Deeper dive

Top-p sampling addresses a limitation of top-k sampling, which always considers a fixed number of tokens regardless of their probability distribution. For example, if the top-5 tokens have probabilities [0.4, 0.3, 0.2, 0.05, 0.05], top-k=5 includes the last two low-probability tokens, potentially introducing noise. Top-p dynamically selects tokens until cumulative probability reaches p, so with p=0.9 it would include only the first three tokens (0.4+0.3+0.2=0.9). This adapts to the shape of the probability distribution: when the model is confident (sharp distribution), fewer tokens are considered; when uncertain (flat distribution), more tokens are included. In practice, top-p is often used with temperature scaling: temperature flattens or sharpens the logits before softmax, and top-p then selects from the resulting distribution. Common p values range from 0.8 to 0.95 for creative tasks, and 0.5 to 0.7 for factual or code generation.

Practical example

When generating a story with Llama 3.1 8B in LM Studio, setting top-p=0.9 and temperature=0.8 yields varied but coherent outputs. If top-p=0.5, the model repeatedly picks from the same few high-probability tokens, making output repetitive. If top-p=1.0 (effectively off), the model considers all tokens, often producing gibberish. On an RTX 4090, these settings have negligible impact on tokens/sec (~80 tok/s) since sampling happens after the forward pass.

Workflow example

In Ollama, you set top-p via the options parameter: ollama run llama3.1:8b --options temperature 0.8 top_p 0.9. In llama.cpp's main example, use --top-p 0.9. In Hugging Face Transformers, pass do_sample=True, top_p=0.9 to model.generate(). In vLLM, set top_p=0.9 in the sampling parameters. Operators often tune top-p interactively in LM Studio's generation settings slider.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →