Top-p (Nucleus) Sampling
Top-p (nucleus) sampling is a text generation strategy that selects from the smallest set of tokens whose cumulative probability exceeds a threshold p (e.g., 0.9). Instead of considering all possible next tokens, the model filters out low-probability tokens, keeping only the 'nucleus' of likely candidates. This reduces the chance of generating nonsensical or overly random outputs while maintaining diversity. Operators adjust top-p alongside temperature to control creativity: lower p (e.g., 0.5) makes output more focused, higher p (e.g., 0.95) allows more variety.
Deeper dive
Top-p sampling addresses a limitation of top-k sampling, which always considers a fixed number of tokens regardless of their probability distribution. For example, if the top-5 tokens have probabilities [0.4, 0.3, 0.2, 0.05, 0.05], top-k=5 includes the last two low-probability tokens, potentially introducing noise. Top-p dynamically selects tokens until cumulative probability reaches p, so with p=0.9 it would include only the first three tokens (0.4+0.3+0.2=0.9). This adapts to the shape of the probability distribution: when the model is confident (sharp distribution), fewer tokens are considered; when uncertain (flat distribution), more tokens are included. In practice, top-p is often used with temperature scaling: temperature flattens or sharpens the logits before softmax, and top-p then selects from the resulting distribution. Common p values range from 0.8 to 0.95 for creative tasks, and 0.5 to 0.7 for factual or code generation.
Practical example
When generating a story with Llama 3.1 8B in LM Studio, setting top-p=0.9 and temperature=0.8 yields varied but coherent outputs. If top-p=0.5, the model repeatedly picks from the same few high-probability tokens, making output repetitive. If top-p=1.0 (effectively off), the model considers all tokens, often producing gibberish. On an RTX 4090, these settings have negligible impact on tokens/sec (~80 tok/s) since sampling happens after the forward pass.
Workflow example
In Ollama, you set top-p via the options parameter: ollama run llama3.1:8b --options temperature 0.8 top_p 0.9. In llama.cpp's main example, use --top-p 0.9. In Hugging Face Transformers, pass do_sample=True, top_p=0.9 to model.generate(). In vLLM, set top_p=0.9 in the sampling parameters. Operators often tune top-p interactively in LM Studio's generation settings slider.
Reviewed by Fredoline Eruo. See our editorial policy.