RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Transformer & LLM components / Top-k Sampling
Transformer & LLM components

Top-k Sampling

Top-k sampling is a text-generation strategy that restricts the model's next-token choices to the k tokens with the highest probabilities. The model then randomly samples from this shortlist, weighted by their probabilities. This prevents the model from picking extremely unlikely tokens (which can derail coherence) while still allowing variety. Operators encounter it as a generation parameter: a low k (e.g., 10) makes output more focused; a high k (e.g., 100) increases diversity. It is often used alongside temperature, which scales probabilities before top-k filtering.

Deeper dive

Top-k sampling was introduced to address the limitations of greedy decoding (always picking the most likely token, which leads to repetitive text) and pure sampling (which can produce incoherent output by selecting rare tokens). By keeping only the top k tokens, the model avoids the long tail of improbable choices. The value of k is typically between 10 and 100. A common variant is top-p (nucleus) sampling, which dynamically selects tokens whose cumulative probability exceeds a threshold p, rather than a fixed count. In practice, operators often combine top-k with temperature: temperature flattens or sharpens the probability distribution, and then top-k cuts off the tail. For example, in llama.cpp, setting --temp 0.8 --top-k 40 means the model samples from the 40 most likely tokens after temperature scaling.

Practical example

On a 24 GB RTX 4090 running Llama 3.1 8B via llama.cpp at Q4_K_M, setting --top-k 40 with --temp 0.7 produces creative but coherent responses. If top-k is set to 1, the output becomes deterministic (greedy). If set to 300 (near the full vocabulary), the model may occasionally pick a rare token, causing a sudden topic shift or gibberish. Operators often tune top-k alongside top-p: many find top-p=0.9 with top-k=40 a safe starting point.

Workflow example

In Ollama, you set top-k in the Modelfile: PARAMETER top_k 40. When running ollama run llama3.1:8b, the runtime applies this during generation. In LM Studio, the 'Sampling' panel has a slider for top-k (default 40). In vLLM, you pass --top-k 40 to the server. In Hugging Face Transformers code, it's model.generate(..., top_k=40). Operators typically adjust top-k when they notice the model repeating phrases (lower k) or becoming incoherent (higher k).

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →