Transformer & LLM components

Top-k Sampling

Top-k sampling is a text-generation strategy that restricts the model's next-token choices to the k tokens with the highest probabilities. The model then randomly samples from this shortlist, weighted by their probabilities. This prevents the model from picking extremely unlikely tokens (which can derail coherence) while still allowing variety. Operators encounter it as a generation parameter: a low k (e.g., 10) makes output more focused; a high k (e.g., 100) increases diversity. It is often used alongside temperature, which scales probabilities before top-k filtering.

Deeper dive

Top-k sampling was introduced to address the limitations of greedy decoding (always picking the most likely token, which leads to repetitive text) and pure sampling (which can produce incoherent output by selecting rare tokens). By keeping only the top k tokens, the model avoids the long tail of improbable choices. The value of k is typically between 10 and 100. A common variant is top-p (nucleus) sampling, which dynamically selects tokens whose cumulative probability exceeds a threshold p, rather than a fixed count. In practice, operators often combine top-k with temperature: temperature flattens or sharpens the probability distribution, and then top-k cuts off the tail. For example, in llama.cpp, setting --temp 0.8 --top-k 40 means the model samples from the 40 most likely tokens after temperature scaling.

Practical example

On a 24 GB RTX 4090 running Llama 3.1 8B via llama.cpp at Q4_K_M, setting --top-k 40 with --temp 0.7 produces creative but coherent responses. If top-k is set to 1, the output becomes deterministic (greedy). If set to 300 (near the full vocabulary), the model may occasionally pick a rare token, causing a sudden topic shift or gibberish. Operators often tune top-k alongside top-p: many find top-p=0.9 with top-k=40 a safe starting point.

Workflow example

In Ollama, you set top-k in the Modelfile: PARAMETER top_k 40. When running ollama run llama3.1:8b, the runtime applies this during generation. In LM Studio, the 'Sampling' panel has a slider for top-k (default 40). In vLLM, you pass --top-k 40 to the server. In Hugging Face Transformers code, it's model.generate(..., top_k=40). Operators typically adjust top-k when they notice the model repeating phrases (lower k) or becoming incoherent (higher k).

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work