Transformer & LLM components
Prefill (Prompt Processing)
Prefill is the first phase of LLM inference: the model processes the entire prompt in a single parallel pass, building up the KV cache for every prompt token. Prefill is compute-bound — large matrix-matrix operations that saturate tensor cores.
Prefill latency dominates time-to-first-token (TTFT). For a 7B model on an RTX 4090, prefill runs at roughly 3,000–8,000 tokens/sec depending on batch geometry, so a 2K-token prompt takes 250–700 ms before generation even starts.
Optimizations: chunked prefill (process the prompt in slices to overlap with decode), prefix caching (reuse KV from a previous prompt with the same prefix), and Flash Attention (reduce memory traffic during attention).
Related terms
Reviewed by Fredoline Eruo. See our editorial policy.