Chunked Prefill
Also known as: chunked prefill, prefill chunking
Chunked prefill is an inference-engine technique that splits long-prompt processing into smaller chunks so the engine can interleave decode steps from in-flight requests in between. Without chunked prefill, a single long prompt monopolizes the GPU until the entire prefill completes, blocking decode tokens for every other request and spiking inter-token latency. Engines like vLLM and SGLang use chunked prefill to keep tail latency stable under mixed workloads.
Deeper dive
Inference has two distinct compute phases: prefill (compute KV cache for the prompt) and decode (generate one token at a time). Prefill is compute-bound and processes the entire prompt in parallel; a 32K-token prompt prefill can take seconds. During this time, any other request waiting to generate the next token sits idle. Chunked prefill breaks the prefill into pieces (e.g., 512 tokens each) so the engine schedules a small prefill chunk, then a decode step for other in-flight requests, then another prefill chunk, alternating until the prefill completes. The result is smoother tail latency for decode-bound users at the cost of slightly higher end-to-end latency for the long-prompt user. The technique pairs well with continuous batching — together they form the latency-stability backbone of high-throughput open-source inference servers.
Practical example
A server fielding both a single long-context summarization request (16K-token prompt, 200-token output) and many short chat requests (500-token prompt, 100-token output) would, without chunked prefill, see the chat requests freeze for the duration of the summarization prefill — several seconds of zero progress on chat tokens. With chunked prefill set to a 1,024-token chunk size, the summarization prefill yields back to the chat decode steps every ~50ms, so chat users see continuous streaming throughout. The summarization request still completes in roughly the same total time; only the latency curve flattens.
Workflow example
In vLLM, chunked prefill is enabled via --enable-chunked-prefill (off by default in older versions, on by default in newer). The chunk size is controlled by --max-num-batched-tokens. SGLang enables it by default. To measure whether it's helping, capture P99 inter-token latency for short requests while a long-prompt request is running — without chunked prefill there will be a spike, with it the curve stays roughly flat. Tuning the chunk size is a tradeoff: smaller chunks improve fairness, larger chunks improve overall throughput. Start with the engine default and only tune if measurement says you should.
Related terms
Reviewed by Fredoline Eruo. See our editorial policy.