Large language models

Latency

Latency measures how fast you get a response. Two metrics matter for local LLMs:

Time to First Token (TTFT) — wall-clock from request to first generated token. Dominated by prefill (compute-bound). Long prompts make TTFT worse linearly. On a 4090, a 1K-token prompt has ~50ms TTFT; a 32K prompt has 1-2 seconds.

Inter-Token Latency — time between consecutive tokens during generation. Inverse of tokens-per-second. Dominated by memory bandwidth in the decode phase.

Distinct from throughput, which measures total tokens-per-second across batched/concurrent requests. A serving system optimized for throughput (vLLM with continuous batching) often has worse single-request latency than a system optimized for latency (ExLlamaV2).

Related terms

Reviewed by Fredoline Eruo. See our editorial policy.