Large language models

Inference

Inference is the act of running a trained model to generate predictions, as opposed to training which produces the model. For LLMs, inference has two distinct phases:

Prefill processes the entire prompt in one parallel pass; this is compute-bound (matrix-matrix operations). Time-to-first-token (TTFT) is the latency from request to first generated token, dominated by prefill.

Decode generates one token at a time autoregressively; this is memory-bandwidth-bound (matrix-vector operations on the full model weights). Tokens-per-second is the headline metric, scaling linearly with hardware memory bandwidth.

For local AI on consumer hardware, decode dominates real-world experience. An RTX 5080 with 960 GB/s bandwidth running an 8B Q4 model (5 GB) achieves a theoretical peak of ~190 tok/s, with real measurements typically 60-75% of that.

Related terms

KV Cache Latency Throughput

Reviewed by Fredoline Eruo. See our editorial policy.