Inference
Inference is the act of running a trained model to generate predictions, as opposed to training which produces the model. For LLMs, inference has two distinct phases:
Prefill processes the entire prompt in one parallel pass; this is compute-bound (matrix-matrix operations). Time-to-first-token (TTFT) is the latency from request to first generated token, dominated by prefill.
Decode generates one token at a time autoregressively; this is memory-bandwidth-bound (matrix-vector operations on the full model weights). Tokens-per-second is the headline metric, scaling linearly with hardware memory bandwidth.
For local AI on consumer hardware, decode dominates real-world experience. An RTX 5080 with 960 GB/s bandwidth running an 8B Q4 model (5 GB) achieves a theoretical peak of ~190 tok/s, with real measurements typically 60-75% of that.
Related terms
Reviewed by Fredoline Eruo. See our editorial policy.