Large language models

Throughput

Throughput measures how much work a system completes per unit time — typically tokens-per-second across all concurrent requests. Distinct from latency, which measures a single request's time.

A vLLM server with continuous batching can serve dozens of concurrent users with 5-10× the aggregate throughput of a single-stream llama.cpp setup, because batching amortizes the cost of reading model weights from VRAM across multiple requests' tokens.

For solo local use you mostly care about latency, not throughput. For self-hosted multi-user deployments (a team using a shared local LLM) throughput is the key metric. The right runner choice differs: ExLlamaV2 wins single-user; vLLM wins multi-user.

Related terms

Latency Inference

Related terms

See also