Transformer & LLM components
PagedAttention
PagedAttention is the memory layout introduced by vLLM that stores the KV cache in fixed-size blocks (pages), like virtual memory. Each request keeps a page table mapping logical positions to physical blocks; blocks can be shared across requests with identical prefixes.
The win: eliminates the internal fragmentation that wastes 60–80% of KV cache memory in naive contiguous allocations. PagedAttention lets vLLM pack 2–4× more concurrent requests into the same VRAM.
A side benefit: trivial prefix caching — multiple requests sharing a system prompt point at the same physical block, so the prefill cost is paid once.
Related terms
See also
Reviewed by Fredoline Eruo. See our editorial policy.