vLLM

High-throughput serving engine. PagedAttention, continuous batching, prefix caching. Production default for self-hosted LLM APIs at scale.

By Fredoline Eruo·Last verified May 6, 2026·50,000 GitHub stars

Overview

High-throughput serving engine. PagedAttention, continuous batching, prefix caching. Production default for self-hosted LLM APIs at scale.

Operating systems	Linux
GPU backends	NVIDIA CUDA AMD ROCm Intel Gaudi TPU
License	Open source · free

Yes — vLLM is free to download and use and open-source under a permissive license.

vLLM supports Linux.

vLLM supports NVIDIA CUDA, AMD ROCm, Intel Gaudi, TPU. CPU-only inference is also possible but slow.

Reviewed by RunLocalAI Editorial. See our editorial policy for how we evaluate tools.