server
Open source
free
4.8/5
vLLM
High-throughput serving engine. PagedAttention, continuous batching, prefix caching. Production default for self-hosted LLM APIs at scale.
Overview
High-throughput serving engine. PagedAttention, continuous batching, prefix caching. Production default for self-hosted LLM APIs at scale.
Pros
- Best throughput in class
- OpenAI-compatible API
- Tensor parallelism
- Speculative decoding
Cons
- Linux-only
- GPU-only
- Steeper learning curve than Ollama
Compatibility
| Operating systems | Linux |
| GPU backends | NVIDIA CUDA AMD ROCm Intel Gaudi TPU |
| License | Open source · free |
Get vLLM
Frequently asked
Is vLLM free?
Yes — vLLM is free to download and use and open-source under a permissive license.
What operating systems does vLLM support?
vLLM supports Linux.
Which GPUs work with vLLM?
vLLM supports NVIDIA CUDA, AMD ROCm, Intel Gaudi, TPU. CPU-only inference is also possible but slow.
Reviewed by RunLocalAI Editorial. See our editorial policy for how we evaluate tools.