TabbyAPI
OpenAI-API frontend for ExLlamaV2. Wraps the EXL2 inference engine in a clean HTTP API, adds streaming, batching, and OAI-compatible chat templates. The default front-of-house when you've already committed to the EXL2 quant format and want to expose it to clients that speak OpenAI.
Overview
OpenAI-API frontend for ExLlamaV2. Wraps the EXL2 inference engine in a clean HTTP API, adds streaming, batching, and OAI-compatible chat templates. The default front-of-house when you've already committed to the EXL2 quant format and want to expose it to clients that speak OpenAI.
Stack & relationships
How TabbyAPI relates to other entries in the catalog — recommended pairings, alternatives, dependencies, and edges to avoid. Each edge carries a one-line operator note from our editorial team.
Recommended stack
- Pairs withExLlamaV2
The canonical pairing for production-ish ExLlamaV2 serving. ExLlamaV2 is the engine; TabbyAPI is the front of house.
Alternatives
- Alternative toOllama
Both expose OpenAI-compatible APIs locally. TabbyAPI wins on raw single-card EXL2 speed for advanced users; Ollama wins on ergonomics and breadth of quant formats. Pick by quant commitment.
Depends on
- Depends onExLlamaV2
TabbyAPI is purely a frontend — it wraps ExLlamaV2 in an OpenAI-compatible HTTP API. No TabbyAPI without ExLlamaV2 installed underneath.
Pros
- Cleanest production wrapper for ExLlamaV2
- Streaming + batching + tool-call support
- Minimal operational footprint (single process)
Cons
- NVIDIA only (it's bound to ExLlamaV2)
- EXL2 quant only (no GGUF, no AWQ)
- Smaller community than vLLM / llama.cpp server mode
Compatibility
| Operating systems | Linux Windows macOS |
| GPU backends | NVIDIA CUDA |
| License | Open source · free (OSS, AGPL-3.0) |
Get TabbyAPI
Frequently asked
Is TabbyAPI free?
What operating systems does TabbyAPI support?
Which GPUs work with TabbyAPI?
Reviewed by RunLocalAI Editorial. See our editorial policy for how we evaluate tools.