ExLlamaV2
Hand-optimized inference for EXL2-quantized models. Fastest single-GPU runtime for the EXL2 quant format on Ada/Hopper hardware. Lower-level than llama.cpp; pairs with text-generation-webui + TabbyAPI as front-ends.
Overview
What ExLlamaV2 actually is
ExLlamaV2 is a CUDA-only inference engine for quantized transformer models, written by Turbo (turboderp) with a single design goal: maximum single-stream tokens-per-second on consumer NVIDIA GPUs. It is the engine that makes a 24 GB RTX 4090 or RTX 3090 punch dramatically above its price class for local-AI workloads, and it ships its own quantization format — EXL2 — designed specifically for the kernels it runs.
It is not a general-purpose engine. It is not a multi-tenant production server. It does one thing — fast single-stream decode of large quantized models on consumer NVIDIA hardware — and in May 2026 it remains the fastest path on the 24 GB consumer tier by a meaningful margin.
Where it fits in the stack
ExLlamaV2 is an engine layer with a thin server frontend (exllamav2 API + community wrappers like TabbyAPI). The stack:
- Frontend: TabbyAPI or Open WebUI pointed at TabbyAPI's OAI-compatible endpoint
- Engine: ExLlamaV2
- Hardware: consumer NVIDIA — RTX 3090 / 4090 / 5090 are the canonical targets
- Model format: EXL2 (preferred) or GPTQ
It is not the right layer for production serving with concurrent users; for that, use vLLM. It is not the right layer if you need cross-platform portability; for that, use llama.cpp. It is the right layer when you have a single 24 GB card, you're the only user, and you want every last token-per-second.
Best use cases
- Solo developer with an RTX 4090 / 3090 / 5090. Single-stream decode at the top of the consumer tier. See /hardware/rtx-4090 and /hardware/rtx-3090.
- 70B-class models on dual 24 GB cards. EXL2 + tensor-parallel splits across two cards efficiently; see /stacks/dual-3090-workstation.
- Long-context single-user agents. ExLlamaV2's KV-cache management is unusually efficient — 32K+ context fits where vLLM would OOM on the same hardware.
- Workloads where prefill latency matters less than decode throughput. ExLlamaV2 is decode-optimized; vLLM's continuous batching wins on prefill at scale.
OS support
| OS | Quality |
|---|---|
| Linux (x86_64, CUDA 12+) | excellent — reference platform |
| Windows native | excellent — official wheels |
| Windows (WSL2) | excellent — same as Linux |
| macOS | unsupported — CUDA-only |
| Linux ARM64 | unsupported in practice |
Hardware / backend support
- NVIDIA only. That's the start and end of the list.
- Compute capability 7.5+ (Turing and later — RTX 20-series and up).
- Compute capability 8.6+ (Ampere — RTX 30-series) is where EXL2 starts to really sing because of FP16 tensor-core throughput.
- Compute capability 8.9 (Ada — RTX 40-series) and 9.0 (Hopper) are the current sweet spots.
ExLlamaV2 will technically run on a GTX 1080 — but you are bottlenecked on memory bandwidth and tensor-core absence; pick llama.cpp instead.
Model / quant format support
- EXL2 — native format; the production-recommended path. EXL2 is calibration-aware mixed-bit quantization — different layers can run at different precisions based on importance scores from a calibration dataset. Models are typically published at "X bpw" (bits per weight) — 4.0 bpw, 4.65 bpw, 5.0 bpw, 6.0 bpw, 8.0 bpw.
- GPTQ — supported, slower than EXL2 on the same hardware. Useful when only GPTQ checkpoints exist for a given model.
- GGUF / AWQ / FP8 — unsupported; out-of-scope.
For the cross-runtime comparison see /systems/quantization-formats.
Setup path
The most common path in 2026 is via TabbyAPI, which gives you an OAI-compatible HTTP server on top of ExLlamaV2:
git clone https://github.com/theroyallab/tabbyAPI
cd tabbyAPI
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
# place your EXL2 model under models/<name>/
python main.py
Or for direct library usage:
pip install exllamav2
Pre-converted EXL2 checkpoints are abundant on Hugging Face — search "exl2" or look at the turboderp, bartowski, and LoneStriker repos for canonical quants of most popular open models.
What breaks first
- CUDA / PyTorch version drift. ExLlamaV2 wheels are tightly coupled to a CUDA + PyTorch version.
pip install -Uis dangerous; pin everything. - EXL2 quant published at the wrong bpw. A 4.0 bpw 70B model fits 2× 24 GB cards comfortably; 4.65 bpw is on the edge; 5.0 bpw will OOM. The bpw label is load-bearing.
- KV-cache eviction at long context. ExLlamaV2 has solid KV-cache management, but past ~32K tokens on a 70B 4.0 bpw split across 2× 3090, you start swapping cache pages and tok/s collapses.
- TabbyAPI auth misconfig. TabbyAPI ships with API-key auth on by default; first-time setups often hit "401 unauthorized" before they figure that out.
- Tensor-parallel boot order. On dual-GPU setups, both cards must be visible to CUDA before boot;
CUDA_VISIBLE_DEVICESordering matters.
Alternatives by intent
| If you want… | Reach for |
|---|---|
| Same hardware, multi-user serving | vLLM (AWQ-INT4 path) |
| Same hardware, friendly UX | Ollama (GGUF) — slower but simpler |
| Apple Silicon equivalent | MLX-LM |
| AMD equivalent | llama.cpp on ROCm |
| Production datacenter throughput | TensorRT-LLM on H100 |
Best pairings
- RTX 4090 + ExLlamaV2 + EXL2 4.65bpw + 32B model = the canonical solo-user inference setup
- RTX 3090 ×2 + ExLlamaV2 + EXL2 4.0bpw + 70B model = the canonical "70B on a budget" setup; see /stacks/dual-3090-workstation
- Open WebUI + TabbyAPI + ExLlamaV2 = the canonical solo-user chat stack
- Aider / Continue.dev routed at TabbyAPI's OAI-compatible endpoint
Who should avoid ExLlamaV2
- Anyone on AMD or Apple Silicon. CUDA-only, full stop.
- Production serving with concurrent users. vLLM wins above ~3 concurrent users.
- Operators who don't want to pin Python / CUDA versions. ExLlamaV2 rewards careful environment management; sloppy environments break it.
- Anyone who needs maximum portability across model formats. EXL2 + GPTQ is a narrow format set.
Related
- Stacks: /stacks/dual-3090-workstation, /stacks/local-coding-agent
- System guides: /systems/quantization-formats, /guides/running-local-ai-on-multiple-gpus-2026
- Hardware: RTX 4090, RTX 3090, NVIDIA H100 SXM
- Errors: /errors/wsl2-gpu-not-detected
Setup guidance
Install from the ExLlamaV2 repository: git clone https://github.com/turboderp/exllamav2 && cd exllamav2 && pip install -e .. Requires Python 3.10+ and CUDA 12.1+, NVIDIA GPU only (Maxwell through Blackwell supported). Convert a HuggingFace model to EXL2 format: python convert.py -i ./Llama-3.1-8B-Instruct -o ./Llama-3.1-8B-Instruct-exl2 -cf Llama-3.1-8B-Instruct-4.0bpw -b 4.0. The -b 4.0 specifies 4.0 bits-per-weight — EXL2 supports arbitrary bitrates (2.5–8.0 bpw) calibrated per-layer for minimal PPL degradation. Serve with TabbyAPI (the most common EXL2 server): pip install tabbyapi && python -m tabbyapi --model-dir ./Llama-3.1-8B-Instruct-exl2 --port 5000. TabbyAPI exposes OpenAI-compatible /v1/chat/completions at port 5000. Verify: curl http://localhost:5000/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"user","content":"Hello"}]}'. The 4-bit calibration measurement pass takes 10–30 minutes per model. First run after conversion: instant load, ~2 seconds warmup.
Workload fit
Best for: single-user local inference on NVIDIA consumer GPUs (RTX 3090/4090/5090) where maximum tokens-per-second is the goal, 4–5 bit quantization scenarios where EXL2's calibrated bitrate delivers better quality-per-byte than GGUF, speculative decoding with a draft model (ExLlamaV2's batched verify is fast), creative writing and roleplay workloads where high single-stream decode speed matters more than throughput. Not suited for: multi-tenant production serving (use vLLM), non-NVIDIA hardware (use llama.cpp), CPU inference, Apple Silicon, models larger than a single GPU's VRAM at target quantization, rapid model switching (EXL2 format conversion is a separate build step for each model).
Alternatives
Use ExLlamaV2 when you need maximum decoding speed on NVIDIA consumer GPUs (RTX 3090, 4090, 5090) at 4–5 bit quantization — its fused attention kernels and tensor-core-optimized matmuls are 20–50% faster than llama.cpp CUDA on the same hardware for single-user decode. The EXL2 format's per-layer bitrate calibration produces measurably lower PPL at the same file size vs GGUF at low bitrates (<4.5 bpw). Switch to llama.cpp when you need CPU offloading, Apple Silicon, or broader hardware support — ExLlamaV2 is NVIDIA-only. Use vLLM when you need multi-tenant concurrent serving with continuous batching — ExLlamaV2 is single-user-optimized. Use TensorRT-LLM for Hopper/Blackwell datacenter deployment; ExLlamaV2 excels on consumer cards. Use Ollama when you want a polished CLI and auto-quantization selection.
Troubleshooting + when to switch
Problem: RuntimeError: CUDA error: no kernel image is available for execution on the device. Fix: ExLlamaV2 compiles CUDA kernels at wheel-install time for your compute capability. Reinstall with: pip uninstall exllamav2 && pip install -e . --no-build-isolation from the repo directory. Ensure your CUDA toolkit matches the driver version. Problem: Quantization calibration measurement produces poor PPL on your specific model. Fix: The default calibration dataset (WikiText) may not match your domain. Run conversion with -c /path/to/your/calibration.parquet to calibrate on domain-specific text. EXL2's per-layer bit allocation is dataset-sensitive — calibration on domain text produces 0.5–1.5 PPL improvement on that domain. Problem: TabbyAPI hangs on model load with no error. Fix: Check the config.yml model path and ensure the directory contains config.json, tokenizer.model or tokenizer_config.json, and the .safetensors calibration results. EXL2 needs both the quantized weights and the tokenizer files.
Stack & relationships
How ExLlamaV2 relates to other entries in the catalog — recommended pairings, alternatives, dependencies, and edges to avoid. Each edge carries a one-line operator note from our editorial team.
Recommended stack
- Pairs withTabbyAPI
The canonical pairing for production-ish ExLlamaV2 serving. ExLlamaV2 is the engine; TabbyAPI is the front of house.
Depends on
- Depends onTabbyAPI
TabbyAPI is purely a frontend — it wraps ExLlamaV2 in an OpenAI-compatible HTTP API. No TabbyAPI without ExLlamaV2 installed underneath.
Featured in these stacks
The L3 execution stacks that pick this tool as a recommended component, with the one-line note explaining the role it plays in each.
- Stack · L3·Workstation tier·Role: Alternative high-throughput runtimeDual RTX 3090 workstation stack — 70B-class on $1,800 of used GPUs
ExLlamaV2 with EXL2 quants is the throughput leader on dual-3090 NVLink for single-stream decode. Slightly sharper than vLLM AWQ-INT4 at the cost of a less-mature serving stack. Use when peak per-stream tok/s matters more than concurrent serving.
- Stack · L3·Homelab tier·Role: Alternative for asymmetric layer-splitMixed RTX 4090 + 3090 workstation — the asymmetric upgrade path
ExLlamaV2 EXL2 quants accept ratio-based split. Sharper quants than GGUF at equivalent size; pick when peak per-stream throughput on the 4090's strengths matters.
Pros
- Top single-card NVIDIA speed
- Custom EXL2 quant format
- Tight memory usage
Cons
- NVIDIA only
- EXL2 ecosystem narrower than GGUF
Compatibility
| Operating systems | Linux Windows |
| GPU backends | NVIDIA CUDA |
| License | Open source · free |
Runtime health
Operator-grade signals on how actively ExLlamaV2 is being maintained, how fresh its measurements are, and what failure classes operators have flagged. Every label below is anchored to a real date or count — we never infer maintainer activity we can't show.
Release cadence
Derived from the most recent editorial signal on this row.
8 days since last refresh · source: lastUpdated
Benchmark freshness
How recent the editorial measurements on this runtime are.
No editorial benchmarks for this runtime yet.
Community reproduction
Submissions that match an editorial measurement on similar hardware.
No community reproductions on file yet.
Ecosystem stability
Editorial rating from RunLocalAI — qualitative, not measured.
Get ExLlamaV2
Frequently asked
Is ExLlamaV2 free?
What operating systems does ExLlamaV2 support?
Which GPUs work with ExLlamaV2?
Reviewed by RunLocalAI Editorial. See our editorial policy for how we evaluate tools.
Related — keep moving
Verify ExLlamaV2 runs on your specific hardware before committing money.