RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Tools
  4. /ExLlamaV2
runner
Open source
free
4.4/5

ExLlamaV2

Hand-optimized inference for EXL2-quantized models. Fastest single-GPU runtime for the EXL2 quant format on Ada/Hopper hardware. Lower-level than llama.cpp; pairs with text-generation-webui + TabbyAPI as front-ends.

By Fredoline Eruo·Last verified Jun 12, 2026·4,500 GitHub stars

Overview

What ExLlamaV2 actually is

ExLlamaV2 is a CUDA-only inference engine for quantized transformer models, written by Turbo (turboderp) with a single design goal: maximum single-stream tokens-per-second on consumer NVIDIA GPUs. It is the engine that makes a 24 GB RTX 4090 or RTX 3090 punch dramatically above its price class for local-AI workloads, and it ships its own quantization format — EXL2 — designed specifically for the kernels it runs.

It is not a general-purpose engine. It is not a multi-tenant production server. It does one thing — fast single-stream decode of large quantized models on consumer NVIDIA hardware — and in May 2026 it remains the fastest path on the 24 GB consumer tier by a meaningful margin.

Where it fits in the stack

ExLlamaV2 is an engine layer with a thin server frontend (exllamav2 API + community wrappers like TabbyAPI). The stack:

  • Frontend: TabbyAPI or Open WebUI pointed at TabbyAPI's OAI-compatible endpoint
  • Engine: ExLlamaV2
  • Hardware: consumer NVIDIA — RTX 3090 / 4090 / 5090 are the canonical targets
  • Model format: EXL2 (preferred) or GPTQ

It is not the right layer for production serving with concurrent users; for that, use vLLM. It is not the right layer if you need cross-platform portability; for that, use llama.cpp. It is the right layer when you have a single 24 GB card, you're the only user, and you want every last token-per-second.

Best use cases

  • Solo developer with an RTX 4090 / 3090 / 5090. Single-stream decode at the top of the consumer tier. See /hardware/rtx-4090 and /hardware/rtx-3090.
  • 70B-class models on dual 24 GB cards. EXL2 + tensor-parallel splits across two cards efficiently; see /stacks/dual-3090-workstation.
  • Long-context single-user agents. ExLlamaV2's KV-cache management is unusually efficient — 32K+ context fits where vLLM would OOM on the same hardware.
  • Workloads where prefill latency matters less than decode throughput. ExLlamaV2 is decode-optimized; vLLM's continuous batching wins on prefill at scale.

OS support

OS Quality
Linux (x86_64, CUDA 12+) excellent — reference platform
Windows native excellent — official wheels
Windows (WSL2) excellent — same as Linux
macOS unsupported — CUDA-only
Linux ARM64 unsupported in practice

Hardware / backend support

  • NVIDIA only. That's the start and end of the list.
  • Compute capability 7.5+ (Turing and later — RTX 20-series and up).
  • Compute capability 8.6+ (Ampere — RTX 30-series) is where EXL2 starts to really sing because of FP16 tensor-core throughput.
  • Compute capability 8.9 (Ada — RTX 40-series) and 9.0 (Hopper) are the current sweet spots.

ExLlamaV2 will technically run on a GTX 1080 — but you are bottlenecked on memory bandwidth and tensor-core absence; pick llama.cpp instead.

Model / quant format support

  • EXL2 — native format; the production-recommended path. EXL2 is calibration-aware mixed-bit quantization — different layers can run at different precisions based on importance scores from a calibration dataset. Models are typically published at "X bpw" (bits per weight) — 4.0 bpw, 4.65 bpw, 5.0 bpw, 6.0 bpw, 8.0 bpw.
  • GPTQ — supported, slower than EXL2 on the same hardware. Useful when only GPTQ checkpoints exist for a given model.
  • GGUF / AWQ / FP8 — unsupported; out-of-scope.

For the cross-runtime comparison see /systems/quantization-formats.

Setup path

The most common path in 2026 is via TabbyAPI, which gives you an OAI-compatible HTTP server on top of ExLlamaV2:

git clone https://github.com/theroyallab/tabbyAPI
cd tabbyAPI
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
# place your EXL2 model under models/<name>/
python main.py

Or for direct library usage:

pip install exllamav2

Pre-converted EXL2 checkpoints are abundant on Hugging Face — search "exl2" or look at the turboderp, bartowski, and LoneStriker repos for canonical quants of most popular open models.

What breaks first

  1. CUDA / PyTorch version drift. ExLlamaV2 wheels are tightly coupled to a CUDA + PyTorch version. pip install -U is dangerous; pin everything.
  2. EXL2 quant published at the wrong bpw. A 4.0 bpw 70B model fits 2× 24 GB cards comfortably; 4.65 bpw is on the edge; 5.0 bpw will OOM. The bpw label is load-bearing.
  3. KV-cache eviction at long context. ExLlamaV2 has solid KV-cache management, but past ~32K tokens on a 70B 4.0 bpw split across 2× 3090, you start swapping cache pages and tok/s collapses.
  4. TabbyAPI auth misconfig. TabbyAPI ships with API-key auth on by default; first-time setups often hit "401 unauthorized" before they figure that out.
  5. Tensor-parallel boot order. On dual-GPU setups, both cards must be visible to CUDA before boot; CUDA_VISIBLE_DEVICES ordering matters.

Alternatives by intent

If you want… Reach for
Same hardware, multi-user serving vLLM (AWQ-INT4 path)
Same hardware, friendly UX Ollama (GGUF) — slower but simpler
Apple Silicon equivalent MLX-LM
AMD equivalent llama.cpp on ROCm
Production datacenter throughput TensorRT-LLM on H100

Best pairings

  • RTX 4090 + ExLlamaV2 + EXL2 4.65bpw + 32B model = the canonical solo-user inference setup
  • RTX 3090 ×2 + ExLlamaV2 + EXL2 4.0bpw + 70B model = the canonical "70B on a budget" setup; see /stacks/dual-3090-workstation
  • Open WebUI + TabbyAPI + ExLlamaV2 = the canonical solo-user chat stack
  • Aider / Continue.dev routed at TabbyAPI's OAI-compatible endpoint

Who should avoid ExLlamaV2

  • Anyone on AMD or Apple Silicon. CUDA-only, full stop.
  • Production serving with concurrent users. vLLM wins above ~3 concurrent users.
  • Operators who don't want to pin Python / CUDA versions. ExLlamaV2 rewards careful environment management; sloppy environments break it.
  • Anyone who needs maximum portability across model formats. EXL2 + GPTQ is a narrow format set.

Related

  • Stacks: /stacks/dual-3090-workstation, /stacks/local-coding-agent
  • System guides: /systems/quantization-formats, /guides/running-local-ai-on-multiple-gpus-2026
  • Hardware: RTX 4090, RTX 3090, NVIDIA H100 SXM
  • Errors: /errors/wsl2-gpu-not-detected

Setup guidance

Install from the ExLlamaV2 repository: git clone https://github.com/turboderp/exllamav2 && cd exllamav2 && pip install -e .. Requires Python 3.10+ and CUDA 12.1+, NVIDIA GPU only (Maxwell through Blackwell supported). Convert a HuggingFace model to EXL2 format: python convert.py -i ./Llama-3.1-8B-Instruct -o ./Llama-3.1-8B-Instruct-exl2 -cf Llama-3.1-8B-Instruct-4.0bpw -b 4.0. The -b 4.0 specifies 4.0 bits-per-weight — EXL2 supports arbitrary bitrates (2.5–8.0 bpw) calibrated per-layer for minimal PPL degradation. Serve with TabbyAPI (the most common EXL2 server): pip install tabbyapi && python -m tabbyapi --model-dir ./Llama-3.1-8B-Instruct-exl2 --port 5000. TabbyAPI exposes OpenAI-compatible /v1/chat/completions at port 5000. Verify: curl http://localhost:5000/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"user","content":"Hello"}]}'. The 4-bit calibration measurement pass takes 10–30 minutes per model. First run after conversion: instant load, ~2 seconds warmup.

Workload fit

Best for: single-user local inference on NVIDIA consumer GPUs (RTX 3090/4090/5090) where maximum tokens-per-second is the goal, 4–5 bit quantization scenarios where EXL2's calibrated bitrate delivers better quality-per-byte than GGUF, speculative decoding with a draft model (ExLlamaV2's batched verify is fast), creative writing and roleplay workloads where high single-stream decode speed matters more than throughput. Not suited for: multi-tenant production serving (use vLLM), non-NVIDIA hardware (use llama.cpp), CPU inference, Apple Silicon, models larger than a single GPU's VRAM at target quantization, rapid model switching (EXL2 format conversion is a separate build step for each model).

Alternatives

Use ExLlamaV2 when you need maximum decoding speed on NVIDIA consumer GPUs (RTX 3090, 4090, 5090) at 4–5 bit quantization — its fused attention kernels and tensor-core-optimized matmuls are 20–50% faster than llama.cpp CUDA on the same hardware for single-user decode. The EXL2 format's per-layer bitrate calibration produces measurably lower PPL at the same file size vs GGUF at low bitrates (<4.5 bpw). Switch to llama.cpp when you need CPU offloading, Apple Silicon, or broader hardware support — ExLlamaV2 is NVIDIA-only. Use vLLM when you need multi-tenant concurrent serving with continuous batching — ExLlamaV2 is single-user-optimized. Use TensorRT-LLM for Hopper/Blackwell datacenter deployment; ExLlamaV2 excels on consumer cards. Use Ollama when you want a polished CLI and auto-quantization selection.

Troubleshooting + when to switch

Problem: RuntimeError: CUDA error: no kernel image is available for execution on the device. Fix: ExLlamaV2 compiles CUDA kernels at wheel-install time for your compute capability. Reinstall with: pip uninstall exllamav2 && pip install -e . --no-build-isolation from the repo directory. Ensure your CUDA toolkit matches the driver version. Problem: Quantization calibration measurement produces poor PPL on your specific model. Fix: The default calibration dataset (WikiText) may not match your domain. Run conversion with -c /path/to/your/calibration.parquet to calibrate on domain-specific text. EXL2's per-layer bit allocation is dataset-sensitive — calibration on domain text produces 0.5–1.5 PPL improvement on that domain. Problem: TabbyAPI hangs on model load with no error. Fix: Check the config.yml model path and ensure the directory contains config.json, tokenizer.model or tokenizer_config.json, and the .safetensors calibration results. EXL2 needs both the quantized weights and the tokenizer files.

Stack & relationships

How ExLlamaV2 relates to other entries in the catalog — recommended pairings, alternatives, dependencies, and edges to avoid. Each edge carries a one-line operator note from our editorial team.

ExLlamaV2 ↔ ecosystem

Recommended stack

  • Pairs with
    TabbyAPI

    The canonical pairing for production-ish ExLlamaV2 serving. ExLlamaV2 is the engine; TabbyAPI is the front of house.

Depends on

  • Depends on
    TabbyAPI

    TabbyAPI is purely a frontend — it wraps ExLlamaV2 in an OpenAI-compatible HTTP API. No TabbyAPI without ExLlamaV2 installed underneath.

Featured in these stacks

The L3 execution stacks that pick this tool as a recommended component, with the one-line note explaining the role it plays in each.

  • Stack · L3·Workstation tier·Role: Alternative high-throughput runtime
    Dual RTX 3090 workstation stack — 70B-class on $1,800 of used GPUs

    ExLlamaV2 with EXL2 quants is the throughput leader on dual-3090 NVLink for single-stream decode. Slightly sharper than vLLM AWQ-INT4 at the cost of a less-mature serving stack. Use when peak per-stream tok/s matters more than concurrent serving.

  • Stack · L3·Homelab tier·Role: Alternative for asymmetric layer-split
    Mixed RTX 4090 + 3090 workstation — the asymmetric upgrade path

    ExLlamaV2 EXL2 quants accept ratio-based split. Sharper quants than GGUF at equivalent size; pick when peak per-stream throughput on the 4090's strengths matters.

Pros

  • Top single-card NVIDIA speed
  • Custom EXL2 quant format
  • Tight memory usage

Cons

  • NVIDIA only
  • EXL2 ecosystem narrower than GGUF

Compatibility

Operating systems
Linux
Windows
GPU backends
NVIDIA CUDA
LicenseOpen source · free

Runtime health

Operator-grade signals on how actively ExLlamaV2 is being maintained, how fresh its measurements are, and what failure classes operators have flagged. Every label below is anchored to a real date or count — we never infer maintainer activity we can't show.

Release cadence

Derived from the most recent editorial signal on this row.

Active
Updated Jun 12, 2026

8 days since last refresh · source: lastUpdated

Benchmark freshness

How recent the editorial measurements on this runtime are.

0editorial benchmarks

No editorial benchmarks for this runtime yet.

Community reproduction

Submissions that match an editorial measurement on similar hardware.

0reproduced reports

No community reproductions on file yet.

Ecosystem stability

Editorial rating from RunLocalAI — qualitative, not measured.

4.4/5✓Editorial

Get ExLlamaV2

GitHub
https://github.com/turboderp-org/exllamav2

Frequently asked

Is ExLlamaV2 free?

Yes — ExLlamaV2 is free to use and open-source.

What operating systems does ExLlamaV2 support?

ExLlamaV2 supports Linux, Windows.

Which GPUs work with ExLlamaV2?

ExLlamaV2 supports NVIDIA CUDA. CPU-only operation is also possible but typically slower.
See something off?Report outdated·Suggest a correctionWe read every submission. Editorial review takes 1-7 days.

Reviewed by RunLocalAI Editorial. See our editorial policy for how we evaluate tools.

Related — keep moving

Compare hardware
  • RTX 3090 vs RTX 4090 →
  • Dual 3090 vs RTX 5090 →
Buyer guides
  • Best GPU for local AI →
  • Best used GPU (3090 / 4090) →
When it doesn't work
  • ExLlamaV2 not loading →
  • CUDA out of memory →
  • FlashAttention not supported →
Recommended hardware
  • RTX 3090 (24 GB) →
  • RTX 4090 (24 GB) →
Alternatives
LlamafileMLX-LMIPEX-LLMIntel OpenVINODirectMLllama-cpp-pythonAphrodite EngineONNX Runtime Mobile
Before you buy

Verify ExLlamaV2 runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →