fatalEditorialReviewed May 2026

ExLlamaV2 not loading — fix the model format, cache, or driver issue

ExLlamaV2 load failures trace to wrong model format (needs EXL2 or EXL3, not GGUF), insufficient cache for context, or a driver/runtime version mismatch. The exl2 format is non-negotiable.

ExLlamaV2TabbyAPIText-Generation-WebUI ExLlama loader
By Fredoline Eruo · Last verified 2026-05-08

Diagnostic order — most likely first

#1

Trying to load GGUF in ExLlamaV2

Diagnose

Error 'Could not find model index' or 'Invalid format.' GGUF and EXL2 are different formats.

Fix

ExLlamaV2 needs EXL2 (or newer EXL3) format. Find quantized versions on HuggingFace: search '<model> exl2' (e.g., 'turboderp/Llama-3.1-70B-exl2'). Pick a bpw (bits-per-weight) tier matching your VRAM.

#2

Cache (KV) larger than VRAM allows at chosen context

Diagnose

Loads, but errors at first inference: 'CUDA out of memory.' Context cache (Q4 by default in EXL2) consumes more than expected.

Fix

Lower max_seq_len. Use `--cache_q4` for aggressive cache quantization (Q4 KV halves memory vs Q8). For long context on 24 GB: 8K-16K is the comfort zone for 70B.

#3

Driver / CUDA version too old for ExLlamaV2 build

Diagnose

Crashes with kernel-image errors or 'undefined symbol.' ExLlamaV2 main branch tracks recent CUDA features (FlashAttention 2/3).

Fix

Update NVIDIA driver to 555+ for CUDA 12.4. Reinstall ExLlamaV2 from source matching your CUDA: `pip install --no-cache-dir exllamav2 --extra-index-url https://download.pytorch.org/whl/cu124`.

#4

Wrong bpw tier for the VRAM

Diagnose

Model loads partially, then OOM. EXL2 models come in many bpw flavors (3.0, 4.0, 4.65, 5.0, 6.0, 8.0). Higher bpw = better quality, more VRAM.

Fix

Choose bpw to fit. For 70B on 24 GB: 4.0 bpw with 8K context. For 70B on 48 GB: 5.0+ bpw with 16K+ context. Repos list bpw in the model name.

#5

FlashAttention build issue (compute capability mismatch)

Diagnose

ExLlamaV2 wants flash-attn-2.5+. Older cards (Pascal sm_61, Turing sm_75) don't support FA2 properly.

Fix

FA2 requires sm_80+ (Ampere). On older cards, ExLlamaV2 falls back to xformers — slower but works. Or upgrade GPU.

Frequently asked questions

ExLlamaV2 vs vLLM vs llama.cpp — which to use?

ExLlamaV2 wins on single-user inference perf at 24+ GB VRAM (highest tok/s of the three). vLLM wins on multi-user serving (paged KV cache + continuous batching). llama.cpp wins on portability + cross-platform. ExLlamaV2 = power user tool.

What's a good bpw to choose for EXL2?

4.0 bpw is the value sweet spot — minimal quality loss vs FP16, fits more on the same VRAM. 5.0+ for higher-stakes work. 3.0 only when desperate for VRAM (noticeable quality drop). Repos like 'turboderp' ship multiple tiers.

Does ExLlamaV2 support multi-GPU?

Yes via `--gpu-split`. Specify VRAM allocation per card. ExLlamaV2 multi-GPU scales 1.85-1.95x on dual cards — among the best for inference. Pair with TabbyAPI for OpenAI-compatible serving.

Related troubleshooting

When the fix is hardware

A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time: