ExLlamaV2

GPU-only inference library optimized for consumer NVIDIA cards. Fastest tokens-per-second on a single 24GB card for 30B models in EXL2 quant.

By Fredoline Eruo·Last verified May 6, 2026·4,500 GitHub stars

Overview

GPU-only inference library optimized for consumer NVIDIA cards. Fastest tokens-per-second on a single 24GB card for 30B models in EXL2 quant.

Yes — ExLlamaV2 is free to download and use and open-source under a permissive license.

ExLlamaV2 supports Linux, Windows.

ExLlamaV2 supports NVIDIA CUDA. CPU-only inference is also possible but slow.

Reviewed by RunLocalAI Editorial. See our editorial policy for how we evaluate tools.