fatalEditorialReviewed May 2026

bitsandbytes CUDA error — fix the quant library after a driver or torch update

bitsandbytes silently breaks after PyTorch or NVIDIA driver updates. The fix is usually a reinstall with the right CUDA version, or switching to a prebuilt wheel. Here's the diagnostic order.

bitsandbytesPyTorchHugging Face TransformersQLoRA fine-tuningNVIDIA CUDA

By Fredoline Eruo · Last verified 2026-05-09

Diagnostic order — most likely first

bitsandbytes was installed before a PyTorch upgrade and now links against an old CUDA runtime

Diagnose

`python -c 'import bitsandbytes as bnb; print(bnb.__version__)'` succeeds. But `model = AutoModelForCausalLM.from_pretrained(..., load_in_8bit=True)` errors with 'CUDA Setup failed' or 'libbitsandbytes_cuda124.so not found.'

Fix

Reinstall bitsandbytes against the current PyTorch + CUDA: `pip install --upgrade --force-reinstall bitsandbytes`. If that fails, try the prebuilt wheel: `pip install bitsandbytes --index-url https://jllllll.github.io/bitsandbytes-windows-webui` (Windows) or `pip install bitsandbytes==0.43.3` (pinned known-good version for CUDA 12.4).

Windows: MSVC runtime not installed (bitsandbytes builds CUDA kernels that link against MSVC)

Diagnose

`pip install bitsandbytes` fails with 'error: Microsoft Visual C++ 14.0 or greater is required' or the import succeeds but `load_in_8bit=True` errors with a DLL-load failure. bitsandbytes compiles CUDA kernels on install, and those kernels depend on the MSVC runtime.

Fix

Install Visual Studio Build Tools 2022 with 'Desktop development with C++' workload. Then rebuild: `pip install --upgrade --force-reinstall --no-cache-dir bitsandbytes`. If on Windows and this is still painful, use the prebuilt Windows wheel from the jllllll GitHub repo instead.

Compute capability of the GPU isn't in bitsandbytes' compiled kernel list

Diagnose

`nvidia-smi --query-gpu=compute_cap --format=csv,noheader` shows e.g. `8.9` (RTX 4090). But bitsandbytes was compiled without sm_89 in its target list. The import errors with 'no kernel image is available for execution on the device' or the 8-bit optimizer silently runs at FP32 speed.

Fix

Set `BNB_CUDA_VERSION=124` and `CUDA_VERSION=124` environment variables, then reinstall from source: `pip install bitsandbytes --no-cache-dir --no-build-isolation`. For Blackwell (RTX 5090, sm_100): ensure bitsandbytes ≥ 0.44.0 which added Blackwell kernel support. For older cards (Pascal sm_61): bitsandbytes ≥ 0.39.0 is required.

Out of VRAM during 4-bit model loading (fits in FP16 but quant loading path uses extra VRAM)

Diagnose

FP16 model loads fine in PyTorch (fits in VRAM). Loading with `load_in_4bit=True` OOMs. The 4-bit quant path creates temporary FP16 buffers during conversion that consume additional VRAM before the quantized tensors are finalized.

Fix

Pre-quantize the model offline with `bnb`'s quantization API and save as a 4-bit checkpoint. Then load directly: `model = AutoModelForCausalLM.from_pretrained('./path/to/4bit-model')`. This avoids the in-memory conversion step. Or lower `max_memory` in the device map to reserve headroom: `device_map='auto', max_memory={0: '20GiB'}` (on a 24 GB card) leaves 4 GB for the conversion buffer.

Best GPU for local AI →

bitsandbytes conflicts with another quantization library in the same env

Diagnose

`pip list | grep quant` shows both `bitsandbytes` AND `gptq` or `auto-gptq` installed. The two libraries ship overlapping CUDA kernels and can collide on the import of `cuda_setup`.

Fix

Install in separate virtual environments. `python -m venv .venv-bnb && source .venv-bnb/bin/activate && pip install bitsandbytes ...`. Don't mix bitsandbytes with GPTQ/AWQ in the same env — the CUDA runtime path conflicts are real and hard to debug.

Frequently asked questions

Is bitsandbytes still needed if I'm not doing QLoRA fine-tuning?

For pure inference, no. Most runtimes (vLLM, llama.cpp, ExLlamaV2) have their own quantization engines that are faster and more stable than bitsandbytes' inference path. bitsandbytes' value is in QLoRA fine-tuning (the 4-bit optimizer that enables fine-tuning 70B models on a single 24 GB GPU). For inference-only, use the runtime's native quant.

Windows + bitsandbytes = always broken? What actually works in 2026?

The official bitsandbytes Windows support is fragile. Two paths that actually work: (1) Use WSL2 + Ubuntu — install bitsandbytes inside WSL where it behaves identically to native Linux (same CUDA path, no MSVC dependency). (2) Use the prebuilt Windows wheels from github.com/jllllll/bitsandbytes-windows-webui which are tested against common combos. Native `pip install bitsandbytes` on Windows works for some combos but fails silently for many.

What's the difference between load_in_8bit and load_in_4bit in practice?

8-bit: ~50% VRAM reduction vs FP16, near-lossless quality. Good for inference + continued pre-training. Set `load_in_8bit=True, llm_int8_threshold=6.0`. 4-bit: ~75% VRAM reduction vs FP16, minimal quality loss with NF4 dtype. `load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type='nf4'`. For VRAM-tight QLoRA fine-tuning, 4-bit is the standard. For higher-quality inference, 8-bit or EXL2 5.0 bpw.

Why does HuggingFace's documentation use bitsandbytes everywhere if it's finicky?

Because HuggingFace's `transformers` library integrated bitsandbytes as the default quantization backend for `from_pretrained(load_in_4bit=True)`. It's the path of least resistance in the HF ecosystem despite being finicky on non-Linux platforms. Community preference is shifting toward EXL2 (for GPU-only) and GGUF (for cross-platform) for inference, and bitsandbytes for QLoRA training specifically.

Related troubleshooting

CUDA out of memory

Why CUDA OOM happens during local LLM inference and image gen, how to confirm the real cause, and the four real fixes (smaller quant, shorter context, gradient checkpointing, or more VRAM).

torch.cuda.is_available() returns False

PyTorch falsely reporting no CUDA is the most common Python ML setup failure. The cause is almost always: wrong PyTorch wheel for your CUDA version, or a CPU-only build accidentally installed.

Model loads but generation is slow / tok/s far below expectation

When the model loads (no OOM) but token generation is far below expected speeds, the bottleneck is usually VRAM paging, KV cache overcommit, or GPU contention. Here's how to diagnose and fix each.

When the fix is hardware

A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time:

Where next?

All troubleshooting guides

OrBest GPU for local AI Will it run on my hardware?