60 common errors when running AI locally — with verified causes and solutions. Paste your error message into Google and you should land on the right page.
Solutions tagged
Failed to fetch from /ollama / WebUI says "Connection failed: Could not connect to Ollama"
(downloads at 100 KB/s instead of saturating bandwidth)
401 Client Error: Unauthorized for url: https://huggingface.co/...
Error: listen tcp 127.0.0.1:11434: bind: address already in use
ImportError: libcudart.so.12: cannot open shared object file (typical when the cu124 wheel of vLLM lands on a cu118-only...
RuntimeError: CUDA error: device-side assert triggered
RuntimeError: The detected CUDA version (12.4) mismatches the version that was used to compile PyTorch (12.1). Please ma...
RuntimeError: CUDA error: no kernel image is available for execution on the device
could not select device driver "" with capabilities: [[gpu]]
Error: model 'X' not found, try pulling it first
Error: listen tcp 127.0.0.1:11434: bind: address already in use
Error: connect ECONNREFUSED 127.0.0.1:11434
(no error — tok/s drops from 50 to 5 as context fills)
(no error — long inputs get silently truncated)
(no error — output is correct but tok/s is 5-10× slower than expected)
Error: listen tcp 127.0.0.1:11434: bind: address already in use
(no error — TTFT goes from 200ms at 2K context to 30+ seconds at 64K context)
(no error — tok/s reads e.g. 4 tok/s on hardware that should do 40 tok/s)
(no error — onnxruntime falls back to CPUExecutionProvider despite DirectML wheel installed)
OSError: libcuda.so.1: cannot open shared object file: No such file or directory
RuntimeError: CUDA error: CUDA driver version is insufficient for CUDA runtime version
torch.cuda.is_available() == False and "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver" in...
NCCL error: unhandled system error / peer to peer access not supported between GPU{0} and GPU{1}
CUDA driver version is insufficient for CUDA runtime version
Command 'nvidia-smi' not found, or NVIDIA-SMI failed because it couldn't communicate with the NVIDIA driver
nvidia-smi: command not found
could not select device driver "nvidia" with capabilities: [[gpu]]
docker: Error response from daemon: could not select device driver "nvidia" with capabilities: [[gpu]].
Killed
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate X.XX GiB
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate
AsyncEngineDeadError: Background loop has errored already
Error: model requires more system memory than is available
RuntimeError: KV cache pool full (RadixAttention) — increase --mem-fraction-static or reduce --max-running-requests
torch.cuda.OutOfMemoryError or 'cannot allocate KV cache' at >32K tokens
RuntimeError: No available KV cache blocks
HIP error: invalid device function / hipErrorNoDevice
HIP error: invalid device function
HIP error: invalid device function / hipErrorInvalidDeviceFunction (typical wording when HSA_OVERRIDE_GFX_VERSION is uns...
HSA_STATUS_ERROR_INVALID_DEVICE or rocminfo shows no agents
llama_model_load: error loading model: failed to load model 'X': bad magic / unsupported GGUF version
llama_model_load: error loading model: failed to open ... or mmap
llama_model_load: error loading model: this GGUF file is version X but llama.cpp supports up to version Y
(no error — generation is fluent gibberish, repeats one token, or emits raw special tokens like <|im_start|>)
(no error — output is garbled like 'the the the' or random unicode)
(no error — output is incoherent, repeats, or generates until max tokens)
Vocab size mismatch: model has X tokens, tokenizer has Y
TypeError: 'NoneType' object is not subscriptable
OSError: Can't load tokenizer for '...'. If you were trying to load it from 'https://huggingface.co/models'
make: nvcc: No such file or directory
GGML_USE_CUDA defined but nvcc not found in PATH
error: unsupported GNU version! gcc versions later than 13 are not supported
ImportError: cannot import name 'ExLlamaV2' from 'exllamav2'
ERROR: Could not build wheels for flash-attn
RuntimeError: MPS backend out of memory (MPS allocated: ... GB, other allocations: ... GB, max allowed: ... GB)
[METAL] Metal Allocator: out of memory (Allocation size X exceeds available)
[MLX][ERROR] Metal command buffer execution failed
metal::MetalCommandQueue allocation failed or [MPS] OOM
Warning: Memory pressure detected. Consider reducing the batch size.
We add ~5 new errors per month based on what readers report.
Email Contact support with the literal error message and what you tried. If it's a common one we'll write it up; if it's something only you hit, we'll often help directly.