Local AI errors & fixes

21 common errors when running AI locally — with verified causes and solutions. Paste your error message into Google and you should land on the right page.

Solutions tagged

verified by owner

have been hit and fixed personally on the test hardware. Others are sourced from authoritative GitHub issue threads with citations.

Configuration5 entries

Token generation slows as conversation gets longer

(no error — tok/s drops from 50 to 5 as context fills)

verified

Ollama: bind: address already in use (port 11434)

Error: listen tcp 127.0.0.1:11434: bind: address already in use

verified

Ollama truncates input — default context length is only 2048

(no error — long inputs get silently truncated)

verified

Ollama: connection refused on localhost:11434

Error: connect ECONNREFUSED 127.0.0.1:11434

verified

Ollama: Error: model 'X' not found

Error: model 'X' not found, try pulling it first

verified

Model format / GGUF1 entry

llama.cpp: failed to mmap GGUF file

llama_model_load: error loading model: failed to open ... or mmap

verified

Tokenizer mismatches2 entries

Model produces gibberish or repeats one token forever

(no error — output is garbled like 'the the the' or random unicode)

verified

Model loaded but tokenizer vocab size mismatch

Vocab size mismatch: model has X tokens, tokenizer has Y

Out of memory5 entries

Process killed (OOM killer) when loading large model

Killed

verified

CUDA out of memory when loading a model

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate X.XX GiB

verified

vLLM: No available KV cache blocks

RuntimeError: No available KV cache blocks

Ollama: model requires more system memory than is available

Error: model requires more system memory than is available

Out of memory specifically at long context lengths

torch.cuda.OutOfMemoryError or 'cannot allocate KV cache' at >32K tokens

ROCm / AMD1 entry

ROCm: HIP error: invalid device — no GPU detected

HIP error: invalid device function / hipErrorNoDevice

verified

Network / downloads2 entries

HuggingFace download is extremely slow or stalls

(downloads at 100 KB/s instead of saturating bandwidth)

HuggingFace: 403 Forbidden when downloading a gated model

401 Client Error: Unauthorized for url: https://huggingface.co/...

Driver issues2 entries

CUDA driver version is insufficient for CUDA runtime version

nvidia-smi: command not found

Quantization issues1 entry

Q2_K or Q3 quantized model produces nonsense

(no error — output is incoherent at Q2_K but fine at Q4_K_M)

Build / compile failures1 entry

llama.cpp build fails: nvcc not found

GGML_USE_CUDA defined but nvcc not found in PATH

Metal / Apple Silicon1 entry

MLX / Metal: command buffer execution failed

[MLX][ERROR] Metal command buffer execution failed

Hit an error we don't have?

We add ~5 new errors per month based on what readers report.

Email hello@runlocalai.co with the literal error message and what you tried. If it's a common one we'll write it up; if it's something only you hit, we'll often help directly.