RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38

Local AI errors & fixes

60 common errors when running AI locally — with verified causes and solutions. Paste your error message into Google and you should land on the right page.

Solutions tagged

verified by owner
have been hit and fixed personally on the test hardware. Others are sourced from authoritative GitHub issue threads with citations.

Network / downloads(4)CUDA / NVIDIA(4)Configuration(11)Driver issues(9)Out of memory(8)ROCm / AMD(4)Model format / GGUF(3)Tokenizer mismatches(6)Build / compile failures(5)Metal / Apple Silicon(5)Quantization issues(1)

Network / downloads4 entries

Open WebUI: Failed to fetch from /ollama (cannot reach Ollama backend)

Failed to fetch from /ollama / WebUI says "Connection failed: Could not connect to Ollama"

verified

HuggingFace download is extremely slow or stalls

(downloads at 100 KB/s instead of saturating bandwidth)

HuggingFace: 403 Forbidden when downloading a gated model

401 Client Error: Unauthorized for url: https://huggingface.co/...

Ollama can't bind port 11434 — already in use

Error: listen tcp 127.0.0.1:11434: bind: address already in use

CUDA / NVIDIA4 entries

vLLM install picks the wrong CUDA wheel

ImportError: libcudart.so.12: cannot open shared object file (typical when the cu124 wheel of vLLM lands on a cu118-only...

verified

RuntimeError: CUDA error: device-side assert triggered

RuntimeError: CUDA error: device-side assert triggered

verified

CUDA runtime version doesn't match the installed driver

RuntimeError: The detected CUDA version (12.4) mismatches the version that was used to compile PyTorch (12.1). Please ma...

verified

PyTorch: CUDA error: no kernel image is available for execution on the device

RuntimeError: CUDA error: no kernel image is available for execution on the device

Configuration11 entries

Docker: could not select device driver "" with capabilities: [[gpu]]

could not select device driver "" with capabilities: [[gpu]]

verified

Ollama: Error: model 'X' not found

Error: model 'X' not found, try pulling it first

verified

Ollama: bind: address already in use (port 11434)

Error: listen tcp 127.0.0.1:11434: bind: address already in use

verified

Ollama: connection refused on localhost:11434

Error: connect ECONNREFUSED 127.0.0.1:11434

verified

Token generation slows as conversation gets longer

(no error — tok/s drops from 50 to 5 as context fills)

verified

Ollama truncates input — default context length is only 2048

(no error — long inputs get silently truncated)

verified

Slow tokens/sec on capable GPU (silent CPU fallback)

(no error — output is correct but tok/s is 5-10× slower than expected)

verified

Ollama: listen tcp 127.0.0.1:11434 bind: address already in use

Error: listen tcp 127.0.0.1:11434: bind: address already in use

verified

Very slow first token / OOM only at long prompts

(no error — TTFT goes from 200ms at 2K context to 30+ seconds at 64K context)

LM Studio generation much slower than expected

(no error — tok/s reads e.g. 4 tok/s on hardware that should do 40 tok/s)

Windows DirectML model runs on CPU instead of GPU

(no error — onnxruntime falls back to CPUExecutionProvider despite DirectML wheel installed)

Driver issues9 entries

WSL2: nvidia-smi works but PyTorch sees no CUDA / libcuda.so missing

OSError: libcuda.so.1: cannot open shared object file: No such file or directory

verified

PyTorch CUDA error: driver version is insufficient for CUDA runtime

RuntimeError: CUDA error: CUDA driver version is insufficient for CUDA runtime version

verified

WSL2: torch.cuda.is_available() returns False

torch.cuda.is_available() == False and "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver" in...

verified

NCCL error: peer to peer not supported (multi-GPU)

NCCL error: unhandled system error / peer to peer access not supported between GPU{0} and GPU{1}

verified

CUDA driver version is insufficient for CUDA runtime version

CUDA driver version is insufficient for CUDA runtime version

WSL2 GPU not detected — nvidia-smi missing or empty

Command 'nvidia-smi' not found, or NVIDIA-SMI failed because it couldn't communicate with the NVIDIA driver

nvidia-smi: command not found

nvidia-smi: command not found

Docker container can't see GPU — nvidia-container-toolkit missing

could not select device driver "nvidia" with capabilities: [[gpu]]

Docker: could not select device driver "nvidia"

docker: Error response from daemon: could not select device driver "nvidia" with capabilities: [[gpu]].

Out of memory8 entries

Process killed (OOM killer) when loading large model

Killed

verified

CUDA out of memory when loading a model

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate X.XX GiB

verified

CUDA OOM that only happens at long context (KV cache blowup)

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate

verified

vLLM AsyncEngineDeadError after large batch / OOM

AsyncEngineDeadError: Background loop has errored already

verified

Ollama: model requires more system memory than is available

Error: model requires more system memory than is available

SGLang: RadixAttention KV cache overflow / out of memory

RuntimeError: KV cache pool full (RadixAttention) — increase --mem-fraction-static or reduce --max-running-requests

Out of memory specifically at long context lengths

torch.cuda.OutOfMemoryError or 'cannot allocate KV cache' at >32K tokens

vLLM: No available KV cache blocks

RuntimeError: No available KV cache blocks

ROCm / AMD4 entries

ROCm: HIP error: invalid device — no GPU detected

HIP error: invalid device function / hipErrorNoDevice

verified

ROCm: HIP error: invalid device function

HIP error: invalid device function

ROCm: hipErrorInvalidDeviceFunction on RX 7000-series

HIP error: invalid device function / hipErrorInvalidDeviceFunction (typical wording when HSA_OVERRIDE_GFX_VERSION is uns...

ROCm: HSA_STATUS_ERROR_INVALID_DEVICE — GPU not detected

HSA_STATUS_ERROR_INVALID_DEVICE or rocminfo shows no agents

Model format / GGUF3 entries

llama.cpp: error loading model — bad magic / unsupported GGUF

llama_model_load: error loading model: failed to load model 'X': bad magic / unsupported GGUF version

verified

llama.cpp: failed to mmap GGUF file

llama_model_load: error loading model: failed to open ... or mmap

verified

Failed to load model: GGUF version mismatch

llama_model_load: error loading model: this GGUF file is version X but llama.cpp supports up to version Y

verified

Tokenizer mismatches6 entries

GGUF model outputs garbage — tokenizer / chat-template mismatch

(no error — generation is fluent gibberish, repeats one token, or emits raw special tokens like <|im_start|>)

verified

Model produces gibberish or repeats one token forever

(no error — output is garbled like 'the the the' or random unicode)

verified

Quantized model produces garbage / never stops generating

(no error — output is incoherent, repeats, or generates until max tokens)

verified

Model loaded but tokenizer vocab size mismatch

Vocab size mismatch: model has X tokens, tokenizer has Y

TypeError: 'NoneType' object is not subscriptable in tokenizer

TypeError: 'NoneType' object is not subscriptable

OSError: Can't load tokenizer for ... / no file named tokenizer.json

OSError: Can't load tokenizer for '...'. If you were trying to load it from 'https://huggingface.co/models'

Build / compile failures5 entries

llama.cpp build fails: nvcc not found / CUDA toolkit missing

make: nvcc: No such file or directory

verified

llama.cpp build fails: nvcc not found

GGML_USE_CUDA defined but nvcc not found in PATH

llama.cpp CUDA build: unsupported GNU version! gcc versions later than X are not supported

error: unsupported GNU version! gcc versions later than 13 are not supported

exllamav2 ImportError: cannot import name 'ExLlamaV2' / undefined symbol

ImportError: cannot import name 'ExLlamaV2' from 'exllamav2'

flash-attn install fails on Windows / no precompiled wheel

ERROR: Could not build wheels for flash-attn

Metal / Apple Silicon5 entries

Apple Silicon: RuntimeError: MPS backend out of memory

RuntimeError: MPS backend out of memory (MPS allocated: ... GB, other allocations: ... GB, max allowed: ... GB)

verified

Metal Allocator: out of memory on Apple Silicon

[METAL] Metal Allocator: out of memory (Allocation size X exceeds available)

verified

MLX / Metal: command buffer execution failed

[MLX][ERROR] Metal command buffer execution failed

Metal allocation failed — Apple Silicon OOM under unified memory pressure

metal::MetalCommandQueue allocation failed or [MPS] OOM

MLX: Memory pressure detected — consider reducing batch size

Warning: Memory pressure detected. Consider reducing the batch size.

Quantization issues1 entry

Q2_K or Q3 quantized model produces nonsense

(no error — output is incoherent at Q2_K but fine at Q4_K_M)

Hit an error we don't have?

We add ~5 new errors per month based on what readers report.

Email Contact support with the literal error message and what you tried. If it's a common one we'll write it up; if it's something only you hit, we'll often help directly.