Diagnostics · 43 guidesEditorial

Local AI troubleshooting

Fix the most common local AI errors: CUDA out of memory, Ollama running on CPU, ROCm not detected, models crashing mid-inference. Operator-grade diagnostics, real fixes, no copy-paste-from-Reddit guesses.

Most common errors

fatal

CUDA out of memory

Why CUDA OOM happens during local LLM inference and image gen, how to confirm the real cause, and the four real fixes (smaller quant, shorter context, gradient checkpointing, or more VRAM).

NVIDIA CUDAPyTorchvLLMComfyUIOllama
degrades

Ollama is slow / running on CPU instead of GPU

Ollama silently falls back to CPU when it can't load a model into VRAM. Here's how to confirm the fallback, force GPU usage, and pick a model that actually fits.

OllamaNVIDIA CUDAAMD ROCmApple Silicon Metal
fatal

ROCm not detected / AMD GPU not found

ROCm is finicky on consumer AMD GPUs in 2026. Here's the install order, the gfx-version override that fixes 80% of detection failures, and when to give up and use Vulkan.

AMD ROCmPyTorch ROCmllama.cpp ROCmLinuxWindows WSL
fatal

WSL2 cannot see GPU / nvidia-smi fails inside WSL

WSL2 doesn't pass the GPU through unless the host driver is right and the kernel is current. Here's the install order that actually works in 2026, and how to confirm passthrough is live before you waste an afternoon.

WSL2Ubuntu on WSLNVIDIA driverDocker Desktop on Windows
fatal

Docker container cannot access GPU / `--gpus all` fails

Docker doesn't expose the host GPU by default. The NVIDIA Container Toolkit is the bridge. Here's the install + the runtime config + the four common symptoms that mean it's misconfigured.

DockerNVIDIA Container ToolkitLinuxWSL2 + Docker DesktopKubernetes
fatal

vLLM: CUDA version mismatch / 'no kernel image is available for execution'

vLLM ships pre-built wheels against specific CUDA versions. When your system CUDA differs, you get cryptic kernel-image errors. Here's the version matrix and the fix order.

vLLMPyTorchCUDA ToolkitNVIDIA driver
fatal

llama.cpp Metal: GGML_ASSERT / mtl_buffer crash on macOS

Most Metal crashes in llama.cpp on Apple Silicon trace to too-aggressive context size, an old GGUF format, or a model whose tensor shape Metal can't kernel. Diagnostic + fix order.

llama.cppApple SiliconMetalOllama on MacLM Studio on Mac
degrades

Ollama: 'address already in use' / port 11434 conflict

Ollama defaults to port 11434. When something else is on that port — often a previous Ollama process, Docker container, or another LLM server — startup fails. Here's how to find the squatter and reclaim the port.

OllamaDockerLM StudiomacOSLinuxWindows
degrades

GGUF tokenizer mismatch / 'tokenizer model not found'

When llama.cpp / Ollama outputs garbled text or repeats tokens infinitely, the tokenizer baked into the GGUF doesn't match the runtime's expectations. Here's how to confirm and fix.

llama.cppOllamaLM StudioHugging Face GGUF conversions
degrades

FlashAttention: 'kernel not supported' / not available on this GPU

FlashAttention 2 / 3 require specific compute capabilities. Older GPUs and consumer Pascal/Turing cards don't support it. Here's the support matrix and the runtime fallbacks.

FlashAttention 2FlashAttention 3vLLMPyTorchTransformers
fatal

torch.cuda.is_available() returns False

PyTorch falsely reporting no CUDA is the most common Python ML setup failure. The cause is almost always: wrong PyTorch wheel for your CUDA version, or a CPU-only build accidentally installed.

PyTorchHugging Face TransformersvLLMany CUDA-using Python lib
fatal

ROCm: HSA_STATUS_ERROR / HIP runtime errors during inference

HSA / HIP errors mid-inference on AMD GPUs usually trace to thermal limits, kernel-driver mismatch, or known-bad memory modes on consumer cards. Here's the diagnostic order.

ROCmPyTorch ROCmllama.cpp HIP backendAMD RDNA 2/3
fatal

SGLang: server hangs / requests time out

SGLang server hangs on startup or stops responding mid-load mostly trace to: request batching saturation, KV cache miss-sizing, scheduler deadlock, or a runtime-CUDA mismatch. Here's the order.

SGLangvLLMFlashInferTriton kernels
degrades

ComfyUI stuck on 'loading' / first run never completes

ComfyUI hanging on first launch is usually a custom-node conflict, model file corruption, or python env collision with A1111. Bisect via --disable-all-custom-nodes and you'll catch 80% of cases in 30 seconds.

ComfyUIPyTorchNVIDIA CUDAApple Silicon MetalPython venv
degrades

PyTorch MPS falling back to CPU on Apple Silicon

PyTorch on Apple Silicon silently falls back to CPU when an op isn't supported by MPS. Set PYTORCH_ENABLE_MPS_FALLBACK=1 to make it audible, then fix the actual op (cast dtype, disable flash-attention, lower batch).

PyTorchApple Silicon (M1-M4)MetalComfyUI on MacHugging Face Transformers
fatal

llama.cpp build failed (CUDA / Metal / Vulkan flags rejected)

Most llama.cpp build failures trace to a missing toolkit (CUDA, Metal, Vulkan SDK), wrong compiler version, or a stale CMake cache. Diagnose in order: PATH first, CMake version second, GCC/MSVC third.

llama.cppCMakeCUDA ToolkitMetalVulkan SDKROCm
fatal

WSL2 OOM-killer killing inference / 'Killed' message

WSL2 inherits a fraction of host RAM by default and won't let processes exceed it. Edit .wslconfig to set `memory=32GB` (or whatever you need) and restart WSL. Then verify with `free -h` inside the distro.

WSL2Ubuntu on WSLDocker Desktop on WindowsLinux memory subsystem
fatal

NVIDIA driver / CUDA toolkit version mismatch

When PyTorch / vLLM / a CUDA app errors on 'CUDA driver version is insufficient' or 'no kernel image,' the host driver is too old (or sometimes too new) for the installed toolkit. Read nvidia-smi's max-CUDA, match it.

NVIDIA driverCUDA ToolkitPyTorchvLLMany CUDA-using library
fatal

Windows: CUDA not found / 'Could not load nvcuda.dll'

Windows CUDA loading errors trace to a driver-vs-toolkit version skew, a PATH that doesn't include CUDA bin, or a CPU-only PyTorch wheel. Check nvidia-smi first, then the wheel suffix, then PATH.

WindowsPyTorch on WindowsvLLM on WindowsComfyUI Windows portableany CUDA app
degrades

llama.cpp running too slow / CPU-bound on supposedly-GPU build

If llama.cpp tok/s is 5-10x lower than expected on your GPU, the build probably defaulted to CPU, the model is partially CPU-offloaded, or flash-attention isn't enabled. Diagnose in 60 seconds with --verbose.

llama.cppNVIDIA CUDAAMD ROCmApple MetalVulkan backend
fatal

MLX: out of memory / 'Failed to allocate memory'

MLX OOM on Apple Silicon traces to wrong-size model for unified memory, missing wired-memory limit, or memory pressure from other apps. macOS reserves 25-30% for system; the rest is your AI budget.

MLXMLX-LMApple Silicon (M1-M4)macOS
fatal

Ollama: 'model not found' / 'pull manifest unknown' errors

Ollama 'model not found' errors trace to typos in the model name, pulling a model that doesn't exist in the official registry, network blocks on the registry, or pulling from a custom registry without auth.

OllamaOllama Hub registrycustom Ollama registries
fatal

HuggingFace download failed / 401 / rate-limit / network error

HuggingFace download errors split into auth (gated model, no token), rate-limit (anonymous traffic capped), or network (corporate proxy, country block). Diagnose by HTTP status code, fix per cause.

huggingface_hub Pythonhuggingface-clidiffuserstransformersany HF-pulling tool
fatal

Tensor parallelism: NCCL crash / 'unable to allocate' / 'distributed init failed'

Multi-GPU tensor-parallel crashes trace to NCCL backend issues (PCIe topology, missing peer access), insufficient GPU pair memory, or tensor-parallel-size not matching GPU count. Diagnose with NCCL_DEBUG=INFO.

vLLMExLlamaV2TensorRT-LLMDeepSpeedPyTorch DDPNCCL
fatal

ExLlamaV2: model not loading / 'Could not find model index' / cache OOM

ExLlamaV2 load failures trace to wrong model format (needs EXL2 or EXL3, not GGUF), insufficient cache for context, or a driver/runtime version mismatch. The exl2 format is non-negotiable.

ExLlamaV2TabbyAPIText-Generation-WebUI ExLlama loader
warning

Quantized model: noticeable quality loss / repetition / coherence drop

Output quality drop after quantization usually means the bpw is too aggressive, KV cache quantization is too low, or the calibration data didn't match the model. Q4_K_M is the safe floor; below that needs care.

llama.cpp GGUFExLlamaV2 EXL2AWQGPTQany quantized inference
fatal

Tokenizer mismatch / 'Unknown token' / 'Token ID out of range'

Tokenizer errors usually mean the loaded tokenizer doesn't match the model weights, the chat template is wrong, or special tokens (BOS/EOS) weren't preserved through quantization. Verify tokenizer config first.

Hugging Face TransformersvLLMllama.cppOllamaany tokenizer-using lib
fatal

CUDA driver too old / 'CUDA driver version is insufficient'

If PyTorch / vLLM / CUDA app errors with 'driver version insufficient,' your NVIDIA driver predates the CUDA runtime. Driver 555+ supports CUDA 12.4 (the 2026 standard). Update via nvidia.com or distro.

NVIDIA driverPyTorchvLLMTensorRT-LLMany CUDA app
fatal

Python: wheel build failed / 'Failed building wheel for X'

Wheel build failures in pip install almost always trace to: missing compiler (gcc / MSVC), missing system headers (Python.h, CUDA), or a Rust-based package without the Rust toolchain. Fix compiler first, then verify wheel availability.

pip installflash-attnvllmexllamav2tokenizersany pip package needing build
fatal

safetensors: 'header validation failed' / 'invalid format'

Safetensors header errors mean the file is corrupted, partially downloaded, or isn't actually a safetensors file. Check file size against the repo, re-download if mismatch, fall back to checked download tools.

safetensorsHugging Face Transformersdiffusersany safetensors-loading tool
fatal

Model keeps crashing / segfault during inference

Mid-inference crashes (segfault, illegal memory access, kernel panic) usually mean VRAM ECC, thermal throttling, PSU instability, or a bad model file. Here's the diagnostic order.

NVIDIA CUDAAMD ROCmllama.cppvLLMOllama
fatal

Windows LLM install failed / Python CUDA not found

Why first-time Windows AI installs fail, how to fix each link in the driver-CUDA-Python chain, and the specific download links that actually work.

WindowsPyTorchCUDA ToolkitOllamavLLM
degrades

Model loads but generation is slow / tok/s far below expectation

When the model loads (no OOM) but token generation is far below expected speeds, the bottleneck is usually VRAM paging, KV cache overcommit, or GPU contention. Here's how to diagnose and fix each.

NVIDIA CUDAAMD ROCmllama.cppOllamavLLMLM Studio
degrades

Token generation too slow / low throughput across runtimes

Slow token generation across multiple runtimes (not specific to Ollama or vLLM) means a system-level bottleneck: GPU underutilization, missing flash-attention, wrong thread count, thermal throttle, or VRAM paging.

llama.cppOllamavLLMTransformersExLlamaV2LM Studio
fatal

ComfyUI CUDA out of memory

ComfyUI-specific CUDA OOM: what triggers it (loaded checkpoints, IPAdapter/ControlNet overhead, missing --lowvram), how to fix it, and the ComfyUI settings that matter.

ComfyUINVIDIA CUDAStable DiffusionFlux
fatal

vLLM worker crashed / vLLM scheduler crash

vLLM worker/scheduler crashes: KV cache fraction misconfiguration, max-model-len exceeding VRAM, worker timeouts, NCCL failures, and quant incompatibility. The exact fix order that production operators use.

vLLMNVIDIA CUDAPython
fatal

TensorRT-LLM build failed / TensorRT-LLM compilation failed

TensorRT-LLM compilation/build failures: missing CUDA arch flag, version mismatches, Python wheel OOM, and NVCC compute capability issues. Honest advice: for most users, vLLM is the saner path.

TensorRT-LLMNVIDIA CUDAWindowsLinux
degrades

ONNX Runtime falls back to CPU / ONNX Runtime GPU not used

ONNX Runtime silently falls back to CPU even with a GPU present. Fix the provider registration, package choice, and model export to get GPU inference working.

ONNX RuntimeNVIDIA CUDAPythonTransformers
fatal

bitsandbytes: CUDA error / 'CUDA Setup failed despite GPU being available'

bitsandbytes silently breaks after PyTorch or NVIDIA driver updates. The fix is usually a reinstall with the right CUDA version, or switching to a prebuilt wheel. Here's the diagnostic order.

bitsandbytesPyTorchHugging Face TransformersQLoRA fine-tuningNVIDIA CUDA
degrades

HuggingFace 429 Too Many Requests / rate limit exceeded

HuggingFace returns HTTP 429 when you exceed the anonymous rate limit. A free account + token raises your ceiling dramatically. Here's exactly what triggers it, how to authenticate, and how to batch downloads so you never hit it again.

huggingface_hubhuggingface-clitransformersdiffusersHF Hub API
fatal

GGUF corrupt on disk / 'invalid magic number' / 'failed to read model file'

A corrupt GGUF file fails with cryptic magic-number or read errors. Here's how to validate the file without loading it, identify the corruption, and re-download only the damaged parts.

llama.cppOllamaLM StudioGGUF formatany GGUF-loading tool
fatal

WSL: systemd not running / 'System has not been booted with systemd as init'

WSL2 defaults to a non-systemd init for speed. For Docker, NVIDIA Container Toolkit, and multi-service AI stacks, you need systemd enabled. Here's how to turn it on and verify it's running.

WSL2Docker Desktop on WindowsNVIDIA Container ToolkitUbuntu on WSLsystemd
fatal

FlashAttention build failed on Windows / 'nvcuda.dll not found' / MSVC linker errors

FlashAttention compilation on Windows is the most common build failure in the local AI stack. The three real fixes: a prebuilt wheel, WSL2, or switching to SDPA.

FlashAttention 2WindowsMSVCNVIDIA CUDAPyTorch

Don't see your error?

We're building the troubleshooting library by the highest-volume queries first. If you're hitting an error that isn't covered, the diagnostic patterns here usually transfer: check VRAM headroom, check thermals, check driver versions, check the model file. Most local AI failures fall in those four buckets.