fatalEditorialReviewed May 2026

Model crashes mid-inference — debug the actual cause

Mid-inference crashes (segfault, illegal memory access, kernel panic) usually mean VRAM ECC, thermal throttling, PSU instability, or a bad model file. Here's the diagnostic order.

NVIDIA CUDAAMD ROCmllama.cppvLLMOllama

By Fredoline Eruo · Last verified 2026-05-08

Diagnostic order — most likely first

Model file is corrupted (incomplete download, bad mirror)

Diagnose

Crash is reproducible at the same token / step. Sometimes the runtime reports a hash mismatch.

Fix

Re-download the model from the source (HuggingFace direct, official Ollama registry). Verify the SHA256 if the source publishes one. Don't use sketchy mirrors.

GPU thermal throttling / unstable overclock

Diagnose

Crash correlates with high load. `nvidia-smi -q -d TEMPERATURE` shows GPU temp > 85°C right before crash. Or you've manually overclocked.

Fix

Reset clocks to stock. Improve case airflow. Underclock VRAM by 100-200 MHz if you have an aggressive AIB card; many crashes that look like 'illegal memory access' are actually VRAM stability issues.

PSU not stable under load (transient power dips)

Diagnose

Crash happens 5-30 seconds into sustained inference. PC may also reboot under load. PSU is undersized or aging.

Fix

Check PSU wattage vs total system load. Single 4090 / 5090 needs 850-1000W minimum. Consider a higher-quality PSU (Seasonic Prime, Corsair RMx, EVGA SuperNova). Old PSUs degrade — 5+ year old units are suspect.

VRAM ECC error (used cards, mining cards)

Diagnose

Crashes are random, not load-correlated. `nvidia-smi -q -d ECC` shows non-zero double-bit errors. Common on used / ex-mining 3090s.

Fix

If under 100 errors and isolated to one VRAM bank, you can sometimes work around it by underclocking VRAM. If consistent or growing, the card is failing — replace it.

Runtime + driver incompatibility

Diagnose

Crash happens immediately on load. Logs show CUDA error 700 (illegal memory access) before a single token generates.

Fix

Update drivers to latest stable. If on the bleeding-edge runtime (vLLM nightly, llama.cpp HEAD), pin to the last release tag. New runtimes occasionally ship CUDA kernels that need newer drivers than you have.

RAM (system) corruption causing GPU memory transfer failures

Diagnose

Crash is random, sometimes during model load (before any inference). System RAM might be the issue, not VRAM.

Fix

Run memtest86 overnight. Bad system RAM corrupts model weights as they transfer to GPU, producing CUDA crashes that look like GPU issues.

Frequently asked questions

Is my GPU dying if local AI keeps crashing?

Possibly, but check the cheaper causes first: PSU stability, thermals, model file integrity, drivers. If you've ruled all of those out and `nvidia-smi -q -d ECC` shows growing error counts, the card has a real hardware problem.

Can a used GPU from a mining rig be safe for local AI?

Often yes, with caveats. Mining wears the fans (replaceable) and the thermal pads (replaceable on most cards). It rarely wears VRAM or the GPU die unless the card was overclocked + run hot for years. Buy from sellers willing to demo the card running stress tests; check ECC error counts.

What stress test should I run on a GPU before trusting it for AI?

Run a 30-minute llama.cpp inference loop on a model that fully fits VRAM, monitoring `nvidia-smi -l 1`. Watch for thermal throttling (clocks dropping under sustained load), VRAM errors, or driver resets. If it survives 30 minutes at 95%+ utilization without issues, it'll handle inference.

Related troubleshooting

CUDA out of memory

Why CUDA OOM happens during local LLM inference and image gen, how to confirm the real cause, and the four real fixes (smaller quant, shorter context, gradient checkpointing, or more VRAM).

Ollama is slow / running on CPU instead of GPU

Ollama silently falls back to CPU when it can't load a model into VRAM. Here's how to confirm the fallback, force GPU usage, and pick a model that actually fits.

ROCm not detected / AMD GPU not found

ROCm is finicky on consumer AMD GPUs in 2026. Here's the install order, the gfx-version override that fixes 80% of detection failures, and when to give up and use Vulkan.

When the fix is hardware

A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time:

Where next?

All troubleshooting guides

OrBest GPU for local AI Will it run on my hardware?