COURSE · OPS · B018

Troubleshooting Local AI

Learn troubleshooting local ai through RunLocalAI's practical lens: troubleshooting, debugging, errors and diagnosis, hardware fit, runtime settings, verification habits and local-vs-cloud tradeoffs.

15 chapters5hOperator trackBy Fredoline Eruo
PREREQUISITES
  • B001
  • B003

Course B018: Troubleshooting Local AI

Why this course exists

Local AI deployments fail in predictable ways. GPU memory exhausted after the third request. Models that downloaded successfully but refuse to load. APIs that work from curl but time out from the application. These problems are not random—they follow patterns that, once recognized, become straightforward to resolve. Most troubleshooting guides present solutions without teaching the diagnostic thinking that separates a competent operator from someone who reinstalls everything and hopes for the best. This course teaches a systematic approach: isolate the failure layer, identify the component responsible, and apply the correct fix. The same 10 failure modes cause 90% of local AI problems. Learn to recognize them once, and you stop Googling the same errors forever.

What you will know after

  • Diagnose GPU, memory, and network failures using command-line tools rather than guesswork
  • Identify which system layer (hardware, driver, runtime, application) is responsible for a given error
  • Apply targeted fixes for the 10 most common local AI failure modes
  • Read logs strategically to extract signal from noise
  • Build a personal runbook that documents fixes for recurring problems on your specific hardware
CHAPTERS
  1. 01Systematic Debugging Approach90% of local AI failures occur at layer 3 or below. Before touching application code, verify the hardware and software stack beneath it.15 min
  2. 02Installation FailuresInstallation failures almost always leave error messages that are specific enough to diagnose if you actually read them. Pip's verbose flag (`-v`) and Docker's logs (`docker logs`) are underused diagnostic tools.20 min
  3. 03GPU Not Detected"GPU not detected" is three different problems depending on where the detection fails. `lspci` checks hardware, `nvidia-smi` checks the driver, and PyTorch's `cuda.is_available()` checks the runtime. Solve each in sequence.20 min
  4. 04OOM ErrorsExit code 137 always means the OOM killer terminated the process. It is not a Python error or a model error—it is the kernel enforcing memory limits. Check `dmesg` to confirm and `free -h` to understand how much headroom existed.20 min
  5. 05Model Download FailuresAlways verify checksums or use safetensors format when downloading models. The error message from a corrupted model file is often "dimension mismatch in layer" or "NaN values detected"—symptoms that look like model bugs but originate from download corruption.20 min
  6. 06Slow InferenceSlow inference is almost always a configuration issue, not a hardware issue. The same model on the same GPU runs at different speeds depending on quantization level, batch size, attention implementation, and KV cache settings.15 min
  7. 07API Connection Refused"Connection refused" means the TCP handshake completed but no process was listening on that port. The process is either not running, bound to the wrong interface, or a firewall is dropping packets before they reach the application.20 min
  8. 08Docker IssuesDocker layer caching means builds reuse unchanged layers. After modifying `requirements.txt`, only layers after that line rebuild. Use `--no-cache` only when debugging layer-specific failures.15 min
  9. 09WSL2 ProblemsWSL2 is a Linux kernel running in a lightweight VM. It has the same debugging needs as a Linux VM, with the added complexity of the Windows host. GPU debugging starts the same way as bare-metal Linux debugging.20 min
  10. 10Context Length Errors"Token indices rise above the model's maximum context length" is the exact error message. It tells you exactly what happened—the input exceeded the model's limit. The fix depends on whether you need to truncate the input or switch to a model with a longer context.15 min
  11. 11Model Hallucination DebuggingBefore fixing hallucinations, reproduce them. Use identical seeds, print intermediate states, and verify that the model is receiving the inputs you believe it is receiving. Most "hallucination bugs" turn out to be retrieval bugs or prompt formatting bugs.15 min
  12. 12Performance ProfilingMeasure before optimizing. A profile tells you exactly which operation consumes time or memory. Optimizing the wrong operation wastes effort.15 min
  13. 13Log AnalysisError messages are specific. "CUDA out of memory" means VRAM exhausted. "CUDA error" with no memory mentioned means hardware or driver failure. Read the actual error, not a summary of the error.20 min
  14. 14Community ResourcesWell-documented bugs get fixed. Poorly documented bugs get closed as "cannot reproduce" or "needs more information." Spend 10 minutes writing a complete bug report to save waiting days for a response.15 min
  15. 15Troubleshooting Runbook ProjectA runbook is not documentation you write once—it's documentation you update every time you solve a new problem. After each debugging session, spend 5 minutes adding the fix to your runbook. Six months later, you'll thank yourself. ## Completion Criteria You have completed this course when you can: - Run the full GPU diagnostic sequence and interpret each command's output - Identify which system layer (hardware, driver, runtime, application) is responsible for any given error - Fix the 10 most common local AI errors from memory rather than by searching - Build a runbook that documents your specific system's configuration and recurring fixes - Profile inference performance and identify the bottleneck (compute, memory bandwidth, or transfer) These skills are not about memorizing error messages—they are about developing a mental model of how local AI systems stack, so diagnosing a new error takes minutes instead of hours.20 min