Systematic Debugging Approach — Troubleshooting Local AI (Chapter 1)

The Layer Model

Local AI systems stack in predictable layers, and errors always originate at exactly one of them:

Hardware layer: GPU not detected, insufficient VRAM, thermal throttling
Driver layer: CUDA not installed, wrong version, NVIDIA Container Toolkit missing
Runtime layer: cuDNN mismatches, missing shared libraries, LD_LIBRARY_PATH misconfigured
Container layer: Docker not running, port conflicts, volume mount failures
Application layer: wrong model file, corrupted weights, missing tokenizer
Network layer: firewall blocking, DNS resolution failure, self-signed certificate rejection

When something breaks, start at layer 1 and work upward. The instinct to check the application first wastes hours because application errors caused by missing CUDA libraries look identical to application errors caused by corrupted model files.

The Diagnostic Loop

Every troubleshooting session follows this sequence:

Observe → Hypothesize → Test → Conclude → Fix

Observe: Capture the exact error message, reproduction steps, and system state. "It doesn't work" is not an observation. "Inference hangs for 90 seconds then returns 'Connection reset by peer'" is an observation.

Hypothesize: Name the specific layer and component. "The GPU driver is not exposing CUDA to the container" is a hypothesis. "Something is wrong with the GPU" is not.

Test: Run a command that proves or disproves the hypothesis. Restarting everything is not a test. Running nvidia-smi with specific output you expect is a test.

Conclude: Accept that the hypothesis was correct or incorrect. Never skip this step.

Fix: Apply the minimum change that resolves the root cause.

Document Everything

Keep a text file of what you tried and what worked. Local AI hardware configurations vary too much for generic guides to cover your specific setup. Your documented fixes become your personal knowledge base.