01. Systematic Debugging Approach

Chapter 1 of 15 · 15 min

The Layer Model

Local AI systems stack in predictable layers, and errors always originate at exactly one of them:

  1. Hardware layer: GPU not detected, insufficient VRAM, thermal throttling
  2. Driver layer: CUDA not installed, wrong version, NVIDIA Container Toolkit missing
  3. Runtime layer: cuDNN mismatches, missing shared libraries, LD_LIBRARY_PATH misconfigured
  4. Container layer: Docker not running, port conflicts, volume mount failures
  5. Application layer: wrong model file, corrupted weights, missing tokenizer
  6. Network layer: firewall blocking, DNS resolution failure, self-signed certificate rejection

When something breaks, start at layer 1 and work upward. The instinct to check the application first wastes hours because application errors caused by missing CUDA libraries look identical to application errors caused by corrupted model files.

The Diagnostic Loop

Every troubleshooting session follows this sequence:

Observe → Hypothesize → Test → Conclude → Fix

Observe: Capture the exact error message, reproduction steps, and system state. "It doesn't work" is not an observation. "Inference hangs for 90 seconds then returns 'Connection reset by peer'" is an observation.

Hypothesize: Name the specific layer and component. "The GPU driver is not exposing CUDA to the container" is a hypothesis. "Something is wrong with the GPU" is not.

Test: Run a command that proves or disproves the hypothesis. Restarting everything is not a test. Running nvidia-smi with specific output you expect is a test.

Conclude: Accept that the hypothesis was correct or incorrect. Never skip this step.

Fix: Apply the minimum change that resolves the root cause.

Document Everything

Keep a text file of what you tried and what worked. Local AI hardware configurations vary too much for generic guides to cover your specific setup. Your documented fixes become your personal knowledge base.

EXERCISE

Run nvcc --version, nvidia-smi, python -c "import torch; print(torch.cuda.is_available())" and docker ps on your system. Note the exact output of each. This baseline becomes your reference when things break.