15. Troubleshooting Runbook Project
Building Your Personal Runbook
A runbook documents your specific system's configuration, recurring problems, and their fixes. Generic documentation covers your hardware; your runbook covers your system.
Runbook Template
# System: [Hostname/Description]
## Hardware
- GPU: [Model, VRAM]
- RAM: [Total]
- OS: [Distribution, Kernel version]
## Common Problems
### Problem: Ollama returns "connection refused"
**Symptoms**: curl http://localhost:11434/api/tags fails
**Cause**: Ollama not running
**Fix**:
```bash
sudo systemctl restart ollama
sudo systemctl status ollama
Problem: Model loads but inference is slow
Symptoms: <5 tokens/second on 7B model Cause: Running on CPU instead of GPU Fix:
## Verify GPU detection
python -c "import torch; print(torch.cuda.is_available())"
## Check environment variables
echo $CUDA_VISIBLE_DEVICES
Installation Notes
- CUDA Version: 12.1
- Driver Version: 535.154.05
- Ollama Version: 0.1.26
Model Registry
| Model | Size | Quantization | Location |
|---|---|---|---|
| Llama-2-7B | 13B | Q4_K_M | /models/llama-2-7b-q4 |
## Completion Criteria
You have completed this course when you can:
- Run the full GPU diagnostic sequence and interpret each command's output
- Identify which system layer (hardware, driver, runtime, application) is responsible for any given error
- Fix the 10 most common local AI errors from memory rather than by searching
- Build a runbook that documents your specific system's configuration and recurring fixes
- Profile inference performance and identify the bottleneck (compute, memory bandwidth, or transfer)
These skills are not about memorizing error messages—they are about developing a mental model of how local AI systems stack, so diagnosing a new error takes minutes instead of hours.
Build a runbook for your system. Document hardware spec (GPU model, VRAM, driver version), installed AI frameworks with versions, and the three most common errors you've encountered (symptoms, cause, fix). Then create a shell script that generates a diagnostic report and save its output as your baseline.
Key Insight: A runbook is not documentation you write once—it's documentation you update every time you solve a new problem. After each debugging session, spend 5 minutes adding the fix to your runbook. Six months later, you'll thank yourself.
Completion Criteria
You have completed this course when you can:
- Run the full GPU diagnostic sequence and interpret each command's output
- Identify which system layer (hardware, driver, runtime, application) is responsible for any given error
- Fix the 10 most common local AI errors from memory rather than by searching
- Build a runbook that documents your specific system's configuration and recurring fixes
- Profile inference performance and identify the bottleneck (compute, memory bandwidth, or transfer)
These skills are not about memorizing error messages—they are about developing a mental model of how local AI systems stack, so diagnosing a new error takes minutes instead of hours.