Log Analysis — Troubleshooting Local AI (Chapter 13)

Finding Signal in Noise

Application logs for AI systems contain three types of messages: information (normal operation), warnings (suboptimal but functional), and errors (action required). Error messages from different layers look different.

Application Layer Errors

ValueError: cannot sample with temperature=0.0 using greedy decoding

This is a parameter validation error. Fix: set do_sample=True when using temperature > 0, or set temperature=0 and use greedy decoding.

Runtime Layer Errors

RuntimeError: CUDA error: an illegal memory access was encountered

This means the kernel attempted to read or write memory outside valid addresses. Causes: tensor shape mismatch, index out of bounds, or corrupted model weights.

Driver Layer Errors

NVRM: Xid: GPU 0: GPU fault: reasons...

These NVIDIA kernel driver messages in dmesg indicate hardware-level problems—typically overheating, power supply issues, or driver corruption. Rebooting clears the error state but does not fix the underlying cause.

Structured Log Collection

# Collect logs from Ollama
journalctl -u ollama --no-pager -n 100

# Collect Docker logs
docker logs --tail 200 your-container-name > docker_logs.txt

# Collect GPU error log
sudo dmesg | grep -E "(nvidia|NVRM|GPU)" > gpu_logs.txt

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.