03. GPU Not Detected
The Diagnostic Sequence
GPU detection failures cascade from hardware through application. Work through each step before concluding the GPU is working.
Step 1: Hardware Check
lspci | grep -i nvidia
If this returns nothing, the GPU is not visible to the Linux kernel. This means a physical problem (not seated, not powered, BIOS setting) rather than a software problem. Check PCIe visibility in BIOS/UEFI settings.
Step 2: Driver Check
nvidia-smi
If this fails with "command not found", the NVIDIA driver is not installed. If it fails with "No devices were found", the driver loaded but did not detect the GPU—typically a driver-GPU version mismatch or a kernel module loading failure.
# Check loaded kernel modules
lsmod | grep nvidia
# Check dmesg for GPU-related errors
sudo dmesg | grep -i nvidia
sudo dmesg | grep -i nv
Step 3: CUDA Runtime Check
nvidia-smi
# Should show GPU model, driver version, temperature, memory usage
Then verify from Python:
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'Device count: {torch.cuda.device_count()}'); print(f'Device name: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"N/A\"}')"
Step 4: Container Check
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi
If this fails but nvidia-smi works on the host, NVIDIA Container Toolkit failed to install or configure correctly.
Common Causes and Fixes
Driver version too old: New GPUs require recent drivers. RTX 40-series cards need driver 535+.
Secure Boot blocking driver: The NVIDIA driver kernel module signed by Secure Boot prevents the driver from loading. Disable Secure Boot in UEFI or sign the module manually.
Docker without NVIDIA runtime: Add "default-runtime": "nvidia" to /etc/docker/daemon.json or use --gpus all on every docker run command.
On your system, run the diagnostic sequence from hardware check through container check. Document the output of each command. When you understand what each command checks, you know exactly where to look when GPU detection fails.