System guide · Operations

Maintaining a local-AI build over time

What breaks after 3 months. Driver drift, CUDA mismatches, ROCm cycles, Windows-update fallout, Docker / runtime versioning, SSD wear from embeddings, fan curves, dust, BIOS updates, PSU degradation. The operator's day-90-and-beyond reality.

By Fredoline Eruo · Reviewed 2026-05-07 · ~2,100 words

Why this guide exists

Most local-AI guidance stops at does it boot? But every operator who runs a build for more than three months hits the same set of slow-motion failures: a driver auto-update breaks vLLM, a Windows kernel patch silently flips WSL2 GPU access off, the Docker daemon's overlay2 layer cache fills the SSD, an ROCm minor bump makes Q4_K_M output garbled tokens, the GPU fan starts shedding bearings at 73 °C ambient, and the PSU's sustained-load capacitors start drooping under transient 4090 spikes.

None of these are show-stopping bugs. All of them are operator failures of not noticing. This page is the operator's pre-mortem.

The maintenance schedule that works

The cadence below is what breaks the failure-of-noticing pattern. Every event is small and cheap; missing the cadence is what compounds.

Daily. Glance at the GPU temp + power dashboard (see /systems/local-ai-observability). Should take 10 seconds.
Weekly. docker system prune -af --volumes if you run agent sandboxes (OpenHands leaks layers fast). Verify Caddy / Tailscale auto-renewal didn't error.
Monthly. Read the change-log of every pinned image / wheel before bumping. Run nvidia-smi --query-gpu=temperature.memory --format=csv and check whether the memory junction has crept up.
Quarterly. Open the case, blow the dust, check fan bearings audibly. Re-pin the OS auto-update window.
Annually. Decide whether to upgrade — driver, CUDA, kernel, runtime. Read the relevant RC threads first.

Driver / CUDA architecture and what breaks

NVIDIA's split between driver, CUDA toolkit, and cuDNN / TensorRT / NCCL is where most operator pain originates. The driver is what the kernel module talks to your GPU through; the CUDA toolkit is the userspace API your inference engine compiles against; cuDNN ships kernels for popular ML ops. They have to agree on ABI versions or you get cryptic loader errors.

The frequent failure: a Linux distro auto-updates the driver from 535.x to 545.x. vLLM compiled against CUDA 12.4 still loads, but a newly-pulled ExLlamaV2 wheel built against 12.6 crashes with libcuda.so version mismatch. The fix is dull: apt-mark hold nvidia-driver-535 on Ubuntu, and only bump deliberately on a quiet weekend.

On WSL2 the driver lives on the Windows host but the toolkit lives in the Linux guest. A Windows Update that bumps the driver but not the WSL kernel patch produces nvidia-smi success but torch.cuda.is_available() false. See /errors/wsl2-gpu-not-detected for the canonical fix.

ROCm breakage cycles vs CUDA

ROCm is on a faster, less-stable cadence than CUDA. ROCm 6.0 → 6.1 → 6.2 in 2025 each shifted gfx-target compile flags; cards built with one HIPBLAS version refused to load with another. The operator pattern that survives this:

Pin both amdgpu-dkms and ROCm userspace versions explicitly (apt-mark hold).
Build inference engines from source against the pinned ROCm rather than relying on prebuilt wheels.
Read every minor release-notes page. ROCm has done some excellent work in 2025-2026, but they still ship breaking changes more often than NVIDIA.

If you are on Windows + AMD, that is the highest-pain combination of all. The ROCm-on-Windows path improved in 2025 but still drops under multi-GPU and FlashAttention paths. See /errors/rocm-device-error.

Apple Silicon — different shape, similar discipline

Apple's stack moves more slowly and breaks less often. The pain points are macOS major-version upgrades (Sonoma → Sequoia → Tahoe) which can break MLX-LM or llama.cpp's Metal backend for a few weeks while the engine catches up. Defer the OS upgrade for 6-8 weeks after release.

The other Apple-specific item: thermal throttling. The fanless / quiet design that makes M3 Ultra Mac Studios so pleasant in chat workloads becomes a problem under sustained inference. Long-running agent loops or batch embeddings push the chassis to the thermal ceiling; you watch tok/s drop without an obvious error. The fix is to schedule heavy batch work for cool ambient hours, or to add a quiet external fan to the chassis surface.

Windows-update workflow / failure modes

Windows Update is the largest source of unsolicited change in any local-AI build that uses Windows or WSL2. The discipline:

Set Active Hours wide; never let updates kick in mid-eval-run.
Defer feature updates 90+ days. Cumulative updates within 7-14 days is fine for security; feature updates frequently shift WSL behavior.
Snapshot before any major update. If you use VirtualBox or VMware for WSL, take a snapshot. If not, at least back up /etc and the Docker volumes.

Docker + runtime version drift

Pinning Docker image SHAs (not tags) is the single highest-leverage maintenance discipline. vllm/vllm-openai:latest changes silently; vllm/vllm-openai@sha256:... does not. Same for Open WebUI, Ollama, Qdrant, Caddy.

The other Docker-specific pain: overlay2 layer growth. A long-running OpenHands agent leaks per-task ephemeral layers that don't get cleaned up automatically. docker system prune -af --volumes is the weekly cron that prevents the disk filling up.

SSD wear: embeddings + checkpoints + sandbox layers

Modern consumer NVMe drives are rated 600 TBW (terabytes written) for the cheap end, 1200-2400 TBW for the prosumer end. Local AI rarely approaches those limits in normal use, but specific workloads grow fast:

Embedding ingestion. A nightly full re-ingest of a 100K-file monorepo writes ~1-3 GB. Annually that is 365-1100 GB. Switch to incremental (watcher-based) ingestion.
Fine-tune checkpoints. Saving a checkpoint every 50 steps on a 7B QLoRA writes ~150 MB × N. Multi-day training campaigns can easily write 50-100 GB.
Agent sandbox layers. See above.

The mitigation is mundane: monitor SMART data quarterly with smartctl -a /dev/nvme0n1; replace at 85% wear. Do not run a fine-tuning workstation on a QLC drive.

Thermals over time: fans, dust, paste

On a 24/7 homelab, fans are the #1 mechanical failure. RTX 4090 / 3090 reference-design vapor chambers degrade subtly: a small thermal-pad gap forms, the memory junction temp creeps from 78 °C to 92 °C over 18 months, and you start seeing memory ECC errors. Quarterly dust-blow is non-negotiable. Repaste every 18-24 months on cards that run hot.

On Mac Studio hardware, the failure is dust ingestion through the bottom intake. Same quarterly schedule with an air can.

PSU degradation under sustained AI load

Cheaper PSUs use capacitors that drift over 2-3 years of sustained inference load. The first symptom is intermittent system shutdowns under transient GPU spikes (4090 transients hit 600 W+). Buy a Gold-rated 1000 W+ PSU (Seasonic, Corsair RMx) for any single-4090 build; 1200 W+ for dual-card. Replace at the first sign of intermittent shutdowns; do not chase the diagnosis.

BIOS updates: when to bother, when to skip

Most BIOS updates fix bugs you don't have. The exceptions are: (1) AGESA / microcode updates that fix CPU memory-controller bugs; (2) PCIe bifurcation fixes that affect multi-GPU stability; (3) security patches you actually need. Skip cosmetic updates.

VRAM fragmentation and OOM-near-miss

After weeks of mixed-workload usage, the GPU memory allocator can fragment. The model fits, the KV cache fits, the activations fit — but a contiguous block fails. The symptom is intermittent OOM at the same context length that worked yesterday. Restart the inference engine. If you see this often, look at /will-it-run/custom's effective-VRAM math — you may be running closer to the edge than you think.

What this looks like in practice (calendar)

Concretely, an operator running a single-4090 coding-agent workflow should expect, in the first year:

2-3 driver auto-update incidents requiring rollback or pin.
1 Docker disk-fill event if no weekly prune.
1 thermal creep event (dust accumulation) at the 6-month mark.
0-1 PSU intermittent-shutdown if the PSU is undersized.
3-5 Open WebUI / Ollama image bumps that tweak something.

When the build is genuinely past it

The honest signal: when keeping the build current consumes more time than the build saves. For a single-4090 homelab, that's typically 3-4 years. For a small-team production deployment, 18-24 months. At that point the upgrade math is straightforward — see /will-it-run/custom for current-gen alternatives, and /compare/builds to compare your current build against next-gen.

Adjacent guides: observability and security are the other two operator-realism surfaces.