Operator path
Operator-reviewed

Homelab: 24/7 local AI you don't babysit

For: Owners of a homelab box (Proxmox, bare metal, or NAS-adjacent) who want a stable always-on inference service. By the end: A homelab inference service that survives a power cycle, reports its own health, and lets you reach it from anywhere securely.

By Fredoline Eruo8 milestonesLast reviewed 2026-05-07

A homelab inference service is an entirely different problem than a workstation that runs a model when you ask. The box has to survive a power blip at 3am. The fans need to behave when nobody's watching. The model has to come back up after a kernel upgrade without you SSH'ing in. This path walks the eight operational disciplines that turn a stack-on-a-shelf into a service you can ignore.

Pick the chassis and rule out the wrong ones

Used Threadripper, refurb Xeon, prosumer ATX with a 1000W Platinum — all fine. Mini-PCs with a soldered GPU and a laptop-sized PSU — not for 24/7 inference. The rule of thumb: at sustained 80% GPU load, your PSU should run at 60-70% of rated, not 90%. If you can't measure this, you can't run 24/7.

Cooling is the other half. Run your model at full tilt for an hour and watch GPU temp. If it climbs past 83°C in the first 30 minutes you have a thermals problem now and a reliability problem in three months.

When this is done you should have
A box with adequate PSU headroom, real cooling, and a thermals story you've measured at sustained load.

Lock the OS, lock the kernel

The single biggest cause of "homelab inference broke overnight" is an automatic kernel upgrade that left the GPU driver behind. Disable unattended-upgrades for kernel packages. When you do upgrade, do it on a schedule you choose, with a rollback plan you've tested.

Distro choice matters less than the discipline. Ubuntu 22.04/24.04 LTS or Debian Stable both work. Avoid rolling distros for an unattended box.

When this is done you should have
Linux installed (Ubuntu LTS or Debian Stable), kernel pinned, automatic kernel updates disabled. Driver stack verified to match the kernel.

Run inference as a managed service

Tmux sessions are not a service. Screen is not a service. A systemd unit (or Docker container with restart=unless- stopped) is. Pick one and stick to it. The unit file is now part of your homelab definition — version-control it.

Validate by killing the process and watching it come back. Then power-cycle the box and watch the model come up without intervention. If it doesn't, you don't have a homelab service yet.

When this is done you should have
vLLM or llama.cpp running under systemd (or a Docker container with a restart policy), starting on boot, restarting on crash, logs going somewhere persistent.

Add observability before you need it

Set this up while everything is healthy. The whole point of observability is that when something breaks at 2am, you already have the dashboard, the alert, and the baseline. Adding metrics during an incident is not a strategy.

Minimum viable: GPU temperature, GPU memory utilization, request rate, error rate, system load. If your dashboard has fewer than five tiles, it's too sparse. If it has fifty, you'll never look at it.

When this is done you should have
Prometheus + Grafana (or equivalent) showing GPU temp, GPU memory, request rate, and error rate. A dashboard you've actually looked at.

Power and UPS discipline

A 4090 + Threadripper at full tilt pulls 700-900W. Your UPS sizing must include real margin — a "1000VA" UPS is closer to 600W usable. The job of the UPS is not to keep the box running through an outage; it's to give you 5 clean minutes for an automatic shutdown.

Test it. With the model loaded and serving requests, pull the wall plug and watch what happens. If the box hard- crashes, you don't have a UPS, you have a heavy paperweight.

When this is done you should have
A UPS sized for clean shutdown, configured with NUT or apcupsd, tested by pulling the plug while the box is loaded.

Remote access without exposing the model port

Do not port-forward 8000 to your model. Don't put it behind a basic-auth proxy and call it secure. Use a mesh VPN — Tailscale is the path of least resistance, WireGuard if you want full control. Then your phone, your laptop, and your other boxes reach the model on a private address that simply doesn't exist on the public internet.

Verify with a port scanner from outside your network. If 8000 responds, you did it wrong; go back.

When this is done you should have
WireGuard or Tailscale configured. The model port is reachable only over the VPN, never on the public internet.

Restart discipline and rollback plan

"It works, don't touch it" is not a maintenance plan. A real plan: schedule monthly maintenance windows, snapshot the system before changes (LVM, ZFS, or just a config backup), test rollback before you ever need it. The difference between a homelab and a "machine that runs sometimes" is exactly this discipline.

When this is done you should have
A documented procedure for kernel upgrades, driver upgrades, and runtime upgrades. A backup of the working configuration.

Document the runbook

Future-you, three months from now, has forgotten exactly why the GPU power limit is set to 350W and why the inference service depends on the local DNS server starting first. Document it. The runbook is for that future-you.

If the box dies and you have to rebuild on new hardware, the runbook is the spec. Treat it like infrastructure code, because it is.

When this is done you should have
A written runbook with the boot sequence, recovery steps, contact list, and dependency map. Stored somewhere that survives the box dying.

Next recommended step

The cadence questions: weekly health checks, monthly upgrades, quarterly hardware inspections.