System guide · Operating system

Local AI on Linux — operator-grade deployment paths

The definitive Linux local-AI operating manual. CUDA + ROCm install, Docker GPU passthrough, vLLM/SGLang/llama.cpp deployment, systemd services, headless deployment, AMD-on-Linux advantages, container-native serving, homelab and production patterns. Where serious local AI actually runs.

By Fredoline Eruo · Last reviewed 2026-05-07

Why Linux is the production-default for local AI

Every production-grade local-AI runtime in 2026 — vLLM, SGLang, TensorRT-LLM, TGI, Ray Serve — is Linux-first. NVIDIA's container toolkit, AMD's ROCm, Intel's oneAPI all target Linux as the reference platform. Windows + WSL2 works for development; Linux is where the deployment lives.

The honest framing of the OS-tier hierarchy:

  • Linux: production-default. Every runtime supported. Every deployment pattern documented.
  • macOS: Apple Silicon-only. MLX-LM is the production path; vLLM doesn't run.
  • Windows + WSL2: development viable. Production deployment belongs on Linux.
  • Windows native: hobbyist tier. Ollama / LM Studio work; serious serving needs WSL2 or Linux.

Distro choices — Ubuntu, Debian, Fedora, Arch, NixOS

For local AI, distro choice matters less than driver-stack compatibility. The pragmatic ranking:

  • Ubuntu LTS (22.04 or 24.04): the operator default. NVIDIA datacenter drivers, ROCm, OpenVINO, Docker — all packaged officially. 90%+ of deployment guides target Ubuntu LTS.
  • Debian stable: where you go when you don't want Ubuntu's telemetry. Trails Ubuntu on driver freshness by 6-12 months.
  • Fedora / RHEL: solid for NVIDIA work; ROCm support is improving. Good fit for enterprise environments.
  • Arch / Manjaro: bleeding-edge driver versions. Useful for testing latest features; risky for production stability.
  • NixOS: most reproducible deployment story. Steep learning curve. Production-viable if your team already uses Nix.

NVIDIA stack: driver + CUDA toolkit

The NVIDIA stack on Ubuntu 24.04 LTS, the canonical install:

# 1. Add NVIDIA package signing key + repo
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update

# 2. Install proprietary driver + CUDA toolkit
sudo apt install -y cuda-toolkit-12-6 nvidia-driver-560

# 3. Reboot for driver to load
sudo reboot

# 4. Verify
nvidia-smi
# Expected: GPU listed, driver 560+, CUDA 12.6+

The nvidia-driver-560-server variant is the production-grade pin for sustained-load deployments. Don't mix the open-source nouveau driver with proprietary NVIDIA — blacklist nouveau if it loads.

AMD stack: ROCm install + verification

ROCm is AMD's CUDA equivalent. Linux is the production-grade path; Windows ROCm trails by 6-12 months.

# 1. ROCm 6.x install on Ubuntu 22.04 / 24.04
sudo apt install -y python3-setuptools python3-wheel
wget https://repo.radeon.com/amdgpu-install/6.2/ubuntu/jammy/amdgpu-install_6.2.60200-1_all.deb
sudo apt install -y ./amdgpu-install_6.2.60200-1_all.deb
sudo amdgpu-install -y --usecase=rocm

# 2. Add user to render + video groups
sudo usermod -a -G render,video $USER
sudo reboot

# 3. Verify
rocm-smi
# Expected: GPU listed (e.g. RX 7900 XTX, MI300A)
rocminfo | grep "Marketing Name"

For consumer cards: RX 7900 XTX is fully supported; older Polaris / Vega cards are not in 2026 ROCm. Confirm your card is in the ROCm support matrix before committing time to setup.

Intel stack: oneAPI + IPEX-LLM

For Intel Arc GPUs (A770, B580) and Lunar Lake / Meteor Lake NPUs:

# 1. Intel oneAPI base toolkit
wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/.../oneapi-base-toolkit.sh
sudo sh oneapi-base-toolkit.sh

# 2. Source oneAPI environment
source /opt/intel/oneapi/setvars.sh

# 3. IPEX-LLM in a venv
python3 -m venv venv && source venv/bin/activate
pip install --pre --upgrade ipex-llm[xpu]

# 4. Verify XPU detected
python -c "import intel_extension_for_pytorch as ipex; print(ipex.xpu.is_available())"

Intel Arc support on Linux is solid as of 2026 but the community is smaller than CUDA / ROCm. See IPEX-LLM operational review for deployment patterns.

Docker + nvidia-container-toolkit

For containerized inference, you need nvidia-container-toolkit to expose GPUs to containers:

# 1. Add NVIDIA Container Toolkit repo
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# 2. Install + restart Docker
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# 3. Verify GPU access from container
docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu24.04 nvidia-smi

For ROCm: the AMD equivalent is amdgpu-container-toolkit + ROCm Docker images from rocm/rocm-terminal repo.

vLLM deployment path

The production-default for multi-tenant serving on Linux + NVIDIA:

# venv + vLLM install
python3 -m venv ~/venvs/vllm
source ~/venvs/vllm/bin/activate
pip install vllm autoawq

# Pull a quantized 32B model
huggingface-cli download casperhansen/qwen-2.5-coder-32b-instruct-awq

# Launch the OpenAI-compatible server
vllm serve casperhansen/qwen-2.5-coder-32b-instruct-awq \
  --quantization awq \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92 \
  --port 8000

For multi-GPU: add --tensor-parallel-size N where N is the GPU count. See multi-GPU buying guide for the topology decisions.

SGLang deployment path

SGLang wins over vLLM when prefix-cache hit rate is high (agent loops with stable system prompts). Same install model:

pip install "sglang[all]"

python -m sglang.launch_server \
  --model-path Qwen/Qwen2.5-Coder-32B-Instruct \
  --quantization awq \
  --tp 1 \
  --port 30000 \
  --enable-radixattention

llama.cpp on Linux (CPU + every GPU backend)

llama.cpp is the most-portable runtime — runs on CPU + every GPU backend. The Linux build path:

# Build with CUDA
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make GGML_CUDA=1 -j$(nproc)

# Or with ROCm
make GGML_HIPBLAS=1 -j$(nproc)

# Or with Vulkan (vendor-agnostic)
make GGML_VULKAN=1 -j$(nproc)

# Pull a GGUF model + serve
huggingface-cli download bartowski/Qwen2.5-7B-Instruct-GGUF Qwen2.5-7B-Instruct-Q4_K_M.gguf
./llama-server --model Qwen2.5-7B-Instruct-Q4_K_M.gguf -ngl 999 --host 0.0.0.0 --port 8080

Ollama as systemd service

Ollama installs as a systemd service by default. The user-mode variant for development:

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Enable + start
sudo systemctl enable --now ollama

# Pull + run a model
ollama pull qwen2.5-coder:32b
ollama run qwen2.5-coder:32b

# OpenAI-compatible endpoint live at http://localhost:11434/v1

For headless deployment: OLLAMA_HOST=0.0.0.0:11434 + OLLAMA_MODELS=/large-disk/models via systemd environment file.

Headless / SSH deployment patterns

The standard remote-deploy shape:

  • Ubuntu Server 24.04 LTS minimal install
  • SSH key-only auth; fail2ban for noise reduction
  • NVIDIA driver via Ubuntu repo (no GUI)
  • Docker + nvidia-container-toolkit for containerized runtimes
  • Caddy or nginx as reverse proxy with TLS via Let's Encrypt
  • WireGuard or Tailscale for VPN access (don't expose port 8000 to the internet)

Multi-GPU + cluster deployment

For multi-GPU on Linux see running local AI on multiple GPUs in 2026 and the relevant L3 stack:

Filesystem + storage choices for model weights

Model weights are large (3 GB - 200 GB+). Storage choices:

  • Local NVMe SSD: fastest for cold start. PCIe 4.0+ NVMe loads a 30 GB model in <5 seconds.
  • Local SATA SSD: 3-5× slower cold start. Cheap; fine for non-production.
  • HDD: avoid. 30+ second cold start; long-tail seek latency hurts mmap.
  • Network filesystem (NFS, S3, MinIO): only for read-only model registries shared across cluster nodes. Cold-start latency is real.
  • tmpfs / RAM disk: useful for benchmarking; not practical for >10 GB models.

Filesystem choice: ext4 or xfs for local model weights. ZFS/btrfs work but add overhead. Hugging Face caches default to ~/.cache/huggingface/hub — set HF_HOME to your model-storage volume.

Power management + thermal monitoring

Production GPU deployment requires:

# Set persistence mode (driver stays loaded between requests)
sudo nvidia-smi -pm 1

# Lock power limit (avoid thermal throttling cliffs)
sudo nvidia-smi -pl 350  # for RTX 3090 / 4090

# Monitor in real time
nvidia-smi --query-gpu=index,power.draw,temperature.gpu,temperature.memory,utilization.gpu --format=csv -l 5

# Or use nvtop / nvitop for terminal UI
sudo apt install nvtop && nvtop

Memory-junction temps (temperature.memory) crossing 105°C trigger silent throttling that temperature.gpu doesn't reveal. Production deployments must monitor both.

Common failure modes

  1. nvidia-smi works but containers can't see GPU. nvidia-container-toolkit not installed or Docker daemon not restarted. Re-run sudo nvidia-ctk runtime configure --runtime=docker && sudo systemctl restart docker.
  2. CUDA mismatch error after driver upgrade. CUDA toolkit + driver versions must align. Pin both via apt-mark in production.
  3. vLLM OOM on long context. Drop --gpu-memory-utilization to 0.88, or reduce --max-model-len. KV cache spikes during prefill.
  4. Permission denied on model files. Hugging Face cache permissions can drift if seeded as root then run as user. Always download as the runtime user.
  5. Flash-attention version mismatch. FA v2 / v3 / FlashInfer have different kernel requirements per GPU architecture. vLLM auto-selects; manual pinning helps when compile fails.
  6. SystemD startup race. Ollama / your runtime starts before the GPU driver is fully initialized. After=nvidia-persistenced.service in unit file fixes it.

Production-grade hardening

  • Pin every version: kernel, driver, CUDA, Python, runtime, model checkpoint hash
  • Read-only filesystem for the runtime container; writable volume only for logs / temp
  • Secrets via env files or Vault; never bake API keys into images
  • Prometheus + Grafana for nvidia-smi metrics, vLLM /metrics endpoint, request latency p50/p95/p99
  • Log all requests to ELK / Loki; rotate aggressively (LLM payloads are large)
  • Health-check endpoints on every replica; orchestrator drops unhealthy nodes
  • Network: WireGuard or Tailscale, not direct internet exposure
  • Backup the model weights + Hugging Face cache to cold storage; bandwidth-recovery is real

Going deeper