Local AI on Linux — operator-grade deployment paths
The definitive Linux local-AI operating manual. CUDA + ROCm install, Docker GPU passthrough, vLLM/SGLang/llama.cpp deployment, systemd services, headless deployment, AMD-on-Linux advantages, container-native serving, homelab and production patterns. Where serious local AI actually runs.
Why Linux is the production-default for local AI
Every production-grade local-AI runtime in 2026 — vLLM, SGLang, TensorRT-LLM, TGI, Ray Serve — is Linux-first. NVIDIA's container toolkit, AMD's ROCm, Intel's oneAPI all target Linux as the reference platform. Windows + WSL2 works for development; Linux is where the deployment lives.
The honest framing of the OS-tier hierarchy:
- Linux: production-default. Every runtime supported. Every deployment pattern documented.
- macOS: Apple Silicon-only. MLX-LM is the production path; vLLM doesn't run.
- Windows + WSL2: development viable. Production deployment belongs on Linux.
- Windows native: hobbyist tier. Ollama / LM Studio work; serious serving needs WSL2 or Linux.
Distro choices — Ubuntu, Debian, Fedora, Arch, NixOS
For local AI, distro choice matters less than driver-stack compatibility. The pragmatic ranking:
- Ubuntu LTS (22.04 or 24.04): the operator default. NVIDIA datacenter drivers, ROCm, OpenVINO, Docker — all packaged officially. 90%+ of deployment guides target Ubuntu LTS.
- Debian stable: where you go when you don't want Ubuntu's telemetry. Trails Ubuntu on driver freshness by 6-12 months.
- Fedora / RHEL: solid for NVIDIA work; ROCm support is improving. Good fit for enterprise environments.
- Arch / Manjaro: bleeding-edge driver versions. Useful for testing latest features; risky for production stability.
- NixOS: most reproducible deployment story. Steep learning curve. Production-viable if your team already uses Nix.
NVIDIA stack: driver + CUDA toolkit
The NVIDIA stack on Ubuntu 24.04 LTS, the canonical install:
# 1. Add NVIDIA package signing key + repo
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
# 2. Install proprietary driver + CUDA toolkit
sudo apt install -y cuda-toolkit-12-6 nvidia-driver-560
# 3. Reboot for driver to load
sudo reboot
# 4. Verify
nvidia-smi
# Expected: GPU listed, driver 560+, CUDA 12.6+The nvidia-driver-560-server variant is the production-grade pin for sustained-load deployments. Don't mix the open-source nouveau driver with proprietary NVIDIA — blacklist nouveau if it loads.
AMD stack: ROCm install + verification
ROCm is AMD's CUDA equivalent. Linux is the production-grade path; Windows ROCm trails by 6-12 months.
# 1. ROCm 6.x install on Ubuntu 22.04 / 24.04
sudo apt install -y python3-setuptools python3-wheel
wget https://repo.radeon.com/amdgpu-install/6.2/ubuntu/jammy/amdgpu-install_6.2.60200-1_all.deb
sudo apt install -y ./amdgpu-install_6.2.60200-1_all.deb
sudo amdgpu-install -y --usecase=rocm
# 2. Add user to render + video groups
sudo usermod -a -G render,video $USER
sudo reboot
# 3. Verify
rocm-smi
# Expected: GPU listed (e.g. RX 7900 XTX, MI300A)
rocminfo | grep "Marketing Name"For consumer cards: RX 7900 XTX is fully supported; older Polaris / Vega cards are not in 2026 ROCm. Confirm your card is in the ROCm support matrix before committing time to setup.
Intel stack: oneAPI + IPEX-LLM
For Intel Arc GPUs (A770, B580) and Lunar Lake / Meteor Lake NPUs:
# 1. Intel oneAPI base toolkit
wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/.../oneapi-base-toolkit.sh
sudo sh oneapi-base-toolkit.sh
# 2. Source oneAPI environment
source /opt/intel/oneapi/setvars.sh
# 3. IPEX-LLM in a venv
python3 -m venv venv && source venv/bin/activate
pip install --pre --upgrade ipex-llm[xpu]
# 4. Verify XPU detected
python -c "import intel_extension_for_pytorch as ipex; print(ipex.xpu.is_available())"Intel Arc support on Linux is solid as of 2026 but the community is smaller than CUDA / ROCm. See IPEX-LLM operational review for deployment patterns.
Docker + nvidia-container-toolkit
For containerized inference, you need nvidia-container-toolkit to expose GPUs to containers:
# 1. Add NVIDIA Container Toolkit repo
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# 2. Install + restart Docker
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# 3. Verify GPU access from container
docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu24.04 nvidia-smiFor ROCm: the AMD equivalent is amdgpu-container-toolkit + ROCm Docker images from rocm/rocm-terminal repo.
vLLM deployment path
The production-default for multi-tenant serving on Linux + NVIDIA:
# venv + vLLM install
python3 -m venv ~/venvs/vllm
source ~/venvs/vllm/bin/activate
pip install vllm autoawq
# Pull a quantized 32B model
huggingface-cli download casperhansen/qwen-2.5-coder-32b-instruct-awq
# Launch the OpenAI-compatible server
vllm serve casperhansen/qwen-2.5-coder-32b-instruct-awq \
--quantization awq \
--max-model-len 32768 \
--gpu-memory-utilization 0.92 \
--port 8000For multi-GPU: add --tensor-parallel-size N where N is the GPU count. See multi-GPU buying guide for the topology decisions.
SGLang deployment path
SGLang wins over vLLM when prefix-cache hit rate is high (agent loops with stable system prompts). Same install model:
pip install "sglang[all]"
python -m sglang.launch_server \
--model-path Qwen/Qwen2.5-Coder-32B-Instruct \
--quantization awq \
--tp 1 \
--port 30000 \
--enable-radixattentionllama.cpp on Linux (CPU + every GPU backend)
llama.cpp is the most-portable runtime — runs on CPU + every GPU backend. The Linux build path:
# Build with CUDA
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make GGML_CUDA=1 -j$(nproc)
# Or with ROCm
make GGML_HIPBLAS=1 -j$(nproc)
# Or with Vulkan (vendor-agnostic)
make GGML_VULKAN=1 -j$(nproc)
# Pull a GGUF model + serve
huggingface-cli download bartowski/Qwen2.5-7B-Instruct-GGUF Qwen2.5-7B-Instruct-Q4_K_M.gguf
./llama-server --model Qwen2.5-7B-Instruct-Q4_K_M.gguf -ngl 999 --host 0.0.0.0 --port 8080Ollama as systemd service
Ollama installs as a systemd service by default. The user-mode variant for development:
# Install
curl -fsSL https://ollama.com/install.sh | sh
# Enable + start
sudo systemctl enable --now ollama
# Pull + run a model
ollama pull qwen2.5-coder:32b
ollama run qwen2.5-coder:32b
# OpenAI-compatible endpoint live at http://localhost:11434/v1For headless deployment: OLLAMA_HOST=0.0.0.0:11434 + OLLAMA_MODELS=/large-disk/models via systemd environment file.
Headless / SSH deployment patterns
The standard remote-deploy shape:
- Ubuntu Server 24.04 LTS minimal install
- SSH key-only auth;
fail2banfor noise reduction - NVIDIA driver via Ubuntu repo (no GUI)
- Docker + nvidia-container-toolkit for containerized runtimes
- Caddy or nginx as reverse proxy with TLS via Let's Encrypt
- WireGuard or Tailscale for VPN access (don't expose port 8000 to the internet)
Multi-GPU + cluster deployment
For multi-GPU on Linux see running local AI on multiple GPUs in 2026 and the relevant L3 stack:
- Dual RTX 3090 workstation — prosumer 70B-class
- Quad RTX 3090 workstation — prosumer ceiling
- 4× H100 SXM workstation — datacenter reference
- Ray Serve multi-node — replica orchestration pattern
Filesystem + storage choices for model weights
Model weights are large (3 GB - 200 GB+). Storage choices:
- Local NVMe SSD: fastest for cold start. PCIe 4.0+ NVMe loads a 30 GB model in <5 seconds.
- Local SATA SSD: 3-5× slower cold start. Cheap; fine for non-production.
- HDD: avoid. 30+ second cold start; long-tail seek latency hurts mmap.
- Network filesystem (NFS, S3, MinIO): only for read-only model registries shared across cluster nodes. Cold-start latency is real.
- tmpfs / RAM disk: useful for benchmarking; not practical for >10 GB models.
Filesystem choice: ext4 or xfs for local model weights. ZFS/btrfs work but add overhead. Hugging Face caches default to ~/.cache/huggingface/hub — set HF_HOME to your model-storage volume.
Power management + thermal monitoring
Production GPU deployment requires:
# Set persistence mode (driver stays loaded between requests)
sudo nvidia-smi -pm 1
# Lock power limit (avoid thermal throttling cliffs)
sudo nvidia-smi -pl 350 # for RTX 3090 / 4090
# Monitor in real time
nvidia-smi --query-gpu=index,power.draw,temperature.gpu,temperature.memory,utilization.gpu --format=csv -l 5
# Or use nvtop / nvitop for terminal UI
sudo apt install nvtop && nvtopMemory-junction temps (temperature.memory) crossing 105°C trigger silent throttling that temperature.gpu doesn't reveal. Production deployments must monitor both.
Common failure modes
- nvidia-smi works but containers can't see GPU.
nvidia-container-toolkitnot installed or Docker daemon not restarted. Re-runsudo nvidia-ctk runtime configure --runtime=docker && sudo systemctl restart docker. - CUDA mismatch error after driver upgrade. CUDA toolkit + driver versions must align. Pin both via apt-mark in production.
- vLLM OOM on long context. Drop
--gpu-memory-utilizationto 0.88, or reduce--max-model-len. KV cache spikes during prefill. - Permission denied on model files. Hugging Face cache permissions can drift if seeded as root then run as user. Always download as the runtime user.
- Flash-attention version mismatch. FA v2 / v3 / FlashInfer have different kernel requirements per GPU architecture. vLLM auto-selects; manual pinning helps when compile fails.
- SystemD startup race. Ollama / your runtime starts before the GPU driver is fully initialized.
After=nvidia-persistenced.servicein unit file fixes it.
Production-grade hardening
- Pin every version: kernel, driver, CUDA, Python, runtime, model checkpoint hash
- Read-only filesystem for the runtime container; writable volume only for logs / temp
- Secrets via env files or Vault; never bake API keys into images
- Prometheus + Grafana for nvidia-smi metrics, vLLM /metrics endpoint, request latency p50/p95/p99
- Log all requests to ELK / Loki; rotate aggressively (LLM payloads are large)
- Health-check endpoints on every replica; orchestrator drops unhealthy nodes
- Network: WireGuard or Tailscale, not direct internet exposure
- Backup the model weights + Hugging Face cache to cold storage; bandwidth-recovery is real
Going deeper
- Setup path-finder — pick OS + hardware, get the runtime + first commands
- Runtime compatibility matrix — what runs where
- Multi-GPU guide
- Distributed inference architecture
- Dual RTX 3090 stack — Linux deployment recipe