Local AI on Linux — operator-grade deployment paths (May 2026)

Why Linux is the production-default for local AI

Every production-grade local-AI runtime in 2026 — vLLM, SGLang, TensorRT-LLM, TGI, Ray Serve — is Linux-first. NVIDIA's container toolkit, AMD's ROCm, Intel's oneAPI all target Linux as the reference platform. Windows + WSL2 works for development; Linux is where the deployment lives.

The honest framing of the OS-tier hierarchy:

Linux: production-default. Every runtime supported. Every deployment pattern documented.
macOS: Apple Silicon-only. MLX-LM is the production path; vLLM doesn't run.
Windows + WSL2: development viable. Production deployment belongs on Linux.
Windows native: hobbyist tier. Ollama / LM Studio work; serious serving needs WSL2 or Linux.

Distro choices — Ubuntu, Debian, Fedora, Arch, NixOS

For local AI, distro choice matters less than driver-stack compatibility. The pragmatic ranking:

Ubuntu LTS (22.04 or 24.04): the operator default. NVIDIA datacenter drivers, ROCm, OpenVINO, Docker — all packaged officially. 90%+ of deployment guides target Ubuntu LTS.
Debian stable: where you go when you don't want Ubuntu's telemetry. Trails Ubuntu on driver freshness by 6-12 months.
Fedora / RHEL: solid for NVIDIA work; ROCm support is improving. Good fit for enterprise environments.
Arch / Manjaro: bleeding-edge driver versions. Useful for testing latest features; risky for production stability.
NixOS: most reproducible deployment story. Steep learning curve. Production-viable if your team already uses Nix.

NVIDIA stack: driver + CUDA toolkit

The NVIDIA stack on Ubuntu 24.04 LTS, the canonical install:

# 1. Add NVIDIA package signing key + repo
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update

# 2. Install proprietary driver + CUDA toolkit
sudo apt install -y cuda-toolkit-12-6 nvidia-driver-560

# 3. Reboot for driver to load
sudo reboot

# 4. Verify
nvidia-smi
# Expected: GPU listed, driver 560+, CUDA 12.6+

The nvidia-driver-560-server variant is the production-grade pin for sustained-load deployments. Don't mix the open-source nouveau driver with proprietary NVIDIA — blacklist nouveau if it loads.

AMD stack: ROCm install + verification

ROCm is AMD's CUDA equivalent. Linux is the production-grade path; Windows ROCm trails by 6-12 months.

# 1. ROCm 6.x install on Ubuntu 22.04 / 24.04
sudo apt install -y python3-setuptools python3-wheel
wget https://repo.radeon.com/amdgpu-install/6.2/ubuntu/jammy/amdgpu-install_6.2.60200-1_all.deb
sudo apt install -y ./amdgpu-install_6.2.60200-1_all.deb
sudo amdgpu-install -y --usecase=rocm

# 2. Add user to render + video groups
sudo usermod -a -G render,video $USER
sudo reboot

# 3. Verify
rocm-smi
# Expected: GPU listed (e.g. RX 7900 XTX, MI300A)
rocminfo | grep "Marketing Name"

For consumer cards: RX 7900 XTX is fully supported; older Polaris / Vega cards are not in 2026 ROCm. Confirm your card is in the ROCm support matrix before committing time to setup.

Intel stack: oneAPI + IPEX-LLM

For Intel Arc GPUs (A770, B580) and Lunar Lake / Meteor Lake NPUs:

# 1. Intel oneAPI base toolkit
wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/.../oneapi-base-toolkit.sh
sudo sh oneapi-base-toolkit.sh

# 2. Source oneAPI environment
source /opt/intel/oneapi/setvars.sh

# 3. IPEX-LLM in a venv
python3 -m venv venv && source venv/bin/activate
pip install --pre --upgrade ipex-llm[xpu]

# 4. Verify XPU detected
python -c "import intel_extension_for_pytorch as ipex; print(ipex.xpu.is_available())"

Intel Arc support on Linux is solid as of 2026 but the community is smaller than CUDA / ROCm. See IPEX-LLM operational review for deployment patterns.

Docker + nvidia-container-toolkit

For containerized inference, you need nvidia-container-toolkit to expose GPUs to containers:

# 1. Add NVIDIA Container Toolkit repo
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# 2. Install + restart Docker
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# 3. Verify GPU access from container
docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu24.04 nvidia-smi

For ROCm: the AMD equivalent is amdgpu-container-toolkit + ROCm Docker images from rocm/rocm-terminal repo.

vLLM deployment path

The production-default for multi-tenant serving on Linux + NVIDIA:

# venv + vLLM install
python3 -m venv ~/venvs/vllm
source ~/venvs/vllm/bin/activate
pip install vllm autoawq

# Pull a quantized 32B model
huggingface-cli download casperhansen/qwen-2.5-coder-32b-instruct-awq

# Launch the OpenAI-compatible server
vllm serve casperhansen/qwen-2.5-coder-32b-instruct-awq \
  --quantization awq \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92 \
  --port 8000

For multi-GPU: add --tensor-parallel-size N where N is the GPU count. See multi-GPU buying guide for the topology decisions.

SGLang deployment path

SGLang wins over vLLM when prefix-cache hit rate is high (agent loops with stable system prompts). Same install model:

pip install "sglang[all]"

python -m sglang.launch_server \
  --model-path Qwen/Qwen2.5-Coder-32B-Instruct \
  --quantization awq \
  --tp 1 \
  --port 30000 \
  --enable-radixattention

llama.cpp on Linux (CPU + every GPU backend)

llama.cpp is the most-portable runtime — runs on CPU + every GPU backend. The Linux build path:

# Build with CUDA
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make GGML_CUDA=1 -j$(nproc)

# Or with ROCm
make GGML_HIPBLAS=1 -j$(nproc)

# Or with Vulkan (vendor-agnostic)
make GGML_VULKAN=1 -j$(nproc)

# Pull a GGUF model + serve
huggingface-cli download bartowski/Qwen2.5-7B-Instruct-GGUF Qwen2.5-7B-Instruct-Q4_K_M.gguf
./llama-server --model Qwen2.5-7B-Instruct-Q4_K_M.gguf -ngl 999 --host 0.0.0.0 --port 8080

Ollama as systemd service

Ollama installs as a systemd service by default. The user-mode variant for development:

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Enable + start
sudo systemctl enable --now ollama

# Pull + run a model
ollama pull qwen2.5-coder:32b
ollama run qwen2.5-coder:32b

# OpenAI-compatible endpoint live at http://localhost:11434/v1

For headless deployment: OLLAMA_HOST=0.0.0.0:11434 + OLLAMA_MODELS=/large-disk/models via systemd environment file.

Headless / SSH deployment patterns

The standard remote-deploy shape:

Ubuntu Server 24.04 LTS minimal install
SSH key-only auth; fail2ban for noise reduction
NVIDIA driver via Ubuntu repo (no GUI)
Docker + nvidia-container-toolkit for containerized runtimes
Caddy or nginx as reverse proxy with TLS via Let's Encrypt
WireGuard or Tailscale for VPN access (don't expose port 8000 to the internet)

Multi-GPU + cluster deployment

For multi-GPU on Linux see running local AI on multiple GPUs in 2026 and the relevant L3 stack:

Dual RTX 3090 workstation — prosumer 70B-class
Quad RTX 3090 workstation — prosumer ceiling
4× H100 SXM workstation — datacenter reference
Ray Serve multi-node — replica orchestration pattern

Filesystem + storage choices for model weights

Model weights are large (3 GB - 200 GB+). Storage choices:

Local NVMe SSD: fastest for cold start. PCIe 4.0+ NVMe loads a 30 GB model in <5 seconds.
Local SATA SSD: 3-5× slower cold start. Cheap; fine for non-production.
HDD: avoid. 30+ second cold start; long-tail seek latency hurts mmap.
Network filesystem (NFS, S3, MinIO): only for read-only model registries shared across cluster nodes. Cold-start latency is real.
tmpfs / RAM disk: useful for benchmarking; not practical for >10 GB models.

Filesystem choice: ext4 or xfs for local model weights. ZFS/btrfs work but add overhead. Hugging Face caches default to ~/.cache/huggingface/hub — set HF_HOME to your model-storage volume.

Power management + thermal monitoring

Production GPU deployment requires:

# Set persistence mode (driver stays loaded between requests)
sudo nvidia-smi -pm 1

# Lock power limit (avoid thermal throttling cliffs)
sudo nvidia-smi -pl 350  # for RTX 3090 / 4090

# Monitor in real time
nvidia-smi --query-gpu=index,power.draw,temperature.gpu,temperature.memory,utilization.gpu --format=csv -l 5

# Or use nvtop / nvitop for terminal UI
sudo apt install nvtop && nvtop

Memory-junction temps (temperature.memory) crossing 105°C trigger silent throttling that temperature.gpu doesn't reveal. Production deployments must monitor both.

Common failure modes

nvidia-smi works but containers can't see GPU. nvidia-container-toolkit not installed or Docker daemon not restarted. Re-run sudo nvidia-ctk runtime configure --runtime=docker && sudo systemctl restart docker.
CUDA mismatch error after driver upgrade. CUDA toolkit + driver versions must align. Pin both via apt-mark in production.
vLLM OOM on long context. Drop --gpu-memory-utilization to 0.88, or reduce --max-model-len. KV cache spikes during prefill.
Permission denied on model files. Hugging Face cache permissions can drift if seeded as root then run as user. Always download as the runtime user.
Flash-attention version mismatch. FA v2 / v3 / FlashInfer have different kernel requirements per GPU architecture. vLLM auto-selects; manual pinning helps when compile fails.
SystemD startup race. Ollama / your runtime starts before the GPU driver is fully initialized. After=nvidia-persistenced.service in unit file fixes it.

Production-grade hardening

Pin every version: kernel, driver, CUDA, Python, runtime, model checkpoint hash
Read-only filesystem for the runtime container; writable volume only for logs / temp
Secrets via env files or Vault; never bake API keys into images
Prometheus + Grafana for nvidia-smi metrics, vLLM /metrics endpoint, request latency p50/p95/p99
Log all requests to ELK / Loki; rotate aggressively (LLM payloads are large)
Health-check endpoints on every replica; orchestrator drops unhealthy nodes
Network: WireGuard or Tailscale, not direct internet exposure
Backup the model weights + Hugging Face cache to cold storage; bandwidth-recovery is real

Going deeper

Setup path-finder — pick OS + hardware, get the runtime + first commands
Runtime compatibility matrix — what runs where
Multi-GPU guide
Distributed inference architecture
Dual RTX 3090 stack — Linux deployment recipe