Production

Month build-out

Multi-user local AI server

Production-tier self-hosted AI for 20-100 users. SGLang or vLLM with replicas, LiteLLM gateway, Postgres-backed Open WebUI, SSO, observability, audit logging, backup. The internal-tools-team setup.

By Fredoline Eruo · Reviewed 2026-05-07 · ~2,000 words

Build summary

Hardware footprint

Dual H100 SXM OR quad RTX 6000 Ada · 256 GB RAM · 4 TB NVMe RAID · 10 GbE

Concurrency

20-100 concurrent users.

Power

Sustained 1500-2500 W server-class.

Goal: Deploy a private LLM API + chat UI for an organization without sending traffic to a cloud LLM vendor.

Operator card

Workflow

Best for

✓Companies replacing OpenAI/Anthropic API spend with self-hosted
✓Regulated industries that can't ship data to cloud LLMs
✓Internal tools teams with 20-200 users
✓Organizations with an existing K8s + observability stack

Avoid if

⚠Headcount < 20 — overkill (use [/workflows/homelab-ai-api](/workflows/homelab-ai-api))
⚠You don't have a platform team
⚠Your workload is bursty / unpredictable (cloud cheaper)
⚠You can't commit to multi-year hardware lifecycle

Stability

battle tested

Maintenance

Daily attention

Skill

Expert

Long-session reliability

rock solid

Service ledger

7 services across 4 layers. Each entry includes a one-line operator note explaining why this pick over alternatives.

Compute

SGLang

Inference

30000/tcp

Inference engine. RadixAttention prefix-cache compounds wins when many users share system prompts (which they do). Beats vLLM on production agentic workloads.

Runs: Kubernetes deployment, GPU pool

Qwen 2.5 32B Instruct (FP8)

Model

Default model. FP8 on H100 fits comfortably with concurrent batching; 32K context is the multi-tenant default.

Runs: SGLang, dual H100

Surface

Open WebUI (Postgres backend)

Frontend

Chat surface. SQLite caps out around 10-20 active users; Postgres backing store handles 100+. SSO via OIDC.

Runs: Kubernetes deployment

Data

Qdrant cluster (3 nodes)

Vector DB

Shared vector DB. Multi-tenant per-user collections; horizontal scale; snapshots without downtime.

Runs: Kubernetes statefulset

MinIO + Velero

Storage

Object + backup storage. MinIO handles document attachments and backups; Velero snapshots K8s state for DR.

Runs: Kubernetes deployment

Operations

Authelia / Authentik

Auth

SSO. OIDC provider in front of Open WebUI + LiteLLM admin. Bridges to corporate AAD / Google Workspace via SAML.

Runs: Docker / K8s

Prometheus + Grafana + Loki + OpenTelemetry

Observability

Full o11y stack. Production needs metrics + logs + traces. Prometheus scrapes SGLang/LiteLLM; Loki ingests app logs; OTel collector unifies the trace surface.

Runs: Kubernetes deployment

Hardware

H100 SXM 80 GB is the sweet spot. Two cards via NVLink or NVLink-Switch fabric serve a 70B model with FP8 + concurrent batching for 50+ users.

RTX 6000 Ada (48 GB) is the lower-cost alternative — quad cards via tensor-parallel hit similar capacity at higher power draw.

Storage: NVMe RAID 1 minimum on the master node; Postgres + Qdrant are write-amplification heavy. 4 TB total gets you through ~2 years of conversation + document growth at typical org size.

Networking: 10 GbE between nodes is the floor. Inter-node tensor-parallel over 1 GbE is unusable.

Storage

Postgres for Open WebUI (50 GB / year / 100 users). Qdrant for shared embeddings (500 GB / 5M chunks). MinIO for raw documents + nightly backups (~2 TB rolling 30-day).

Backup strategy is non-negotiable. Velero schedules + offsite replication + monthly restore drills.

Conversation history is regulated data in many industries. Encrypt at rest. Define a retention policy (90 days? 1 year?) and enforce it via cron.

Networking

Internal: K8s ingress controller (nginx / Traefik), per-pod NetworkPolicies, mTLS between services via Linkerd or Istio.

External: corporate VPN OR Cloudflare Tunnel + Access. Public DNS, gated entry. Never expose SGLang / Qdrant directly.

DNS + LB: a single ai.internal.corp.com hostname; LB distributes across SGLang replicas.

Observability

Required dashboards:

Per-user usage (calls/day, tokens/day, latency p99). Catch runaway scripts.
Cluster health (GPU utilization across pods, KV-cache pressure, queue depth).
Cost-equivalent vs cloud (token volume × OpenAI rate).
Audit log volume. Compliance teams will ask.

Alerts:

GPU temp ≥ 84 °C → page ops
p99 latency > 5s → page ops
LiteLLM gateway down → page ops
Qdrant write errors > 1/min → page ops

OTel + Loki + Tempo gives you trace-level debugging when a specific user reports slowness.

Security

SSO + RBAC. Every user goes through Authelia → Open WebUI / LiteLLM admin. No shared accounts.

Per-user model whitelisting. Different teams get different model lists. Engineering may have access to coding models; HR doesn't.

Audit log retention. Legal will require this. 365 days minimum in most regulated industries.

Network segmentation. AI server in its own VLAN; no direct access to production databases.

Vulnerability scanning. Trivy on every image; Falco for runtime detection. Container escapes from agent code-execution sandboxes are a real threat.

Upgrade path

HA: dual-master K8s, multi-AZ if you have a real datacenter. Without HA, plan for ~99.9% uptime; with it, ~99.99%.

Bigger models: 405B-class needs 4-8× H100 with NVLink-Switch — at that scale, evaluate cloud H100 rental honestly. Self-hosting frontier-class models is rarely cost-justified for orgs <500 users.

Fine-tuning: add Axolotl / Unsloth on a separate GPU pool. Production inference and fine-tuning don't share GPUs cleanly.

Multi-region: replicate Postgres + Qdrant cross-region; add Cloudflare for global routing. Crosses the line into "you have a real platform team now."

What breaks first

GPU fan / thermal failures on continuous load. Scheduled hardware swaps every 18-24 months.
K8s node-version drift. Kubelet upgrades break GPU passthrough until DCGM-operator catches up. Stage upgrades.
SGLang RadixAttention assumptions. When prefix-cache hit rate drops (e.g. agentic prompts diverge), throughput collapses. Profile per workload type.
Postgres bloat. Open WebUI writes a lot. Run VACUUM ANALYZE weekly; consider pgbouncer for connection pooling.
Audit log explosion. Per-call logs at 100 users grow fast. Loki + S3 backend, retention-tiered to cold storage at 90 days.
Compliance review surprises. GDPR / HIPAA / SOC2 consultations always find one missing log or one unencrypted volume. Build to the standard from day one.

Composes these stacks

The /stacks layer covers what to assemble; this workflow shows how those assemblies operate as a system.

/stacks/distributed-inference-homelab →/stacks/h100-tensor-parallel-workstation →

Map this workflow to a build

Open the custom build engine and explore which hardware tier actually supports this workflow.

Open custom builder →

Workflow validation

unvalidated

Each row is a (model × hardware × runtime) triple this workflow claims. Validation is rule-based: 0 validated by reproduced benchmarks, 0 supported by single-source benchmarks, 0 supported by same-family hardware, 0 supported by adjacent-hardware measurements, 1 currently unvalidated. We never fabricate validation; if no benchmark exists, we say so.

Unvalidated
qwen-2.5-32b-instruct via sglang
- · No public benchmarks yet. The workflow's claim about this model is currently unsubstantiated by measurements.
0 benchmarksSubmit the first benchmark →

EditorialValidate this workflow →See benchmark roadmap →How validation works →