Multi-user local AI server
Production-tier self-hosted AI for 20-100 users. SGLang or vLLM with replicas, LiteLLM gateway, Postgres-backed Open WebUI, SSO, observability, audit logging, backup. The internal-tools-team setup.
Build summary
Goal: Deploy a private LLM API + chat UI for an organization without sending traffic to a cloud LLM vendor.
Operator card
- ✓Companies replacing OpenAI/Anthropic API spend with self-hosted
- ✓Regulated industries that can't ship data to cloud LLMs
- ✓Internal tools teams with 20-200 users
- ✓Organizations with an existing K8s + observability stack
- ⚠Headcount < 20 — overkill (use [/workflows/homelab-ai-api](/workflows/homelab-ai-api))
- ⚠You don't have a platform team
- ⚠Your workload is bursty / unpredictable (cloud cheaper)
- ⚠You can't commit to multi-year hardware lifecycle
Service ledger
7 services across 4 layers. Each entry includes a one-line operator note explaining why this pick over alternatives.
Hardware
H100 SXM 80 GB is the sweet spot. Two cards via NVLink or NVLink-Switch fabric serve a 70B model with FP8 + concurrent batching for 50+ users.
RTX 6000 Ada (48 GB) is the lower-cost alternative — quad cards via tensor-parallel hit similar capacity at higher power draw.
Storage: NVMe RAID 1 minimum on the master node; Postgres + Qdrant are write-amplification heavy. 4 TB total gets you through ~2 years of conversation + document growth at typical org size.
Networking: 10 GbE between nodes is the floor. Inter-node tensor-parallel over 1 GbE is unusable.
Storage
Postgres for Open WebUI (50 GB / year / 100 users). Qdrant for shared embeddings (500 GB / 5M chunks). MinIO for raw documents + nightly backups (~2 TB rolling 30-day).
Backup strategy is non-negotiable. Velero schedules + offsite replication + monthly restore drills.
Conversation history is regulated data in many industries. Encrypt at rest. Define a retention policy (90 days? 1 year?) and enforce it via cron.
Networking
Internal: K8s ingress controller (nginx / Traefik), per-pod NetworkPolicies, mTLS between services via Linkerd or Istio.
External: corporate VPN OR Cloudflare Tunnel + Access. Public DNS, gated entry. Never expose SGLang / Qdrant directly.
DNS + LB: a single ai.internal.corp.com hostname; LB distributes across SGLang replicas.
Observability
Required dashboards:
- Per-user usage (calls/day, tokens/day, latency p99). Catch runaway scripts.
- Cluster health (GPU utilization across pods, KV-cache pressure, queue depth).
- Cost-equivalent vs cloud (token volume × OpenAI rate).
- Audit log volume. Compliance teams will ask.
Alerts:
- GPU temp ≥ 84 °C → page ops
- p99 latency > 5s → page ops
- LiteLLM gateway down → page ops
- Qdrant write errors > 1/min → page ops
OTel + Loki + Tempo gives you trace-level debugging when a specific user reports slowness.
Security
SSO + RBAC. Every user goes through Authelia → Open WebUI / LiteLLM admin. No shared accounts.
Per-user model whitelisting. Different teams get different model lists. Engineering may have access to coding models; HR doesn't.
Audit log retention. Legal will require this. 365 days minimum in most regulated industries.
Network segmentation. AI server in its own VLAN; no direct access to production databases.
Vulnerability scanning. Trivy on every image; Falco for runtime detection. Container escapes from agent code-execution sandboxes are a real threat.
Upgrade path
HA: dual-master K8s, multi-AZ if you have a real datacenter. Without HA, plan for ~99.9% uptime; with it, ~99.99%.
Bigger models: 405B-class needs 4-8× H100 with NVLink-Switch — at that scale, evaluate cloud H100 rental honestly. Self-hosting frontier-class models is rarely cost-justified for orgs <500 users.
Fine-tuning: add Axolotl / Unsloth on a separate GPU pool. Production inference and fine-tuning don't share GPUs cleanly.
Multi-region: replicate Postgres + Qdrant cross-region; add Cloudflare for global routing. Crosses the line into "you have a real platform team now."
What breaks first
- GPU fan / thermal failures on continuous load. Scheduled hardware swaps every 18-24 months.
- K8s node-version drift. Kubelet upgrades break GPU passthrough until DCGM-operator catches up. Stage upgrades.
- SGLang RadixAttention assumptions. When prefix-cache hit rate drops (e.g. agentic prompts diverge), throughput collapses. Profile per workload type.
- Postgres bloat. Open WebUI writes a lot. Run VACUUM ANALYZE weekly; consider pgbouncer for connection pooling.
- Audit log explosion. Per-call logs at 100 users grow fast. Loki + S3 backend, retention-tiered to cold storage at 90 days.
- Compliance review surprises. GDPR / HIPAA / SOC2 consultations always find one missing log or one unencrypted volume. Build to the standard from day one.
Composes these stacks
The /stacks layer covers what to assemble; this workflow shows how those assemblies operate as a system.
Open the custom build engine and explore which hardware tier actually supports this workflow.
Workflow validation
Each row is a (model × hardware × runtime) triple this workflow claims. Validation is rule-based: 0 validated by reproduced benchmarks, 0 supported by single-source benchmarks, 0 supported by same-family hardware, 0 supported by adjacent-hardware measurements, 1 currently unvalidated. We never fabricate validation; if no benchmark exists, we say so.
- Unvalidatedqwen-2.5-32b-instruct via sglang
- · No public benchmarks yet. The workflow's claim about this model is currently unsubstantiated by measurements.
0 benchmarksSubmit the first benchmark →