Private ChatGPT replacement
The full ChatGPT-style experience without OpenAI. Open WebUI as the chat surface, Ollama serving Llama 3.1 / Qwen 2.5, optional persistent memory, optional code-interpreter sandbox, optional document chat. Sized for solo or small-team use on a single workstation.
Build summary
Goal: Drop-in replacement for the ChatGPT.com workflow with private weights and zero cloud dependencies.
Operator card
- ✓Anyone replacing the ChatGPT.com habit on private hardware
- ✓Households with multiple curious users
- ✓Privacy-sensitive solo developers
- ✓Apple Silicon laptop workflows
- ⚠You need real-time speech (use [/workflows/local-voice-assistant](/workflows/local-voice-assistant))
- ⚠You need >5 concurrent users (move to vLLM tier)
- ⚠You need multimodal vision today (Open WebUI vision support is partial)
Service ledger
7 services across 4 layers. Each entry includes a one-line operator note explaining why this pick over alternatives.
Hardware
Single 4090 is overkill but pleasant. RTX 3090 (24 GB) is the budget default. Apple M3 Max 64 GB is the silent-laptop alternative — same UX, lower throughput.
The killer optimization: keep one model in residence; let Ollama auto-evict. Two large models swapping in/out of VRAM is the #1 performance complaint here.
For 5-user concurrency: bump to vLLM (see /workflows/local-coding-agent-system) — Ollama serializes per loaded model.
Storage
Plan 150 GB: 3-5 model weights at ~10-15 GB each, conversation history (1 MB / month / user), embeddings index (~50 MB / 100 K chunks).
Conversations are the user-data layer. Back them up. Open WebUI stores them in SQLite under a Docker volume; docker run --rm -v webui_data:/data -v $(pwd)/backup:/backup alpine tar czf /backup/webui-$(date +%F).tgz /data is the lazy nightly cron.
Networking
Tailscale + Open WebUI is the recommended path. The web UI binds to 0.0.0.0 inside Docker; the host binds 127.0.0.1:8080; Tailscale wraps it.
If multiple household members need access from outside the LAN: each gets a Tailscale device. MagicDNS makes workstation.tail-net.ts.net resolve.
Never publish 8080 to the public internet. Open WebUI's auth is fine for trusted users on private networks; do not stress-test it against attackers.
Observability
Lighter than the production workflows. Watch:
- Ollama load time (first token after model swap). >5s means VRAM contention; close other GPU consumers.
- Conversation count growth. SQLite backing store gets slow past ~10K conversations / user; archive or rotate.
- Disk usage on the Open WebUI volume.
Grafana is overkill; docker stats + a weekly disk-usage check are sufficient.
Security
Default-disable signup. OPEN_WEBUI_SIGNUP=false. Add accounts manually.
Strong owner password. Don't reuse credentials. Open WebUI's auth uses bcrypt — fine — but treat the admin account like a bastion.
Conversation privacy. SQLite stores plaintext. Encrypt the host volume (LUKS) if your threat model includes physical workstation theft.
Memory plugin. If you enable Open WebUI's memory feature, remember it reads / writes to the same vector store as RAG. Don't store secrets there.
Upgrade path
Multi-user (5+): swap Ollama → vLLM with batching; add per-user API keys; consider LiteLLM as a proxy.
Bigger models (32-70B): add VRAM. Single 4090 caps at 32B; dual 3090 / dual 4090 unlocks 70B. See /workflows/local-coding-agent-system.
Document chat: enable Open WebUI's RAG plugin; pre-warm the embeddings model in Ollama so first ingest doesn't cold-start.
Voice: integrate Whisper + Piper as a per-conversation voice mode. See /workflows/local-voice-assistant for the full pattern.
What breaks first
- Model thrashing when users switch models per-conversation. Either keep one model loaded or bump VRAM.
- Open WebUI auto-update on Docker pulls a new image with breaking config changes. Pin the image SHA.
- SQLite write contention at 10+ concurrent active users. Migrate to Postgres backing store before you hit this.
- Ollama port-conflict with other services on 11434. See /errors/ollama-bind-port-conflict.
Composes these stacks
The /stacks layer covers what to assemble; this workflow shows how those assemblies operate as a system.
Open the custom build engine and explore which hardware tier actually supports this workflow.
Workflow validation
Each row is a (model × hardware × runtime) triple this workflow claims. Validation is rule-based: 0 validated by reproduced benchmarks, 0 supported by single-source benchmarks, 0 supported by same-family hardware, 0 supported by adjacent-hardware measurements, 2 currently unvalidated. We never fabricate validation; if no benchmark exists, we say so.
- Unvalidatedqwen-2.5-14b-instruct via ollama
- · No public benchmarks yet. The workflow's claim about this model is currently unsubstantiated by measurements.
0 benchmarksSubmit the first benchmark → - Unvalidatedqwen-2.5-coder-3b via ollama
- · No public benchmarks yet. The workflow's claim about this model is currently unsubstantiated by measurements.
0 benchmarksSubmit the first benchmark →