Homelab

Weekend build-out

Private ChatGPT replacement

The full ChatGPT-style experience without OpenAI. Open WebUI as the chat surface, Ollama serving Llama 3.1 / Qwen 2.5, optional persistent memory, optional code-interpreter sandbox, optional document chat. Sized for solo or small-team use on a single workstation.

By Fredoline Eruo · Reviewed 2026-05-07 · ~1,700 words

Build summary

Hardware footprint

RTX 3090 / 4090 / Apple M3 Max · 32-64 GB RAM · 500 GB NVMe

Concurrency

1-5 concurrent users.

Power

~250-350 W under sustained load.

Goal: Drop-in replacement for the ChatGPT.com workflow with private weights and zero cloud dependencies.

Operator card

Workflow

Best for

✓Anyone replacing the ChatGPT.com habit on private hardware
✓Households with multiple curious users
✓Privacy-sensitive solo developers
✓Apple Silicon laptop workflows

Avoid if

⚠You need real-time speech (use [/workflows/local-voice-assistant](/workflows/local-voice-assistant))
⚠You need >5 concurrent users (move to vLLM tier)
⚠You need multimodal vision today (Open WebUI vision support is partial)

Stability

stable

Maintenance

Monthly check

Skill

Intermediate

Long-session reliability

reliable

Service ledger

7 services across 4 layers. Each entry includes a one-line operator note explaining why this pick over alternatives.

Compute

Ollama

Inference

11434/tcp

Inference engine. Friendliest local LLM UX. The CLI / API surface matches the OpenAI shape; Open WebUI talks to it natively.

Runs: host service

Qwen 2.5 14B Instruct (Q4_K_M)

Model

Default chat model. Strong general chat at 14B size — outperforms Llama 3.1 8B on most benchmarks at modest VRAM cost. Fits 12 GB cards comfortably with 32K context.

Runs: Ollama

Qwen 2.5 Coder 7B

Model

Coding fallback model. Coding-specialized 7B for IDE-style queries. Open WebUI's per-conversation model switching makes this seamless.

Runs: Ollama

nomic-embed-text-v1.5

Embeddings

Embeddings (for RAG mode). Open WebUI's RAG plugin uses this by default. Small enough to share the GPU.

Runs: Ollama

Surface

Open WebUI

Frontend

8080/tcp

Chat surface. Closest open-source ChatGPT clone — multi-model switching, conversation history, RAG, persona presets, voice. LibreChat is the alternative when you need tighter MS365 / SSO integration.

Runs: Docker container

Data

ChromaDB (Open WebUI default)

Vector DB

Vector DB (built-in). Built into Open WebUI; zero-config; perfectly fine for solo use up to ~100 K chunks. Swap to Qdrant for larger corpora.

Runs: embedded in Open WebUI

Operations

Tailscale + Open WebUI built-in auth

Auth

Auth + remote access. Open WebUI's auth covers user management; Tailscale wraps the whole thing in a private mesh. Zero cloud, zero public-internet exposure.

Runs: host service

Hardware

Single 4090 is overkill but pleasant. RTX 3090 (24 GB) is the budget default. Apple M3 Max 64 GB is the silent-laptop alternative — same UX, lower throughput.

The killer optimization: keep one model in residence; let Ollama auto-evict. Two large models swapping in/out of VRAM is the #1 performance complaint here.

For 5-user concurrency: bump to vLLM (see /workflows/local-coding-agent-system) — Ollama serializes per loaded model.

Storage

Plan 150 GB: 3-5 model weights at ~10-15 GB each, conversation history (1 MB / month / user), embeddings index (~50 MB / 100 K chunks).

Conversations are the user-data layer. Back them up. Open WebUI stores them in SQLite under a Docker volume; docker run --rm -v webui_data:/data -v $(pwd)/backup:/backup alpine tar czf /backup/webui-$(date +%F).tgz /data is the lazy nightly cron.

Networking

Tailscale + Open WebUI is the recommended path. The web UI binds to 0.0.0.0 inside Docker; the host binds 127.0.0.1:8080; Tailscale wraps it.

If multiple household members need access from outside the LAN: each gets a Tailscale device. MagicDNS makes workstation.tail-net.ts.net resolve.

Never publish 8080 to the public internet. Open WebUI's auth is fine for trusted users on private networks; do not stress-test it against attackers.

Observability

Lighter than the production workflows. Watch:

Ollama load time (first token after model swap). >5s means VRAM contention; close other GPU consumers.
Conversation count growth. SQLite backing store gets slow past ~10K conversations / user; archive or rotate.
Disk usage on the Open WebUI volume.

Grafana is overkill; docker stats + a weekly disk-usage check are sufficient.

Security

Default-disable signup. OPEN_WEBUI_SIGNUP=false. Add accounts manually.

Strong owner password. Don't reuse credentials. Open WebUI's auth uses bcrypt — fine — but treat the admin account like a bastion.

Conversation privacy. SQLite stores plaintext. Encrypt the host volume (LUKS) if your threat model includes physical workstation theft.

Memory plugin. If you enable Open WebUI's memory feature, remember it reads / writes to the same vector store as RAG. Don't store secrets there.

Upgrade path

Multi-user (5+): swap Ollama → vLLM with batching; add per-user API keys; consider LiteLLM as a proxy.

Bigger models (32-70B): add VRAM. Single 4090 caps at 32B; dual 3090 / dual 4090 unlocks 70B. See /workflows/local-coding-agent-system.

Document chat: enable Open WebUI's RAG plugin; pre-warm the embeddings model in Ollama so first ingest doesn't cold-start.

Voice: integrate Whisper + Piper as a per-conversation voice mode. See /workflows/local-voice-assistant for the full pattern.

What breaks first

Model thrashing when users switch models per-conversation. Either keep one model loaded or bump VRAM.
Open WebUI auto-update on Docker pulls a new image with breaking config changes. Pin the image SHA.
SQLite write contention at 10+ concurrent active users. Migrate to Postgres backing store before you hit this.
Ollama port-conflict with other services on 11434. See /errors/ollama-bind-port-conflict.

Composes these stacks

The /stacks layer covers what to assemble; this workflow shows how those assemblies operate as a system.

/stacks/memory-enabled-agent →/stacks/rtx-4090-workstation →/stacks/apple-silicon-ai →

Map this workflow to a build

Open the custom build engine and explore which hardware tier actually supports this workflow.

Open custom builder →

Workflow validation

unvalidated

Each row is a (model × hardware × runtime) triple this workflow claims. Validation is rule-based: 0 validated by reproduced benchmarks, 0 supported by single-source benchmarks, 0 supported by same-family hardware, 0 supported by adjacent-hardware measurements, 2 currently unvalidated. We never fabricate validation; if no benchmark exists, we say so.

Unvalidated
qwen-2.5-14b-instruct via ollama
- · No public benchmarks yet. The workflow's claim about this model is currently unsubstantiated by measurements.
0 benchmarksSubmit the first benchmark →
Unvalidated
qwen-2.5-coder-3b via ollama
- · No public benchmarks yet. The workflow's claim about this model is currently unsubstantiated by measurements.
0 benchmarksSubmit the first benchmark →

EditorialValidate this workflow →See benchmark roadmap →How validation works →