RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Workflows
  4. /Private ChatGPT replacement
Homelab
Weekend build-out

Private ChatGPT replacement

The full ChatGPT-style experience without OpenAI. Open WebUI as the chat surface, Ollama serving Llama 3.1 / Qwen 2.5, optional persistent memory, optional code-interpreter sandbox, optional document chat. Sized for solo or small-team use on a single workstation.

By Fredoline Eruo · Reviewed 2026-05-07 · ~1,700 words

Build summary

Hardware footprint
RTX 3090 / 4090 / Apple M3 Max · 32-64 GB RAM · 500 GB NVMe
Concurrency
1-5 concurrent users.
Power
~250-350 W under sustained load.

Goal: Drop-in replacement for the ChatGPT.com workflow with private weights and zero cloud dependencies.

Operator card

Workflow
Best for
  • ✓Anyone replacing the ChatGPT.com habit on private hardware
  • ✓Households with multiple curious users
  • ✓Privacy-sensitive solo developers
  • ✓Apple Silicon laptop workflows
Avoid if
  • ⚠You need real-time speech (use [/workflows/local-voice-assistant](/workflows/local-voice-assistant))
  • ⚠You need >5 concurrent users (move to vLLM tier)
  • ⚠You need multimodal vision today (Open WebUI vision support is partial)
Stability
stable
Maintenance
Monthly check
Skill
Intermediate
Long-session reliability
reliable

Service ledger

7 services across 4 layers. Each entry includes a one-line operator note explaining why this pick over alternatives.

Compute
Ollama
Inference
11434/tcp
Inference engine. Friendliest local LLM UX. The CLI / API surface matches the OpenAI shape; Open WebUI talks to it natively.
Runs: host service
Qwen 2.5 14B Instruct (Q4_K_M)
Model
Default chat model. Strong general chat at 14B size — outperforms Llama 3.1 8B on most benchmarks at modest VRAM cost. Fits 12 GB cards comfortably with 32K context.
Runs: Ollama
Qwen 2.5 Coder 7B
Model
Coding fallback model. Coding-specialized 7B for IDE-style queries. Open WebUI's per-conversation model switching makes this seamless.
Runs: Ollama
nomic-embed-text-v1.5
Embeddings
Embeddings (for RAG mode). Open WebUI's RAG plugin uses this by default. Small enough to share the GPU.
Runs: Ollama
Surface
Open WebUI
Frontend
8080/tcp
Chat surface. Closest open-source ChatGPT clone — multi-model switching, conversation history, RAG, persona presets, voice. LibreChat is the alternative when you need tighter MS365 / SSO integration.
Runs: Docker container
Data
ChromaDB (Open WebUI default)
Vector DB
Vector DB (built-in). Built into Open WebUI; zero-config; perfectly fine for solo use up to ~100 K chunks. Swap to Qdrant for larger corpora.
Runs: embedded in Open WebUI
Operations
Tailscale + Open WebUI built-in auth
Auth
Auth + remote access. Open WebUI's auth covers user management; Tailscale wraps the whole thing in a private mesh. Zero cloud, zero public-internet exposure.
Runs: host service

Hardware

Single 4090 is overkill but pleasant. RTX 3090 (24 GB) is the budget default. Apple M3 Max 64 GB is the silent-laptop alternative — same UX, lower throughput.

The killer optimization: keep one model in residence; let Ollama auto-evict. Two large models swapping in/out of VRAM is the #1 performance complaint here.

For 5-user concurrency: bump to vLLM (see /workflows/local-coding-agent-system) — Ollama serializes per loaded model.

Storage

Plan 150 GB: 3-5 model weights at ~10-15 GB each, conversation history (1 MB / month / user), embeddings index (~50 MB / 100 K chunks).

Conversations are the user-data layer. Back them up. Open WebUI stores them in SQLite under a Docker volume; docker run --rm -v webui_data:/data -v $(pwd)/backup:/backup alpine tar czf /backup/webui-$(date +%F).tgz /data is the lazy nightly cron.

Networking

Tailscale + Open WebUI is the recommended path. The web UI binds to 0.0.0.0 inside Docker; the host binds 127.0.0.1:8080; Tailscale wraps it.

If multiple household members need access from outside the LAN: each gets a Tailscale device. MagicDNS makes workstation.tail-net.ts.net resolve.

Never publish 8080 to the public internet. Open WebUI's auth is fine for trusted users on private networks; do not stress-test it against attackers.

Observability

Lighter than the production workflows. Watch:

  • Ollama load time (first token after model swap). >5s means VRAM contention; close other GPU consumers.
  • Conversation count growth. SQLite backing store gets slow past ~10K conversations / user; archive or rotate.
  • Disk usage on the Open WebUI volume.

Grafana is overkill; docker stats + a weekly disk-usage check are sufficient.

Security

Default-disable signup. OPEN_WEBUI_SIGNUP=false. Add accounts manually.

Strong owner password. Don't reuse credentials. Open WebUI's auth uses bcrypt — fine — but treat the admin account like a bastion.

Conversation privacy. SQLite stores plaintext. Encrypt the host volume (LUKS) if your threat model includes physical workstation theft.

Memory plugin. If you enable Open WebUI's memory feature, remember it reads / writes to the same vector store as RAG. Don't store secrets there.

Upgrade path

Multi-user (5+): swap Ollama → vLLM with batching; add per-user API keys; consider LiteLLM as a proxy.

Bigger models (32-70B): add VRAM. Single 4090 caps at 32B; dual 3090 / dual 4090 unlocks 70B. See /workflows/local-coding-agent-system.

Document chat: enable Open WebUI's RAG plugin; pre-warm the embeddings model in Ollama so first ingest doesn't cold-start.

Voice: integrate Whisper + Piper as a per-conversation voice mode. See /workflows/local-voice-assistant for the full pattern.

What breaks first

  1. Model thrashing when users switch models per-conversation. Either keep one model loaded or bump VRAM.
  2. Open WebUI auto-update on Docker pulls a new image with breaking config changes. Pin the image SHA.
  3. SQLite write contention at 10+ concurrent active users. Migrate to Postgres backing store before you hit this.
  4. Ollama port-conflict with other services on 11434. See /errors/ollama-bind-port-conflict.

Composes these stacks

The /stacks layer covers what to assemble; this workflow shows how those assemblies operate as a system.

/stacks/memory-enabled-agent →/stacks/rtx-4090-workstation →/stacks/apple-silicon-ai →
Map this workflow to a build

Open the custom build engine and explore which hardware tier actually supports this workflow.

Open custom builder →

Workflow validation

unvalidated

Each row is a (model × hardware × runtime) triple this workflow claims. Validation is rule-based: 0 validated by reproduced benchmarks, 0 supported by single-source benchmarks, 0 supported by same-family hardware, 0 supported by adjacent-hardware measurements, 2 currently unvalidated. We never fabricate validation; if no benchmark exists, we say so.

  • Unvalidated
    qwen-2.5-14b-instruct via ollama
    • · No public benchmarks yet. The workflow's claim about this model is currently unsubstantiated by measurements.
    0 benchmarksSubmit the first benchmark →
  • Unvalidated
    qwen-2.5-coder-3b via ollama
    • · No public benchmarks yet. The workflow's claim about this model is currently unsubstantiated by measurements.
    0 benchmarksSubmit the first benchmark →
✓EditorialValidate this workflow →See benchmark roadmap →How validation works →
Help keep this page accurate

We read every submission. Editorial review takes 1-7 days.

Report outdatedSuggest a correctionDid this workflow work for you?