Offline RAG pipeline
Air-gapped retrieval-augmented chat. Document ingestion via unstructured.io, nomic-embed embeddings into Qdrant, bge-reranker rerank, Qwen 2.5 14B-Instruct generation, Open WebUI as the chat surface, observability + nightly snapshot backup.
Build summary
Goal: Ship a private Q&A system over a corpus of internal documents that never leaves the network.
Operator card
- ✓Compliance-heavy teams that can't ship documents to cloud LLMs
- ✓Internal knowledge-base Q&A on private corpora
- ✓Regulated industries (legal, healthcare, finance)
- ✓Single-team RAG at 5-15 concurrent users
- ⚠You need >50 concurrent users (move to multi-replica or cloud)
- ⚠Your corpus is mostly low-quality scanned PDFs (OCR pre-step required)
- ⚠You don't have an ops person who can run Docker + Prometheus
Service ledger
10 services across 4 layers. Each entry includes a one-line operator note explaining why this pick over alternatives.
Hardware
RTX 4090 is the comfortable single-card target. Apple M3 Max 64 GB is the silent alternative — runs the same models via MLX-LM at ~25-35% lower decode tok/s but draws a fraction of the power.
The vector DB and reranker tax CPU + RAM more than the LLM does. Budget 32 GB RAM minimum for Qdrant + Open WebUI's RAG processing concurrent with the model. 64 GB is the comfortable working number.
NVMe storage is non-negotiable. SATA SSDs choke on large-corpus ingestion (HDD chokes at the first batch).
Storage
Plan storage in three tiers: (1) raw documents in MinIO (1× corpus size), (2) Qdrant HNSW indices (150 MB per million chunks at 768 dims), (3) snapshot backups (~Qdrant size × N retention generations).
For a 1 GB corpus of typical PDFs: ~3 GB raw (some documents have heavy images), ~50K chunks, ~75 MB Qdrant index. Even a 100 GB corpus stays under 500 GB total once everything's quantized.
Snapshot strategy. Qdrant supports atomic snapshots without downtime. Run a nightly cron that snapshots → uploads to a second storage volume (or to MinIO). Keep 7 daily + 4 weekly + 6 monthly. Total cold-storage cost stays under 100 GB for any practical corpus.
Networking
Air-gapped means: no DNS to public resolvers, no NTP to public servers, no auto-updates. Run a local Pi-hole + an internal NTP server.
If users access via a private corporate network: bind Caddy to the internal interface only. If users access via Tailscale: bind to the tailnet interface only. Never to 0.0.0.0.
Inside the workflow: every container talks via Docker bridge networks, no published ports except Caddy (443) and (optionally) Open WebUI debug (8080 loopback).
Observability
Critical metrics:
- Retrieval latency p99. Qdrant cold-start can take 100ms+; warm queries are <20ms. Sustained p99 > 200ms means the index doesn't fit in RAM.
- Rerank latency p99. bge-reranker-v2-m3 on CPU ~80-150ms for top-10. Sustained > 400ms means the CPU is overcommitted.
- Generation tok/s. Should stay above 30 tok/s on a 4090 + 14B AWQ; below means concurrent users exceeded capacity.
- Document ingestion success rate. unstructured.io fails on ~2-5% of typical PDFs (scans, password-protected, mixed RTL). Track and triage manually.
Security
Document scoping. Open WebUI supports per-user document collections — use them. Never share collections across teams that have different access policies.
Embedding model integrity. Pin the embeddings model SHA. A swapped embedding model breaks every existing query in subtle ways and is a supply-chain attack vector.
Audit trail. Open WebUI logs every query + retrieval; pipe to Loki + retain 90 days for compliance audits.
Backup encryption. Qdrant snapshots contain the full text of indexed documents. Encrypt at rest with age or gpg before shipping to off-site.
Upgrade path
Tighter retrieval (more accurate citations): swap nomic-embed → e5-mistral-7b-instruct (much larger, ~10 GB VRAM) for top-tier MTEB scores. Or stack: keep nomic-embed for speed, run e5-mistral as a secondary embedder for re-ranking via vector similarity.
Larger corpus (>10 GB documents): move from single Qdrant node → 3-node Qdrant cluster on shared NVMe. Adds operational complexity but stays self-hosted.
Multi-tenant production: add per-tenant Qdrant collections, audit logging via Loki + Vector, an API gateway (Kong) in front of Open WebUI for SSO.
What breaks first
- OCR coverage. Scanned PDFs hit unstructured.io's OCR fallback (Tesseract) which works but slowly and with errors. Either pre-OCR with a better tool (Surya, Textract) or live with degradation.
- Document drift. Re-ingestion of changed documents creates orphan vectors. Run a periodic "find vectors with no matching document" cleanup.
- Reranker bottleneck. bge-reranker on CPU caps at ~10 reranks/sec. At 15 concurrent users you'll start queueing; move reranker to GPU when this happens.
- Open WebUI version drift. 0.x → 0.y minor bumps occasionally break the RAG pipeline. Pin the image SHA.
- Snapshot rotation forgotten. Eventually fills the disk and Qdrant goes read-only mid-day. Set up disk-usage alerts in Grafana.
Composes these stacks
The /stacks layer covers what to assemble; this workflow shows how those assemblies operate as a system.
Open the custom build engine and explore which hardware tier actually supports this workflow.
Workflow validation
Each row is a (model × hardware × runtime) triple this workflow claims. Validation is rule-based: 0 validated by reproduced benchmarks, 0 supported by single-source benchmarks, 0 supported by same-family hardware, 0 supported by adjacent-hardware measurements, 1 currently unvalidated. We never fabricate validation; if no benchmark exists, we say so.
- Unvalidatedqwen-2.5-14b-instruct via vllm
- · No public benchmarks yet. The workflow's claim about this model is currently unsubstantiated by measurements.
0 benchmarksSubmit the first benchmark →