Build an offline RAG workstation stack (May 2026)
Search and chat with thousands of private documents on a single workstation that never phones home — for legal, medical, financial, or any data class that legally cannot leave the network.
- 01ToolRAG workspace + ingestion pipelineanythingllm
AnythingLLM over Open WebUI for offline RAG: workspace = collection isolation primitive, native ingestion pipeline (PDF/DOCX/MD chunkers), LanceDB embedded by default — no separate vector DB to firewall. Open WebUI is better for chat-first; AnythingLLM is better for document-first.
- 02ToolInference engine (LLM + embeddings)ollama
Ollama over vLLM for offline RAG: same machine hosts both the LLM and the embedding model with one process; vLLM's production strengths (continuous batching, multi-tenant) don't help a single-user workstation. Pull mxbai-embed-large for embeddings + Qwen 2.5 14B for chat.
- 03ToolVector store (embedded, no server)lancedb
LanceDB is the AnythingLLM default and the right pick for offline: single-folder Arrow files, no server process to firewall, scales comfortably to 1M+ vectors. Switch to Qdrant only when crossing the LanceDB scaling ceiling — Qdrant adds a service to harden.
- 04ModelChat LLM (14B class)qwen-2.5-14b-instruct
Qwen 2.5 14B Instruct over Llama 3.1 8B for offline RAG: stronger at synthesizing across multiple retrieved chunks (real test: 5-document summarization wins by ~15% on benchmark). Fits FP16 on a 24GB card with KV-cache headroom for 32K context.
- 05HardwareGPU (LLM + embedding generation)rtx-4090
RTX 4090 24GB is the workstation default. Embedding 50,000 PDF chunks takes ~30 minutes on a 4090 vs ~3 hours on CPU; the GPU pays for itself on the ingestion side alone for any meaningful document corpus.
Why a dedicated offline-RAG stack
The general-purpose stacks (workstation, coding agent) assume some degree of cloud-friendliness. This one inverts that assumption. It's designed for legal teams ingesting discovery documents, medical groups indexing patient records, financial advisors over client portfolios, regulated industries where data residency is a hard constraint. The differentiating design choice: every component is local, every dependency is auditable, every process is firewall-friendly. No telemetry, no API calls, no “just for analytics” third-party SDK.
The headline architectural choice this stack makes: embedded vector store + native ingestion pipeline + single inference process. Most production-RAG guides suggest Qdrant or Weaviate or Pinecone — all reasonable choices for cloud-friendly deployments, all wrong here. Each adds a server process to harden, a port to firewall, a credential to rotate. LanceDB's embedded architecture is the right answer for offline because there's nothing to firewall — the “database” is just files on disk.
Step-by-step setup
1. Bring up Ollama with the LLM and the embedding model
# Native install (no Docker — fewer egress paths)
curl -fsSL https://ollama.com/install.sh | sh
# Pull the LLM and the embedding model
ollama pull qwen2.5:14b-instruct
ollama pull mxbai-embed-large
# Verify both are loaded
curl http://localhost:11434/api/tagsOllama runs as a single systemd service. The default bind is127.0.0.1:11434 — no external interface exposed. For an air-gapped audit, log a snapshot of iptables -L and ss -lntp at this stage to prove no outbound paths opened.
2. Install AnythingLLM as the RAG frontend
# Docker is fine here — the AnythingLLM container has no
# outbound dependencies once images are pulled
docker pull mintplexlabs/anythingllm:latest
# Pre-create the storage volume so we can audit it later
docker volume create anythingllm-storage
docker run -d --name anythingllm \
-p 127.0.0.1:3001:3001 \
--restart unless-stopped \
--cap-add SYS_ADMIN \
-v anythingllm-storage:/app/server/storage \
-e LLM_PROVIDER="ollama" \
-e OLLAMA_BASE_PATH="http://host.docker.internal:11434" \
-e OLLAMA_MODEL_PREF="qwen2.5:14b-instruct" \
-e EMBEDDING_ENGINE="ollama" \
-e EMBEDDING_BASE_PATH="http://host.docker.internal:11434" \
-e EMBEDDING_MODEL_PREF="mxbai-embed-large" \
-e VECTOR_DB="lancedb" \
--add-host=host.docker.internal:host-gateway \
mintplexlabs/anythingllm:latestBind to 127.0.0.1:3001 only — never expose this to the LAN if true offline isolation is the requirement. Network policy enforcement at the OS level is the real defence; the bind address is belt-and-suspenders.
3. Configure the workspace and ingestion settings
# After UI setup at http://127.0.0.1:3001/
# Workspace settings -> chunking
chunk_size: 800 # smaller than default for higher recall
chunk_overlap: 200
top_k: 6 # retrieve 6 chunks per query (default is 4)
query_mode: "exact" # explicit retrieval; no automatic synthesis
# Workspace settings -> permissions
allow_browser_extensions: false
allow_url_scrape: false # critical: disable web scraper for offline useThe web scraper setting is the single most important toggle for the offline guarantee. Leaving it enabled allows a workspace to fetch URLs — that's an outbound network call by definition. Disable it. Document the disabled state in the workspace audit log.
4. Verify air-gap before ingestion
# Use a network-monitoring tool that captures all packets during
# a smoke-test query. Suricata works well; tcpdump is enough for
# spot checks.
sudo tcpdump -i any -n 'not host 127.0.0.1 and not host ::1' &
# Then run a test query in AnythingLLM. Expected output: NO PACKETS.
# If tcpdump captures anything outbound, the stack is leaking — stop
# and find the source before ingesting real data.Ingestion workflow
The repeatable workflow for safely ingesting tens of thousands of documents:
- Stage documents in a known directory.
~/rag-corpus/inbound/. Always keep the source files outside the AnythingLLM container so you can re-ingest after configuration changes. - Audit the manifest before upload. Generate a hash-listed manifest (
sha256sum *.pdf > manifest.txt); keep it with the workspace audit log. - Ingest in batches of 500-1000 documents. AnythingLLM's upload UI handles single batches well; the
POST /api/v1/document/uploadendpoint handles larger ones via script. On a 4090, 1000 PDFs take ~10-30 minutes depending on length. - Verify chunk count after each batch. Workspace stats UI shows total chunks; sanity-check that chunks-per-document is reasonable (5-20 for typical PDFs; 500+ usually means the chunker is overshooting and you need smaller chunk_size).
- Take a LanceDB snapshot.
cp -r ~/.anythingllm-data/lancedb ~/backups/lancedb-$(date +%Y%m%d)/after every major ingestion batch. Vector stores corrupt quietly; backups recover.
Failure modes you'll hit
- Embedding model mismatch. If you change the embedding model after ingestion, every existing collection becomes unreadable. AnythingLLM doesn't warn loudly enough. Pin the embedding model in workspace settings; re-ingest the entire corpus when changing it.
- Workspace size explosion via OCR. Some PDFs are scanned images; AnythingLLM runs OCR on them. A 100-page scan can produce 800+ chunks. Set chunk-count limits per document or pre-process scans separately.
- LanceDB query latency past 500K chunks. LanceDB scales further than Chroma but still wobbles past ~500K vectors per workspace. Sharding by year/department is the workaround; switching to Qdrant is the upgrade — see the variation below.
- Ollama context truncation. AnythingLLM defaults to 4K context for retrieval prompts; if your top_k=6 with chunk_size=800 you're sending ~5K tokens and the model truncates. Raise the workspace context window to 16K.
- Docker volume permission corruption. Linux host with non-default UID can corrupt the SQLite metadata. Use named volumes (above) rather than bind mounts; if you must bind-mount, pre-set
chown 1000:1000. - Outbound DNS leak via Docker default networking. Docker's embedded DNS still resolves external names by default. For a true air-gap, run with
--network=noneafter image pull, or use a firewall ruleset that blocks Docker bridge → external traffic.
Variations and alternatives
Apple Silicon variation. Replace Ollama with MLX-LM, the 4090 with M3 Max. AnythingLLM is happy talking to either OAI-compatible endpoint. Throughput drops ~30% but unified memory makes larger context windows easier.
Larger-corpus variation. Past ~500K chunks, swap LanceDB for Qdrant (with Docker network isolation). The added service requires firewall hardening but scales to 10M+ vectors.
Higher-throughput query variation. Replace Ollama with vLLM if multiple users hit the workspace concurrently. Same OAI-compatible interface; vLLM's continuous batching handles concurrent retrieval-then-generate cycles much better than Ollama.
Multi-machine offline variation. Document server (storage + ingestion) on one box, inference on a dedicated GPU box, both on an isolated VLAN. AnythingLLM handles cross-machine deployment via the same OAI-compatible model URL pattern.
Who should avoid this stack
- Anyone whose privacy requirements are softer than stated. If “cloud-friendly with reasonable controls” is acceptable, the cloud-RAG path (Pinecone + OpenAI) is faster to set up and operationally cheaper to maintain. This stack costs you ergonomics for a privacy guarantee you may not actually need.
- Single-user RAG over personal notes. Smaller than the workstation tier; consider just running Ollama with built-in document context windows and skip the dedicated vector store entirely.
- Real-time interactive queries on millions of documents. The single-workstation ceiling is real; past 1M chunks per workspace, accept the multi-machine variation or move to a server cluster.
- Heterogeneous data types beyond text. AnythingLLM does some image extraction; for serious multi-modal RAG, build a custom pipeline with specialized vision-language models.
Going deeper
- AnythingLLM operational review — full L1.5 review with the architecture, the 8 production failure modes, and the comparison block.
- LanceDB catalog entry — Arrow-on-disk vector store characteristics and the scaling ceiling.
- Inference runtime ecosystem map — where Ollama sits, plus the production-throughput alternatives if you hit Ollama's ceiling.
- RTX 4090 workstation stack — the broader workstation pattern this offline stack specializes from.