Stack · L3 execution·Workstation tier

Build an offline RAG workstation stack (May 2026)

Search and chat with thousands of private documents on a single workstation that never phones home — for legal, medical, financial, or any data class that legally cannot leave the network.

By Fredoline Eruo·Last reviewed 2026-05-06·~12 min read
The stack
  1. 01
    ToolRAG workspace + ingestion pipeline
    anythingllm

    AnythingLLM over Open WebUI for offline RAG: workspace = collection isolation primitive, native ingestion pipeline (PDF/DOCX/MD chunkers), LanceDB embedded by default — no separate vector DB to firewall. Open WebUI is better for chat-first; AnythingLLM is better for document-first.

  2. 02
    ToolInference engine (LLM + embeddings)
    ollama

    Ollama over vLLM for offline RAG: same machine hosts both the LLM and the embedding model with one process; vLLM's production strengths (continuous batching, multi-tenant) don't help a single-user workstation. Pull mxbai-embed-large for embeddings + Qwen 2.5 14B for chat.

  3. 03
    ToolVector store (embedded, no server)
    lancedb

    LanceDB is the AnythingLLM default and the right pick for offline: single-folder Arrow files, no server process to firewall, scales comfortably to 1M+ vectors. Switch to Qdrant only when crossing the LanceDB scaling ceiling — Qdrant adds a service to harden.

  4. 04
    ModelChat LLM (14B class)
    qwen-2.5-14b-instruct

    Qwen 2.5 14B Instruct over Llama 3.1 8B for offline RAG: stronger at synthesizing across multiple retrieved chunks (real test: 5-document summarization wins by ~15% on benchmark). Fits FP16 on a 24GB card with KV-cache headroom for 32K context.

  5. 05
    HardwareGPU (LLM + embedding generation)
    rtx-4090

    RTX 4090 24GB is the workstation default. Embedding 50,000 PDF chunks takes ~30 minutes on a 4090 vs ~3 hours on CPU; the GPU pays for itself on the ingestion side alone for any meaningful document corpus.

Why a dedicated offline-RAG stack

The general-purpose stacks (workstation, coding agent) assume some degree of cloud-friendliness. This one inverts that assumption. It's designed for legal teams ingesting discovery documents, medical groups indexing patient records, financial advisors over client portfolios, regulated industries where data residency is a hard constraint. The differentiating design choice: every component is local, every dependency is auditable, every process is firewall-friendly. No telemetry, no API calls, no “just for analytics” third-party SDK.

The headline architectural choice this stack makes: embedded vector store + native ingestion pipeline + single inference process. Most production-RAG guides suggest Qdrant or Weaviate or Pinecone — all reasonable choices for cloud-friendly deployments, all wrong here. Each adds a server process to harden, a port to firewall, a credential to rotate. LanceDB's embedded architecture is the right answer for offline because there's nothing to firewall — the “database” is just files on disk.

Step-by-step setup

1. Bring up Ollama with the LLM and the embedding model

# Native install (no Docker — fewer egress paths)
curl -fsSL https://ollama.com/install.sh | sh

# Pull the LLM and the embedding model
ollama pull qwen2.5:14b-instruct
ollama pull mxbai-embed-large

# Verify both are loaded
curl http://localhost:11434/api/tags

Ollama runs as a single systemd service. The default bind is127.0.0.1:11434 — no external interface exposed. For an air-gapped audit, log a snapshot of iptables -L and ss -lntp at this stage to prove no outbound paths opened.

2. Install AnythingLLM as the RAG frontend

# Docker is fine here — the AnythingLLM container has no
# outbound dependencies once images are pulled
docker pull mintplexlabs/anythingllm:latest

# Pre-create the storage volume so we can audit it later
docker volume create anythingllm-storage

docker run -d --name anythingllm \
  -p 127.0.0.1:3001:3001 \
  --restart unless-stopped \
  --cap-add SYS_ADMIN \
  -v anythingllm-storage:/app/server/storage \
  -e LLM_PROVIDER="ollama" \
  -e OLLAMA_BASE_PATH="http://host.docker.internal:11434" \
  -e OLLAMA_MODEL_PREF="qwen2.5:14b-instruct" \
  -e EMBEDDING_ENGINE="ollama" \
  -e EMBEDDING_BASE_PATH="http://host.docker.internal:11434" \
  -e EMBEDDING_MODEL_PREF="mxbai-embed-large" \
  -e VECTOR_DB="lancedb" \
  --add-host=host.docker.internal:host-gateway \
  mintplexlabs/anythingllm:latest

Bind to 127.0.0.1:3001 only — never expose this to the LAN if true offline isolation is the requirement. Network policy enforcement at the OS level is the real defence; the bind address is belt-and-suspenders.

3. Configure the workspace and ingestion settings

# After UI setup at http://127.0.0.1:3001/
# Workspace settings -> chunking
chunk_size: 800        # smaller than default for higher recall
chunk_overlap: 200
top_k: 6              # retrieve 6 chunks per query (default is 4)
query_mode: "exact"   # explicit retrieval; no automatic synthesis

# Workspace settings -> permissions
allow_browser_extensions: false
allow_url_scrape: false  # critical: disable web scraper for offline use

The web scraper setting is the single most important toggle for the offline guarantee. Leaving it enabled allows a workspace to fetch URLs — that's an outbound network call by definition. Disable it. Document the disabled state in the workspace audit log.

4. Verify air-gap before ingestion

# Use a network-monitoring tool that captures all packets during
# a smoke-test query. Suricata works well; tcpdump is enough for
# spot checks.
sudo tcpdump -i any -n 'not host 127.0.0.1 and not host ::1' &

# Then run a test query in AnythingLLM. Expected output: NO PACKETS.
# If tcpdump captures anything outbound, the stack is leaking — stop
# and find the source before ingesting real data.

Ingestion workflow

The repeatable workflow for safely ingesting tens of thousands of documents:

  1. Stage documents in a known directory. ~/rag-corpus/inbound/. Always keep the source files outside the AnythingLLM container so you can re-ingest after configuration changes.
  2. Audit the manifest before upload. Generate a hash-listed manifest (sha256sum *.pdf > manifest.txt); keep it with the workspace audit log.
  3. Ingest in batches of 500-1000 documents. AnythingLLM's upload UI handles single batches well; thePOST /api/v1/document/upload endpoint handles larger ones via script. On a 4090, 1000 PDFs take ~10-30 minutes depending on length.
  4. Verify chunk count after each batch. Workspace stats UI shows total chunks; sanity-check that chunks-per-document is reasonable (5-20 for typical PDFs; 500+ usually means the chunker is overshooting and you need smaller chunk_size).
  5. Take a LanceDB snapshot. cp -r ~/.anythingllm-data/lancedb ~/backups/lancedb-$(date +%Y%m%d)/ after every major ingestion batch. Vector stores corrupt quietly; backups recover.

Failure modes you'll hit

  1. Embedding model mismatch. If you change the embedding model after ingestion, every existing collection becomes unreadable. AnythingLLM doesn't warn loudly enough. Pin the embedding model in workspace settings; re-ingest the entire corpus when changing it.
  2. Workspace size explosion via OCR. Some PDFs are scanned images; AnythingLLM runs OCR on them. A 100-page scan can produce 800+ chunks. Set chunk-count limits per document or pre-process scans separately.
  3. LanceDB query latency past 500K chunks. LanceDB scales further than Chroma but still wobbles past ~500K vectors per workspace. Sharding by year/department is the workaround; switching to Qdrant is the upgrade — see the variation below.
  4. Ollama context truncation. AnythingLLM defaults to 4K context for retrieval prompts; if your top_k=6 with chunk_size=800 you're sending ~5K tokens and the model truncates. Raise the workspace context window to 16K.
  5. Docker volume permission corruption. Linux host with non-default UID can corrupt the SQLite metadata. Use named volumes (above) rather than bind mounts; if you must bind-mount, pre-set chown 1000:1000.
  6. Outbound DNS leak via Docker default networking. Docker's embedded DNS still resolves external names by default. For a true air-gap, run with --network=none after image pull, or use a firewall ruleset that blocks Docker bridge → external traffic.

Variations and alternatives

Apple Silicon variation. Replace Ollama with MLX-LM, the 4090 with M3 Max. AnythingLLM is happy talking to either OAI-compatible endpoint. Throughput drops ~30% but unified memory makes larger context windows easier.

Larger-corpus variation. Past ~500K chunks, swap LanceDB for Qdrant (with Docker network isolation). The added service requires firewall hardening but scales to 10M+ vectors.

Higher-throughput query variation. Replace Ollama with vLLM if multiple users hit the workspace concurrently. Same OAI-compatible interface; vLLM's continuous batching handles concurrent retrieval-then-generate cycles much better than Ollama.

Multi-machine offline variation. Document server (storage + ingestion) on one box, inference on a dedicated GPU box, both on an isolated VLAN. AnythingLLM handles cross-machine deployment via the same OAI-compatible model URL pattern.

Who should avoid this stack

  • Anyone whose privacy requirements are softer than stated. If “cloud-friendly with reasonable controls” is acceptable, the cloud-RAG path (Pinecone + OpenAI) is faster to set up and operationally cheaper to maintain. This stack costs you ergonomics for a privacy guarantee you may not actually need.
  • Single-user RAG over personal notes. Smaller than the workstation tier; consider just running Ollama with built-in document context windows and skip the dedicated vector store entirely.
  • Real-time interactive queries on millions of documents. The single-workstation ceiling is real; past 1M chunks per workspace, accept the multi-machine variation or move to a server cluster.
  • Heterogeneous data types beyond text. AnythingLLM does some image extraction; for serious multi-modal RAG, build a custom pipeline with specialized vision-language models.

Going deeper