RAG & Search
confidential rag
local document chat
private rag

Private Document Analysis

RAG over sensitive documents (legal, medical, financial, personal) where data must not leave the local environment. The local-first wedge.

Capability notes

Local RAG (Retrieval-Augmented Generation) over private documents is the canonical "local-first wedge" — data must not leave your infrastructure and cloud RAG APIs are a non-starter for compliance. Stack: embed documents → store in vector DB → retrieve relevant chunks at query time → generate answer from retrieved context. All components run locally with open-weight models. Accuracy on legal documents: local RAG with [BGE-M3](/models/bge-m3) embeddings + [BGE Reranker V2 M3](/models/bge-reranker-v2-m3) + [Llama 3.3 70B](/models/llama-3-3-70b) achieves 78–85% on LegalBench-RAG, vs GPT-4 + OpenAI embeddings at 85–90% — a 5–10% gap. For medical (PubMedQA, MedRAG): local [Llama 3.3 70B](/models/llama-3-3-70b) + BGE-M3 = 72–80% vs MedPaLM 2 at 80–86%. The gap narrowed from 15–20% (2024) to 5–10% (2026). For many compliance-driven use cases, a 5–10% tradeoff is acceptable when the alternative is "cannot use AI." The pipeline: (1) document ingestion — chunk at paragraph/section boundaries with 10–20% overlap; (2) query processing — embed via BGE-M3, retrieve top-50–100, rerank top-50 via BGE Reranker V2 M3, select top-5–10; (3) grounded generation — inject top-N as context, instruct model to cite specific chunks for each claim. What local RAG does well: answer factual questions from 100,000 documents with 80%+ accuracy, extract specific clauses from contracts, summarize document collections, cross-reference across documents ("compare severance clauses for VPs vs Directors"), maintain source attribution. What it struggles with: multi-hop reasoning across 5+ documents (50–65% accuracy), questions requiring implicit knowledge not in documents, temporal reasoning about document validity.

If you just want to try this

Lowest-friction path to a working setup.

Install [AnythingLLM](https://anythingllm.com) — a desktop app bundling a complete local RAG pipeline with zero configuration. Launch it and from setup: 1. **LLM Provider**: "Ollama" → [Llama 3.3 70B](/models/llama-3-3-70b) (or [Qwen 3 32B](/models/qwen-3-32b) for smaller hardware). Ollama must be running: `ollama pull llama3.3:70b-instruct-q4_K_M`. 2. **Embedding Model**: "Ollama" → "nomic-embed-text". 3. **Vector Database**: "LanceDB" (built-in, zero-config). 4. Click "Create Workspace." Drag PDFs, DOCX, TXT files, or folders into AnythingLLM. It chunks documents, generates embeddings, stores them, makes the workspace available for chat. Ask questions — AnythingLLM retrieves chunks and generates answers with source citations. Hardware: [Llama 3.3 70B](/models/llama-3-3-70b) Q4 needs 24 GB+ VRAM. On 12–16 GB, use [Qwen 3 32B](/models/qwen-3-32b) — quality drop is smaller for RAG than general chat because retrieved context grounds the response. On [MacBook Pro 16" M4 Max](/hardware/macbook-pro-16-m4-max) 64 GB+: 70B runs on SoC at 25–35 tok/s. 3-click setup: (1) install [Ollama](/tools/ollama) → pull llama3.3:70b + nomic-embed-text, (2) install AnythingLLM → point at local Ollama, (3) drop documents → ask questions. Handles up to ~10,000 documents on a laptop before search latency exceeds 2 seconds. For larger corpora, switch vector DB from LanceDB to [Qdrant](/tools/qdrant) (Docker one-command) for sub-second search at any corpus size. For programmatic control: `pip install langchain chromadb ollama` for a Python-native RAG pipeline with full control over chunking, retrieval depth, and generation parameters.

For production deployment

Operator-grade recommendation.

Production local RAG requires chunking strategy, access control, and retrieval quality monitoring. Stack: [Text Embeddings Inference (TEI)](/tools/text-embeddings-inference) + [BGE-M3](/models/bge-m3) for embeddings, [Qdrant](/tools/qdrant) for vector search, [BGE Reranker V2 M3](/models/bge-reranker-v2-m3) for precision, [vLLM](/tools/vllm) + [Llama 3.3 70B](/models/llama-3-3-70b) for generation. **Chunking strategy.** Chunk at semantic boundaries (paragraphs, sections, list items), not fixed character counts. Chunk size: 500–1,000 tokens with 10–20% overlap. For legal: clause level (~200–500 tokens) because queries target specific clauses. For medical: paragraph level (~300–800 tokens). Store chunk metadata: source UUID, page number, section heading, sequence number, document date — enables filtered retrieval and source citation. **Multi-document retrieval.** Naive top-k global retrieval fails when answers span multiple documents. A query comparing policies from two documents requires chunks from both, which may not both be in top-k. Implement document-aware retrieval: retrieve top-3 chunks per query-relevant document, not top-30 globally. Group first-stage results by source document, then select top-N within each group. **Access control.** Tag each chunk with access metadata (user IDs, group IDs, clearance level) and include access filters in every retrieval query. [Qdrant](/tools/qdrant) supports payload filtering: `must: [{key: "access_groups", match: {any: ["legal", "compliance"]}}]`. This adds 1–3ms to query time. Do NOT implement access control via post-retrieval filtering — chunk presence in the embedding index can be inferred. **Retrieval quality monitoring.** Track precision@k and recall@k on a labeled test set of 100–200 query-document pairs. If precision@10 <70% or recall@10 <80%, investigate chunk boundary artifacts, embedding drift, or query-document vocabulary mismatch. Use query expansion for legal/medical vocabulary gaps. Log retrieved chunks + generated answers + user feedback for systematic failure identification. **API vs self-host.** For legal, medical, financial, or compliance-regulated documents: self-host. Sending attorney-client-privileged documents to OpenAI's API waives privilege in many jurisdictions. HIPAA requires a BAA, which most AI API providers don't offer. For non-sensitive documents under 1,000 queries/month: API-based RAG (OpenAI embeddings + GPT-4) costs ~$50–200/month with zero infrastructure. For sensitive documents at any volume, or non-sensitive above 10,000 queries/month: self-hosted local RAG is both the compliance requirement and the cost-effective option.

What breaks

Failure modes operators see in the wild.

**Retrieval misses — wrong chunk retrieved.** Symptom: answer includes incorrect information because the most relevant chunk wasn't retrieved. Model faithfully generates from wrong context. Cause: embedding model maps query and relevant chunk to vectors below retrieval threshold — chunk exists but ranks below position 50. Mitigation: increase retrieval depth (top-50 → top-100) and apply reranking. Use hybrid retrieval (dense + BM25 sparse). Implement query rewriting — LLM-generated synonyms improve first-stage recall by 10–25%. **Hallucination on retrieved context.** Symptom: model generates answer contradicting retrieved chunks — context says "2-year warranty" but model outputs "3 years." Cause: model's internal knowledge competes with context; on borderline cases, defaults to internal knowledge. Mitigation: grounded generation prompt requiring chunk citations and "documents do not contain this information" for unanswerable queries. Post-generation factuality verification: extract claims, search chunks for each, flag answers with >10% unsupported claims. **PII leakage in embeddings.** Symptom: embedding vectors contain identifiable information reconstructable via membership inference. Cause: embedding models encode semantic information about people — "John Smith's diagnosis" is distinguishable from other patients. Mitigation: apply PII redaction before embedding — replace names, SSNs, phones, emails with placeholders ([NAME], [SSN]). Store mapping in encrypted database with audit logging. Generate from redacted chunks, de-redact in output. **Chunk boundary incoherence.** Symptom: retrieved chunk starts/ends mid-sentence, cutting critical context. Cause: naive chunking at fixed character positions ignores sentence boundaries. Mitigation: chunk at sentence boundaries, use document structure (headers, sections). Overlap 1–2 sentences with neighbors. At retrieval, include chunks N-1 and N+1 as auxiliary context — recovers 10–15% of lost information. **Embedding drift over time.** Symptom: retrieval degrades over months — new terminology, model updates, preprocessing changes shift embedding space. Cause: space is a snapshot of model + preprocessing at indexing time. Mitigation: re-index quarterly (slow-changing) or monthly (active). Pin model version hash; never change without full re-index. Use versioned vector DB with blue-green deployment: index to new Qdrant collection, validate, swap pointer.

Hardware guidance

Local RAG combines three model types: embedding (lightweight), generation (heavyweight), and optional reranker (lightweight). Size for the generation model — embedder + reranker are negligible overhead. **Hobbyist tier ($600–1,000 system).** [RTX 3060 12GB](/hardware/rtx-3060-12gb) runs 7B–13B generation + embeddings simultaneously. [Qwen 3 30B-A3B](/models/qwen-3-30b-a3b) (MoE, ~7B active) at Q4 fits in 12 GB. 7B-class RAG accuracy: 65–75% on legal/medical — adequate for personal use, not professional compliance. Use [Ollama](/tools/ollama) or [llama.cpp](/tools/llama-cpp) for multi-model serving. **SMB tier ($2,500–4,000 system).** [RTX 4090](/hardware/rtx-4090) at 24 GB: 70B Q4 generation (40 GB) + BGE-M3 (1.1 GB) + BGE Reranker (1.1 GB) all on one GPU with ~4 GB headroom. The complete RAG stack on a single consumer card. [Qwen 3 32B](/models/qwen-3-32b) FP16 with embeddings: 10 GB headroom. [MacBook Pro 16" M4 Max](/hardware/macbook-pro-16-m4-max) at 64 GB+ unified: full stack at 25–35 tok/s for 70B. **Professional tier ($8,000–15,000).** [RTX 6000 Ada](/hardware/rtx-6000-ada) at 48 GB: 70B FP8 (35 GB) + embedding (1.1 GB) + reranker (1.1 GB) + 12 GB KV cache — comfortable 32K+ context with 10+ concurrent users. [L40S](/hardware/nvidia-l40s) at 48 GB: datacenter equivalent, deploy for multi-user RAG via [vLLM](/tools/vllm). **Enterprise tier ($25,000+).** [H100 PCIe](/hardware/nvidia-h100-pcie) at 80 GB: 70B FP8 + full stack + 40 GB KV cache for 64K+ context at 140–170 tok/s — 50+ concurrent users. [H200](/hardware/nvidia-h200) at 141 GB: enables [DeepSeek V4](/models/deepseek-v4) generation for RAG over 1M+ corpora. [AMD MI300X](/hardware/amd-mi300x) at 192 GB: full stack with 128K context and 100+ concurrent sessions. **VRAM formula.** Total = generation model + 2.2 GB (embed + rerank) + KV cache (~0.8 GB per 1K context for 70B) + 2 GB OS. For 70B Q4: ~40 + 2.2 + 13 (16K context) + 2 = ~57 GB — exceeds consumer GPU, requires multi-GPU, Apple unified memory, or enterprise GPU. For 32B: ~19 + 2.2 + 7 + 2 = ~30 GB — fits [RTX 5090](/hardware/rtx-5090) at 32 GB. The 70B RAG sweet spot is a used A100 80 GB or Apple Silicon Mac with 64 GB+ unified memory.

Runtime guidance

**AnythingLLM vs LangChain/LlamaIndex vs custom.** [AnythingLLM](https://anythingllm.com) provides complete local RAG UI — drag documents, ask questions, view citations. Bundles LanceDB, interfaces with local LLM backends (Ollama, LM Studio). Advantage: zero configuration, offline, 5-minute deployment. Limitation: fixed character-based chunking, no multi-user access control, basic query rewriting. Use for personal RAG and proof-of-concept. Not for production multi-user RAG. [LangChain](https://langchain.com) and [LlamaIndex](https://llamaindex.ai) provide composable RAG components — loaders, splitters, embedders, vector stores, retrievers, chains. Advantage: full programmatic control, 50+ vector DBs, 100+ LLMs. Limitation: abstraction layers add debugging complexity. Right for production RAG needing customization without building entire integration layer. **Custom pipeline.** Build your own when: fine-grained control needed, minimum dependencies (httpx + vector DB client + LLM API), or use case doesn't fit framework abstraction. Custom is 200–500 lines orchestrating: document load → chunk → TEI embed → Qdrant upsert → query → search → rerank → vLLM generate. More upfront work, eliminates framework debugging overhead. **Ollama vs llama.cpp backend.** [Ollama](/tools/ollama) serves generation + embedding from one binary — unified API, auto GPU, model library. Limitation: no continuous batching, no reranking, ~60% dedicated engine throughput. Right for single-user. [llama.cpp](/tools/llama-cpp) server: higher throughput with continuous batching, lower embedding latency. For multi-user (5+): vLLM for generation + TEI for embeddings. **Decision tree.** Personal RAG: AnythingLLM + Ollama + nomic-embed-text + [Llama 3.3 70B](/models/llama-3-3-70b) or [Qwen 3 32B](/models/qwen-3-32b). Production 5–50 users: LlamaIndex + TEI (BGE-M3 + BGE Reranker V2 M3) + [Qdrant](/tools/qdrant) + vLLM (Llama 3.3 70B). Production with compliance: custom pipeline — maximum control, minimum dependency surface. Canonical stack: TEI + BGE-M3 + Qdrant + BGE Reranker V2 M3 + vLLM + Llama 3.3 70B.

Setup walkthrough

  1. Install LM Studio → download Llama 3.1 8B Q4_K_M (~5 GB).
  2. Install AnythingLLM (anythingllm.com — desktop app, free tier).
  3. In AnythingLLM: Settings → LLM → select "LM Studio" as provider → point to localhost:1234.
  4. Create a workspace → upload your PDFs, DOCXs, TXT files (stored locally, never leave your machine).
  5. Ask questions: "Summarize the key findings from the Q3 report" or "What are the contract termination clauses?"
  6. First answer in 5-15 seconds. All data stays local — the entire pipeline (embedding, retrieval, generation) runs on your machine.
  7. For CLI users: pip install llama-index → set up a local embedding model + local LLM → python rag_query.py "question".

The cheap setup

Used RTX 3060 12 GB ($200-250, see /hardware/rtx-3060-12gb). Runs a full local RAG stack: Nomic Embed Text for indexing (500 docs/second), BGE Reranker for retrieval quality, Llama 3.1 8B for generation at 50-80 tok/s. Handles 1,000+ PDF documents with sub-10-second query latency. Pair with Ryzen 5 5600 + 32 GB DDR4 + 1TB NVMe. Total: ~$400-480. For legal/medical use cases with strict compliance needs, this is the minimum viable private analysis rig.

The serious setup

Used RTX 3090 24 GB (~$700-900, see /hardware/rtx-3090). Runs Llama 3.3 70B Q4_K_M for high-quality document analysis at 15-25 tok/s. Handles 10,000+ documents with sub-5-second query latency when combined with a production vector DB (Qdrant/Weaviate). Can run the entire pipeline — embedding, reranking, generation — on a single GPU for 1-5 concurrent users. Pair with Ryzen 7 7700X + 64 GB DDR5 + 2TB NVMe. Total: ~$1,800-2,200. For law firms or medical practices, this is the production-tier self-hosted RAG rig.

Common beginner mistake

The mistake: Uploading sensitive documents to ChatGPT/Claude "for convenience" because setting up local RAG "seems complicated." Why it fails: Cloud AI providers log prompts, use data for training (unless enterprise tier), and are subject to subpoena. HIPAA, attorney-client privilege, GDPR, and corporate confidentiality are all violated the moment data leaves the machine. The fix: Use AnythingLLM + LM Studio or Ollama — the setup takes 15 minutes, all data stays local, and the 8B model quality is 80-90% as good for document Q&A. For regulated industries, local-only is NOT optional — it's a legal requirement. The convenience tradeoff isn't worth the compliance risk.

Recommended setup for private document analysis

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Reality check

Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.

Common mistakes

  • Buying for spec-sheet VRAM without modeling KV cache + activation overhead
  • Underestimating quantization quality loss below Q4
  • Skipping flash-attention support (real perf gap on long context)
  • Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)

What breaks first

The errors most operators hit when running private document analysis locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle private document analysis before committing money.

Hardware buying guidance for Private Document Analysis

RAG workflows mix embedding throughput, long-context inference, and reasonable VRAM headroom. The guides below cover the buyer decision honestly.

Specialized buyer guides
Updated 2026 roundup