Private Document Analysis

Capability notes

Local RAG (Retrieval-Augmented Generation) over private documents is the canonical "local-first wedge" — data must not leave your infrastructure and cloud RAG APIs are a non-starter for compliance. Stack: embed documents → store in vector DB → retrieve relevant chunks at query time → generate answer from retrieved context. All components run locally with open-weight models. Accuracy on legal documents: local RAG with [BGE-M3](/models/bge-m3) embeddings + [BGE Reranker V2 M3](/models/bge-reranker-v2-m3) + [Llama 3.3 70B](/models/llama-3-3-70b) achieves 78–85% on LegalBench-RAG, vs GPT-4 + OpenAI embeddings at 85–90% — a 5–10% gap. For medical (PubMedQA, MedRAG): local [Llama 3.3 70B](/models/llama-3-3-70b) + BGE-M3 = 72–80% vs MedPaLM 2 at 80–86%. The gap narrowed from 15–20% (2024) to 5–10% (2026). For many compliance-driven use cases, a 5–10% tradeoff is acceptable when the alternative is "cannot use AI." The pipeline: (1) document ingestion — chunk at paragraph/section boundaries with 10–20% overlap; (2) query processing — embed via BGE-M3, retrieve top-50–100, rerank top-50 via BGE Reranker V2 M3, select top-5–10; (3) grounded generation — inject top-N as context, instruct model to cite specific chunks for each claim. What local RAG does well: answer factual questions from 100,000 documents with 80%+ accuracy, extract specific clauses from contracts, summarize document collections, cross-reference across documents ("compare severance clauses for VPs vs Directors"), maintain source attribution. What it struggles with: multi-hop reasoning across 5+ documents (50–65% accuracy), questions requiring implicit knowledge not in documents, temporal reasoning about document validity.

If you just want to try this

Lowest-friction path to a working setup.

Install [AnythingLLM](https://anythingllm.com) — a desktop app bundling a complete local RAG pipeline with zero configuration. Launch it and from setup: 1. **LLM Provider**: "Ollama" → [Llama 3.3 70B](/models/llama-3-3-70b) (or [Qwen 3 32B](/models/qwen-3-32b) for smaller hardware). Ollama must be running: `ollama pull llama3.3:70b-instruct-q4_K_M`. 2. **Embedding Model**: "Ollama" → "nomic-embed-text". 3. **Vector Database**: "LanceDB" (built-in, zero-config). 4. Click "Create Workspace." Drag PDFs, DOCX, TXT files, or folders into AnythingLLM. It chunks documents, generates embeddings, stores them, makes the workspace available for chat. Ask questions — AnythingLLM retrieves chunks and generates answers with source citations. Hardware: [Llama 3.3 70B](/models/llama-3-3-70b) Q4 needs 24 GB+ VRAM. On 12–16 GB, use [Qwen 3 32B](/models/qwen-3-32b) — quality drop is smaller for RAG than general chat because retrieved context grounds the response. On [MacBook Pro 16" M4 Max](/hardware/macbook-pro-16-m4-max) 64 GB+: 70B runs on SoC at 25–35 tok/s. 3-click setup: (1) install [Ollama](/tools/ollama) → pull llama3.3:70b + nomic-embed-text, (2) install AnythingLLM → point at local Ollama, (3) drop documents → ask questions. Handles up to ~10,000 documents on a laptop before search latency exceeds 2 seconds. For larger corpora, switch vector DB from LanceDB to [Qdrant](/tools/qdrant) (Docker one-command) for sub-second search at any corpus size. For programmatic control: `pip install langchain chromadb ollama` for a Python-native RAG pipeline with full control over chunking, retrieval depth, and generation parameters.

For production deployment

Operator-grade recommendation.

Production local RAG requires chunking strategy, access control, and retrieval quality monitoring. Stack: [Text Embeddings Inference (TEI)](/tools/text-embeddings-inference) + [BGE-M3](/models/bge-m3) for embeddings, [Qdrant](/tools/qdrant) for vector search, [BGE Reranker V2 M3](/models/bge-reranker-v2-m3) for precision, [vLLM](/tools/vllm) + [Llama 3.3 70B](/models/llama-3-3-70b) for generation. **Chunking strategy.** Chunk at semantic boundaries (paragraphs, sections, list items), not fixed character counts. Chunk size: 500–1,000 tokens with 10–20% overlap. For legal: clause level (~200–500 tokens) because queries target specific clauses. For medical: paragraph level (~300–800 tokens). Store chunk metadata: source UUID, page number, section heading, sequence number, document date — enables filtered retrieval and source citation. **Multi-document retrieval.** Naive top-k global retrieval fails when answers span multiple documents. A query comparing policies from two documents requires chunks from both, which may not both be in top-k. Implement document-aware retrieval: retrieve top-3 chunks per query-relevant document, not top-30 globally. Group first-stage results by source document, then select top-N within each group. **Access control.** Tag each chunk with access metadata (user IDs, group IDs, clearance level) and include access filters in every retrieval query. [Qdrant](/tools/qdrant) supports payload filtering: `must: [{key: "access_groups", match: {any: ["legal", "compliance"]}}]`. This adds 1–3ms to query time. Do NOT implement access control via post-retrieval filtering — chunk presence in the embedding index can be inferred. **Retrieval quality monitoring.** Track precision@k and recall@k on a labeled test set of 100–200 query-document pairs. If precision@10 <70% or recall@10 <80%, investigate chunk boundary artifacts, embedding drift, or query-document vocabulary mismatch. Use query expansion for legal/medical vocabulary gaps. Log retrieved chunks + generated answers + user feedback for systematic failure identification. **API vs self-host.** For legal, medical, financial, or compliance-regulated documents: self-host. Sending attorney-client-privileged documents to OpenAI's API waives privilege in many jurisdictions. HIPAA requires a BAA, which most AI API providers don't offer. For non-sensitive documents under 1,000 queries/month: API-based RAG (OpenAI embeddings + GPT-4) costs ~$50–200/month with zero infrastructure. For sensitive documents at any volume, or non-sensitive above 10,000 queries/month: self-hosted local RAG is both the compliance requirement and the cost-effective option.

What breaks

Failure modes operators see in the wild.

**Retrieval misses — wrong chunk retrieved.** Symptom: answer includes incorrect information because the most relevant chunk wasn't retrieved. Model faithfully generates from wrong context. Cause: embedding model maps query and relevant chunk to vectors below retrieval threshold — chunk exists but ranks below position 50. Mitigation: increase retrieval depth (top-50 → top-100) and apply reranking. Use hybrid retrieval (dense + BM25 sparse). Implement query rewriting — LLM-generated synonyms improve first-stage recall by 10–25%. **Hallucination on retrieved context.** Symptom: model generates answer contradicting retrieved chunks — context says "2-year warranty" but model outputs "3 years." Cause: model's internal knowledge competes with context; on borderline cases, defaults to internal knowledge. Mitigation: grounded generation prompt requiring chunk citations and "documents do not contain this information" for unanswerable queries. Post-generation factuality verification: extract claims, search chunks for each, flag answers with >10% unsupported claims. **PII leakage in embeddings.** Symptom: embedding vectors contain identifiable information reconstructable via membership inference. Cause: embedding models encode semantic information about people — "John Smith's diagnosis" is distinguishable from other patients. Mitigation: apply PII redaction before embedding — replace names, SSNs, phones, emails with placeholders ([NAME], [SSN]). Store mapping in encrypted database with audit logging. Generate from redacted chunks, de-redact in output. **Chunk boundary incoherence.** Symptom: retrieved chunk starts/ends mid-sentence, cutting critical context. Cause: naive chunking at fixed character positions ignores sentence boundaries. Mitigation: chunk at sentence boundaries, use document structure (headers, sections). Overlap 1–2 sentences with neighbors. At retrieval, include chunks N-1 and N+1 as auxiliary context — recovers 10–15% of lost information. **Embedding drift over time.** Symptom: retrieval degrades over months — new terminology, model updates, preprocessing changes shift embedding space. Cause: space is a snapshot of model + preprocessing at indexing time. Mitigation: re-index quarterly (slow-changing) or monthly (active). Pin model version hash; never change without full re-index. Use versioned vector DB with blue-green deployment: index to new Qdrant collection, validate, swap pointer.

Hardware guidance

Local RAG combines three model types: embedding (lightweight), generation (heavyweight), and optional reranker (lightweight). Size for the generation model — embedder + reranker are negligible overhead. **Hobbyist tier ($600–1,000 system).** [RTX 3060 12GB](/hardware/rtx-3060-12gb) runs 7B–13B generation + embeddings simultaneously. [Qwen 3 30B-A3B](/models/qwen-3-30b-a3b) (MoE, ~7B active) at Q4 fits in 12 GB. 7B-class RAG accuracy: 65–75% on legal/medical — adequate for personal use, not professional compliance. Use [Ollama](/tools/ollama) or [llama.cpp](/tools/llama-cpp) for multi-model serving. **SMB tier ($2,500–4,000 system).** [RTX 4090](/hardware/rtx-4090) at 24 GB: 70B Q4 generation (40 GB) + BGE-M3 (1.1 GB) + BGE Reranker (1.1 GB) all on one GPU with ~4 GB headroom. The complete RAG stack on a single consumer card. [Qwen 3 32B](/models/qwen-3-32b) FP16 with embeddings: 10 GB headroom. [MacBook Pro 16" M4 Max](/hardware/macbook-pro-16-m4-max) at 64 GB+ unified: full stack at 25–35 tok/s for 70B. **Professional tier ($8,000–15,000).** [RTX 6000 Ada](/hardware/rtx-6000-ada) at 48 GB: 70B FP8 (35 GB) + embedding (1.1 GB) + reranker (1.1 GB) + 12 GB KV cache — comfortable 32K+ context with 10+ concurrent users. [L40S](/hardware/nvidia-l40s) at 48 GB: datacenter equivalent, deploy for multi-user RAG via [vLLM](/tools/vllm). **Enterprise tier ($25,000+).** [H100 PCIe](/hardware/nvidia-h100-pcie) at 80 GB: 70B FP8 + full stack + 40 GB KV cache for 64K+ context at 140–170 tok/s — 50+ concurrent users. [H200](/hardware/nvidia-h200) at 141 GB: enables [DeepSeek V4](/models/deepseek-v4) generation for RAG over 1M+ corpora. [AMD MI300X](/hardware/amd-mi300x) at 192 GB: full stack with 128K context and 100+ concurrent sessions. **VRAM formula.** Total = generation model + 2.2 GB (embed + rerank) + KV cache (~0.8 GB per 1K context for 70B) + 2 GB OS. For 70B Q4: ~40 + 2.2 + 13 (16K context) + 2 = ~57 GB — exceeds consumer GPU, requires multi-GPU, Apple unified memory, or enterprise GPU. For 32B: ~19 + 2.2 + 7 + 2 = ~30 GB — fits [RTX 5090](/hardware/rtx-5090) at 32 GB. The 70B RAG sweet spot is a used A100 80 GB or Apple Silicon Mac with 64 GB+ unified memory.

Runtime guidance

**AnythingLLM vs LangChain/LlamaIndex vs custom.** [AnythingLLM](https://anythingllm.com) provides complete local RAG UI — drag documents, ask questions, view citations. Bundles LanceDB, interfaces with local LLM backends (Ollama, LM Studio). Advantage: zero configuration, offline, 5-minute deployment. Limitation: fixed character-based chunking, no multi-user access control, basic query rewriting. Use for personal RAG and proof-of-concept. Not for production multi-user RAG. [LangChain](https://langchain.com) and [LlamaIndex](https://llamaindex.ai) provide composable RAG components — loaders, splitters, embedders, vector stores, retrievers, chains. Advantage: full programmatic control, 50+ vector DBs, 100+ LLMs. Limitation: abstraction layers add debugging complexity. Right for production RAG needing customization without building entire integration layer. **Custom pipeline.** Build your own when: fine-grained control needed, minimum dependencies (httpx + vector DB client + LLM API), or use case doesn't fit framework abstraction. Custom is 200–500 lines orchestrating: document load → chunk → TEI embed → Qdrant upsert → query → search → rerank → vLLM generate. More upfront work, eliminates framework debugging overhead. **Ollama vs llama.cpp backend.** [Ollama](/tools/ollama) serves generation + embedding from one binary — unified API, auto GPU, model library. Limitation: no continuous batching, no reranking, ~60% dedicated engine throughput. Right for single-user. [llama.cpp](/tools/llama-cpp) server: higher throughput with continuous batching, lower embedding latency. For multi-user (5+): vLLM for generation + TEI for embeddings. **Decision tree.** Personal RAG: AnythingLLM + Ollama + nomic-embed-text + [Llama 3.3 70B](/models/llama-3-3-70b) or [Qwen 3 32B](/models/qwen-3-32b). Production 5–50 users: LlamaIndex + TEI (BGE-M3 + BGE Reranker V2 M3) + [Qdrant](/tools/qdrant) + vLLM (Llama 3.3 70B). Production with compliance: custom pipeline — maximum control, minimum dependency surface. Canonical stack: TEI + BGE-M3 + Qdrant + BGE Reranker V2 M3 + vLLM + Llama 3.3 70B.

Setup walkthrough

Install LM Studio → download Llama 3.1 8B Q4_K_M (~5 GB).
Install AnythingLLM (anythingllm.com — desktop app, free tier).
In AnythingLLM: Settings → LLM → select "LM Studio" as provider → point to localhost:1234.
Create a workspace → upload your PDFs, DOCXs, TXT files (stored locally, never leave your machine).
Ask questions: "Summarize the key findings from the Q3 report" or "What are the contract termination clauses?"
First answer in 5-15 seconds. All data stays local — the entire pipeline (embedding, retrieval, generation) runs on your machine.
For CLI users: pip install llama-index → set up a local embedding model + local LLM → python rag_query.py "question".

The cheap setup

Used RTX 3060 12 GB ($200-250, see /hardware/rtx-3060-12gb). Runs a full local RAG stack: Nomic Embed Text for indexing (500 docs/second), BGE Reranker for retrieval quality, Llama 3.1 8B for generation at 50-80 tok/s. Handles 1,000+ PDF documents with sub-10-second query latency. Pair with Ryzen 5 5600 + 32 GB DDR4 + 1TB NVMe. Total: ~$400-480. For legal/medical use cases with strict compliance needs, this is the minimum viable private analysis rig.

The serious setup

Used RTX 3090 24 GB (~$700-900, see /hardware/rtx-3090). Runs Llama 3.3 70B Q4_K_M for high-quality document analysis at 15-25 tok/s. Handles 10,000+ documents with sub-5-second query latency when combined with a production vector DB (Qdrant/Weaviate). Can run the entire pipeline — embedding, reranking, generation — on a single GPU for 1-5 concurrent users. Pair with Ryzen 7 7700X + 64 GB DDR5 + 2TB NVMe. Total: ~$1,800-2,200. For law firms or medical practices, this is the production-tier self-hosted RAG rig.

Common beginner mistake

The mistake: Uploading sensitive documents to ChatGPT/Claude "for convenience" because setting up local RAG "seems complicated." Why it fails: Cloud AI providers log prompts, use data for training (unless enterprise tier), and are subject to subpoena. HIPAA, attorney-client privilege, GDPR, and corporate confidentiality are all violated the moment data leaves the machine. The fix: Use AnythingLLM + LM Studio or Ollama — the setup takes 15 minutes, all data stays local, and the 8B model quality is 80-90% as good for document Q&A. For regulated industries, local-only is NOT optional — it's a legal requirement. The convenience tradeoff isn't worth the compliance risk.

Reality check

Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.

Common mistakes

Buying for spec-sheet VRAM without modeling KV cache + activation overhead
Underestimating quantization quality loss below Q4
Skipping flash-attention support (real perf gap on long context)
Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)

What breaks first

The errors most operators hit when running private document analysis locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle private document analysis before committing money.

Hardware buying guidance for Private Document Analysis

RAG workflows mix embedding throughput, long-context inference, and reasonable VRAM headroom. The guides below cover the buyer decision honestly.