Capability notes
Local RAG (Retrieval-Augmented Generation) over private documents is the canonical "local-first wedge" — data must not leave your infrastructure and cloud RAG APIs are a non-starter for compliance. Stack: embed documents → store in vector DB → retrieve relevant chunks at query time → generate answer from retrieved context. All components run locally with open-weight models.
Accuracy on legal documents: local RAG with [BGE-M3](/models/bge-m3) embeddings + [BGE Reranker V2 M3](/models/bge-reranker-v2-m3) + [Llama 3.3 70B](/models/llama-3-3-70b) achieves 78–85% on LegalBench-RAG, vs GPT-4 + OpenAI embeddings at 85–90% — a 5–10% gap. For medical (PubMedQA, MedRAG): local [Llama 3.3 70B](/models/llama-3-3-70b) + BGE-M3 = 72–80% vs MedPaLM 2 at 80–86%. The gap narrowed from 15–20% (2024) to 5–10% (2026). For many compliance-driven use cases, a 5–10% tradeoff is acceptable when the alternative is "cannot use AI."
The pipeline: (1) document ingestion — chunk at paragraph/section boundaries with 10–20% overlap; (2) query processing — embed via BGE-M3, retrieve top-50–100, rerank top-50 via BGE Reranker V2 M3, select top-5–10; (3) grounded generation — inject top-N as context, instruct model to cite specific chunks for each claim.
What local RAG does well: answer factual questions from 100,000 documents with 80%+ accuracy, extract specific clauses from contracts, summarize document collections, cross-reference across documents ("compare severance clauses for VPs vs Directors"), maintain source attribution. What it struggles with: multi-hop reasoning across 5+ documents (50–65% accuracy), questions requiring implicit knowledge not in documents, temporal reasoning about document validity.
If you just want to try this
Lowest-friction path to a working setup.
Install [AnythingLLM](https://anythingllm.com) — a desktop app bundling a complete local RAG pipeline with zero configuration. Launch it and from setup:
1. **LLM Provider**: "Ollama" → [Llama 3.3 70B](/models/llama-3-3-70b) (or [Qwen 3 32B](/models/qwen-3-32b) for smaller hardware). Ollama must be running: `ollama pull llama3.3:70b-instruct-q4_K_M`.
2. **Embedding Model**: "Ollama" → "nomic-embed-text".
3. **Vector Database**: "LanceDB" (built-in, zero-config).
4. Click "Create Workspace."
Drag PDFs, DOCX, TXT files, or folders into AnythingLLM. It chunks documents, generates embeddings, stores them, makes the workspace available for chat. Ask questions — AnythingLLM retrieves chunks and generates answers with source citations.
Hardware: [Llama 3.3 70B](/models/llama-3-3-70b) Q4 needs 24 GB+ VRAM. On 12–16 GB, use [Qwen 3 32B](/models/qwen-3-32b) — quality drop is smaller for RAG than general chat because retrieved context grounds the response. On [MacBook Pro 16" M4 Max](/hardware/macbook-pro-16-m4-max) 64 GB+: 70B runs on SoC at 25–35 tok/s.
3-click setup: (1) install [Ollama](/tools/ollama) → pull llama3.3:70b + nomic-embed-text, (2) install AnythingLLM → point at local Ollama, (3) drop documents → ask questions. Handles up to ~10,000 documents on a laptop before search latency exceeds 2 seconds. For larger corpora, switch vector DB from LanceDB to [Qdrant](/tools/qdrant) (Docker one-command) for sub-second search at any corpus size.
For programmatic control: `pip install langchain chromadb ollama` for a Python-native RAG pipeline with full control over chunking, retrieval depth, and generation parameters.
For production deployment
Operator-grade recommendation.
Production local RAG requires chunking strategy, access control, and retrieval quality monitoring. Stack: [Text Embeddings Inference (TEI)](/tools/text-embeddings-inference) + [BGE-M3](/models/bge-m3) for embeddings, [Qdrant](/tools/qdrant) for vector search, [BGE Reranker V2 M3](/models/bge-reranker-v2-m3) for precision, [vLLM](/tools/vllm) + [Llama 3.3 70B](/models/llama-3-3-70b) for generation.
**Chunking strategy.** Chunk at semantic boundaries (paragraphs, sections, list items), not fixed character counts. Chunk size: 500–1,000 tokens with 10–20% overlap. For legal: clause level (~200–500 tokens) because queries target specific clauses. For medical: paragraph level (~300–800 tokens). Store chunk metadata: source UUID, page number, section heading, sequence number, document date — enables filtered retrieval and source citation.
**Multi-document retrieval.** Naive top-k global retrieval fails when answers span multiple documents. A query comparing policies from two documents requires chunks from both, which may not both be in top-k. Implement document-aware retrieval: retrieve top-3 chunks per query-relevant document, not top-30 globally. Group first-stage results by source document, then select top-N within each group.
**Access control.** Tag each chunk with access metadata (user IDs, group IDs, clearance level) and include access filters in every retrieval query. [Qdrant](/tools/qdrant) supports payload filtering: `must: [{key: "access_groups", match: {any: ["legal", "compliance"]}}]`. This adds 1–3ms to query time. Do NOT implement access control via post-retrieval filtering — chunk presence in the embedding index can be inferred.
**Retrieval quality monitoring.** Track precision@k and recall@k on a labeled test set of 100–200 query-document pairs. If precision@10 <70% or recall@10 <80%, investigate chunk boundary artifacts, embedding drift, or query-document vocabulary mismatch. Use query expansion for legal/medical vocabulary gaps. Log retrieved chunks + generated answers + user feedback for systematic failure identification.
**API vs self-host.** For legal, medical, financial, or compliance-regulated documents: self-host. Sending attorney-client-privileged documents to OpenAI's API waives privilege in many jurisdictions. HIPAA requires a BAA, which most AI API providers don't offer. For non-sensitive documents under 1,000 queries/month: API-based RAG (OpenAI embeddings + GPT-4) costs ~$50–200/month with zero infrastructure. For sensitive documents at any volume, or non-sensitive above 10,000 queries/month: self-hosted local RAG is both the compliance requirement and the cost-effective option.
What breaks
Failure modes operators see in the wild.
**Retrieval misses — wrong chunk retrieved.** Symptom: answer includes incorrect information because the most relevant chunk wasn't retrieved. Model faithfully generates from wrong context. Cause: embedding model maps query and relevant chunk to vectors below retrieval threshold — chunk exists but ranks below position 50. Mitigation: increase retrieval depth (top-50 → top-100) and apply reranking. Use hybrid retrieval (dense + BM25 sparse). Implement query rewriting — LLM-generated synonyms improve first-stage recall by 10–25%.
**Hallucination on retrieved context.** Symptom: model generates answer contradicting retrieved chunks — context says "2-year warranty" but model outputs "3 years." Cause: model's internal knowledge competes with context; on borderline cases, defaults to internal knowledge. Mitigation: grounded generation prompt requiring chunk citations and "documents do not contain this information" for unanswerable queries. Post-generation factuality verification: extract claims, search chunks for each, flag answers with >10% unsupported claims.
**PII leakage in embeddings.** Symptom: embedding vectors contain identifiable information reconstructable via membership inference. Cause: embedding models encode semantic information about people — "John Smith's diagnosis" is distinguishable from other patients. Mitigation: apply PII redaction before embedding — replace names, SSNs, phones, emails with placeholders ([NAME], [SSN]). Store mapping in encrypted database with audit logging. Generate from redacted chunks, de-redact in output.
**Chunk boundary incoherence.** Symptom: retrieved chunk starts/ends mid-sentence, cutting critical context. Cause: naive chunking at fixed character positions ignores sentence boundaries. Mitigation: chunk at sentence boundaries, use document structure (headers, sections). Overlap 1–2 sentences with neighbors. At retrieval, include chunks N-1 and N+1 as auxiliary context — recovers 10–15% of lost information.
**Embedding drift over time.** Symptom: retrieval degrades over months — new terminology, model updates, preprocessing changes shift embedding space. Cause: space is a snapshot of model + preprocessing at indexing time. Mitigation: re-index quarterly (slow-changing) or monthly (active). Pin model version hash; never change without full re-index. Use versioned vector DB with blue-green deployment: index to new Qdrant collection, validate, swap pointer.
Hardware guidance
Local RAG combines three model types: embedding (lightweight), generation (heavyweight), and optional reranker (lightweight). Size for the generation model — embedder + reranker are negligible overhead.
**Hobbyist tier ($600–1,000 system).** [RTX 3060 12GB](/hardware/rtx-3060-12gb) runs 7B–13B generation + embeddings simultaneously. [Qwen 3 30B-A3B](/models/qwen-3-30b-a3b) (MoE, ~7B active) at Q4 fits in 12 GB. 7B-class RAG accuracy: 65–75% on legal/medical — adequate for personal use, not professional compliance. Use [Ollama](/tools/ollama) or [llama.cpp](/tools/llama-cpp) for multi-model serving.
**SMB tier ($2,500–4,000 system).** [RTX 4090](/hardware/rtx-4090) at 24 GB: 70B Q4 generation (40 GB) + BGE-M3 (1.1 GB) + BGE Reranker (1.1 GB) all on one GPU with ~4 GB headroom. The complete RAG stack on a single consumer card. [Qwen 3 32B](/models/qwen-3-32b) FP16 with embeddings: 10 GB headroom. [MacBook Pro 16" M4 Max](/hardware/macbook-pro-16-m4-max) at 64 GB+ unified: full stack at 25–35 tok/s for 70B.
**Professional tier ($8,000–15,000).** [RTX 6000 Ada](/hardware/rtx-6000-ada) at 48 GB: 70B FP8 (35 GB) + embedding (1.1 GB) + reranker (1.1 GB) + 12 GB KV cache — comfortable 32K+ context with 10+ concurrent users. [L40S](/hardware/nvidia-l40s) at 48 GB: datacenter equivalent, deploy for multi-user RAG via [vLLM](/tools/vllm).
**Enterprise tier ($25,000+).** [H100 PCIe](/hardware/nvidia-h100-pcie) at 80 GB: 70B FP8 + full stack + 40 GB KV cache for 64K+ context at 140–170 tok/s — 50+ concurrent users. [H200](/hardware/nvidia-h200) at 141 GB: enables [DeepSeek V4](/models/deepseek-v4) generation for RAG over 1M+ corpora. [AMD MI300X](/hardware/amd-mi300x) at 192 GB: full stack with 128K context and 100+ concurrent sessions.
**VRAM formula.** Total = generation model + 2.2 GB (embed + rerank) + KV cache (~0.8 GB per 1K context for 70B) + 2 GB OS. For 70B Q4: ~40 + 2.2 + 13 (16K context) + 2 = ~57 GB — exceeds consumer GPU, requires multi-GPU, Apple unified memory, or enterprise GPU. For 32B: ~19 + 2.2 + 7 + 2 = ~30 GB — fits [RTX 5090](/hardware/rtx-5090) at 32 GB. The 70B RAG sweet spot is a used A100 80 GB or Apple Silicon Mac with 64 GB+ unified memory.
Runtime guidance
**AnythingLLM vs LangChain/LlamaIndex vs custom.**
[AnythingLLM](https://anythingllm.com) provides complete local RAG UI — drag documents, ask questions, view citations. Bundles LanceDB, interfaces with local LLM backends (Ollama, LM Studio). Advantage: zero configuration, offline, 5-minute deployment. Limitation: fixed character-based chunking, no multi-user access control, basic query rewriting. Use for personal RAG and proof-of-concept. Not for production multi-user RAG.
[LangChain](https://langchain.com) and [LlamaIndex](https://llamaindex.ai) provide composable RAG components — loaders, splitters, embedders, vector stores, retrievers, chains. Advantage: full programmatic control, 50+ vector DBs, 100+ LLMs. Limitation: abstraction layers add debugging complexity. Right for production RAG needing customization without building entire integration layer.
**Custom pipeline.** Build your own when: fine-grained control needed, minimum dependencies (httpx + vector DB client + LLM API), or use case doesn't fit framework abstraction. Custom is 200–500 lines orchestrating: document load → chunk → TEI embed → Qdrant upsert → query → search → rerank → vLLM generate. More upfront work, eliminates framework debugging overhead.
**Ollama vs llama.cpp backend.** [Ollama](/tools/ollama) serves generation + embedding from one binary — unified API, auto GPU, model library. Limitation: no continuous batching, no reranking, ~60% dedicated engine throughput. Right for single-user. [llama.cpp](/tools/llama-cpp) server: higher throughput with continuous batching, lower embedding latency. For multi-user (5+): vLLM for generation + TEI for embeddings.
**Decision tree.** Personal RAG: AnythingLLM + Ollama + nomic-embed-text + [Llama 3.3 70B](/models/llama-3-3-70b) or [Qwen 3 32B](/models/qwen-3-32b). Production 5–50 users: LlamaIndex + TEI (BGE-M3 + BGE Reranker V2 M3) + [Qdrant](/tools/qdrant) + vLLM (Llama 3.3 70B). Production with compliance: custom pipeline — maximum control, minimum dependency surface. Canonical stack: TEI + BGE-M3 + Qdrant + BGE Reranker V2 M3 + vLLM + Llama 3.3 70B.
Hardware buying guidance for Private Document Analysis
RAG workflows mix embedding throughput, long-context inference, and reasonable VRAM headroom. The guides below cover the buyer decision honestly.