RAG · Document search

Local AI for document search — what RAG actually delivers

Local RAG over your own documents in 2026: what works, what doesn't, what to expect. Embedding model picks for English and multilingual, vector DB choices (Chroma vs Qdrant vs SQLite-vec), chunking strategies, retrieval quality vs context window tradeoffs, BM25 + dense hybrid, PDF and OCR ingestion gotchas, and realistic accuracy expectations with eval anchors.

By Fredoline Eruo · Last reviewed 2026-05-08 · ~1,850 words

Answer first

Local RAG (retrieval-augmented generation) over a few hundred to a few thousand documents on your own hardware is one of the highest-leverage things you can do with a local AI stack in 2026. A 7-32B local model paired with a sensible embedding model and a vector DB on the same machine produces useful, citable answers over your contracts, your textbooks, your meeting notes, your codebase, your medical record — without the documents leaving the disk. The technology is mature; the gap between “works” and “works well” is almost entirely in chunking, retrieval strategy, and document ingestion, not in the choice of vector DB or embedding model.

That is the answer most operators want. The honest extension: retrieval quality is the bottleneck, not generation. A frontier-class language model can't fix bad retrieval. A small local model with great retrieval beats a large cloud model with bad retrieval on almost every document-Q&A task. This page is operator-grade about where the wins and the failure modes actually live. The full RAG glossary entry is the conceptual primer; the workflow at /workflows/offline-rag-pipeline is the end-to-end production setup.

What local RAG is and what it isn't

Local RAG is a three-step pipeline. (1) Embed each chunk of each document into a vector with an embedding model. (2) Store the vectors and the source text in a vector database. (3) Retrieve the most similar chunks at query time and feed them to a language model alongside the user's question, asking the model to answer using only those chunks and to cite them.

That is what local RAG is. What it is not: a way to make a model “know” your documents in a deep sense. The model has not learned anything from the corpus. It is reading the relevant chunks at inference time and producing a grounded summary. This means RAG is excellent at “what does this document say about X?” questions and weak at “what is the overall theme across these 100 documents?” questions, because the latter requires synthesis the retrieval step cannot deliver. For corpus-level synthesis you need either a much larger context window, a deliberate map-reduce-style multi-pass workflow, or fine-tuning — RAG is not the right tool. The honest framing of when RAG fits and when it doesn't is the difference between operators who get value from local RAG and operators who give up after two weeks.

Embedding model picks — English first, then multilingual

The embedding model is the single most consequential choice in the pipeline because every other decision compounds against it. The picks below are the ones operators actually run in 2026 — all open-weight, all free, all locally servable.

  • nomic-embed-text-v1.5 (137M params) — the daily-driver English embedding model for most local stacks. Strong on the MTEB leaderboard for its size, runs on CPU acceptably and on GPU very fast, integrates cleanly with AnythingLLM and LlamaIndex. The default pick unless you have a reason to pick something else.
  • BGE-base / BGE-large (BAAI) — the predecessor and still strong. BGE-large-en-v1.5 is a fair quality jump over nomic on harder retrieval tasks at the cost of being roughly 3x larger. Worth running for high-stakes corpora.
  • multilingual-e5-large — the right pick if your corpus mixes English with other languages. Robust across ~100 languages, slightly behind English-specialized models on pure-English benchmarks.
  • jina-embeddings-v3 — a 2024-vintage embedding family with strong long-document handling (8K input tokens directly). Useful when you want fewer, larger chunks rather than many small ones.
  • nomic-embed-vision — for image-text retrieval if your corpus has scanned diagrams, slides, or figures alongside text. A separate question from textual RAG; addressed only when needed.

The non-obvious operational rule: do not change embedding models on a populated database. The vectors are model-specific; mixing embeddings from two different models in one DB makes similarity search meaningless. Pick once, populate, and live with the choice. If you must change, re-embed everything from source. The embedding glossary entry covers the underlying mechanics.

Vector DB picks — Chroma, Qdrant, SQLite-vec

For local-only document search at the freelancer / homelab / individual scale (under ~500K chunks), three options handle the workload comfortably. The ranking is operator-grade, not benchmark-driven, because at this scale the database is rarely the bottleneck.

Chroma. The smoothest default for individual operators. Embedded in your Python process, no separate server, persistence to a local directory. Excellent for scripts and Jupyter work. Scales fine to ~100K chunks, gets sluggish above ~500K. The right pick for “I have a few thousand PDFs and want to ask questions over them.”

Qdrant. A real production-grade vector DB you run locally as a service. Filtering on metadata is the killer feature — “find chunks from documents tagged ‘client A’ modified after 2026-01-01.” Filters in Chroma exist but are slower and less expressive. Run it via Docker; the daemon is light. The right pick for any corpus with structured metadata or any setup that will outgrow the embedded-DB pattern.

SQLite-vec. A vector index extension for SQLite. Single file on disk, no daemon, embeddable in any language with SQLite bindings. Genuinely the right pick for portable local-only setups — “a personal knowledge base I want to be able to copy to a USB drive.” Slower than Qdrant on large corpora; for under 50K chunks the speed is fine.

What about LanceDB, Weaviate, Milvus, pgvector? All viable, each with a niche. LanceDB is excellent if you're already in Apache Arrow / DuckDB territory. Weaviate is overkill for individual local use but right for small-team self-hosted. Milvus is industrial-scale and the wrong choice for a single laptop. pgvector is right if you already have Postgres in your stack. Pick from this list only if one of those conditions matches; otherwise stay on Chroma or Qdrant.

Chunking — the lever most operators ignore

The single biggest determinant of local RAG quality is how you chop documents into chunks. Five operator-grade rules that most tutorials skip:

  • Chunk on semantic boundaries first, length second. A clean paragraph boundary, a section heading, a code-block fence — these are the cuts that produce chunks the embedding can represent cleanly. Cuts mid-sentence destroy retrieval quality even on small documents.
  • Target 200-500 tokens per chunk, with overlap. Below 200, individual chunks lose enough context that the model can't reason over them. Above 500, chunks contain too many topics and the embedding becomes a mush. Overlap of 50-100 tokens between adjacent chunks ensures concepts straddling chunk boundaries are still findable.
  • Add a parent-document anchor to every chunk. Each chunk should carry the document title, section heading, and page number as metadata. The model uses this to cite. Without it, the model says “the document says X” — without “the document” resolving to anything specific.
  • Treat tables and code as their own chunks. Embedding a chunk that is half-prose and half-table loses signal on both halves. Extract tables as their own units, keep code blocks intact, use prose-style descriptions to make tables retrievable on natural-language queries.
  • Re-chunk if quality is bad before changing anything else. Most “my RAG is bad” problems are chunking problems. Do not change embedding models, vector DBs, or LLMs first. Look at the actual chunks your pipeline produced and ask “could I answer the user's question from this chunk?”

The LlamaIndex and LangChain ecosystems both ship semantic-aware chunkers that improve on naive fixed-size splitting; both also ship hierarchical retrievers (sentence-level retrieval, paragraph-level reranking) that handle the “chunks too small” problem at retrieval time rather than at chunking time. Either is a real upgrade over hand-rolled splitting once you outgrow the basics.

Retrieval quality vs context window — the real tradeoff

The 2026 question every RAG operator hits: I have a model with a 32K or 128K context window. Should I just stuff every retrieved chunk into context and let the model figure it out? The honest answer is no, but probably yes more often than you think.

Two countervailing forces. On the “stuff more context” side: long-context local models (Qwen 2.5 with YaRN to 128K, Llama 3.1 native 128K, MLX-served Qwen on Apple Silicon to 128K) are genuinely capable, and putting top-20 chunks in context produces measurably better answers than putting top-5. On the “don't stuff” side: long-context attention is computationally expensive, time-to-first-token grows substantially, the model still suffers from middle-of-context recency bias on long inputs, and noisy chunks crowd out signal.

The operator-grade default that lands well: retrieve top 10-20, rerank to top 5-8 with a cross-encoder reranker, then put those in context. The reranker (e.g. bge-reranker-v2-m3) is small, fast on CPU, and lifts retrieval precision substantially. This is what most production-grade local RAG stacks settle on, and it is what the offline RAG pipeline workflow documents in detail.

BM25 + dense hybrid is the honest default

Pure dense retrieval (embed query, find nearest chunks by cosine similarity) misses exact-string matches. A user asking “what does section 4.3 say?” or “find the clause about indemnification” benefits from old-fashioned keyword retrieval — BM25, the lexical scoring algorithm Elasticsearch and Lucene have used for decades.

The honest production setup runs both: BM25 over the chunks for keyword precision, dense embedding similarity for semantic recall, and a fusion step (reciprocal rank fusion is the simple-and-strong default) that merges the two ranked lists. Empirically this is 5-15% better than dense-only on heterogeneous corpora and substantially better on technical documents where exact terms matter. Both Qdrant and Weaviate ship native hybrid retrieval; on Chroma you wire it up with the BM25 from a search library yourself, or you graduate to Qdrant. The cost is a few hundred lines of Python and one extra moving piece — almost always worth it.

Document ingestion — PDFs, tables, OCR

Where local RAG reliably falls over: the document ingestion step. Operator-grade rules:

  • PDFs are not a format, they are a graveyard of formats. Native digital PDFs (Word/LaTeX/Pages exports) extract cleanly with PyMuPDF, pdfplumber, or pdfminer. Scanned PDFs need OCR. PDFs that are partly digital and partly scanned (a digital cover page over scanned body) need both. Always inspect a sample after ingestion before declaring the pipeline working.
  • OCR is now genuinely good locally. Tesseract is the baseline; Surya and PaddleOCR are 2024-2026 vintage and significantly better on multi-column layouts and tables. EasyOCR is the easiest install. For document-heavy work, run OCR once at ingestion and persist the extracted text alongside the original.
  • Tables need structural extraction, not flat text. Camelot or pdfplumber's table extractor produces row-by-row structure. Embedding the result row-by-row, with column headers, lets the model retrieve specific cells.
  • Strip headers, footers, and page numbers before chunking. Repeated noise like “Page 47 of 312” or a copyright footer destroys embedding quality because every chunk looks similar in that dimension.
  • Slide decks and Word docs ingest via their native libraries. python-pptx for PowerPoint, python-docx for Word. Don't convert to PDF first — you lose structure.

Realistic accuracy and how to evaluate it

Honest accuracy expectations for a well-tuned local RAG stack on a 1,000-document corpus, asking specific factual questions: top-1 retrieval relevance ~70-80%, top-5 ~90-95%, generation factuality conditioned on retrieved context ~85-92%. These numbers are from operator reports across 2025-2026 with the stack above (nomic-embed + Qdrant + reranker + 14-32B local model). Below those numbers, your stack has a fixable problem; above, you are at the ceiling of what the technology delivers.

Evaluation in practice: build a small hand-labeled eval set. Twenty to fifty real questions you might ask, with the actual document and chunk that should answer each one. Run your pipeline against it, count top-K retrieval hits and answer correctness. Re-run after every change. The BEIR benchmarks (BEIR-15, MS MARCO, NaturalQuestions) are the public standard if you want public-corpus comparisons, but for individual operators, an internal eval set against your own corpus is the more useful artifact. The LlamaIndex evaluation harness gives you most of this for free.

The minimum viable local RAG stack

If you want to be running document search on your own machine in an hour:

  1. Runtime: Ollama with qwen2.5:14b-instruct for generation (or 7B on lighter hardware).
  2. Embedding model: ollama pull nomic-embed-text. Same runtime, separate model.
  3. Frontend with built-in RAG: AnythingLLM. Browser-based, drag-and-drop ingestion, points at your Ollama instance for both embedding and generation, ships a working chunker out of the box.
  4. Vector DB: AnythingLLM defaults to LanceDB internally; for a custom build use Qdrant via Docker.
  5. Ingest 50-100 representative documents first. Try real questions. Inspect the retrieved chunks before declaring it working.
  6. Iterate on chunking and on the prompt, in that order. Most early problems are not the model.

The full operator-grade reference setup with reranker, BM25 hybrid, ingestion pipeline, and eval harness is in /workflows/offline-rag-pipeline. The privacy framing for why you might be running local RAG instead of uploading docs to a hosted service is in /guides/local-ai-for-privacy; the broader free-tools tour is in /guides/best-free-local-ai-tools.

Next recommended step

End-to-end production-grade stack with reranker, hybrid retrieval, and evals.

RAG workloads split into an embedding pass that wants batch throughput and a generation pass that wants low single-token latency. A GPU that handles one well may bottleneck on the other. The cards that balance both stages — high VRAM for large embedding batches plus enough compute headroom for responsive generation — define the practical floor for a document search system you actually enjoy using day to day.

The hardware that balances both stages of the RAG pipeline: best GPU for RAG, and RTX 3090 verdict.