RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Advanced RAG — Chunking, Retrieval, Re-ranking
  6. /Ch. 1
Advanced RAG — Chunking, Retrieval, Re-ranking

01. RAG Pipeline Anatomy

Chapter 1 of 24 · 15 min
KEY INSIGHT

Pipeline quality is multiplicative across stages; optimizing only retrieval ignores compounding failures upstream.

A RAG pipeline consists of four primary stages: ingestion, chunking, indexing, and retrieval. Each stage introduces latency and potential quality degradation that compounds across the system.

Ingestion involves loading raw documents from storage—file systems, databases, or object stores. Large documents require streaming parsers to avoid memory exhaustion. PDFs present parsing challenges due to layout variability; table extraction often requires specialized libraries.

Chunking splits documents into segments that balance context preservation against retrieval precision. Chunk size directly affects embedding quality and query-document similarity scores. Oversized chunks dilute relevance; undersized chunks lose necessary context.

Indexing maps chunks to vector representations in a high-dimensional space. The embedding model choice determines what semantic relationships the index captures. Dense indices require approximate nearest neighbor algorithms (HNSW, IVF) for sub-second queries at scale.

Retrieval matches user queries against indexed chunks using similarity metrics—typically cosine similarity or inner product for normalized embeddings. The retrieved chunks feed a language model that synthesizes an answer.

The critical insight: quality at each stage propagates forward, but failures compound. A poorly chunked document produces an index that cannot represent its semantic content, and no retrieval algorithm can compensate.

# Minimal RAG pipeline structure
from dataclasses import dataclass
from typing import List

@dataclass
class Chunk:
    text: str
    metadata: dict
    chunk_id: str

@dataclass
class RetrievedChunk:
    chunk: Chunk
    score: float

class RAGPipeline:
    def __init__(self, embedder, vector_store, ranker=None):
        self.embedder = embedder
        self.vector_store = vector_store
        self.ranker = ranker  # Optional cross-encoder
    
    def index_documents(self, documents: List[str], metadata: List[dict]):
        chunks = [Chunk(text=doc, metadata=meta, chunk_id=generate_id())
                  for doc, meta in zip(documents, metadata)]
        embeddings = self.embedder.encode([c.text for c in chunks])
        self.vector_store.add(embeddings, chunks)
    
    def retrieve(self, query: str, top_k: int = 10) -> List[RetrievedChunk]:
        query_embedding = self.embedder.encode(query)
        initial_results = self.vector_store.search(query_embedding, top_k * 2)
        if self.ranker:
            reranked = self.ranker.rerank(query, initial_results)
            return reranked[:top_k]
        return initial_results[:top_k]
EXERCISE

Profile a simple pipeline with timing instrumentation on each stage using time.time() or time.perf_counter() to identify your slowest stage with a 10-document sample.

← Overview
Advanced RAG — Chunking, Retrieval, Re-ranking
Chapter 2 →
Semantic Chunking at Scale