13. Repo-Level RAG

Chapter 13 of 18 · 15 min

Retrieval Augmented Generation applied to code repositories enables asking questions that span the entire codebaseΓÇöa capability impossible with single-file context windows. RAG for code requires careful index design, chunking strategy, and retrieval optimization to be practically useful.

The core challenge differs from document RAG. Code has structure, relationships, and semantics that naive text chunking destroys. A function definition relates to its callers, its imported modules, its type signatures, and its documentation. Chopping code into arbitrary chunks loses these relationships and produces retrieval results that are technically relevant but practically useless.

Effective code RAG starts with structural awareness. Index functions, classes, and modules as first-class entities rather than text fragments. Capture the dependency graph: what this file imports, what imports this file, what functions call this function. When a user asks "how does authentication work," the system should retrieve all auth-related code with their relationships preserved, not scattered snippets that happen to contain the word "auth."

Embedding strategy affects retrieval quality. General-purpose embeddings trained on prose struggle with code syntax. Code-specific embeddings capture semantic similarity that surface-level matching missesΓÇötwo functions with different variable names but identical logic should match. Several open-source embedding models train specifically on code corpora and outperform general alternatives for code retrieval tasks.

Chunk boundaries require special handling. Python's indentation-based blocks, JavaScript's closure scopes, and SQL's query structures each have natural boundaries. Breaking code at these boundaries preserves semantic coherence. Breakpoint decisions should respect function boundaries even when that produces uneven chunk sizes. A 200-line function split awkwardly is less useful than the same function kept intact at 250 lines.

Hybrid retrieval combines keyword and semantic matching. Some queries need exact identifier matchesΓÇö"find all uses of getUserById"ΓÇöwhile others need conceptual understandingΓÇö"find code that handles user authentication." The retrieval pipeline should support both modes, merging results with appropriate ranking.

Index freshness matters for accuracy. Stale indexes produce confident but incorrect answers about code that has changed. Some implementations use Git hooks to update the index on commit, or scheduled refreshes for large repositories. Others maintain incremental updates, adding new code and marking deleted code as removed without full reindexing.

Query construction affects retrieval results. "Where is the payment processing logic?" retrieves different results than "show me payment-related code." Encourage consistent query patterns through interface design or provide query templates for common question types.

EXERCISE

Design an indexing schema for a medium-sized repository (5-15 files). Document what entities you'll index, how you'll capture relationships, and what chunking strategy you'll use. Implement a basic version using a vector database.