COURSE · FND · B014

RAG Systems: Part 2

Learn rag systems: part 2 through RunLocalAI's practical lens: rag, reranking, query and optimization, hardware fit, runtime settings, verification habits and local-vs-cloud tradeoffs.

22 chapters12hFoundations trackBy Fredoline Eruo
PREREQUISITES
  • B013

Course B014: RAG Systems: Part 2

Why this course exists

Part 1 covered the fundamentals: chunking, embedding, vector storage, and basic retrieval. Those foundations are necessary but not sufficient for production systems. Real queries fail in predictable ways. Synonymy causes recall gaps. Polysemy causes precision drops. Lengthy documents get retrieved without their relevant sections. Dense embeddings miss exact keyword matches. Basic semantic similarity rankings don't align with relevance.

This course addresses those gaps systematically. It covers reranking to improve ranking quality after initial retrieval. It covers query transformation to bridge the gap between how users ask and what's in your documents. It covers hybrid search combining dense and sparse methods. It covers adaptive retrieval that adjusts strategy based on query characteristics. It covers agentic retrieval that decomposes complex questions into multi-step searches.

Build on Part 1 foundations. Expect real code, real failure modes, and concrete best practices.

What you will know after

After completing both parts of this course, you will be able to design and implement production-grade RAG pipelines that are substantially more accurate than naive implementations. You will understand when to apply reranking versus query rewriting. You will be able to combine multiple retrieval strategies using rank fusion. You will be able to build systems that decompose complex queries and iteratively refine results. You will have working code for each major component.

CHAPTERS
  1. 01Part 1 RecapPart 1 built the ingestion-to-generation pipeline; Part 2 optimizes every stage of that pipeline for production. This chapter consolidates the foundational concepts from Part 1 that the rest of this course builds upon. If any concept feels unfamiliar, return to Part 1 before continuing. ### Vector Retrieval Pipeline A basic RAG pipeline works like this: documents get chunked, each chunk gets embedded into a dense vector, chunks get stored in a vector database, user queries get embedded using the same model, and the database returns the k-nearest chunks by cosine similarity. Those chunks get injected into an LLM prompt, and the LLM generates an answer. ```python from langchain_community.vectorstores import Chroma from langchain_openai import OpenAIEmbeddings # Basic retrieval setup from Part 1 embeddings = OpenAIEmbeddings(model="text-embedding-3-small") vectorstore = Chroma.from_documents( documents=chunks, embedding=embeddings, persist_directory="./chroma_db" ) # Query execution query_embedding = embeddings.embed_query(user_query) results = vectorstore.similarity_search_by_vector( query_embedding, k=10 ) ``` This pipeline has a hard ceiling on accuracy. The embedding model makes a one-time decision about what semantic content each chunk represents. Retrieval then just does nearest-neighbor search. Errors compound: a suboptimal chunk may contain relevant information but with distracting context, or the query may use different vocabulary than the chunk. ### Chunking Strategies Chunk size affects retrieval quality. Chunks too small lose context; chunks too large dilute relevance. Hierarchical chunking stores parent chunks for context while using child chunks for retrieval. Semantic chunking groups related content even if it crosses boundaries that would otherwise be arbitrary. ```python # From Part 1: hierarchical chunking def hierarchical_chunk(text, parent_size=1000, child_size=200): parent_chunks = text_chunker(text, chunk_size=parent_size) chunks_with_metadata = [] for i, parent in enumerate(parent_chunks): # Split parent into contextually coherent children children = semantic_split(parent, chunk_size=child_size) for child in children: chunks_with_metadata.append({ "content": child, "parent_idx": i, "parent_content": parent }) return chunks_with_metadata ``` ### Embedding Models Embedding models transform text into vectors that capture semantic meaning. The choice of model affects both retrieval accuracy and latency. Text-embedding-3-small offers good quality at low cost. BGE models provide strong open-source alternatives. Commercial models like Cohere and Voyage offer specialized optimizations. ### Limitations of Basic Retrieval Three failure modes dominate basic retrieval: **1. Vocabulary Mismatch.** User asks about "vehicle insurance" but documents say "auto coverage." Dense embeddings handle synonyms reasonably well but struggle with domain-specific terminology that wasn't well-represented during training. **2. Precision vs. Recall Tension.** Setting k=5 retrieves too few chunks when relevant information spans multiple sections. Setting k=50 retrieves too many, drowning relevant content in noise. **3. Ranking vs. Relevance Mismatch.** Cosine similarity in embedding space doesn't perfectly correlate with task-specific relevance. The chunk most semantically similar to the query may not be the chunk most useful for answering it. The following chapters address each of these failure modes with targeted techniques.20 min
  2. 02Why Reranking MattersReranking trades computation for accuracy: retrieve more than needed, then use a computationally expensive but more accurate scorer to select the best results.20 min
  3. 03Cross-Encoder SetupCross-encoders compute joint query-document attention, which is slower than bi-encoder vector comparison but captures relevance that bi-encoders miss.20 min
  4. 04Local Cross-Encoder ModelsLocal cross-encoders give full control and eliminate API costs, but require careful model selection and optimization for production throughput.25 min
  5. 05Reranking PipelineThe reranking pipeline's value lies in decoupling recall (retrieve widely) from precision (select intelligently), but requires tuning k parameters against your actual evaluation data.25 min
  6. 06Query RewritingQuery rewriting addresses vocabulary mismatch by transforming user queries into document-like language, improving retrieval precision without sacrificing recall.25 min
  7. 07Query ExpansionQuery expansion trades latency for recall, which is often a good trade when baseline retrieval misses relevant documents. Use selective expansion to apply effort only where it helps.25 min
  8. 08Hybrid Search (Dense + Sparse)Hybrid search combines dense (semantic) and sparse (keyword) retrieval to capture complementary strengths, with optimal weighting determined empirically on your specific data and queries.25 min
  9. 09Reciprocal Rank FusionReciprocal Rank Fusion reliably combines rankings from multiple retrieval methods by converting ranks to scores, avoiding the need for score normalization across heterogeneous methods.25 min
  10. 10Adaptive RetrievalAdaptive retrieval applies different strategies to different query types, optimizing the tradeoff between retrieval quality and computational cost for each query.25 min
  11. 11Agentic RetrievalAgentic retrieval enables dynamic, self-correcting search by putting the LLM in control, allowing it to recognize failures, decompose complex questions, and chain multiple retrievals.30 min
  12. 12Multi-Hop RAGMulti-hop RAG uses the answer from each retrieval step to inform the next search, enabling reasoning across document boundaries.20 min
  13. 13Context CompressionContext compression uses relevance scoring to remove redundant or irrelevant content while preserving key facts needed to answer the query.20 min
  14. 14Sliding Window ContextSliding windows search long documents by creating overlapping chunks, but you must handle cases where relevant information spans multiple chunks.20 min
  15. 15Document Re-rankingRe-ranking applies a more expensive scoring model to initial retrieval results, improving relevance at the cost of additional latency.20 min
  16. 16Caching StrategiesCaching reduces latency and cost by storing embeddings and responses, with semantic caching handling near-duplicate queries.25 min
  17. 17Batch ProcessingBatch processing groups queries to reduce API calls and uses parallel workers for throughput, while rate limiting prevents exceeding API quotas.20 min
  18. 18Production PipelineProduction pipelines combine retrieval stages with error handling, fallbacks, and health checks to ensure reliable operation under failure conditions.20 min
  19. 19Monitoring RAG QualityMonitoring RAG quality requires tracking retrieval metrics (precision, recall, MRR) alongside latency, with automated alerts when quality degrades.20 min
  20. 20A/B Testing RetrievalA/B testing compares retrieval strategies using real user queries, with statistical tests determining whether observed differences are significant or due to chance.20 min
  21. 21Advanced RAG EvaluationAdvanced RAG evaluation measures not just retrieval quality but whether the retrieved context actually enables accurate, hallucination-free answers.20 min
  22. 22Part 2 Final ProjectProduction RAG combines multi-hop retrieval, re-ranking, compression, caching, and monitoring into a single system that improves over time through A/B testing.25 min