RAG Systems: Part 1
Learn rag systems: part 1 through RunLocalAI's practical lens: rag, retrieval, chunking and ingestion, hardware fit, runtime settings, verification habits and local-vs-cloud tradeoffs.
- B011
- B012
Why this course matters
RAG Systems: Part 1 is for new local AI users who need clean mental models before changing settings. It connects rag, retrieval, chunking, ingestion and pipeline to the questions RunLocalAI wants every reader to answer before they install, upgrade or scale a model: will it run, what will it cost in memory, what setting changes the result, and how do you verify the answer instead of trusting a demo?
What you will be able to do
By the end, you should be able to explain the main tradeoffs in plain language, choose a safe next experiment, and use the chapter exercises as a repeatable operator checklist. The course favors local evidence, hardware fit, context limits, latency and failure modes over generic AI vocabulary.
How to use this course
Start at chapter one if the topic is new. If you already have a working stack, scan for chapters such as What is RAG?, RAG Architecture Overview, PDF Ingestion with PyMuPDF and HTML Ingestion and use those lessons as a quality-control pass before changing a workstation, team workflow or production-like local deployment.
- 01What is RAG?RAG gives LLMs real-time access to your documents by retrieving relevant chunks at query time instead of relying on training data. ```python # The three stages in pseudo-code documents = ingest("your/documents/") chunks = chunk(documents) index(chunks) context = retrieve("user query") answer = generate("user query", context) ```15 min
- 02RAG Architecture OverviewRAG architecture flows from documents through loading, chunking, embedding, and storage to retrieval, with metadata tracking provenance at every step. ```python # Class skeleton showing the architecture class RAGPipeline: def __init__(self): self.loader = None # Document loader self.splitter = None # Text splitter self.embedding_model = None # Embedding model self.vector_store = None # ChromaDB def ingest(self, documents): texts = self.loader.load(documents) chunks = self.splitter.split(texts) embeddings = self.embedding_model.embed(chunks) self.vector_store.add(chunks, embeddings) def query(self, user_query, top_k=5): query_embedding = self.embedding_model.embed([user_query]) results = self.vector_store.similarity_search(query_embedding, k=top_k) return results ```20 min
- 03PDF Ingestion with PyMuPDFPDF extraction quality depends on layout analysis. Use position-based sorting for multi-column documents and validate output to catch encoding and OCR failures. ```python import fitz # Minimal working example doc = fitz.open("document.pdf") for page in doc: print(page.get_text()) doc.close() ```25 min
- 04HTML IngestionHTML's semantic structure lets you extract content by heading boundaries, preserving contextual relationships that PDFs lack. ```python from bs4 import BeautifulSoup # Minimal working example html = "<html><body><h1>Title</h1><p>Content here</p></body></html>" soup = BeautifulSoup(html, "lxml") print(soup.find("p").get_text()) ```25 min
- 05Markdown IngestionMarkdown's heading syntax maps directly to document hierarchy, making heading-based chunking natural and semantically coherent. ```python # Minimal working example from pathlib import Path md_content = Path("readme.md").read_text() sections = [line for line in md_content.split("\n") if line.startswith("#")] print(sections) ```25 min
- 06Fixed-Size ChunkingFixed-size chunking is fast but ignores semantic boundaries, often splitting paragraphs and separating headers from their content. ```python import tiktoken encoder = tiktoken.get_encoding("cl100k_base") text = "Hello world" tokens = encoder.encode(text) print(f"Token count: {len(tokens)}") # Output: 2 ```25 min
- 07Semantic ChunkingSemantic chunking keeps related sentences together by measuring embedding similarity, producing internally coherent chunks even when token counts vary. ```python from sentence_transformers import SentenceTransformer encoder = SentenceTransformer("all-MiniLM-L6-v2") sentences = ["The cat sat on the mat.", "It was a sunny day."] embeddings = encoder.encode(sentences) print(f"Embedding shape: {embeddings.shape}") # (2, 384) ```25 min
- 08Recursive Character SplitterRecursive character splitting respects document structure (paragraphs, lines) while guaranteeing chunk sizes stay within limits by falling back to smaller separators. ```python # Minimal working example text = "Paragraph one.\n\nParagraph two.\n\nParagraph three." separator = "\n\n" chunks = text.split(separator) print(chunks) # ['Paragraph one.', 'Paragraph two.', 'Paragraph three.'] ```30 min
- 09Document Metadata ExtractionMetadata turns retrieval from keyword matching into intelligent filtering, enabling queries like "only documents from this year" or "only from this section." ```python # Minimal metadata example chunk = { "text": "The return policy...", "metadata": { "source": "policies.pdf", "year": 2024, "section": "electronics" } } ```30 min
- 10Embedding PipelineEmbedding quality determines retrieval quality. Batch by token limits to prevent context overflow, and normalize vectors for consistent similarity calculations. ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("all-MiniLM-L6-v2") emb = model.encode("Hello world") print(f"Embedding: {emb[:5]}... (384 dims)") ```30 min
- 11Storing Embeddings in ChromaDBChromaDB stores embeddings alongside metadata, enabling fast similarity search with metadata filtering. Batch insertion and proper indexing are essential for handling large document sets. ```python import chromadb client = chromadb.Client() collection = client.get_or_create_collection("test") collection.add(ids=["1"], embeddings=[[1.0, 2.0]], documents=["hello"]) print(collection.query(query_embeddings=[[1.0, 2.0]], n_results=1)) ```30 min
- 12Retrieval StrategiesHybrid search with reranking consistently outperforms any single retrieval method across diverse query types.20 min
- 13Dense RetrievalDense retrieval quality depends more on embedding model choice and fine-tuning than on index parameters.20 min
- 14Sparse Retrieval (BM25)BM25 excels at exact term queries but requires hybrid pairing with dense retrieval to handle semantic queries effectively.20 min
- 15Context AssemblyContext assembly quality matters as much as retrieval quality. Well-organized context prevents hallucination from confusing source ordering.20 min
- 16Prompt with Retrieved ContextExplicit citation requirements in prompts reduce hallucination by forcing the model to explicitly attribute claims to retrieved context.25 min
- 17Basic Generation PipelinePipeline quality depends on weakest link. Optimize retrieval quality first - generation cannot fix poor context.20 min
- 18RAG Evaluation: Hit RateSet hit rate targets based on application tolerance for missed information, not arbitrary thresholds.25 min
- 19RAG Evaluation: MRRMRR captures ranking quality. Systems with high hit rate but low MRR retrieve relevant content but rank it poorly - reranking fixes this.25 min
- 20Common RAG Failures80% of RAG failures trace to retrieval problems, not generation problems. Debug retrieval first before adjusting prompts or models.25 min
- 21RAG Pipeline OptimizationOptimize the bottleneck stage first. For most RAG systems, LLM generation is the bottleneck - switch to faster models first before optimizing retrieval.25 min
- 22Part 1 Final ProjectThis capstone integrates all course concepts. A well-organized pipeline with proper configuration management is more maintainable than clever one-liners.30 min