RAG Architecture Overview — RAG Systems: Part 1 (Chapter 2)

Before writing code, you need a mental model of how the pieces fit together. RAG architecture has five components that communicate in a fixed order.

Component diagram

Documents (PDF, HTML, MD)
    │
    ▼
┌─────────────────┐
│  Document       │
│  Loader         │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Text Splitter  │
│  (Chunking)     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Embedding      │
│  Model          │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Vector Store   │
│  (ChromaDB)     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Retriever      │◄── User Query
│  (Similarity)   │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  LLM            │───► Answer
│  (Generator)    │
└─────────────────┘

The data flow

Document Loader: Reads files from disk or URLs. Returns raw text and metadata (filename, page number, source URL).
Text Splitter: Breaks raw text into chunks. This is where domain knowledge matters. A legal document needs different chunking than a blog post.
Embedding Model: Converts each chunk into a vector (list of numbers). Chunks with similar meaning get similar vectors. This is the mathematical heart of retrieval.
Vector Store: Stores chunks alongside their embeddings. Provides fast similarity search. ChromaDB is the vector database used in this course.
Retriever: Takes a user query, embeds it, finds the most similar chunks, returns them as context.
LLM: Takes user query plus retrieved context, generates answer.

Chunk size matters more than you think

Chunk size determines what the retriever can return. Too small and you lose context. Too large and you dilute relevance.

Typical range: 256 to 2048 tokens per chunk.

256 tokens: High precision, low recall. Good for factual queries with exact matches.
512 tokens: Balanced. Good starting point for most use cases.
1024 tokens: More context, lower precision. Good for summaries and narratives.
2048 tokens: High recall, low precision. Risk of including irrelevant surrounding text.

Overlap between chunks (e.g., 20% overlap) helps prevent cutting important context at chunk boundaries.

Embedding dimensions

Modern embedding models produce vectors with 384 to 1536 dimensions. Higher dimensions capture more nuanced relationships but cost more storage and compute. sentence-transformers/all-MiniLM-L6-v2 produces 384-dimensional vectors. text-embedding-3-large produces 3072-dimensional vectors.

For most use cases, 384 to 768 dimensions is sufficient. The marginal improvement from higher dimensions rarely justifies the storage cost.

Metadata: the underappreciated component

Every chunk should carry metadata: source document, page number, chunk index, creation date. This metadata serves two purposes:

Debugging: When a retrieval fails, metadata tells you which document caused the problem.
Filtering: You can filter retrieval by metadata. "Only search in documents created after 2024" or "Only search in the FAQ section."

# Example chunk with metadata
{
    "text": "The return policy applies to all electronics...",
    "metadata": {
        "source": "policies.pdf",
        "page": 3,
        "chunk_index": 7,
        "section": "electronics"
    },
    "embedding": [0.123, -0.456, ...]  # 384 floats
}