01. Why Evaluate RAG?
A RAG system has two failure modes: retrieval fetches the wrong documents, or the language model generates poor answers from correct documents. Without evaluation, these failures remain invisible until they cause problems in production.
The naive approach is eyeballing outputs. Ask a few questions, glance at the answers, declare success if nothing looks egregious. This method has systematic flaws. First, humans calibrate poorly across dozens of queries—the same answer seems acceptable or problematic depending on mood and context. Second, eyeballing cannot catch gradual degradation: subtle quality erosion that happens across weeks of incremental changes. Third, manual inspection provides no signal for whether a change helps or hurts.
Evaluation solves these problems by converting quality into numbers. When every change runs through an evaluation suite, developers see concrete scores before and after. A modification that boosts average relevance by 0.12 points in retrieval is clearly beneficial. A prompt change that drops Faithfulness from 0.94 to 0.81 is grounds for rollback. Numbers enable decisions that intuition cannot support.
The cost of poor evaluation compounds over time. RAG systems that bypass evaluation accumulate technical debt. Architecture decisions get made without evidence. Teams cannot distinguish between valid approaches and superstition. Eventually, the system reaches a state where no one knows what it actually does, only what it sometimes appears to do.
RAGAS provides a principled framework for measuring generation quality. The library uses LLMs to assess answers against ground truth contexts and queries without requiring human labels for every evaluation. This makes evaluation sustainable at scale and reproducible across runs.
The metrics covered in this course form a complete picture. Retrieval metrics tell you whether the right information enters the system. RAGAS metrics tell you whether the system produces correct information from what it retrieved.
Draft three queries relevant to your application domain. For each, manually assess whether a human would find the correct answer in your document set. Write down the queries and expected sources—this becomes your evaluation seed set.