How to build a RAG evaluation pipeline
RAG system running, ragas or evaluation library
What this does
Building a RAG evaluation pipeline measures the quality of a Retrieval-Augmented Generation system across multiple dimensions: context relevance, answer faithfulness, answer relevance, and retrieval precision. The pipeline runs a set of test queries through the RAG system, collects the retrieved documents and generated answers, then scores each dimension using a combination of LLM-as-judge metrics and traditional information retrieval metrics. The output is a dashboard of scores that guides retrieval tuning, prompt engineering, and chunking strategy decisions.
Steps
Prepare the evaluation dataset as a Hugging Face Dataset with fields: question, answer (the RAG system's generated answer), contexts (list of retrieved document chunks), and ground_truth (the expected answer). Create the evaluation script. Import ragas metrics: from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall. Define a RAGEvaluator class that takes the RAG query function as a dependency. For each test question, run the RAG system and collect the generated answer and retrieved contexts. Run the metrics: result = evaluate_dataset(dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall]). Faithfulness measures how well the answer is grounded in retrieved context by decomposing the answer into claims and checking each against the context. Answer relevancy checks if the answer addresses the question. Context precision measures how many retrieved documents are relevant, and context recall measures how many relevant documents were retrieved. Compute aggregate scores and per-category breakdowns. Store results in a JSON file with timestamp for trend tracking. Set up a recurring evaluation job (cron or CI pipeline) that runs daily and alerts if any metric drops below a threshold. Generate a simple HTML report with metric trend charts.
Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Confirm the local starting state. Print the active binary, package version, model name, or configuration path before changing the workflow.
Run the smallest complete path. Execute the minimum command or script that proves the guide works end to end on the local machine.
Compare against expected output. Check the final line, status code, generated artifact, or model response against the verification section before expanding the setup.
Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Verification
Run the evaluation on a small test set of 5 queries and confirm all 4 metrics return scores between 0 and 1. Compare RAG outputs against ground truth: a high faithfulness score should correlate with answers that closely match retrieved contexts. Test with intentionally bad retrieval (empty contexts): faithfulness should drop to near 0. Verify the pipeline runs end-to-end in under 5 minutes for a 50-query dataset. Check the output JSON file for valid structure and all expected metric fields.
Common failures
Ragas metrics return NaN: This typically means the LLM judge failed to parse the output—check the model's JSON output format and ensure it follows the expected schema. Evaluation dataset too small: At least 30 queries per category are needed for statistically meaningful scores; use ragas synthetic test generation to expand small datasets. Context and answer length mismatch: Token limits on the evaluation model can truncate long contexts—split contexts and answers into chunks under the model's context window. Inconsistent scores between runs: LLM-as-judge approaches have inherent variance; run evaluation 3 times and average the scores, or set temperature to 0. Metric misalignment with human judgment: Periodically have a human reviewer score 10 samples and compare correlation with metric scores.
- Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
- Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.
Related guides
- use-dspy-prompt-optimization
- setup-prompt-layer-prompt-management
- implement-ab-testing-model-responses