How to Evaluate RAG Pipeline Performance
RAG pipeline running, ragas library installed
What this does
Measuring a RAG pipeline requires more than asking a few questions manually. Quantitative evaluation reveals where retrieval fails, where generation drifts, and whether the system handles edge cases. This guide covers using the ragas library to compute faithfulness, answer relevance, context precision, and context recall against a curated test set.
Steps
Set up the evaluation dataset. Structure the test data as a dictionary of lists.
import os os.environ["OLLAMA_BASE_URL"] = "http://localhost:11434" from ragas import EvaluationDataset eval_data = { "user_input": [ "What is retrieval-augmented generation?", "How does embedding work?", ], "retrieved_contexts": [ ["RAG combines retrieval with generation models."], ["Embeddings convert text into numerical vectors."], ], "response": [ "RAG augments generation with retrieved documents.", "Embeddings map text to vectors for similarity search.", ], "reference": [ "RAG is a technique that uses retrieved documents to ground LLM responses.", "An embedding is a numerical representation of text used in similarity comparisons.", ], } dataset = EvaluationDataset.from_dict(eval_data)Import the evaluation metrics. RagAS provides multiple metric evaluators.
from ragas.metrics import ( faithfulness, answer_relevancy, context_precision, context_recall, )Initialize an evaluator LLM. Use Ollama as the evaluation backend.
from ragas.llms import LangchainLLM from langchain_ollama import ChatOllama eval_llm = LangchainLLM(ChatOllama(model="llama3", base_url="http://localhost:11434"))Run the evaluation. Score each metric across the dataset.
from ragas import evaluate result = evaluate( dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall], llm=eval_llm, ) print(result)Expected output: a table of scores for each metric across all test samples.
Interpret the results. Aim for context precision above 0.8 and faithfulness above 0.9 in well-tuned systems. Low recall signals missing relevant documents; low faithfulness signals hallucination.
Verification
python -c "
from ragas.metrics import faithfulness
print(hasattr(faithfulness, 'name'))
# Expected: True
"
Common failures
- Insufficient test samples. Fewer than 10 samples produce noisy scores; aim for 20 or more.
- Ragas LLM not connecting. Verify Ollama is running at
http://localhost:11434before evaluation. - Reference answers missing. Without ground-truth answers, context recall cannot be computed reliably.
- Metric timeout on long responses. Set a longer timeout in the ragas config when generating explanations.
- Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
- Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.