What this does

Measuring a RAG pipeline requires more than asking a few questions manually. Quantitative evaluation reveals where retrieval fails, where generation drifts, and whether the system handles edge cases. This guide covers using the ragas library to compute faithfulness, answer relevance, context precision, and context recall against a curated test set.

Steps

Set up the evaluation dataset. Structure the test data as a dictionary of lists.

import os
os.environ["OLLAMA_BASE_URL"] = "http://localhost:11434"

from ragas import EvaluationDataset

eval_data = {
    "user_input": [
        "What is retrieval-augmented generation?",
        "How does embedding work?",
    ],
    "retrieved_contexts": [
        ["RAG combines retrieval with generation models."],
        ["Embeddings convert text into numerical vectors."],
    ],
    "response": [
        "RAG augments generation with retrieved documents.",
        "Embeddings map text to vectors for similarity search.",
    ],
    "reference": [
        "RAG is a technique that uses retrieved documents to ground LLM responses.",
        "An embedding is a numerical representation of text used in similarity comparisons.",
    ],
}
dataset = EvaluationDataset.from_dict(eval_data)

Import the evaluation metrics. RagAS provides multiple metric evaluators.

from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)

Initialize an evaluator LLM. Use Ollama as the evaluation backend.

from ragas.llms import LangchainLLM
from langchain_ollama import ChatOllama

eval_llm = LangchainLLM(ChatOllama(model="llama3", base_url="http://localhost:11434"))

Run the evaluation. Score each metric across the dataset.

from ragas import evaluate

result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
   llm=eval_llm,
)
print(result)

Expected output: a table of scores for each metric across all test samples.

Interpret the results. Aim for context precision above 0.8 and faithfulness above 0.9 in well-tuned systems. Low recall signals missing relevant documents; low faithfulness signals hallucination.

Verification

python -c "
from ragas.metrics import faithfulness
print(hasattr(faithfulness, 'name'))
# Expected: True
"

Common failures

Insufficient test samples. Fewer than 10 samples produce noisy scores; aim for 20 or more.
Ragas LLM not connecting. Verify Ollama is running at http://localhost:11434 before evaluation.
Reference answers missing. Without ground-truth answers, context recall cannot be computed reliably.
Metric timeout on long responses. Set a longer timeout in the ragas config when generating explanations.
Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

How to Evaluate RAG Pipeline Performance

What this does

Steps

Verification

Common failures

Related guides