RAGAS Introduction — RAG Evaluation and Metrics (Chapter 5)

RAGAS provides automated evaluation for RAG pipelines using LLM-based metrics. Instead of relying on exact string matching or requiring human-rated exemplars for every test case, RAGAS prompts language models to assess generation quality against reference contexts and questions.

The framework operates on four-dimensional evaluation.Faithfulness measures whether generated answers真的会 stay grounded in retrieved context without fabricating information. Answer Relevance measures whether generated answers address the original query intent. Context Precision measures whether retrieved context contains only relevant information ranked correctly. Context Recall measures whether retrieved context captures all information needed to answer the query.

RAGAS requires three inputs: the user query, the retrieved context, and the generated answer. In return, it produces scores between 0 and 1 for each dimension being evaluated.

# pip install ragas langchain langchain-openai

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from ragas.dataset import RuntimeDataset
from datasets import Dataset
import os

os.environ["OPENAI_API_KEY"] = "your-api-key"

# Define evaluation data for a single RAG pipeline response
evaluation_data = [
    {
        "user_input": "What is the refund policy for items purchased in December?",
        "retrieved_contexts": [
            "All purchases come with a 30-day return window. "
            "Refunds are processed within 5-7 business days. "
            "Items must be in original packaging."
        ],
        "response": "Items purchased in December have a standard 30-day return "
                   "window. Refunds process within 5-7 business days and must "
                   "be in original packaging."
    }
]

# Convert to RAGAS dataset format
dataset = Dataset.from_list(evaluation_data)

# Run evaluation with specific metrics
result = evaluate(
    dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
    ],
)

print(result)
# {
#     'faithfulness': 1.0,
#     'answer_relevancy': 0.9,
#     'context_precision': 1.0,
#     'context_recall': 1.0
# }

The output shows how RAGAS量化 evaluation. The answer demonstrates perfect Faithfulness because it contains no information outside the retrieved context. The Answer Relevance score of 0.9 indicates the answer mostly addresses the query but may have minor gaps or tangents. Context Precision at 1.0 means all retrieved documents were relevant. Context Recall at 1.0 means the retrieved context fully captured information needed to answer (a conditional requiring ground truth answer or additional annotations).

RAGAS relies on the evaluating LLM's judgment. Scores reflect that model's assessment of quality, not guaranteed ground truth. This introduces variance—both between model providers and across time as model versions update. Use RAGAS scores as relative indicators and trend monitors rather than absolute quality guarantees.

The framework integrates with LangChain for flexible pipeline evaluation. Chaining RAGAS evaluation into LangChain allows measuring quality across pipeline variations without rewriting evaluation logic.