Answer Relevance — RAG Evaluation and Metrics (Chapter 7)

Answer Relevance measures whether a generated answer actually addresses the user's question. The RAGAS implementation generates multiple embeddings of the answer and measures semantic similarity to the original query, capturing whether the answer discussion matches the requested topic.

The metric penalizes answers that address the wrong topic, contain accurate information but miss the point, or bury relevant content in tangents. A technically correct answer about shipping methods that ignores the specific cost question scores low on relevance.

from ragas.metrics import answer_relevancy
from ragas import evaluate
from ragas.dataset import Dataset

# Example comparing relevant and irrelevant answers
relevant_data = [
    {
        "user_input": "What are the weekend checkout times?",
        "retrieved_contexts": [
            "Pool hours are 7am-10pm daily. Gym hours are 6am-11pm daily. "
            "Front desk checkout is 11am on weekdays and noon on weekends."
        ],
        "response": "Checkout on weekends is at noon at the front desk. "
                   "Weekday checkout remains at 11am."
    }
]

biased_data = [
    {
        "user_input": "What are the weekend checkout times?",
        "retrieved_contexts": [
            "Pool hours are 7am-10pm daily. Gym hours are 6am-11pm daily. "
            "Front desk checkout is 11am on weekdays and noon on weekends."
        ],
        "response": "The gym is open from 6am to 11pm on weekends, and "
                   "the pool runs 7am to 10pm. Both facilities are available "
                   "throughout the day."
    }
]

relevant_ds = Dataset.from_list(relevant_data)
biased_ds = Dataset.from_list(biased_data)

relevant_result = evaluate(relevant_ds, metrics=[answer_relevancy])
biased_result = evaluate(biased_ds, metrics=[answer_relevancy])

print(f"\nDirect answer: {relevant_result['answer_relevancy']:.2f}")
print(f"Tangent answer: {biased_result['answer_relevancy']:.2f}")

The tangent answer provides accurate information but fails to address the question. The user asked about checkout times; the answer describes gym and pool hours. A human reading this exchange would immediately identify the mismatch. The Answer Relevance metric captures this programmatically by measuring semantic alignment between query and answer content.

Low Answer Relevance has distinct causes from low Faithfulness. An answer can be highly faithful (every claim matches context) while missing the point entirely. Causes include retrieval failures where relevant information exists in the corpus but the wrong documents get retrieved, or prompt issues where the generation task does not sufficiently constrain topic focus.

The metric uses cosine similarity under the hood, comparing embeddings of the query and answer. This means it captures topical alignment rather than factual correctness or completeness. An answer that addresses a different but related question scores low, even if the answer content would be accurate for its own question.