RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /RAG Evaluation and Metrics
  6. /Ch. 7
RAG Evaluation and Metrics

07. Answer Relevance

Chapter 7 of 18 · 15 min
KEY INSIGHT

Answer Relevance measures topical alignment between query and answer, catching cases where accurate content addresses the wrong question.

Answer Relevance measures whether a generated answer actually addresses the user's question. The RAGAS implementation generates multiple embeddings of the answer and measures semantic similarity to the original query, capturing whether the answer discussion matches the requested topic.

The metric penalizes answers that address the wrong topic, contain accurate information but miss the point, or bury relevant content in tangents. A technically correct answer about shipping methods that ignores the specific cost question scores low on relevance.

from ragas.metrics import answer_relevancy
from ragas import evaluate
from ragas.dataset import Dataset

# Example comparing relevant and irrelevant answers
relevant_data = [
    {
        "user_input": "What are the weekend checkout times?",
        "retrieved_contexts": [
            "Pool hours are 7am-10pm daily. Gym hours are 6am-11pm daily. "
            "Front desk checkout is 11am on weekdays and noon on weekends."
        ],
        "response": "Checkout on weekends is at noon at the front desk. "
                   "Weekday checkout remains at 11am."
    }
]

biased_data = [
    {
        "user_input": "What are the weekend checkout times?",
        "retrieved_contexts": [
            "Pool hours are 7am-10pm daily. Gym hours are 6am-11pm daily. "
            "Front desk checkout is 11am on weekdays and noon on weekends."
        ],
        "response": "The gym is open from 6am to 11pm on weekends, and "
                   "the pool runs 7am to 10pm. Both facilities are available "
                   "throughout the day."
    }
]

relevant_ds = Dataset.from_list(relevant_data)
biased_ds = Dataset.from_list(biased_data)

relevant_result = evaluate(relevant_ds, metrics=[answer_relevancy])
biased_result = evaluate(biased_ds, metrics=[answer_relevancy])

print(f"\nDirect answer: {relevant_result['answer_relevancy']:.2f}")
print(f"Tangent answer: {biased_result['answer_relevancy']:.2f}")

The tangent answer provides accurate information but fails to address the question. The user asked about checkout times; the answer describes gym and pool hours. A human reading this exchange would immediately identify the mismatch. The Answer Relevance metric captures this programmatically by measuring semantic alignment between query and answer content.

Low Answer Relevance has distinct causes from low Faithfulness. An answer can be highly faithful (every claim matches context) while missing the point entirely. Causes include retrieval failures where relevant information exists in the corpus but the wrong documents get retrieved, or prompt issues where the generation task does not sufficiently constrain topic focus.

The metric uses cosine similarity under the hood, comparing embeddings of the query and answer. This means it captures topical alignment rather than factual correctness or completeness. An answer that addresses a different but related question scores low, even if the answer content would be accurate for its own question.

EXERCISE

Create an answer that is faithful to context but irrelevant to the query. Run Answer Relevance evaluation. Then modify the answer to address the query while adding a factual error. Run both Faithfulness and Answer Relevance. Notice how these metrics can move independently—the answer may score high on one and low on the other.

← Chapter 6
Faithfulness
Chapter 8 →
Context Precision