19. RAG Evaluation: MRR
Mean Reciprocal Rank (MRR) measures retrieval ranking quality, not just presence. A system that finds the answer at position 3 outperforms one that finds it at position 10.
MRR Definition
MRR = (1 / Q) × Σ (1 / rank_i)
Where Q is the number of queries, rank_i is the position of the first relevant chunk in query i's results. If no relevant chunk appears in results, rank = ∞, reciprocal = 0.
Query 1: First relevant at position 2 → 1/2 = 0.5
Query 2: First relevant at position 1 → 1/1 = 1.0
Query 3: No relevant in top 10 → 0
MRR = (0.5 + 1.0 + 0) / 3 = 0.5
MRR penalizes systems that retrieve relevant content but rank it poorly. Hit rate@10 might be 1.0 while MRR@10 is 0.5 - this signals ranking needs improvement.
Implementing MRR
def calculate_mrr(
queries: list[str],
relevance_labels: list[list[int]],
retrieval_results: list[list[str]],
k: int = 10
) -> float:
"""Calculate Mean Reciprocal Rank at top K."""
reciprocal_ranks = []
for query_idx, labels in enumerate(relevance_labels):
retrieved = retrieval_results[query_idx][:k]
# Find rank of first relevant chunk (relevance > 0)
rank = None
for position, chunk_idx in enumerate(retrieved):
if labels[chunk_idx] > 0:
rank = position + 1 # 1-indexed
break
# Handle miss: reciprocal = 0
reciprocal = 1.0 / rank if rank else 0.0
reciprocal_ranks.append(reciprocal)
return sum(reciprocal_ranks) / len(reciprocal_ranks)
# Calculate MRR@10 for example queries
mrr = calculate_mrr(queries, labels, retrieved, k=10)
print(f"MRR@10: {mrr:.3f}")
# MRR@10: 0.389
MRR vs Hit Rate Comparison
import pandas as pd
def full_retrieval_evaluation(
queries: list[str],
relevance_labels: list[list[int]],
retrieval_results: list[list[str]],
k_values: list[int] = [1, 3, 5, 10]
) -> pd.DataFrame:
"""Full retrieval evaluation metrics."""
metrics = {"k": k_values}
# Hit Rate at each k
for k in k_values:
hits = sum(
any(labels[i] > 0 for i in range(min(k, len(r))))
for r, labels in zip(retrieval_results, relevance_labels)
)
metrics[f"hit_rate@{k}"] = [hits / len(queries)]
# MRR is calculated at the maximum k
mrr = calculate_mrr(queries, relevance_labels, retrieval_results, k=max(k_values))
metrics["mrr"] = [mrr]
return pd.DataFrame(metrics)
# Example comparison between two retrieval strategies
results_dense = calculate_mrr(queries, labels, dense_retrieved, k=10)
results_hybrid = calculate_mrr(queries, labels, hybrid_retrieved, k=10)
print(f"Dense MRR@10: {results_dense:.3f}")
print(f"Hybrid MRR@10: {results_hybrid:.3f}")
# Dense MRR@10: 0.312
# Hybrid MRR@10: 0.389
The 24% improvement in MRR from hybrid search demonstrates why ranking quality matters, not just retrie
MRR@K Calculation
Standard MRR evaluates at K=∞ (or dataset maximum). Sometimes a relevant chunk appearing at position 12 in a 100-article dataset is acceptable while position 12 in 1000-article dataset is not:
def calculate_mrr_at_k(
relevance_labels: list[list[int]],
retrieval_results: list[list[str]],
k: int = 10b
) -> float:
"""MRR calculated only considering results within top K."""
reciprocal_ranks = []
for labels, retrieved in zip(relevance_labels, retrieval_results):
# Limit to top K
truncated = retrieved[:k]
rank = None
for pos, chunk_idx in enumerate(truncated):
if labels[chunk_idx] > 0:
rank = pos + 1
break
reciprocal = 1.0 / rank if rank else 0.0
reciprocal_ranks.append(reciprocal)
return sum(reciprocal_ranks) / len(reciprocal_ranks)
Use MRR@K when you want to measure ranking quality only within what the generation model can actually use.
Calculate both hit rate@10 and MRR@10 for your retrieval system. If hit rate is high (>0.9) but MRR is low (<0.5), implement a reranking stage and measure improvement. Expect 20-40% MRR improvement from reranking.