Mean Reciprocal Rank — RAG Evaluation and Metrics (Chapter 3)

Mean Reciprocal Rank (MRR) improves on Hit Rate by accounting for position. Instead of checking whether any relevant document appears in the results, MRR rewards systems that place relevant documents first.

The reciprocal rank for a single query is 1 divided by the position of the first relevant document. If the first document is relevant, the reciprocal rank is 1. If relevant documents appear at positions 3 and 5, the reciprocal rank is 1/3. If no relevant document appears, the reciprocal rank is 0.

def mean_reciprocal_rank(results: list[list[str]], relevance: list[set[str]]) -> float:
    """
    Calculate Mean Reciprocal Rank.
    
    Args:
        results: List of ranked doc IDs for each query
        relevance: List of sets containing relevant doc IDs for each query
    Returns:
        Average reciprocal rank across all queries
    """
    reciprocal_ranks = []
    
    for retrieved, relevant in zip(results, relevance):
        rr = 0.0
        for position, doc_id in enumerate(retrieved, start=1):
            if doc_id in relevant:
                rr = 1.0 / position
                break
        reciprocal_ranks.append(rr)
    
    return sum(reciprocal_ranks) / len(reciprocal_ranks)


# Example with comparison to Hit Rate
results = [
    ["doc_A", "doc_B", "doc_C"],
    ["doc_D", "doc_E", "doc_F"],
    ["doc_G", "doc_H", "doc_I"],
]

relevance = [
    {"doc_A"},  # Relevant at position 1
    {"doc_F"},  # Relevant at position 3
    {"doc_K"},  # No relevant documents
]

hr = hit_rate(results, relevance)
mrr = mean_reciprocal_rank(results, relevance)

print(f"Hit Rate: {hr}")       # 0.667 (2 hits out of 3)
print(f"MRR: {mrr}")          # 0.444 ((1 + 1/3 + 0) / 3)

The example shows why MRR matters. Query 2 has a hit (doc_F appears), so it contributes to Hit Rate. But doc_F appears at position 3, so it contributes only 1/3 to MRR. A system that retrieves doc_F at position 3 is genuinely worse than a system retrieving it at position 1.

MRR is sensitive to early failures. A relevant document at position 1 contributes 1.0 to the average. A relevant document at position 10 contributes 0.1. Gaps in ranking quality produce multiplicative drops in MRR compared to Hit Rate.

For RAG applications, MRR matters when the first retrieved document disproportionately influences the generated answer. If the RAG pipeline passes only top-k documents to the language model, getting the best document first has compounding benefits for downstream generation quality.