18. RAG Evaluation: Hit Rate
Hit Rate measures whether the correct information appears in the retrieved top-K results. This is the most fundamental retrieval metric.
Hit Rate Definition
Hit Rate@k = (Number of queries where expected chunk appears in top-K) / (Total queries)
Hit Rate@10 = 0.7 means 70% of queries retrieve at least one relevant chunk in top 10 results.
Hit Rate tells you nothing about which chunk is retrieved, only that relevant content exists in the results.
Implementing Hit Rate
import numpy as np
def calculate_hit_rate(
queries: list[str],
relevance_labels: list[list[int]], # Ground truth relevance
retrieval_results: list[list[str]], # Retrieved chunk IDs
k_values: list[int] = [1, 3, 5, 10, 20]
) -> dict[int, float]:
"""Calculate hit rate at multiple k values."""
results = {}
for k in k_values:
hits = 0
for query_idx, labels_of_query in enumerate(relevance_labels):
retrieved = retrieval_results[query_idx][:k]
# Hit if any retrieved chunk has relevance > 0
hits += any(label > 0 for label, chunk in
zip(labels_of_query, retrieved))
results[f"hit_rate@{k}"] = hits / len(queries)
return results
# Example usage
queries = [
"How do I configure OAuth?",
"What are the retry limits?",
"How to scale workers horizontally?"
]
# For each query: which chunks are relevant? (0=not relevant, 1=somewhat, 2=highly)
labels = [
[0, 1, 2, 0, 0], # Chunk 3 is highly relevant
[0, 0, 1, 0, 0], # Chunk 3 is somewhat relevant
[2, 0, 0, 1, 0] # Chunk 1 is highly relevant
]
# Retrieved chunk indices for each query
retrieved = [
[0, 1, 2, 3, 4],
[2, 0, 1, 3, 4],
[1, 2, 0, 3, 4]
]
hit_rates = calculate_hit_rate(queries, labels, retrieved, k_values=[1, 3, 5])
print(hit_rates)
# {'hit_rate@1': 0.0, 'hit_rate@3': 0.33, 'hit_rate@5': 1.0}
Ground Truth Annotation
Hit Rate requires ground truth labels. For production systems, annotate relevance manually:
# Ground truth format
ground_truth = [
{
"query_id": "q001",
"query": "How do I configure OAuth2?",
"relevant_chunks": [
{"chunk_id": "auth.md:0", "relevance": 2, "notes": "Primary OAuth2 setup"},
{"chunk_id": "auth.md:5", "relevance": 1, "notes": "Token refresh details"}
]
},
# 50-200 query-chunk pairs minimum for meaningful evaluation
]
Relevance levels: 0 (not relevant), 1 (somewhat relevant), 2 (highly relevant). Use higher granularity when analyzing specific retrieval failures.
Hit Rate Benchmarks
Different application types expect different hit rates:
| Application | Target Hit Rate@10 | Notes |
|---|---|---|
| Customer support | 0.95+ | High recall expected - every missed chunk risks customer frustration |
| Research summarization | 0.85+ | Some misses acceptable since users expect to browse |
| Code search | 0.90+ | Exact answers expected - code chunks don't have alternatives |
| Policy compliance | 0.98+ | Missed policy = legal risk |
def check_hit_rate_meets_target(hit_rates: dict, application_type: str) -> dict:
targets = {
"customer_support": {"k10": 0.95, "k5": 0.85},
"research": {"k10": 0.85, "k5": 0.70},
"code_search": {"k10": 0.90, "k5": 0.80},
"compliance": {"k10": 0.98, "k5": 0.90}
}
target = targets.get(application_type, targets["research"])
results = {}
for k, target_value in target.items():
actual = hit_rates.get(f"hit_rate@{k.split('k')[1]}", 0)
results[k] = {
"target": target_value,
"actual": actual,
"pass": actual >= target_value
}
return results
Create a ground truth dataset of 50 query-chunk pairs. Calculate hit rate@10 for your current retrieval system. For each query below 1.0 hit rate, analyze why the relevant chunk was missed - is it a semantic mismatch, keyword gap, or metadata filter issue?