RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /RAG Systems: Part 1
  6. /Ch. 18
RAG Systems: Part 1

18. RAG Evaluation: Hit Rate

Chapter 18 of 22 · 25 min
KEY INSIGHT

Set hit rate targets based on application tolerance for missed information, not arbitrary thresholds.

Hit Rate measures whether the correct information appears in the retrieved top-K results. This is the most fundamental retrieval metric.

Hit Rate Definition

Hit Rate@k = (Number of queries where expected chunk appears in top-K) / (Total queries)

Hit Rate@10 = 0.7 means 70% of queries retrieve at least one relevant chunk in top 10 results.

Hit Rate tells you nothing about which chunk is retrieved, only that relevant content exists in the results.

Implementing Hit Rate

import numpy as np

def calculate_hit_rate(
    queries: list[str],
    relevance_labels: list[list[int]],  # Ground truth relevance
    retrieval_results: list[list[str]],  # Retrieved chunk IDs
    k_values: list[int] = [1, 3, 5, 10, 20]
) -> dict[int, float]:
    """Calculate hit rate at multiple k values."""
    results = {}
    
    for k in k_values:
        hits = 0
        for query_idx, labels_of_query in enumerate(relevance_labels):
            retrieved = retrieval_results[query_idx][:k]
            # Hit if any retrieved chunk has relevance > 0
            hits += any(label > 0 for label, chunk in 
                       zip(labels_of_query, retrieved))
        results[f"hit_rate@{k}"] = hits / len(queries)
    
    return results

# Example usage
queries = [
    "How do I configure OAuth?",
    "What are the retry limits?",
    "How to scale workers horizontally?"
]

# For each query: which chunks are relevant? (0=not relevant, 1=somewhat, 2=highly)
labels = [
    [0, 1, 2, 0, 0],  # Chunk 3 is highly relevant
    [0, 0, 1, 0, 0],  # Chunk 3 is somewhat relevant
    [2, 0, 0, 1, 0]   # Chunk 1 is highly relevant
]

# Retrieved chunk indices for each query
retrieved = [
    [0, 1, 2, 3, 4],
    [2, 0, 1, 3, 4],
    [1, 2, 0, 3, 4]
]

hit_rates = calculate_hit_rate(queries, labels, retrieved, k_values=[1, 3, 5])
print(hit_rates)
# {'hit_rate@1': 0.0, 'hit_rate@3': 0.33, 'hit_rate@5': 1.0}

Ground Truth Annotation

Hit Rate requires ground truth labels. For production systems, annotate relevance manually:

# Ground truth format
ground_truth = [
    {
        "query_id": "q001",
        "query": "How do I configure OAuth2?",
        "relevant_chunks": [
            {"chunk_id": "auth.md:0", "relevance": 2, "notes": "Primary OAuth2 setup"},
            {"chunk_id": "auth.md:5", "relevance": 1, "notes": "Token refresh details"}
        ]
    },
    # 50-200 query-chunk pairs minimum for meaningful evaluation
]

Relevance levels: 0 (not relevant), 1 (somewhat relevant), 2 (highly relevant). Use higher granularity when analyzing specific retrieval failures.

Hit Rate Benchmarks

Different application types expect different hit rates:

Application Target Hit Rate@10 Notes
Customer support 0.95+ High recall expected - every missed chunk risks customer frustration
Research summarization 0.85+ Some misses acceptable since users expect to browse
Code search 0.90+ Exact answers expected - code chunks don't have alternatives
Policy compliance 0.98+ Missed policy = legal risk
def check_hit_rate_meets_target(hit_rates: dict, application_type: str) -> dict:
    targets = {
        "customer_support": {"k10": 0.95, "k5": 0.85},
        "research": {"k10": 0.85, "k5": 0.70},
        "code_search": {"k10": 0.90, "k5": 0.80},
        "compliance": {"k10": 0.98, "k5": 0.90}
    }
    
    target = targets.get(application_type, targets["research"])
    results = {}
    
    for k, target_value in target.items():
        actual = hit_rates.get(f"hit_rate@{k.split('k')[1]}", 0)
        results[k] = {
            "target": target_value,
            "actual": actual,
            "pass": actual >= target_value
        }
    
    return results
EXERCISE

Create a ground truth dataset of 50 query-chunk pairs. Calculate hit rate@10 for your current retrieval system. For each query below 1.0 hit rate, analyze why the relevant chunk was missed - is it a semantic mismatch, keyword gap, or metadata filter issue?

← Chapter 17
Basic Generation Pipeline
Chapter 19 →
RAG Evaluation: MRR