RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to Evaluate Reranking Quality with NDCG
HOW-TO · RAG

How to Evaluate Reranking Quality with NDCG

intermediate·20 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Reranking pipeline running, labeled relevance judgments

What this does

Normalized Discounted Cumulative Gain (NDCG) measures the quality of a ranking by comparing the positions of relevant documents in your reranked output against an ideal ordering. NDCG ranges from 0 to 1, where 1 means the reranker perfectly ordered all relevant documents at the top. It is the standard evaluation metric for retrieval and reranking tasks because it rewards both having relevant documents at high positions and placing irrelevant documents lower.

Steps

  1. Compile your labeled dataset into a dictionary mapping each query ID to a list of (document_id, relevance_score) tuples.
  2. Run your reranking pipeline for each query in the evaluation set and collect the ranked list of document IDs in the returned order.
  3. Implement the DCG calculation: DCG = sum_{i=1}^{N} (rel_i / log2(i + 1)), where rel_i is the relevance score of the document at position i.
  4. Compute the Ideal DCG (IDCG) by sorting the documents for each query by their relevance score in descending order and applying the same DCG formula.
  5. Calculate NDCG for each query: NDCG = DCG / IDCG. Handle edge cases where IDCG is zero (all documents irrelevant for a query) by setting NDCG to 1.0.
  6. Aggregate NDCG across all queries by computing the mean (Mean NDCG) to get a single quality score for your pipeline.
  7. Optionally report NDCG@k for k=5 and k=10 to evaluate top-position quality, which matters most for systems where only the top results are shown to users.

Verification

Run the evaluation on a test set of 30 queries and confirm that NDCG scores fall between 0 and 1. Compare your reranking pipeline's NDCG against the baseline vector search without reranking to quantify the improvement.

Expected output: Mean NDCG: 0.847 (reranked). Mean NDCG without reranking: 0.712. NDCG@5: 0.891. NDCG@10: 0.863. All 30 queries processed. Reranking improves NDCG by 0.135 (19% relative improvement).

Common failures

  1. Missing relevance labels for many query-document pairs: NDCG requires relevance judgments for all candidates in your test set. Pairs without labels are typically assigned relevance 0, but if too many relevant documents are unlabeled, your NDCG underestimates actual performance. Audit your label coverage and aim for complete coverage of the top-100 candidates per query.
  2. Graded relevance not properly weighted: Using binary relevance (0/1) loses the nuance that graded relevance provides. Ensure your DCG formula uses the actual graded scores, not a binarized version, to reflect real retrieval preferences (highly relevant documents should rank above marginally relevant ones).
  3. Inconsistent document identifiers across systems: If your reranking pipeline returns document IDs that do not match the IDs in your evaluation dataset, relevance lookups fail silently and produce incorrect NDCG. Normalize IDs to a consistent format (e.g., string-based chunk IDs) before evaluating and verify a sample of ID mappings before running the full evaluation.

Related guides

  • setup-cross-encoder-reranking
  • build-bm25-vector-hybrid-retrieval
← All how-to guidesCourses →