What this does

Normalized Discounted Cumulative Gain (NDCG) measures the quality of a ranking by comparing the positions of relevant documents in your reranked output against an ideal ordering. NDCG ranges from 0 to 1, where 1 means the reranker perfectly ordered all relevant documents at the top. It is the standard evaluation metric for retrieval and reranking tasks because it rewards both having relevant documents at high positions and placing irrelevant documents lower.

Steps

Compile your labeled dataset into a dictionary mapping each query ID to a list of (document_id, relevance_score) tuples.
Run your reranking pipeline for each query in the evaluation set and collect the ranked list of document IDs in the returned order.
Implement the DCG calculation: DCG = sum_{i=1}^{N} (rel_i / log2(i + 1)), where rel_i is the relevance score of the document at position i.
Compute the Ideal DCG (IDCG) by sorting the documents for each query by their relevance score in descending order and applying the same DCG formula.
Calculate NDCG for each query: NDCG = DCG / IDCG. Handle edge cases where IDCG is zero (all documents irrelevant for a query) by setting NDCG to 1.0.
Aggregate NDCG across all queries by computing the mean (Mean NDCG) to get a single quality score for your pipeline.
Optionally report NDCG@k for k=5 and k=10 to evaluate top-position quality, which matters most for systems where only the top results are shown to users.

Verification

Run the evaluation on a test set of 30 queries and confirm that NDCG scores fall between 0 and 1. Compare your reranking pipeline's NDCG against the baseline vector search without reranking to quantify the improvement.

Expected output: Mean NDCG: 0.847 (reranked). Mean NDCG without reranking: 0.712. NDCG@5: 0.891. NDCG@10: 0.863. All 30 queries processed. Reranking improves NDCG by 0.135 (19% relative improvement).

Common failures

Missing relevance labels for many query-document pairs: NDCG requires relevance judgments for all candidates in your test set. Pairs without labels are typically assigned relevance 0, but if too many relevant documents are unlabeled, your NDCG underestimates actual performance. Audit your label coverage and aim for complete coverage of the top-100 candidates per query.
Graded relevance not properly weighted: Using binary relevance (0/1) loses the nuance that graded relevance provides. Ensure your DCG formula uses the actual graded scores, not a binarized version, to reflect real retrieval preferences (highly relevant documents should rank above marginally relevant ones).
Inconsistent document identifiers across systems: If your reranking pipeline returns document IDs that do not match the IDs in your evaluation dataset, relevance lookups fail silently and produce incorrect NDCG. Normalize IDs to a consistent format (e.g., string-based chunk IDs) before evaluating and verify a sample of ID mappings before running the full evaluation.

Related guides

setup-cross-encoder-reranking
build-bm25-vector-hybrid-retrieval