NDCG Explained — RAG Evaluation and Metrics (Chapter 4)

Normalized Discounted Cumulative Gain (NDCG) extends MRR by supporting graded relevance. Instead of binary relevance (relevant/not relevant), NDCG handles multi-level relevance judgments: a document can be slightly relevant, moderately relevant, highly relevant, or perfectly relevant.

The "Discounted" part penalizes documents appearing in lower positions. The "Normalized" part scales scores to a 0-1 range, making results comparable across different query difficulty levels and result set sizes.

import math

def dcg(gains: list[float], positions: list[int] = None) -> float:
    """Calculate DCG from relevance gains."""
    if positions is None:
        positions = range(1, len(gains) + 1)
    
    return sum(gain / math.log2(pos + 1) for gain, pos in zip(gains, positions))


def ndcg(results: list[list[str]], 
         graded_relevance: list[dict[str, float]],
         k: int = 10) -> float:
    """
    Calculate Normalized Discounted Cumulative Gain.
    
    Args:
        results: List of ranked doc IDs for each query
        graded_relevance: List of dicts mapping doc_id to relevance grade (0-4)
        k: Consider only top-k results
    """
    ndcg_scores = []
    
    for retrieved, relevance in zip(results, graded_relevance):
        # Observed gains from retrieved documents
        gains = [relevance.get(doc_id, 0.0) for doc_id in retrieved[:k]]
        
        # Ideal gains: best possible ordering of same documents
        ideal_gains = sorted([relevance[d] for d in retrieved], reverse=True)[:k]
        
        observed_dcg = dcg(gains)
        ideal_dcg = dcg(ideal_gains)
        
        if ideal_dcg > 0:
            ndcg_scores.append(observed_dcg / ideal_dcg)
        else:
            ndcg_scores.append(0.0)  # No relevant docs in results
        
    return sum(ndcg_scores) / len(ndcg_scores)


# Example with graded relevance
results = [
    ["doc_X", "doc_Y", "doc_Z", "doc_W"],
]

# Context-aware grading: perfect match = 4, good = 3, partial = 2, weak = 1
graded_relevance = [
    {
        "doc_X": 4.0,  # Perfect match for query
        "doc_Y": 2.0,  # Partially relevant
        "doc_Z": 0.0,  # Not relevant
        "doc_W": 3.0,  # Highly relevant but retrieved late
    }
]

print(f"NDCG: {ndcg(results, graded_relevance):.4f}")

The example demonstrates graded relevance. doc_X (relevance 4) appears at position 1, contributing full gain with minimal discount. doc_W (relevance 3) appears at position 4, suffering a substantial discount for its poor position. doc_Z (relevance 0) appears at position 3, still receiving a discount despite not being relevant because it bumps more relevant documents.

Document ordering directly affects NDCG. Swapping doc_W and doc_Z positions would improve the score despite retrieving the same documents. This sensitivity makes NDCG the preferred metric when ranking quality is the primary concern.

The choice of grading scale affects NDCG meaningfully. Binary relevance (0 or 1) produces scores similar to MRR. Fine-grained scales (0-4 or 0-10) capture more nuance but require more careful calibration of grade boundaries. Design the grading scale around human judges' ability to distinguish relevance levels consistently.