Advanced RAG — Chunking, Retrieval, Re-ranking

23. RAGAS Context Precision

Chapter 23 of 24 · 20 min

KEY INSIGHT

Retrieval precision measures whether the top-ranked chunks in the context window contain what is needed to answer, penalizing relevant content buried in lower positions. ### Definition Context precision evaluates the "density" of relevant information in the retrieval ordering. Of the chunks needed to fully answer the question, how many appear in the top-N positions of the retrieved result set? Perfect precision means all needed chunks appear at position 1, 2, ..., k. Poor precision means needed chunks are pushed to positions 7–10 of a top-10 window. ### Implementation ```python from statistics import mean def computeContextPrecision( retrievedChunks: list[dict], relevantChunkIds: set[str], k: int = 10 ) -> dict: """ Compute precision at k using position-weighted scoring. retrievedChunks: list of {"id", "text", "score"} ordered by retrieval rank relevantChunkIds: set of chunk IDs known to answer the question k: evaluation window size """ retrievedK = retrievedChunks[:k] precisions = [] for i, chunk in enumerate(retrievedK): position = i + 1 # 1-indexed is_relevant = chunk["id"] in relevantChunkIds # Precision at this position = relevant items seen / position # Equivalent to binary relevance precision@k formula precisions.append(1 / position if is_relevant else 0) # NDCG-like weighting: sum of relevant positions weighted by inverse rank # Simplified RAGAS precision formula relevantItems = sum( 1 for c in retrievedK if c["id"] in relevantChunkIds ) totalRelevant = len(relevantChunkIds) if totalRelevant == 0: return {"precision": 0.0, "retrieved_relevant": relevantItems, "total_relevant": 0} # Proportion of relevant items retrieved, weighted by position precisionScore = sum(precisions) / k return { "precision": round(precisionScore, 3), "retrieved_relevant": relevantItems, "total_relevant": totalRelevant, "position_breakdown": [ {"position": i+1, "id": c["id"], "relevant": c["id"] in relevantChunkIds} for i, c in enumerate(retrievedK) ] } ``` ### Computing Relevant Chunk IDs For evaluation, relevant chunks are identified by an oracle or LLM annotation. In production, use the ground-truth answer to identify which chunks contain the facts needed: ```python def identifyRelevantChunks( chunks: list[dict], groundTruthAnswer: str, question: str, model: str = "gpt-4o-mini" ) -> set[str]: """ Use an LLM to identify which retrieved chunks contain facts needed to answer the question. """ relevant = set() for chunk in chunks: check_prompt = ( "Given the question and the ground truth answer, determine if " "the following passage contains at least one fact needed to answer the question.\n\n" "Question: {question}\n\nAnswer: {answer}\n\nPassage:\n{passage}\n\n" "Answer YES or NO." ) response = client.chat.completions.create( model=model, messages=[ {"role": "system", "content": "Answer only YES or NO."}, {"role": "user", "content": check_prompt.format( question=question, answer=groundTruthAnswer, passage=chunk["text"] )} ], temperature=0.0, max_tokens=16 ) verdict = response.choices[0].message.content.strip().upper() if "YES" in verdict: relevant.add(chunk["id"]) return relevant ``` ### Aggregated Context Precision ```python def evaluateContextPrecisionOnDataset( dataset: list[dict], k: int = 10 ) -> dict: results = [] for item in dataset: retrieved = item["retrieved_chunks"] # already ordered relevant = identifyRelevantChunks( retrieved, item["ground_truth"], item["question"] ) precResult = computeContextPrecision(retrieved, relevant, k=k) results.append(precResult) return { "mean_precision": mean(r["precision"] for r in results), "min_precision": min(r["precision"] for r in results), "max_precision": max(r["precision"] for r in results), "results": results } ``` ### Failure Modes Precision at k does not differentiate between a needed chunk at position 3 versus position 10 if both are in the window. For higher-fidelity evaluation, use NDCG or DCG with relevance grades instead of binary relevance. Oracle-identified relevant chunks are approximations; human annotation remains the gold standard. Chunk boundaries affect precision—chunks that split a single fact across two retrieved chunks may double-count in evaluation.

EXERCISE

Implement context precision with relevant chunk identification. Evaluate on 20 queries with ground-truth answers. Report mean precision and identify queries where needed chunks ranked below position 5. (15 min)