KEY INSIGHT
Retrieval precision measures whether the top-ranked chunks in the context window contain what is needed to answer, penalizing relevant content buried in lower positions.
### Definition
Context precision evaluates the "density" of relevant information in the retrieval ordering. Of the chunks needed to fully answer the question, how many appear in the top-N positions of the retrieved result set? Perfect precision means all needed chunks appear at position 1, 2, ..., k. Poor precision means needed chunks are pushed to positions 7–10 of a top-10 window.
### Implementation
```python
from statistics import mean
def computeContextPrecision(
retrievedChunks: list[dict],
relevantChunkIds: set[str],
k: int = 10
) -> dict:
"""
Compute precision at k using position-weighted scoring.
retrievedChunks: list of {"id", "text", "score"} ordered by retrieval rank
relevantChunkIds: set of chunk IDs known to answer the question
k: evaluation window size
"""
retrievedK = retrievedChunks[:k]
precisions = []
for i, chunk in enumerate(retrievedK):
position = i + 1 # 1-indexed
is_relevant = chunk["id"] in relevantChunkIds
# Precision at this position = relevant items seen / position
# Equivalent to binary relevance precision@k formula
precisions.append(1 / position if is_relevant else 0)
# NDCG-like weighting: sum of relevant positions weighted by inverse rank
# Simplified RAGAS precision formula
relevantItems = sum(
1 for c in retrievedK if c["id"] in relevantChunkIds
)
totalRelevant = len(relevantChunkIds)
if totalRelevant == 0:
return {"precision": 0.0, "retrieved_relevant": relevantItems, "total_relevant": 0}
# Proportion of relevant items retrieved, weighted by position
precisionScore = sum(precisions) / k
return {
"precision": round(precisionScore, 3),
"retrieved_relevant": relevantItems,
"total_relevant": totalRelevant,
"position_breakdown": [
{"position": i+1, "id": c["id"], "relevant": c["id"] in relevantChunkIds}
for i, c in enumerate(retrievedK)
]
}
```
### Computing Relevant Chunk IDs
For evaluation, relevant chunks are identified by an oracle or LLM annotation. In production, use the ground-truth answer to identify which chunks contain the facts needed:
```python
def identifyRelevantChunks(
chunks: list[dict],
groundTruthAnswer: str,
question: str,
model: str = "gpt-4o-mini"
) -> set[str]:
"""
Use an LLM to identify which retrieved chunks contain
facts needed to answer the question.
"""
relevant = set()
for chunk in chunks:
check_prompt = (
"Given the question and the ground truth answer, determine if "
"the following passage contains at least one fact needed to answer the question.\n\n"
"Question: {question}\n\nAnswer: {answer}\n\nPassage:\n{passage}\n\n"
"Answer YES or NO."
)
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "Answer only YES or NO."},
{"role": "user", "content": check_prompt.format(
question=question,
answer=groundTruthAnswer,
passage=chunk["text"]
)}
],
temperature=0.0,
max_tokens=16
)
verdict = response.choices[0].message.content.strip().upper()
if "YES" in verdict:
relevant.add(chunk["id"])
return relevant
```
### Aggregated Context Precision
```python
def evaluateContextPrecisionOnDataset(
dataset: list[dict],
k: int = 10
) -> dict:
results = []
for item in dataset:
retrieved = item["retrieved_chunks"] # already ordered
relevant = identifyRelevantChunks(
retrieved, item["ground_truth"], item["question"]
)
precResult = computeContextPrecision(retrieved, relevant, k=k)
results.append(precResult)
return {
"mean_precision": mean(r["precision"] for r in results),
"min_precision": min(r["precision"] for r in results),
"max_precision": max(r["precision"] for r in results),
"results": results
}
```
### Failure Modes
Precision at k does not differentiate between a needed chunk at position 3 versus position 10 if both are in the window. For higher-fidelity evaluation, use NDCG or DCG with relevance grades instead of binary relevance. Oracle-identified relevant chunks are approximations; human annotation remains the gold standard. Chunk boundaries affect precision—chunks that split a single fact across two retrieved chunks may double-count in evaluation.