14. Human Annotation
Chapter 14 of 18 · 20 min
Automated metrics miss nuanced quality issues that human raters catch reliably. Building an annotation pipeline with clear guidelines and quality control produces ground truth data that automated evaluation can be calibrated against.
Designing Annotation Guidelines
Clear guidelines reduce inter-annotator disagreement. Each guide should include question types with examples, scoring rubrics with boundary cases, and feedback mechanisms for ambiguous cases.
# Relevance Annotation Guidelines
## Scoring Scale
- **4 - Perfect**: Answer fully satisfies the query with appropriate scope
- **3 - Acceptable**: Answer addresses core intent but has minor gaps
- **2 - Partial**: Answer covers only part of what was asked
- **1 - Poor**: Answer is tangentially related or contains significant errors
- **0 - Irrelevant**: Answer does not address the query at all
## Boundary Cases
### Query: "How do I reset the admin password?"
- 4: Step-by-step reset instructions for admin password
- 3: Reset instructions that require contacting support (partial workaround)
- 2: General password reset instructions (not specific to admin)
- 1: Mention password reset exists without how-to
- 0: Answer about user password management
Building Annotation Interface
from dataclasses import dataclass
from typing import List, Optional
import json
@dataclass
class AnnotationTask:
query_id: str
query: str
context: List[str]
answer: str
annotation_id: str
annotator_id: Optional[str] = None
@dataclass
class AnnotationResult:
task_id: str
annotator_id: str
relevance_score: int # 0-4
faithfulness_score: int # 0-4
conciseness_score: int # 0-4
notes: Optional[str] = None
flagged: bool = False
timestamp: Optional[str] = None
def export_annotation_batch(
tasks: List[AnnotationTask],
output_file: str
):
"""Export tasks for annotation in standard format."""
with open(output_file, "w") as f:
for task in tasks:
f.write(json.dumps(task.__dict__) + "\n")
# Streamlit-based annotation interface
ANNOTATION_APP = """
import streamlit as st
import json
st.title("RAG Response Annotation")
with open("annotation_tasks.jsonl") as f:
tasks = [json.loads(line) for line in f]
if "current_idx" not in st.session_state:
st.session_state.current_idx = 0
task = tasks[st.session_state.current_idx]
st.subheader(f"Query: {task['query']}")
st.markdown("**Retrieved Context:**")
for ctx in task['context']:
st.markdown(f"> {ctx}")
st.markdown(f"**Answer:** {task['answer']}")
col1, col2, col3 = st.columns(3)
relevance = col1.slider("Relevance", 0, 4, 2)
faithfulness = col2.slider("Faithfulness", 0, 4, 2)
conciseness = col3.slider("Conciseness", 0, 4, 2)
if st.button("Submit & Next"):
save_annotation(task['query_id'], relevance, faithfulness, conciseness)
st.session_state.current_idx += 1
st.rerun()
"""
Calculating Inter-Annotator Agreement
Cohen's Kappa or Krippendorff's Alpha measures agreement beyond chance. Below 0.6 indicates guidelines ambiguity that needs resolution.
from sklearn.metrics import cohen_kappa_score
from typing import Dict, List
import numpy as np
def calculate_annotation_agreement(
annotations_by_annotator: Dict[str, List[dict]]
) -> dict:
"""Calculate inter-annotator agreement across raters."""
annotators = list(annotations_by_annotator.keys())
# Find common tasks
common_tasks = set(annotations_by_annotator[annotators[0]].keys())
for annotator in annotators[1:]:
common_tasks &= set(annotations_by_annotator[annotator].keys())
if len(common_tasks) < 3:
return {"error": "Insufficient common annotations"}
# Pairwise kappa for relevance
kappa_scores = []
for i, a1 in enumerate(annotators):
for a2 in annotators[i+1:]:
scores1 = [
annotations_by_annotator[a1][t]["relevance_score"]
for t in common_tasks
]
scores2 = [
annotations_by_annotator[a2][t]["relevance_score"]
for t in common_tasks
]
kappa = cohen_kappa_score(scores1, scores2)
kappa_scores.append(kappa)
return {
"mean_kappa": np.mean(kappa_scores),
"all_pairwise_kappas": kappa_scores,
"common_tasks_count": len(common_tasks),
"annotators": annotators
}
EXERCISE
Recruit two colleagues to annotate 50 test cases independently. Calculate Cohen's Kappa, identify the highest-disagreement examples, and use those cases to refine the annotation guidelines.