Human Annotation — RAG Evaluation and Metrics (Chapter 14)

Automated metrics miss nuanced quality issues that human raters catch reliably. Building an annotation pipeline with clear guidelines and quality control produces ground truth data that automated evaluation can be calibrated against.

Designing Annotation Guidelines

Clear guidelines reduce inter-annotator disagreement. Each guide should include question types with examples, scoring rubrics with boundary cases, and feedback mechanisms for ambiguous cases.

# Relevance Annotation Guidelines

## Scoring Scale
- **4 - Perfect**: Answer fully satisfies the query with appropriate scope
- **3 - Acceptable**: Answer addresses core intent but has minor gaps
- **2 - Partial**: Answer covers only part of what was asked
- **1 - Poor**: Answer is tangentially related or contains significant errors
- **0 - Irrelevant**: Answer does not address the query at all

## Boundary Cases

### Query: "How do I reset the admin password?"
- 4: Step-by-step reset instructions for admin password
- 3: Reset instructions that require contacting support (partial workaround)
- 2: General password reset instructions (not specific to admin)
- 1: Mention password reset exists without how-to
- 0: Answer about user password management

Building Annotation Interface

from dataclasses import dataclass
from typing import List, Optional
import json

@dataclass
class AnnotationTask:
    query_id: str
    query: str
    context: List[str]
    answer: str
    annotation_id: str
    annotator_id: Optional[str] = None

@dataclass
class AnnotationResult:
    task_id: str
    annotator_id: str
    relevance_score: int  # 0-4
    faithfulness_score: int  # 0-4
    conciseness_score: int  # 0-4
    notes: Optional[str] = None
    flagged: bool = False
    timestamp: Optional[str] = None

def export_annotation_batch(
    tasks: List[AnnotationTask],
    output_file: str
):
    """Export tasks for annotation in standard format."""
    with open(output_file, "w") as f:
        for task in tasks:
            f.write(json.dumps(task.__dict__) + "\n")

# Streamlit-based annotation interface
ANNOTATION_APP = """
import streamlit as st
import json

st.title("RAG Response Annotation")

with open("annotation_tasks.jsonl") as f:
    tasks = [json.loads(line) for line in f]

if "current_idx" not in st.session_state:
    st.session_state.current_idx = 0

task = tasks[st.session_state.current_idx]

st.subheader(f"Query: {task['query']}")
st.markdown("**Retrieved Context:**")
for ctx in task['context']:
    st.markdown(f"> {ctx}")
st.markdown(f"**Answer:** {task['answer']}")

col1, col2, col3 = st.columns(3)
relevance = col1.slider("Relevance", 0, 4, 2)
faithfulness = col2.slider("Faithfulness", 0, 4, 2)
conciseness = col3.slider("Conciseness", 0, 4, 2)

if st.button("Submit & Next"):
    save_annotation(task['query_id'], relevance, faithfulness, conciseness)
    st.session_state.current_idx += 1
    st.rerun()
"""

Calculating Inter-Annotator Agreement

Cohen's Kappa or Krippendorff's Alpha measures agreement beyond chance. Below 0.6 indicates guidelines ambiguity that needs resolution.

from sklearn.metrics import cohen_kappa_score
from typing import Dict, List
import numpy as np

def calculate_annotation_agreement(
    annotations_by_annotator: Dict[str, List[dict]]
) -> dict:
    """Calculate inter-annotator agreement across raters."""
    annotators = list(annotations_by_annotator.keys())
    
    # Find common tasks
    common_tasks = set(annotations_by_annotator[annotators[0]].keys())
    for annotator in annotators[1:]:
        common_tasks &= set(annotations_by_annotator[annotator].keys())
    
    if len(common_tasks) < 3:
        return {"error": "Insufficient common annotations"}
    
    # Pairwise kappa for relevance
    kappa_scores = []
    for i, a1 in enumerate(annotators):
        for a2 in annotators[i+1:]:
            scores1 = [
                annotations_by_annotator[a1][t]["relevance_score"] 
                for t in common_tasks
            ]
            scores2 = [
                annotations_by_annotator[a2][t]["relevance_score"] 
                for t in common_tasks
            ]
            kappa = cohen_kappa_score(scores1, scores2)
            kappa_scores.append(kappa)
    
    return {
        "mean_kappa": np.mean(kappa_scores),
        "all_pairwise_kappas": kappa_scores,
        "common_tasks_count": len(common_tasks),
        "annotators": annotators
    }