16. Evaluation

Chapter 16 of 24 · 20 min

KEY INSIGHT

Evaluation determines whether fine-tuning achieved its intended goal. Generic perplexity scores don't capture task-specific performance. A model with excellent perplexity might fail at the actual use case while another with worse perplexity performs exactly as needed. Design evaluation to match downstream behavior. If the model will answer questions, evaluate question answering. If it will generate code, evaluate code generation with unit tests. Metrics must proxy real-world success. ```python from datasets import load_metric import torch def evaluate_model(model, dataloader, tokenizer, device): model.eval() metrics = {"loss": [], "accuracy": []} with torch.no_grad(): for batch in dataloader: input_ids = batch["input_ids"].to(device) attention_mask = batch["attention_mask"].to(device) labels = batch["labels"].to(device) outputs = model(input_ids, attention_mask=attention_mask) logits = outputs.logits # Perplexity loss = torch.nn.functional.cross_entropy( logits.view(-1, logits.size(-1)), labels.view(-1) ) metrics["loss"].append(loss.item()) # Token accuracy predictions = logits.argmax(dim=-1) mask = labels != tokenizer.pad_token_id correct = (predictions == labels) & mask acc = correct.sum().item() / mask.sum().item() metrics["accuracy"].append(acc) return {k: sum(v) / len(v) for k, v in metrics.items()} ``` **Task-specific evaluation** requires different metrics: | Task | Metrics | |------|---------| | Classification | F1, Precision, Recall, AUROC | | Generation | BLEU, ROUGE, BERTScore | | Code Generation | Pass@k, Compilation Rate | | Summarization | ROUGE-L, Factuality | ```python # Example: Code generation evaluation def evaluate_code_generation(model, test_cases, tokenizer, device): results = {"compiled": 0, "passed": 0} for prompt, expected_output in test_cases: generated = generate_code(model, prompt, tokenizer, device) try: # Attempt to compile compile(generated, "<string>", "exec") results["compiled"] += 1 # Check output exec_output = capture_stdout(generated) if exec_output.strip() == expected_output.strip(): results["passed"] += 1 except: pass return {k: v / len(test_cases) for k, v in results.items()} ``` **Failure mode**: Data contamination. If evaluation samples appear in training data, metrics will be artificially inflated. Use strict data separation and report both in-distribution and out-of-distribution performance.

EXERCISE

: Implement Task-Specific Evaluation

Create a custom evaluation function for text classification that computes per-class precision and recall:

def evaluate_classification(model, dataloader, id2label, device):
    """Returns per-class metrics."""
    from collections import defaultdict
    
    model.eval()
    predictions = []
    references = []
    
    with torch.no_grad():
        for batch in dataloader:
            # Implementation here
            pass
    
    # Compute confusion matrix and derive metrics
    from sklearn.metrics import classification_report
    print(classification_report(references, predictions, target_names=id2label.values()))