08. Quantitative Evaluation

Chapter 8 of 18 · 15 min

KEY INSIGHT

Quantitative evaluation is only meaningful when accompanied by statistical rigor. Report means, standard deviations, and significance tests—never single-run numbers. Rigorous evaluation requires multiple seeds, proper statistical tests, and transparent reporting of both improvements and regressions. **Statistical Framework:** ```python import scipy.stats as stats import numpy as np def evaluate_significance(results_dict, alpha=0.05): """ Compare treatment (our method) against baseline. Returns significance status and effect size. """ baseline_scores = np.array(results_dict["baseline"]) treatment_scores = np.array(results_dict["treatment"]) # Welch's t-test (does not assume equal variances) t_stat, p_value = stats.ttest_ind(treatment_scores, baseline_scores, equal_var=False) # Cohen's d for effect size pooled_std = np.sqrt((np.std(baseline_scores)**2 + np.std(treatment_scores)**2) / 2) cohens_d = (np.mean(treatment_scores) - np.mean(baseline_scores)) / pooled_std return { "p_value": p_value, "significant": p_value < alpha, "effect_size": cohens_d, "baseline_mean": np.mean(baseline_scores), "treatment_mean": np.mean(treatment_scores), "improvement_pct": (np.mean(treatment_scores) - np.mean(baseline_scores)) / np.mean(baseline_scores) * 100 } ``` **Evaluation Reporting Template:** | Metric | Baseline | Ours | Δ | p-value | Notes | |--------|----------|------|---|---------|-------| | BLEU | 28.4 | 29.8 | +1.4 | 0.003 | Significant | | Params (M) | 250 | 245 | -2% | - | - | | Latency (ms) | 45 | 38 | -16% | 0.012 | Significant | | Memory (GB) | 12 | 11.5 | -4% | 0.08 | Not significant | **Common Mistakes:** - Reporting test set metrics without validation set check (overfitting to test) - Ignoring failed runs (survivorship bias in results) - Comparing single runs across different random seeds (incomparable)

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

EXERCISE

Run your evaluation pipeline with 3 different random seeds. Report mean and standard deviation for each metric. If std is >10% of mean, investigate instability.