RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Capstone: Research AI System
  6. /Ch. 8
Capstone: Research AI System

08. Quantitative Evaluation

Chapter 8 of 18 · 15 min
KEY INSIGHT

Quantitative evaluation is only meaningful when accompanied by statistical rigor. Report means, standard deviations, and significance tests—never single-run numbers. Rigorous evaluation requires multiple seeds, proper statistical tests, and transparent reporting of both improvements and regressions. **Statistical Framework:** ```python import scipy.stats as stats import numpy as np def evaluate_significance(results_dict, alpha=0.05): """ Compare treatment (our method) against baseline. Returns significance status and effect size. """ baseline_scores = np.array(results_dict["baseline"]) treatment_scores = np.array(results_dict["treatment"]) # Welch's t-test (does not assume equal variances) t_stat, p_value = stats.ttest_ind(treatment_scores, baseline_scores, equal_var=False) # Cohen's d for effect size pooled_std = np.sqrt((np.std(baseline_scores)**2 + np.std(treatment_scores)**2) / 2) cohens_d = (np.mean(treatment_scores) - np.mean(baseline_scores)) / pooled_std return { "p_value": p_value, "significant": p_value < alpha, "effect_size": cohens_d, "baseline_mean": np.mean(baseline_scores), "treatment_mean": np.mean(treatment_scores), "improvement_pct": (np.mean(treatment_scores) - np.mean(baseline_scores)) / np.mean(baseline_scores) * 100 } ``` **Evaluation Reporting Template:** | Metric | Baseline | Ours | Δ | p-value | Notes | |--------|----------|------|---|---------|-------| | BLEU | 28.4 | 29.8 | +1.4 | 0.003 | Significant | | Params (M) | 250 | 245 | -2% | - | - | | Latency (ms) | 45 | 38 | -16% | 0.012 | Significant | | Memory (GB) | 12 | 11.5 | -4% | 0.08 | Not significant | **Common Mistakes:** - Reporting test set metrics without validation set check (overfitting to test) - Ignoring failed runs (survivorship bias in results) - Comparing single runs across different random seeds (incomparable)

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Run your evaluation pipeline with 3 different random seeds. Report mean and standard deviation for each metric. If std is >10% of mean, investigate instability.

← Chapter 7
Ablation Study Design
Chapter 9 →
Qualitative Analysis