HOW-TO · DEV
How to run statistical significance tests on A/B prompt test results
Target environment
Ubuntu 24.04 · Python 3.12Ubuntu 24.04 · Python 3.12
PREREQUISITES
A/B prompt test results with success/failure counts per variant, Python 3.10+ with scipy installed
What this does
After collecting A/B test data, statistical significance tests determine whether the observed performance difference between prompt variants is likely real or attributable to random chance. For binary outcome metrics (success/failure), a chi-squared test or two-proportion z-test is appropriate. For continuous metrics (latency, token count), a Welch's t-test or Mann-Whitney U test is used.
Steps
- Collect test data: organize results into a structure with variant labels, total requests, and success counts per variant.
- For binary outcomes (task success, error rate), use
scipy.stats.chisquareorscipy.stats.proportions_ztest. For continuous outcomes, usescipy.stats.ttest_indorscipy.stats.mannwhitneyu. - Define the null hypothesis: there is no difference in performance between the variants.
- Compute the test statistic and associated p-value using the appropriate scipy function.
- Compare the p-value to the chosen significance level (alpha, typically 0.05). If p < alpha, reject the null hypothesis and declare the difference statistically significant.
- Report the effect size alongside the p-value: for proportions, compute the absolute difference and relative lift; for continuous metrics, report Cohen's d.
- If sample size is small (fewer than 30 per variant), prefer exact tests or Fisher's exact test over chi-squared approximations.
- Visualize results: plot the distribution of outcomes per variant and overlay the significance verdict on a results dashboard.
- Document the conclusion with test parameters, p-value, effect size, and confidence interval.
Verification
python3 -c "
from scipy.stats import chisquare, proportions_ztest
import numpy as np
# Example A/B test results
variant_a_successes, variant_a_total = 380, 500
variant_b_successes, variant_b_total = 430, 500
# Chi-squared test
observed = np.array([[variant_a_successes, variant_a_total - variant_a_successes],
[variant_b_successes, variant_b_total - variant_b_successes]])
chi2, p = chisquare(observed, axis=1)
print(f'Chi-squared p-value: {p[0]:.4f}')
# Two-proportion z-test
z, p_z = proportions_ztest([variant_a_successes, variant_b_successes],
[variant_a_total, variant_b_total])
print(f'Z-statistic: {z:.4f}, two-sided p-value: {p_z:.4f}')
significant = p_z < 0.05
print(f'Statically significant at alpha=0.05: {significant}')
"
Expected output: chi-squared p-value near 0.0000, z-statistic around 3.90, and Statistically significant at alpha=0.05: True.
Common failures
- Low sample size causing false negatives: small sample sizes produce wide confidence intervals, making significance undetectable even when a real difference exists. Solution: perform power analysis before running the test to determine the minimum sample size needed.
- Wrong test type for data distribution: applying a z-test to heavily skewed continuous data inflates Type I error rates. Solution: inspect the distribution histogram before selecting a test; use non-parametric tests (Mann-Whitney) when normality assumptions are violated.
- Multiple comparisons without correction: testing many metrics separately without adjusting alpha inflates the family-wise error rate. Solution: apply Bonferroni correction or use Benjamini-Hochberg FDR correction when running multiple simultaneous tests.
- Misinterpreting p-value as effect size: a statistically significant result does not indicate practical importance. Solution: always report effect size alongside p-value to contextualize the magnitude of the difference.