What this does

After collecting A/B test data, statistical significance tests determine whether the observed performance difference between prompt variants is likely real or attributable to random chance. For binary outcome metrics (success/failure), a chi-squared test or two-proportion z-test is appropriate. For continuous metrics (latency, token count), a Welch's t-test or Mann-Whitney U test is used.

Steps

Collect test data: organize results into a structure with variant labels, total requests, and success counts per variant.
For binary outcomes (task success, error rate), use scipy.stats.chisquare or scipy.stats.proportions_ztest. For continuous outcomes, use scipy.stats.ttest_ind or scipy.stats.mannwhitneyu.
Define the null hypothesis: there is no difference in performance between the variants.
Compute the test statistic and associated p-value using the appropriate scipy function.
Compare the p-value to the chosen significance level (alpha, typically 0.05). If p < alpha, reject the null hypothesis and declare the difference statistically significant.
Report the effect size alongside the p-value: for proportions, compute the absolute difference and relative lift; for continuous metrics, report Cohen's d.
If sample size is small (fewer than 30 per variant), prefer exact tests or Fisher's exact test over chi-squared approximations.
Visualize results: plot the distribution of outcomes per variant and overlay the significance verdict on a results dashboard.
Document the conclusion with test parameters, p-value, effect size, and confidence interval.

Verification

python3 -c "
from scipy.stats import chisquare, proportions_ztest
import numpy as np

# Example A/B test results
variant_a_successes, variant_a_total = 380, 500
variant_b_successes, variant_b_total = 430, 500

# Chi-squared test
observed = np.array([[variant_a_successes, variant_a_total - variant_a_successes],
                     [variant_b_successes, variant_b_total - variant_b_successes]])
chi2, p = chisquare(observed, axis=1)
print(f'Chi-squared p-value: {p[0]:.4f}')

# Two-proportion z-test
z, p_z = proportions_ztest([variant_a_successes, variant_b_successes],
                           [variant_a_total, variant_b_total])
print(f'Z-statistic: {z:.4f}, two-sided p-value: {p_z:.4f}')
significant = p_z < 0.05
print(f'Statically significant at alpha=0.05: {significant}')
"

Expected output: chi-squared p-value near 0.0000, z-statistic around 3.90, and Statistically significant at alpha=0.05: True.

Common failures

Low sample size causing false negatives: small sample sizes produce wide confidence intervals, making significance undetectable even when a real difference exists. Solution: perform power analysis before running the test to determine the minimum sample size needed.
Wrong test type for data distribution: applying a z-test to heavily skewed continuous data inflates Type I error rates. Solution: inspect the distribution histogram before selecting a test; use non-parametric tests (Mann-Whitney) when normality assumptions are violated.
Multiple comparisons without correction: testing many metrics separately without adjusting alpha inflates the family-wise error rate. Solution: apply Bonferroni correction or use Benjamini-Hochberg FDR correction when running multiple simultaneous tests.
Misinterpreting p-value as effect size: a statistically significant result does not indicate practical importance. Solution: always report effect size alongside p-value to contextualize the magnitude of the difference.

How to run statistical significance tests on A/B prompt test results

What this does

Steps

Verification

Common failures

Related guides