What this does

Implementing A/B testing for model responses enables data-driven comparison of different model versions, prompts, or configurations. Traffic is split between a control variant (current production model) and one or more treatment variants (candidate improvements). User feedback, task completion rates, and quality metrics are collected for each variant. Statistical analysis determines whether the treatment outperforms the control with confidence, guiding deployment decisions.

Steps

Define the experiment configuration. Create an experiments table: CREATE TABLE experiments (id UUID PRIMARY KEY, name TEXT, control_variant TEXT, treatment_variants JSONB, traffic_split JSONB, start_time TIMESTAMP, end_time TIMESTAMP, status TEXT). Example entry: traffic_split = {"control": 0.5, "treatment_a": 0.5}. Implement the traffic router. In the API middleware, for each incoming request, hash the user/session ID to a consistent bucket: bucket = int(hashlib.md5(f"{experiment_id}:{user_id}".encode()).hexdigest(), 16) % 100. Assign the variant based on bucket ranges: 0-49 → control, 50-99 → treatment. This ensures the same user always sees the same variant (sticky assignment). For anonymous traffic, use a random assignment and set a cookie. Implement the variant execution: route the request to the appropriate model and prompt based on the assigned variant. Log each request with variant assignment: INSERT INTO experiment_events (experiment_id, user_id, variant, request_tokens, response_tokens, latency_ms, task_type, timestamp) VALUES (...). Define success metrics. For a chatbot: user satisfaction rating (1-5). For code generation: whether the user accepted the generated code. For summarization: the ROUGE score against a reference. Create a results analyzer. After collecting data (minimum 100 samples per variant), run the analysis. Calculate the mean and standard deviation of the success metric for each variant. Run a two-sample t-test: from scipy.stats import ttest_ind; t_stat, p_value = ttest_ind(control_scores, treatment_scores). If p_value < 0.05, the difference is statistically significant. Also compute the effect size using Cohen's d: effect_size = (mean_treatment - mean_control) / pooled_std. Generate a report with sample sizes, mean values, confidence intervals, p-value, and a recommendation. Implement an automatic stop mechanism: if the treatment is clearly harmful (p < 0.01 and negative effect), halt the experiment early and route all traffic to control.

Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Confirm the local starting state. Print the active binary, package version, model name, or configuration path before changing the workflow.
Run the smallest complete path. Execute the minimum command or script that proves the guide works end to end on the local machine.
Compare against expected output. Check the final line, status code, generated artifact, or model response against the verification section before expanding the setup.
Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

Create a test experiment with 50/50 split. Send 100 requests and verify each variant receives roughly 50 (±5). Check that the same user ID always gets the same variant across multiple requests. Verify experiment events are logged with correct variant labels and timestamps. Run the statistical analyzer on a small dataset and confirm it produces p-values and effect sizes. Simulate a clear winner by making one variant always return higher scores and verify the analyzer recommends it. Test the automatic stop: configure a 10-request check and make the treatment score 0 every time—verify traffic reverts to control.

Common failures

Uneven traffic split: Verify the hashing bucket assignment produces a uniform distribution—test with 1000 mock user IDs and count per variant. Simpson's paradox: Aggregate results may show the wrong winner if user segments have different base rates; segment analysis by user type or task category. Peeking problem: Checking results too frequently and stopping when p < 0.05 inflates false positives—pre-register the required sample size and only analyze at the end. Sticky variant breaks when user clears cookies: Use a user ID from authentication instead of cookies for sticky assignment. Statistical test assumptions violated: If the metric is binary (like/dislike), use a chi-squared test or Fisher's exact test instead of a t-test.

Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

build-rag-evaluation-pipeline
use-dspy-prompt-optimization
setup-prompt-layer-prompt-management