HOW-TO · DEV
How to set up A/B testing for AI prompts using a traffic splitting framework
Target environment
Ubuntu 24.04 · Python 3.12Ubuntu 24.04 · Python 3.12
PREREQUISITES
Prompt variants (at least 2), traffic routing capability (custom proxy, feature flag system, or load balancer), logging infrastructure
What this does
A/B testing for AI prompts routes incoming requests to different prompt variants and records outcomes for each. This enables data-driven prompt optimization by measuring which variant produces better results on real workloads. A traffic splitting proxy sits in front of the AI API and distributes requests according to a configured ratio (e.g., 50/50 or 80/20).
Steps
- Define the hypothesis: what metric does the test aim to improve? (e.g., task success rate, user satisfaction score, latency).
- Create the prompt variants as separate string constants or configuration objects with unique identifiers (e.g.,
variant_a,variant_b). - Implement the traffic splitter. For a Python-based approach, create a FastAPI application that wraps the AI API call: on each request, select a variant using
random.choiceswith configured weights. - Tag every outgoing request and logged event with the selected variant ID so outcomes can be grouped during analysis.
- Instrument the splitter to log: request ID, variant ID, timestamp, user ID or session ID, and the outcome metric.
- Set the traffic split ratio (e.g., 50% variant A, 50% variant B). For new prompts, start with an even split to collect unbiased baseline data.
- Deploy the splitter in front of the AI API. Validate that traffic is correctly distributed by checking variant counts over a short period.
- Run the experiment for a predetermined minimum duration or until the sample size reaches statistical thresholds.
- Export the logged data to a structured format for analysis in a companion metrics pipeline or statistical test.
Verification
python3 -c "
import random
from collections import Counter
results = [random.choices(['variant_a', 'variant_b'], weights=[0.5, 0.5], k=200) for _ in range(1)][0]
counts = Counter(results)
print('variant_a:', counts['variant_a'], 'variant_b:', counts['variant_b'])
assert 80 < counts['variant_a'] < 120, 'Distribution outside expected range'
assert 80 < counts['variant_b'] < 120
print('Traffic split validated: roughly 50/50')
"
Expected output: counts near 100 each for variant_a and variant_b out of 200 total, within expected statistical range.
Common failures
- Variant selection not reproducible across requests: if the splitter does not log variant assignment, correlation with outcomes is impossible. Solution: always log variant ID alongside every logged event, including failed requests.
- Traffic split biased by session persistence: returning users always see the same variant, inflating correlation noise. Solution: assign variants at the request level (stateless) rather than the user level, or use stratified sampling to balance repeat visitors.
- Experiment runs too short for statistical power: insufficient samples produce inconclusive results. Solution: calculate required sample size upfront using an effect size estimate and run the experiment until that threshold is met.
- Prompt cost skew across variants: if one variant uses more tokens, cost differences may outweigh quality improvements. Solution: log token usage per variant and compute cost-adjusted metrics alongside quality metrics.