13. Regression Testing
Chapter 13 of 18 · 20 min
Regression testing ensures prompt modifications don't break existing functionality. Unlike software regressions, prompt regressions manifest as subtle quality degradation that standard metrics miss.
Defining Regression Criteria
A regression occurs when:
- Correct behavior stops working
- Response quality drops below threshold
- Latency increases beyond acceptable bounds
- Output format becomes inconsistent
# regression_detector.py
import statistics
from datetime import datetime, timedelta
class PromptRegressionDetector:
def __init__(self, baseline_results):
self.baseline = baseline_results
self.thresholds = {
"quality_score": 0.85, # 85% minimum
"latency_p99": 2000, # 2 seconds max
"format_compliance": 0.95
}
def check_regression(self, new_results):
regressions = []
# Quality regression
quality_delta = new_results["quality_score"] - self.baseline["quality_score"]
if quality_delta < -0.05:
regressions.append({
"type": "quality_degradation",
"delta": quality_delta,
"severity": "high"
})
# Latency regression
if new_results["latency_p99"] > self.thresholds["latency_p99"]:
regressions.append({
"type": "latency_exceeded",
"value": new_results["latency_p99"],
"threshold": self.thresholds["latency_p99"]
})
return regressions
Automated Regression Suites
# Run regression tests nightly
import schedule
def nightly_regression_check():
results = run_test_suite(
prompt_version=get_current_production_version(),
test_suite="regression_tests.yaml",
model="llama3:70b"
)
regressions = detector.check_regression(results)
if regressions:
alert_ops_team(regressions)
# Don't auto-deploy if regressions detected
return False
return True
schedule.every().day.at("02:00").do(nightly_regression_check)
Golden Set Maintenance
Maintain a "golden set" of inputs with expected outputs. Update it when correct behavior changes:
# golden_set.yaml
- input: "What is my return policy?"
expected_patterns:
- "30 days"
- "original receipt"
updated_version: "2.1"
- input: "I need to speak to a manager"
expected_patterns:
- "escalat"
updated_version: "2.2"
Shadow Mode Testing
Test new prompt versions against production traffic without affecting users:
def shadow_mode_test(prompt_version, duration_hours=24):
"""Compare new prompt against production in shadow mode."""
results = {"current": [], "new": []}
for request in stream_production_requests(duration_hours):
# Run current production prompt
current_response = invoke_prompt(
get_production_prompt(),
request,
model="gpt-4"
)
results["current"].append(current_response)
# Run new version in parallel (shadow)
new_response = invoke_prompt(
load_prompt("prompts", prompt_version),
request,
model="gpt-4"
)
results["new"].append(new_response)
return compare_results(results["current"], results["new"])
EXERCISE
Build a regression detector that compares response quality between two prompt versions using embedding similarity (cosine distance). Run against at least 100 test cases and identify the threshold for flagging regressions.