13. Regression Testing

Chapter 13 of 18 · 20 min

Regression testing ensures prompt modifications don't break existing functionality. Unlike software regressions, prompt regressions manifest as subtle quality degradation that standard metrics miss.

Defining Regression Criteria

A regression occurs when:

  • Correct behavior stops working
  • Response quality drops below threshold
  • Latency increases beyond acceptable bounds
  • Output format becomes inconsistent
# regression_detector.py
import statistics
from datetime import datetime, timedelta

class PromptRegressionDetector:
    def __init__(self, baseline_results):
        self.baseline = baseline_results
        self.thresholds = {
            "quality_score": 0.85,  # 85% minimum
            "latency_p99": 2000,     # 2 seconds max
            "format_compliance": 0.95
        }
    
    def check_regression(self, new_results):
        regressions = []
        
        # Quality regression
        quality_delta = new_results["quality_score"] - self.baseline["quality_score"]
        if quality_delta < -0.05:
            regressions.append({
                "type": "quality_degradation",
                "delta": quality_delta,
                "severity": "high"
            })
        
        # Latency regression
        if new_results["latency_p99"] > self.thresholds["latency_p99"]:
            regressions.append({
                "type": "latency_exceeded",
                "value": new_results["latency_p99"],
                "threshold": self.thresholds["latency_p99"]
            })
        
        return regressions

Automated Regression Suites

# Run regression tests nightly
import schedule

def nightly_regression_check():
    results = run_test_suite(
        prompt_version=get_current_production_version(),
        test_suite="regression_tests.yaml",
        model="llama3:70b"
    )
    
    regressions = detector.check_regression(results)
    
    if regressions:
        alert_ops_team(regressions)
        # Don't auto-deploy if regressions detected
        return False
    
    return True

schedule.every().day.at("02:00").do(nightly_regression_check)

Golden Set Maintenance

Maintain a "golden set" of inputs with expected outputs. Update it when correct behavior changes:

# golden_set.yaml
- input: "What is my return policy?"
  expected_patterns:
    - "30 days"
    - "original receipt"
  updated_version: "2.1"
  
- input: "I need to speak to a manager"
  expected_patterns:
    - "escalat"
  updated_version: "2.2"

Shadow Mode Testing

Test new prompt versions against production traffic without affecting users:

def shadow_mode_test(prompt_version, duration_hours=24):
    """Compare new prompt against production in shadow mode."""
    results = {"current": [], "new": []}
    
    for request in stream_production_requests(duration_hours):
        # Run current production prompt
        current_response = invoke_prompt(
            get_production_prompt(),
            request,
            model="gpt-4"
        )
        results["current"].append(current_response)
        
        # Run new version in parallel (shadow)
        new_response = invoke_prompt(
            load_prompt("prompts", prompt_version),
            request,
            model="gpt-4"
        )
        results["new"].append(new_response)
    
    return compare_results(results["current"], results["new"])
EXERCISE

Build a regression detector that compares response quality between two prompt versions using embedding similarity (cosine distance). Run against at least 100 test cases and identify the threshold for flagging regressions.