12. Prompt Testing Framework

Chapter 12 of 18 · 20 min

A reliable testing framework catches failures before deployment. This chapter covers building test suites that validate prompt behavior systematically.

Test Structure

Organize tests into three categories:

Unit tests verify individual components—a single instruction or format requirement.

Integration tests verify the full prompt with realistic inputs.

Regression tests verify behavior hasn't changed after modifications.

# tests/test_customer_service.py
import pytest
from prompty import load_prompt

@pytest.fixture
def prompt():
    return load_prompt("customer-service/v2.3")

class TestPolicyCompliance:
    def test_refund_timeline_mentioned(self, prompt):
        response = generate(prompt, {"query": "I want a refund"})
        assert "5 business days" in response
    
    def test_escalation_path_exists(self, prompt):
        response = generate(prompt, {"query": "This is my fourth complaint"})
        assert any(kw in response.lower() for kw in ["supervisor", "escalate", "manager"])

class TestFormatRequirements:
    def test_numbered_list_format(self, prompt):
        response = generate(prompt, {"query": "Where is my order?"})
        lines = response.split("\n")
        # Check that numbered items appear
        has_numbers = any(line.strip()[0].isdigit() for line in lines if line.strip())
        assert has_numbers, f"Expected numbered list, got:\n{response}"

Building Test Cases from Production Data

Extract failure cases from logs to create regression tests:

def extract_failure_cases(log_file, count=50):
    """Build test cases from production failures."""
    failures = []
    with open(log_file) as f:
        for line in f:
            entry = json.loads(line)
            if entry["quality_score"] < 0.5:
                failures.append({
                    "input": entry["query"],
                    "output": entry["response"],
                    "failure_type": classify_failure(entry["response"])
                })
    
    return failures[:count]

Local Model Testing with Ollama

import ollama

def generate(prompt, context):
    response = ollama.chat(
        model=prompt.metadata["model"],
        messages=[{
            "role": "user", 
            "content": prompt.template.format(**context)
        }],
        options={
            "temperature": prompt.metadata.get("temperature", 0.7),
            "num_ctx": prompt.metadata.get("context_window", 4096)
        }
    )
    return response["message"]["content"]

Continuous Integration for Prompts

# .github/workflows/prompt-tests.yml
name: Prompt Testing
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Build test environment
        run: pip install prompty pytest
      - name: Run prompt tests against Ollama
        run: |
          ollama serve &
          sleep 5
          pytest tests/ -v --model=llama3:70b
EXERCISE

Create a test suite for a prompt with at least 10 test cases covering happy paths, edge cases, and known failure modes. Run against both Ollama and a cloud model, then compare results.