12. Prompt Testing Framework
Chapter 12 of 18 · 20 min
A reliable testing framework catches failures before deployment. This chapter covers building test suites that validate prompt behavior systematically.
Test Structure
Organize tests into three categories:
Unit tests verify individual components—a single instruction or format requirement.
Integration tests verify the full prompt with realistic inputs.
Regression tests verify behavior hasn't changed after modifications.
# tests/test_customer_service.py
import pytest
from prompty import load_prompt
@pytest.fixture
def prompt():
return load_prompt("customer-service/v2.3")
class TestPolicyCompliance:
def test_refund_timeline_mentioned(self, prompt):
response = generate(prompt, {"query": "I want a refund"})
assert "5 business days" in response
def test_escalation_path_exists(self, prompt):
response = generate(prompt, {"query": "This is my fourth complaint"})
assert any(kw in response.lower() for kw in ["supervisor", "escalate", "manager"])
class TestFormatRequirements:
def test_numbered_list_format(self, prompt):
response = generate(prompt, {"query": "Where is my order?"})
lines = response.split("\n")
# Check that numbered items appear
has_numbers = any(line.strip()[0].isdigit() for line in lines if line.strip())
assert has_numbers, f"Expected numbered list, got:\n{response}"
Building Test Cases from Production Data
Extract failure cases from logs to create regression tests:
def extract_failure_cases(log_file, count=50):
"""Build test cases from production failures."""
failures = []
with open(log_file) as f:
for line in f:
entry = json.loads(line)
if entry["quality_score"] < 0.5:
failures.append({
"input": entry["query"],
"output": entry["response"],
"failure_type": classify_failure(entry["response"])
})
return failures[:count]
Local Model Testing with Ollama
import ollama
def generate(prompt, context):
response = ollama.chat(
model=prompt.metadata["model"],
messages=[{
"role": "user",
"content": prompt.template.format(**context)
}],
options={
"temperature": prompt.metadata.get("temperature", 0.7),
"num_ctx": prompt.metadata.get("context_window", 4096)
}
)
return response["message"]["content"]
Continuous Integration for Prompts
# .github/workflows/prompt-tests.yml
name: Prompt Testing
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build test environment
run: pip install prompty pytest
- name: Run prompt tests against Ollama
run: |
ollama serve &
sleep 5
pytest tests/ -v --model=llama3:70b
EXERCISE
Create a test suite for a prompt with at least 10 test cases covering happy paths, edge cases, and known failure modes. Run against both Ollama and a cloud model, then compare results.