HOW-TO · DEV
How to test and iterate on system prompt designs using structured evaluation datasets
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES
System prompts to evaluate, a test dataset with known inputs and expected outputs, an AI assistant, and a scriptable environment to run automated evaluation (Python with subprocess or curl).
What this does
System prompt quality depends on consistent performance across diverse inputs. This guide describes how to build a structured evaluation dataset, run the AI assistant against it programmatically, measure adherence to expected output formats and behaviors, and use the results to drive iterative prompt improvements.
Steps
- Create an evaluation dataset: write 20–30 query scenarios covering the expected use cases, and define for each a scoring rubric (format check, correctness check, and prohibited content check).
- Export the dataset to
eval_dataset.jsonwith fields forquery,expected_format, andrequired_keywords. - Write an evaluation script that reads the dataset, calls the AI for each query with the current system prompt, and records the raw response.
- Add scoring logic to the script that compares each response against the expected format and checks for required keywords.
- Run the evaluation script and record the pass rate and per-category scores as a baseline.
- Identify the lowest-scoring categories and modify the system prompt to address those specific failures.
- Run the evaluation again with the updated prompt and compare scores to confirm improvement.
- Repeat steps 6–7 until the overall pass rate reaches the target threshold (typically 85% or higher).
Verification
python eval_prompts.py --prompt-file system_v2.txt --dataset eval_dataset.json
Expected output:
Total: 25 queries
Passed: 22 (88%)
Failed categories: ["format", "security"]
Top failure: queries 7, 14 — missing required YAML block
A pass rate above the target threshold and a clear list of remaining failure categories confirms the evaluation loop is working.
Common failures
- The evaluation script produces no output — The AI service is not reachable or the request is timing out. Add verbose logging to the script to capture the HTTP status code and response body from each call.
- Pass rate is near zero for all queries — The expected output criteria in the dataset may not match what the current prompt instructs. Review one raw response and the corresponding
expected_formatfield to identify the mismatch. - Scores improve for some categories but degrade for others — The updated prompt fix in one area is introducing regression in another. Add a regression test case to the dataset for the affected category before making further changes.
- Evaluation script crashes on parsing the response — The AI output includes extra formatting characters or preamble text. Strip known prefixes such as model name labels before parsing the response.
Related guides
- Design effective system prompts (guide 30)
- Create multi-role system prompts with distinct personas (guide 31)