What this does

System prompt quality depends on consistent performance across diverse inputs. This guide describes how to build a structured evaluation dataset, run the AI assistant against it programmatically, measure adherence to expected output formats and behaviors, and use the results to drive iterative prompt improvements.

Steps

Create an evaluation dataset: write 20–30 query scenarios covering the expected use cases, and define for each a scoring rubric (format check, correctness check, and prohibited content check).
Export the dataset to eval_dataset.json with fields for query, expected_format, and required_keywords.
Write an evaluation script that reads the dataset, calls the AI for each query with the current system prompt, and records the raw response.
Add scoring logic to the script that compares each response against the expected format and checks for required keywords.
Run the evaluation script and record the pass rate and per-category scores as a baseline.
Identify the lowest-scoring categories and modify the system prompt to address those specific failures.
Run the evaluation again with the updated prompt and compare scores to confirm improvement.
Repeat steps 6–7 until the overall pass rate reaches the target threshold (typically 85% or higher).

Verification

python eval_prompts.py --prompt-file system_v2.txt --dataset eval_dataset.json

Expected output:

Total: 25 queries
Passed: 22 (88%)
Failed categories: ["format", "security"]
Top failure: queries 7, 14 — missing required YAML block

A pass rate above the target threshold and a clear list of remaining failure categories confirms the evaluation loop is working.

Common failures

The evaluation script produces no output — The AI service is not reachable or the request is timing out. Add verbose logging to the script to capture the HTTP status code and response body from each call.
Pass rate is near zero for all queries — The expected output criteria in the dataset may not match what the current prompt instructs. Review one raw response and the corresponding expected_format field to identify the mismatch.
Scores improve for some categories but degrade for others — The updated prompt fix in one area is introducing regression in another. Add a regression test case to the dataset for the affected category before making further changes.
Evaluation script crashes on parsing the response — The AI output includes extra formatting characters or preamble text. Strip known prefixes such as model name labels before parsing the response.

Related guides

Design effective system prompts (guide 30)
Create multi-role system prompts with distinct personas (guide 31)