RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to test and iterate on system prompt designs using structured evaluation datasets
HOW-TO · DEV

How to test and iterate on system prompt designs using structured evaluation datasets

advanced·30 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

System prompts to evaluate, a test dataset with known inputs and expected outputs, an AI assistant, and a scriptable environment to run automated evaluation (Python with subprocess or curl).

What this does

System prompt quality depends on consistent performance across diverse inputs. This guide describes how to build a structured evaluation dataset, run the AI assistant against it programmatically, measure adherence to expected output formats and behaviors, and use the results to drive iterative prompt improvements.

Steps

  1. Create an evaluation dataset: write 20–30 query scenarios covering the expected use cases, and define for each a scoring rubric (format check, correctness check, and prohibited content check).
  2. Export the dataset to eval_dataset.json with fields for query, expected_format, and required_keywords.
  3. Write an evaluation script that reads the dataset, calls the AI for each query with the current system prompt, and records the raw response.
  4. Add scoring logic to the script that compares each response against the expected format and checks for required keywords.
  5. Run the evaluation script and record the pass rate and per-category scores as a baseline.
  6. Identify the lowest-scoring categories and modify the system prompt to address those specific failures.
  7. Run the evaluation again with the updated prompt and compare scores to confirm improvement.
  8. Repeat steps 6–7 until the overall pass rate reaches the target threshold (typically 85% or higher).

Verification

python eval_prompts.py --prompt-file system_v2.txt --dataset eval_dataset.json

Expected output:

Total: 25 queries
Passed: 22 (88%)
Failed categories: ["format", "security"]
Top failure: queries 7, 14 — missing required YAML block

A pass rate above the target threshold and a clear list of remaining failure categories confirms the evaluation loop is working.

Common failures

  1. The evaluation script produces no output — The AI service is not reachable or the request is timing out. Add verbose logging to the script to capture the HTTP status code and response body from each call.
  2. Pass rate is near zero for all queries — The expected output criteria in the dataset may not match what the current prompt instructs. Review one raw response and the corresponding expected_format field to identify the mismatch.
  3. Scores improve for some categories but degrade for others — The updated prompt fix in one area is introducing regression in another. Add a regression test case to the dataset for the affected category before making further changes.
  4. Evaluation script crashes on parsing the response — The AI output includes extra formatting characters or preamble text. Strip known prefixes such as model name labels before parsing the response.

Related guides

  • Design effective system prompts (guide 30)
  • Create multi-role system prompts with distinct personas (guide 31)
← All how-to guidesCourses →