What this does

Creates a structured test-case suite in JSON format, runs a local model against each case, and scores the responses using automated criteria such as keyword matching and length bounds. After this guide a quality report with pass/fail per test case will be available.

Steps

Define test cases in a JSON file. Structures each case with input, expected criteria, and passing conditions.

[
  {"id": "capital_q", "prompt": "What is the capital of Japan?", "expected_substring": "Tokyo", "min_length": 10, "max_length": 300},
  {"id": "code_fib", "prompt": "Write a Python fibonacci function.", "expected_substring": "def", "min_length": 50, "max_length": 500}
]

Save as test_cases.json.

Write the evaluation runner. Loads the test suite and scores each response against criteria.

import requests, json
with open("test_cases.json") as f: tests = json.load(f)
results = []
for tc in tests:
    resp = requests.post("http://localhost:11434/api/generate", json={"model": "llama3.2:3b", "prompt": tc["prompt"], "stream": False}, timeout=60)
    response = resp.json().get("response", "")
    passed = tc["min_length"] <= len(response) <= tc["max_length"] and tc["expected_substring"].lower() in response.lower()
    results.append({"id": tc["id"], "passed": passed})
    print(f"[{'PASS' if passed else 'FAIL'}] {tc['id']}")
print(f"Score: {sum(1 for r in results if r['passed'])}/{len(results)}")

Run the evaluation. Executes and reviews individual case results.
```
python3 eval_quality.py
```
Expected output: [PASS] capital_q, [FAIL] code_fib, Score: 1/2.
Export results to a report file. Persists the full evaluation record for documentation.
```
python3 -c "import json; json.dump(results, open('eval_report.json','w'), indent=2)"
```
Expected output: eval_report.json written with summary and per-case breakdowns.

Verification

python3 -c "import json; d=json.load(open('eval_report.json')); print(f\"Passed: {sum(1 for r in d if r.get('passed'))}/{len(d)}\")"
# Expected: pass rate matching the score from the runner output

Common failures

all cases fail with empty response - Model may not be loaded; send a warm-up request before the evaluation loop.
expected_substring false negatives - Normalize case in matching logic; models may output different casing.
min_length too restrictive - Very small models produce short answers; adjust bounds per model.
timeout on long responses - Increase timeout or reduce max_length in test cases.

How to evaluate model response quality using predefined test cases

What this does

Steps

Verification

Common failures

Related guides