How to evaluate model response quality using predefined test cases
A model downloaded in Ollama, Python 3.10+, and a curated set of test cases with expected outputs
What this does
Creates a structured test-case suite in JSON format, runs a local model against each case, and scores the responses using automated criteria such as keyword matching and length bounds. After this guide a quality report with pass/fail per test case will be available.
Steps
Define test cases in a JSON file. Structures each case with input, expected criteria, and passing conditions.
[ {"id": "capital_q", "prompt": "What is the capital of Japan?", "expected_substring": "Tokyo", "min_length": 10, "max_length": 300}, {"id": "code_fib", "prompt": "Write a Python fibonacci function.", "expected_substring": "def", "min_length": 50, "max_length": 500} ]Save as
test_cases.json.Write the evaluation runner. Loads the test suite and scores each response against criteria.
import requests, json with open("test_cases.json") as f: tests = json.load(f) results = [] for tc in tests: resp = requests.post("http://localhost:11434/api/generate", json={"model": "llama3.2:3b", "prompt": tc["prompt"], "stream": False}, timeout=60) response = resp.json().get("response", "") passed = tc["min_length"] <= len(response) <= tc["max_length"] and tc["expected_substring"].lower() in response.lower() results.append({"id": tc["id"], "passed": passed}) print(f"[{'PASS' if passed else 'FAIL'}] {tc['id']}") print(f"Score: {sum(1 for r in results if r['passed'])}/{len(results)}")Run the evaluation. Executes and reviews individual case results.
python3 eval_quality.pyExpected output:
[PASS] capital_q,[FAIL] code_fib,Score: 1/2.Export results to a report file. Persists the full evaluation record for documentation.
python3 -c "import json; json.dump(results, open('eval_report.json','w'), indent=2)"Expected output:
eval_report.jsonwritten with summary and per-case breakdowns.
Verification
python3 -c "import json; d=json.load(open('eval_report.json')); print(f\"Passed: {sum(1 for r in d if r.get('passed'))}/{len(d)}\")"
# Expected: pass rate matching the score from the runner output
Common failures
- all cases fail with empty response - Model may not be loaded; send a warm-up request before the evaluation loop.
- expected_substring false negatives - Normalize case in matching logic; models may output different casing.
- min_length too restrictive - Very small models produce short answers; adjust bounds per model.
- timeout on long responses - Increase timeout or reduce max_length in test cases.