HOW-TO · INF

How to evaluate model response quality using predefined test cases

advanced30 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

A model downloaded in Ollama, Python 3.10+, and a curated set of test cases with expected outputs

What this does

Creates a structured test-case suite in JSON format, runs a local model against each case, and scores the responses using automated criteria such as keyword matching and length bounds. After this guide a quality report with pass/fail per test case will be available.

Steps

  1. Define test cases in a JSON file. Structures each case with input, expected criteria, and passing conditions.

    [
      {"id": "capital_q", "prompt": "What is the capital of Japan?", "expected_substring": "Tokyo", "min_length": 10, "max_length": 300},
      {"id": "code_fib", "prompt": "Write a Python fibonacci function.", "expected_substring": "def", "min_length": 50, "max_length": 500}
    ]
    

    Save as test_cases.json.

  2. Write the evaluation runner. Loads the test suite and scores each response against criteria.

    import requests, json
    with open("test_cases.json") as f: tests = json.load(f)
    results = []
    for tc in tests:
        resp = requests.post("http://localhost:11434/api/generate", json={"model": "llama3.2:3b", "prompt": tc["prompt"], "stream": False}, timeout=60)
        response = resp.json().get("response", "")
        passed = tc["min_length"] <= len(response) <= tc["max_length"] and tc["expected_substring"].lower() in response.lower()
        results.append({"id": tc["id"], "passed": passed})
        print(f"[{'PASS' if passed else 'FAIL'}] {tc['id']}")
    print(f"Score: {sum(1 for r in results if r['passed'])}/{len(results)}")
    
  3. Run the evaluation. Executes and reviews individual case results.

    python3 eval_quality.py
    

    Expected output: [PASS] capital_q, [FAIL] code_fib, Score: 1/2.

  4. Export results to a report file. Persists the full evaluation record for documentation.

    python3 -c "import json; json.dump(results, open('eval_report.json','w'), indent=2)"
    

    Expected output: eval_report.json written with summary and per-case breakdowns.

Verification

python3 -c "import json; d=json.load(open('eval_report.json')); print(f\"Passed: {sum(1 for r in d if r.get('passed'))}/{len(d)}\")"
# Expected: pass rate matching the score from the runner output

Common failures

  • all cases fail with empty response - Model may not be loaded; send a warm-up request before the evaluation loop.
  • expected_substring false negatives - Normalize case in matching logic; models may output different casing.
  • min_length too restrictive - Very small models produce short answers; adjust bounds per model.
  • timeout on long responses - Increase timeout or reduce max_length in test cases.

Related guides

RELATED GUIDES