What this does

Builds a structured evaluation framework that scores multiple models across defined criteria including latency, accuracy, output length, and memory usage. After this guide a CSV matrix for systematic side-by-side analysis and reporting will be available.

Steps

Define evaluation criteria and test prompts. Creates a structured test suite covering relevant dimensions.

TEST_PROMPTS = [
    {"id": "factual_q", "prompt": "What is the capital of France?", "expected_contains": "Paris"},
    {"id": "code_gen", "prompt": "Write a Python function to reverse a string.", "expected_contains": "def"},
]

Build the benchmark loop. Iterates over models and prompts, collecting metrics per cell.

import requests, pandas as pd, time
MODELS = ["llama3.2:3b", "mistral:7b", "phi3:3.8b"]
rows = []
for model in MODELS:
    for test in TEST_PROMPTS:
        resp = requests.post("http://localhost:11434/api/generate", json={"model": model, "prompt": test["prompt"], "stream": False}, timeout=60)
        data = resp.json()
        rows.append({"model": model, "test_id": test["id"], "response": data.get("response","")[:120]})
df = pd.DataFrame(rows)
print(df.to_string(index=False))

Pivot into a comparison matrix and export. Reshapes data and saves to CSV.

pivot = df.pivot(index="model", columns="test_id", values="response")
pivot.to_csv("model_comparison_matrix.csv")
print(pivot)

Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

python3 -c "import pandas; print(pandas.read_csv('model_comparison_matrix.csv').to_string())"
# Expected: Matrix with models as rows and test cases as columns

Common failures

request timeout for large models - Increase timeout parameter to 180 seconds for slow models or long prompts.
pandas pivot fails with duplicates - Ensure each model-test combination appears exactly once.
empty responses - Some smaller models decline certain prompts; log full responses and filter empty rows.
missing model names - Verify model names match ollama list exactly; capitalization and tag suffixes matter.

How to create a systematic model comparison matrix

What this does

Steps

Verification

Common failures

Related guides