HOW-TO · INF

How to create a systematic model comparison matrix

advanced30 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Three or more models downloaded in Ollama, Python 3.10+ with pandas and requests libraries

What this does

Builds a structured evaluation framework that scores multiple models across defined criteria including latency, accuracy, output length, and memory usage. After this guide a CSV matrix for systematic side-by-side analysis and reporting will be available.

Steps

  1. Define evaluation criteria and test prompts. Creates a structured test suite covering relevant dimensions.

    TEST_PROMPTS = [
        {"id": "factual_q", "prompt": "What is the capital of France?", "expected_contains": "Paris"},
        {"id": "code_gen", "prompt": "Write a Python function to reverse a string.", "expected_contains": "def"},
    ]
    
  2. Build the benchmark loop. Iterates over models and prompts, collecting metrics per cell.

    import requests, pandas as pd, time
    MODELS = ["llama3.2:3b", "mistral:7b", "phi3:3.8b"]
    rows = []
    for model in MODELS:
        for test in TEST_PROMPTS:
            resp = requests.post("http://localhost:11434/api/generate", json={"model": model, "prompt": test["prompt"], "stream": False}, timeout=60)
            data = resp.json()
            rows.append({"model": model, "test_id": test["id"], "response": data.get("response","")[:120]})
    df = pd.DataFrame(rows)
    print(df.to_string(index=False))
    
  3. Pivot into a comparison matrix and export. Reshapes data and saves to CSV.

    pivot = df.pivot(index="model", columns="test_id", values="response")
    pivot.to_csv("model_comparison_matrix.csv")
    print(pivot)
    
  • Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

python3 -c "import pandas; print(pandas.read_csv('model_comparison_matrix.csv').to_string())"
# Expected: Matrix with models as rows and test cases as columns

Common failures

  • request timeout for large models - Increase timeout parameter to 180 seconds for slow models or long prompts.
  • pandas pivot fails with duplicates - Ensure each model-test combination appears exactly once.
  • empty responses - Some smaller models decline certain prompts; log full responses and filter empty rows.
  • missing model names - Verify model names match ollama list exactly; capitalization and tag suffixes matter.

Related guides

RELATED GUIDES