HOW-TO · INF
How to create a systematic model comparison matrix
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES
Three or more models downloaded in Ollama, Python 3.10+ with pandas and requests libraries
What this does
Builds a structured evaluation framework that scores multiple models across defined criteria including latency, accuracy, output length, and memory usage. After this guide a CSV matrix for systematic side-by-side analysis and reporting will be available.
Steps
Define evaluation criteria and test prompts. Creates a structured test suite covering relevant dimensions.
TEST_PROMPTS = [ {"id": "factual_q", "prompt": "What is the capital of France?", "expected_contains": "Paris"}, {"id": "code_gen", "prompt": "Write a Python function to reverse a string.", "expected_contains": "def"}, ]Build the benchmark loop. Iterates over models and prompts, collecting metrics per cell.
import requests, pandas as pd, time MODELS = ["llama3.2:3b", "mistral:7b", "phi3:3.8b"] rows = [] for model in MODELS: for test in TEST_PROMPTS: resp = requests.post("http://localhost:11434/api/generate", json={"model": model, "prompt": test["prompt"], "stream": False}, timeout=60) data = resp.json() rows.append({"model": model, "test_id": test["id"], "response": data.get("response","")[:120]}) df = pd.DataFrame(rows) print(df.to_string(index=False))Pivot into a comparison matrix and export. Reshapes data and saves to CSV.
pivot = df.pivot(index="model", columns="test_id", values="response") pivot.to_csv("model_comparison_matrix.csv") print(pivot)
- Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Verification
python3 -c "import pandas; print(pandas.read_csv('model_comparison_matrix.csv').to_string())"
# Expected: Matrix with models as rows and test cases as columns
Common failures
- request timeout for large models - Increase timeout parameter to 180 seconds for slow models or long prompts.
- pandas pivot fails with duplicates - Ensure each model-test combination appears exactly once.
- empty responses - Some smaller models decline certain prompts; log full responses and filter empty rows.
- missing model names - Verify model names match
ollama listexactly; capitalization and tag suffixes matter.
Related guides
RELATED GUIDES