What this does

Sends the exact same prompt to two different locally hosted models and captures both responses along with timing metadata. After this guide direct quality and speed comparison between models will be available in a structured format.

Steps

Write a comparison script. Sends a prompt to both models sequentially and prints a formatted report.

import requests, json
PROMPT = "Explain the difference between a transformer and an RNN in two sentences."
MODELS = ["llama3.2:3b", "mistral:7b"]
for model in MODELS:
    payload = {"model": model, "prompt": PROMPT, "stream": False}
    resp = requests.post("http://localhost:11434/api/generate", json=payload)
    data = resp.json()
    print(f"=== {model} ({data.get('total_duration',0)//1_000_000}ms) ===")
    print(data.get("response", "").strip())

Run the comparison script. Executes it from the terminal and inspects both outputs.
```
python3 compare_models.py
```
Expected output: Two model names with response text and generation time in milliseconds.

Vary parameters per model for deeper comparison. Apply different temperature settings to test creativity.

payload_a = {"model": "llama3.2:3b", "prompt": PROMPT, "options": {"temperature": 0.0}, "stream": False}
payload_b = {"model": "mistral:7b", "prompt": PROMPT, "options": {"temperature": 0.7}, "stream": False}

Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

curl -s http://localhost:11434/api/generate -d '{"model":"llama3.2:3b","prompt":"Hello","stream":false}' | jq -r '.response' && curl -s http://localhost:11434/api/generate -d '{"model":"mistral:7b","prompt":"Hello","stream":false}' | jq -r '.response'
# Expected: two different response texts from the two models

Common failures

model not found - Model names must match exact output from ollama list; pull any missing model.
identical responses - Expected at temperature=0.0 for simple prompts; increase temperature to 0.7+ for varied output.
timing variance between runs - Cold-start overhead affects the first call; run each model twice and discard the first.

How to compare two models side-by-side with identical prompts

What this does

Steps

Verification

Common failures

Related guides