HOW-TO · INF
How to compare two models side-by-side with identical prompts
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES
Two models downloaded in Ollama, Python 3.10+ with requests library
What this does
Sends the exact same prompt to two different locally hosted models and captures both responses along with timing metadata. After this guide direct quality and speed comparison between models will be available in a structured format.
Steps
Write a comparison script. Sends a prompt to both models sequentially and prints a formatted report.
import requests, json PROMPT = "Explain the difference between a transformer and an RNN in two sentences." MODELS = ["llama3.2:3b", "mistral:7b"] for model in MODELS: payload = {"model": model, "prompt": PROMPT, "stream": False} resp = requests.post("http://localhost:11434/api/generate", json=payload) data = resp.json() print(f"=== {model} ({data.get('total_duration',0)//1_000_000}ms) ===") print(data.get("response", "").strip())Run the comparison script. Executes it from the terminal and inspects both outputs.
python3 compare_models.pyExpected output: Two model names with response text and generation time in milliseconds.
Vary parameters per model for deeper comparison. Apply different temperature settings to test creativity.
payload_a = {"model": "llama3.2:3b", "prompt": PROMPT, "options": {"temperature": 0.0}, "stream": False} payload_b = {"model": "mistral:7b", "prompt": PROMPT, "options": {"temperature": 0.7}, "stream": False}
- Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Verification
curl -s http://localhost:11434/api/generate -d '{"model":"llama3.2:3b","prompt":"Hello","stream":false}' | jq -r '.response' && curl -s http://localhost:11434/api/generate -d '{"model":"mistral:7b","prompt":"Hello","stream":false}' | jq -r '.response'
# Expected: two different response texts from the two models
Common failures
- model not found - Model names must match exact output from
ollama list; pull any missing model. - identical responses - Expected at temperature=0.0 for simple prompts; increase temperature to 0.7+ for varied output.
- timing variance between runs - Cold-start overhead affects the first call; run each model twice and discard the first.
Related guides
RELATED GUIDES