What this does

Benchmarks identical prompts against the same model in multiple quantization levels, measuring latency and tokens-per-second to determine the best trade-off between resource usage and speed. After this guide a comparison table of Q4_K_M, Q5_K_M, and Q8_0 performance will be available.

Steps

Pull model quantizations. Ensures all variants are available locally.
```
ollama pull mistral:q4_K_M && ollama pull mistral:q5_K_M && ollama pull mistral:q8_0
```
Each pull shows a progress indicator. Verify with ollama list.

Create a standard test prompt. Saves a shared prompt for consistent input.

echo "Explain the mechanism of photosynthesis in three sentences." > /tmp/test_prompt.txt

Benchmark each quantization with timing. Records wall-clock time for each variant.

time ollama run mistral:q4_K_M "$(cat /tmp/test_prompt.txt)"
time ollama run mistral:q5_K_M "$(cat /tmp/test_prompt.txt)"
time ollama run mistral:q8_0 "$(cat /tmp/test_prompt.txt)"

Expected output: Each command prints the response followed by real/user/sys time.

Tabulate results. Compare latency and evaluate response quality subjectively.

Verification

ollama list | grep mistral
# Expected: mistral with q4_K_M, q5_K_M, and q8_0 tags listed with size differences visible (Q8_0 largest, Q4_K_M smallest)

Common failures

model not found: Quantization tag may not match; verify exact tag with ollama show mistral:q4_K_M.
out of memory with Q8_0: Q8_0 requires significantly more RAM; close other applications or use a smaller base model.
inconsistent prompts: Even trailing whitespace changes outputs; always use the same prompt file.
slow inference on Q4_K_M: Some quantizations perform poorly on CPU-only systems; check GPU utilization.
outdated Ollama version: Older versions may not support all quantization tags; upgrade to 0.4.x or later.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

How to compare model performance across different quantization levels

What this does

Steps

Verification

Common failures

Operator checkpoint

Related guides