How to compare vision model outputs across different model sizes
Multiple vision models of different sizes installed in Ollama (e.g., llava:7b and llava:13b), test images, timing script or terminal access
What this does
Runs identical image-plus-prompt requests through vision models of different sizes, comparing description quality, inference time, and resource consumption. After this guide the best model size for a specific use case will be identifiable.
Steps
Verify all models are available. Confirms both sizes are present locally.
ollama list | grep llavaExpected output: Both llava:7b and llava:13b listed with distinct sizes.
Define a standard evaluation prompt. Consistency is essential for fair comparison.
echo "Provide a detailed description of this image, including objects, setting, colors, and any text visible." > /tmp/vision_prompt.txtBenchmark the 7B model with timing. Records wall-clock time and captures response.
time ollama run llava:7b "$(cat /tmp/vision_prompt.txt)" /path/to/test_image.jpgExpected output: Description text followed by real/user/sys timing.
Benchmark the 13B model with identical inputs. Uses same prompt and image.
time ollama run llava:13b "$(cat /tmp/vision_prompt.txt)" /path/to/test_image.jpgExpected output: Typically more detailed description but longer inference time.
Evaluate results across multiple images. Run on 5+ diverse test images. Score responses on detail, accuracy, and coherence.
Verification
ollama list | grep -E "llava:7b|llava:13b" && ollama run llava:7b "Describe this image" /path/to/test.jpg --num-ctx 4096
# Expected: Both tags listed with distinct file sizes; 7B produces a reasonable description
Common failures
- inconsistent prompts - Store prompt in a file and read with
$(cat file)for exact reproducibility. - cold start penalty - First run after model load incurs overhead; discard first run or run each model twice.
- different context windows - Set
--num-ctxexplicitly on both models for fair comparison. - insufficient RAM for 13B - Larger model may fail on memory-constrained systems; check with
free -hbefore benchmarking. - subjective scoring bias - Define a rubric (counts objects, describes colors) before reading outputs.