What this does

Runs identical image-plus-prompt requests through vision models of different sizes, comparing description quality, inference time, and resource consumption. After this guide the best model size for a specific use case will be identifiable.

Steps

Verify all models are available. Confirms both sizes are present locally.
```
ollama list | grep llava
```
Expected output: Both llava:7b and llava:13b listed with distinct sizes.

Define a standard evaluation prompt. Consistency is essential for fair comparison.

echo "Provide a detailed description of this image, including objects, setting, colors, and any text visible." > /tmp/vision_prompt.txt

Benchmark the 7B model with timing. Records wall-clock time and captures response.
```
time ollama run llava:7b "$(cat /tmp/vision_prompt.txt)" /path/to/test_image.jpg
```
Expected output: Description text followed by real/user/sys timing.
Benchmark the 13B model with identical inputs. Uses same prompt and image.
```
time ollama run llava:13b "$(cat /tmp/vision_prompt.txt)" /path/to/test_image.jpg
```
Expected output: Typically more detailed description but longer inference time.
Evaluate results across multiple images. Run on 5+ diverse test images. Score responses on detail, accuracy, and coherence.

Verification

ollama list | grep -E "llava:7b|llava:13b" && ollama run llava:7b "Describe this image" /path/to/test.jpg --num-ctx 4096
# Expected: Both tags listed with distinct file sizes; 7B produces a reasonable description

Common failures

inconsistent prompts - Store prompt in a file and read with $(cat file) for exact reproducibility.
cold start penalty - First run after model load incurs overhead; discard first run or run each model twice.
different context windows - Set --num-ctx explicitly on both models for fair comparison.
insufficient RAM for 13B - Larger model may fail on memory-constrained systems; check with free -h before benchmarking.
subjective scoring bias - Define a rubric (counts objects, describes colors) before reading outputs.

How to compare vision model outputs across different model sizes

What this does

Steps

Verification

Common failures

Related guides