HOW-TO · INF

How to compare vision model outputs across different model sizes

advanced30 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Multiple vision models of different sizes installed in Ollama (e.g., llava:7b and llava:13b), test images, timing script or terminal access

What this does

Runs identical image-plus-prompt requests through vision models of different sizes, comparing description quality, inference time, and resource consumption. After this guide the best model size for a specific use case will be identifiable.

Steps

  1. Verify all models are available. Confirms both sizes are present locally.

    ollama list | grep llava
    

    Expected output: Both llava:7b and llava:13b listed with distinct sizes.

  2. Define a standard evaluation prompt. Consistency is essential for fair comparison.

    echo "Provide a detailed description of this image, including objects, setting, colors, and any text visible." > /tmp/vision_prompt.txt
    
  3. Benchmark the 7B model with timing. Records wall-clock time and captures response.

    time ollama run llava:7b "$(cat /tmp/vision_prompt.txt)" /path/to/test_image.jpg
    

    Expected output: Description text followed by real/user/sys timing.

  4. Benchmark the 13B model with identical inputs. Uses same prompt and image.

    time ollama run llava:13b "$(cat /tmp/vision_prompt.txt)" /path/to/test_image.jpg
    

    Expected output: Typically more detailed description but longer inference time.

  5. Evaluate results across multiple images. Run on 5+ diverse test images. Score responses on detail, accuracy, and coherence.

Verification

ollama list | grep -E "llava:7b|llava:13b" && ollama run llava:7b "Describe this image" /path/to/test.jpg --num-ctx 4096
# Expected: Both tags listed with distinct file sizes; 7B produces a reasonable description

Common failures

  • inconsistent prompts - Store prompt in a file and read with $(cat file) for exact reproducibility.
  • cold start penalty - First run after model load incurs overhead; discard first run or run each model twice.
  • different context windows - Set --num-ctx explicitly on both models for fair comparison.
  • insufficient RAM for 13B - Larger model may fail on memory-constrained systems; check with free -h before benchmarking.
  • subjective scoring bias - Define a rubric (counts objects, describes colors) before reading outputs.

Related guides

RELATED GUIDES