by Microsoft Research + community
The pioneering open-weight VLM family. LLaVA-1.5, LLaVA-Next, LLaVA-OneVision. Established the VLM training recipe that Qwen-VL + InternVL refined.
Start with LLaVA 1.6 13B at Q4_K_M via Ollama — fits on single RTX 3060 12GB at 9 GB VRAM (7B text + ~600 MB vision encoder + projector). LLaVA 1.6 (formerly LLaVA-NeXT) is the reference open-weight vision-language model — it pioneered the CLIP-vision-encoder + Llama-language-backbone architecture that every subsequent VLM has adopted. The 13B variant using Vicuna-13B as backbone delivers solid general VQA and image description. For higher quality, LLaVA 1.6 34B Q4 (22 GB) fits on RTX 4090 24 GB. Skip LLaVA 1.5 — the 1.6 release adds dynamic high-resolution input (AnyRes) which is essential for document/chart understanding. However, InternVL2 now outperforms LLaVA on every VQA benchmark — LLaVA's primary advantage is simpler architecture and broader runtime support. Apache 2.0 license.
For single-user local: Ollama + llava:13b Q4_K_M on RTX 3060 12GB. Ollama's LLaVA support is the most mature VLM deployment path — ollama run llava:13b downloads both the LLM backbone and CLIP vision encoder with correctly configured projector weights. For multi-user serving: vLLM 0.6.1+ with LLaVA multimodal backend on L40S 48 GB — handles 200 concurrent VQA requests. The CLIP vision encoder (ViT-L/14) is lightweight (430 MB FP32) and should be kept in GPU memory alongside the text backbone — don't offload it. For document understanding pipelines: deploy LLaVA with tiled AnyRes preprocessing (max 4 tiles for 13B, 6 tiles for 34B) — each tile is a separate vision-encoder forward pass, adding ~100ms per tile on RTX 4090. For image-only workloads that don't need text generation, skip LLaVA and use the CLIP vision encoder directly via Transformers.
Verify LLaVA runs on your specific hardware before committing money.