Molmo 72B
Molmo flagship. Apache 2.0 VLM rivaling proprietary models on UI pointing and visual reasoning.
Positioning
Molmo 72B is a dense 72-billion-parameter vision-language model (VLM) released by the Allen Institute (AI2) under the permissive Apache 2.0 license. With a 4,096-token context window, it is designed for datacenter-tier deployments and is particularly suited for agent UI tasks such as visual pointing and reasoning. As an open-weight VLM, it offers a transparent alternative to proprietary models in the VLM space.
Strengths
- Fully open Apache 2.0 license: Unrestricted commercial use, modification, and redistribution — no gating or royalties.
- Dense architecture at 72B parameters: Full parameter count is active for every forward pass, providing maximum representational capacity for complex visual-language tasks.
- Designed for UI agent use cases: The model is specifically optimized for visual pointing and reasoning, making it a strong candidate for automating graphical user interfaces.
- Large model file sizes enable high-fidelity quantizations: At Q4_K_M (40.5 GB) or Q8_0 (77 GB), the model retains most of its precision while fitting into multi-GPU datacenter configurations.
Limitations
- Datacenter-only deployment class: With FP16 requiring ~144 GB of disk and substantial GPU memory (add ~30-50% for KV cache and overhead), the model cannot run on consumer or workstation hardware.
- Short context window (4,096 tokens): Compared to many modern LLMs offering 32K–128K contexts, this limits the model's ability to process long documents or multi-turn interactions.
- No community benchmarks available: We do not yet have independent measurements of real-world performance; published vendor metrics should be treated as best-case.
- Dense architecture means high compute cost: Unlike Mixture-of-Experts models that activate only a fraction of parameters per token, Molmo 72B uses all 72B parameters for every forward pass, requiring substantial GPU compute and memory bandwidth.
What it takes to run this locally
Quantized model file sizes (disk):
- FP16: ~144 GB
- Q8_0: ~77 GB
- Q6_K: ~59.4 GB
- Q5_K_M: ~51.3 GB
- Q4_K_M: ~40.5 GB
- Q3_K_M: ~35.1 GB
- Q2_K: ~23.4 GB
Add approximately 30–50% overhead for KV cache and framework memory at typical context lengths. This model is firmly in the datacenter deployment class: it requires multiple high-end GPUs (e.g., 4–8× A100 80GB or H100) to run at usable precision. No single consumer or workstation GPU can accommodate it.
Should you run this locally?
Yes if you have access to multi-GPU datacenter hardware and need a fully open, Apache 2.0 licensed VLM for UI agent or visual reasoning tasks where transparency and customizability are paramount.
No if you lack multi-GPU infrastructure, require longer context windows, or need a model that can run on a single consumer GPU. For those cases, consider smaller VLMs or quantized models in the 7B–13B range.
Catalog cross-links
- Molmo 7B — smaller sibling for consumer hardware
- Qwen2-VL 72B — alternative open VLM with longer context
- A100 GPU — typical datacenter GPU for running 72B models
Overview
Molmo flagship. Apache 2.0 VLM rivaling proprietary models on UI pointing and visual reasoning.
How to run it
Molmo 72B is Ai2's vision-language model — 72B dense backbone with a custom vision encoder. Designed for strong visual understanding with a focus on pointing/grounding (can reference specific image regions). Run at Q4_K_M via llama.cpp with llava-server for vision. Q4_K_M file size ~41 GB (text) + ~3-5 GB (vision). Minimum VRAM: 48 GB — RTX A6000 at Q3_K_M with vision. Recommended: A100 80GB at AWQ-INT4. Throughput: ~12-20 tok/s on A6000 at Q4_K_M text-only; vision adds 1-3s encoding. Molmo's unique feature is pixel-precise pointing — it can identify regions in images by coordinates, useful for UI automation, visual QA with grounding, and robotics. Ai2's license is permissive (Apache 2.0). Ecosystem support is narrower than Llama/Qwen vision models — verify llama.cpp Molmo support. Ollama may not have Molmo — use raw llama.cpp. For serving: vLLM if Molmo is registered as a supported architecture.
Hardware guidance
Minimum: RTX A6000 48GB at Q3_K_M + vision (4K context). Recommended: A100 80GB at AWQ-INT4. VRAM math: 72B dense at Q4_K_M ≈ 41 GB. Molmo vision encoder: 3-5 GB. KV cache at 8K: ~10 GB. Total: ~54-56 GB. A6000 48GB: Q3_K_M (31 GB) + vision at 4K context. A100 80GB: comfortable for Q4 + vision + 8K. Dual RTX 4090: row-split text + vision VRAM split across cards. Mac Studio M4 Ultra 128GB: Q4_K_M + vision, 2-5 tok/s (Molmo support on Apple Silicon uncertain). Cloud: A100 at $5-10/hr. AWQ-INT4 on A100 enables 16K+ context.
What breaks first
- Molmo GGUF availability. Pre-converted Molmo GGUFs are rare. You may need to convert from hf using Ai2's conversion script. Verify GGUF or AWQ availability before provisioning hardware. 2. Pointing/grounding in local inference. Molmo's coordinate outputs rely on specific output formatting tokens. llama.cpp may not parse these correctly — verify that coordinate outputs are well-formed before trusting results. 3. Vision encoder compatibility. Molmo uses a custom vision encoder (not CLIP, not InternViT). llama.cpp's standard llava implementation may not support it without model-specific patches. 4. Apache 2.0 but verify. While Molmo is Apache 2.0 licensed, the vision encoder or training data may have additional restrictions. Check the full license on huggingface.co/allenai/Molmo-72B.
Runtime recommendation
Common beginner mistakes
Mistake: Expecting Molmo to work with standard Ollama vision commands. Fix: Molmo requires custom model registration in llama.cpp. Test with raw llama.cpp and verify the multimodal GGUF. Mistake: Ignoring the pointing/grounding output format. Fix: Molmo outputs coordinates in a specific format. Parse these explicitly — don't treat them as regular text. Mistake: Using a Llama 3.2 Vision mmproj with Molmo. Fix: Vision projectors are architecture-specific. Download or convert the Molmo-specific projector. Mistake: Assuming Molmo's text quality matches Qwen 3 72B. Fix: Molmo is optimized for vision grounding — general text quality may be lower than same-sized general-purpose models. Test text-only tasks before deploying.
Family & lineage
How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.
Strengths
- Apache 2.0
- Frontier UI grounding
Weaknesses
- 48GB+ VRAM tier
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 41.0 GB | 48 GB |
Get the model
HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Molmo 72B.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Molmo 72B?
Can I use Molmo 72B commercially?
What's the context length of Molmo 72B?
Does Molmo 72B support images?
Source: huggingface.co/allenai/Molmo-72B-0924
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify Molmo 72B runs on your specific hardware before committing money.