InternVL 2.5 78B
InternVL 2.5 flagship. Approaches frontier proprietary VLMs on document and OCR tasks.
Overview
InternVL 2.5 flagship. Approaches frontier proprietary VLMs on document and OCR tasks.
How to run it
InternVL 2.5 78B is OpenGVLab's multimodal model — 78B text backbone with a vision encoder based on InternViT. Run at Q4_K_M via llama.cpp with llava-server or vLLM multimodal pipeline. Q4_K_M file size ~45 GB (text) + ~4-6 GB (vision). Minimum VRAM: 48 GB — RTX A6000 at Q3_K_M with vision, or text-only Q4_K_M. Recommended: A100 80GB at AWQ-INT4 for full vision serving. Throughput: ~8-15 tok/s on A6000 at Q4_K_M text-only; vision encoding adds ~2-4s per image. InternVL uses a custom architecture (InternViT + InternLM2/LLaMA backbone) — ecosystem support is narrower than Llama-based vision models. Check llama.cpp InternVL support before provisioning. Ollama may not have InternVL 2.5 — use raw llama.cpp llava-server. For production serving: vLLM with custom model registration (if supported). InternVL is known for strong vision-language benchmarks, especially on document understanding and OCR-heavy tasks.
Hardware guidance
Minimum: RTX A6000 48GB at Q3_K_M + vision (tight). Recommended: A100 80GB at AWQ-INT4. VRAM math: 78B dense at Q4_K_M ≈ 45 GB. InternViT encoder: ~5-8 GB (varies by resolution). KV cache at 8K: ~12 GB. Total with vision: ~62-65 GB. Single A6000 48GB is 15+ GB short — must use Q3_K_M or text-only Q4_K_M. Dual RTX 3090 48 GB total: Q4_K_M text-only or Q3_K_M + vision. A100 80GB: comfortable for Q4 + vision + 8K. Mac Studio M4 Ultra 128GB: Q4_K_M + vision, 2-5 tok/s (Apple Silicon InternVL support uncertain). Cloud: A100 at $5-10/hr. InternVL's InternViT is large — expect 2-3× the vision encoder VRAM of Llama 3.2 Vision's CLIP encoder.
What breaks first
- InternVL architecture support. llama.cpp's InternVL support is experimental — vision features may not project correctly, causing garbled image descriptions. Validate against reference outputs from the official InternVL GitHub repo. 2. InternViT VRAM bloat. The InternViT encoder is 6B+ parameters — significantly larger than typical vision encoders (CLIP is ~300M). At high resolutions, InternViT activations can spike to 10-15 GB. 3. Tokenizer incompatibility. InternVL may use a different tokenizer than standard LLaMA. Using the wrong tokenizer silently produces incorrect image token embeddings. 4. Multimodal GGUF availability. Pre-converted multimodal GGUFs for InternVL are less common than Llama 3.2 Vision. You may need to convert from hf yourself.
Runtime recommendation
Common beginner mistakes
Mistake: Using a Llama 3.2 Vision mmproj with InternVL text GGUF. Fix: Vision projectors are architecture-specific. Download the InternVL mmproj from the InternVL hf repo. Mistake: Assuming InternVL works with standard Ollama vision tags. Fix: InternVL requires custom model registration. Use llama.cpp directly with the correct multimodal GGUF. Mistake: Sending high-res images expecting InternViT to handle them. Fix: InternViT is large but has fixed input resolution limits. Resize images to the encoder's expected size to avoid OOM. Mistake: Expecting InternVL to run at Llama 3.2 Vision's VRAM footprint. Fix: InternViT is 5-10× larger than CLIP. Vision VRAM is proportionally higher. Budget extra 5-10 GB for InternViT.
Family & lineage
How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.
Strengths
- MIT license
- Frontier-tier OCR
Weaknesses
- 48GB+ VRAM tier
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 45.0 GB | 52 GB |
Get the model
HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of InternVL 2.5 78B.
Frequently asked
What's the minimum VRAM to run InternVL 2.5 78B?
Can I use InternVL 2.5 78B commercially?
What's the context length of InternVL 2.5 78B?
Does InternVL 2.5 78B support images?
Source: huggingface.co/OpenGVLab/InternVL2_5-78B
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify InternVL 2.5 78B runs on your specific hardware before committing money.