InternVL 2.5 26B
InternVL 2.5 mid-tier — Shanghai AI Lab vision-language model with strong document and chart understanding.
Overview
InternVL 2.5 mid-tier — Shanghai AI Lab vision-language model with strong document and chart understanding.
How to run it
InternVL 2.5 26B is OpenGVLab's 26B vision-language model — the smaller sibling of InternVL 2.5 78B. 26B text backbone + InternViT vision encoder, designed for document understanding, OCR, and visual QA. Run at Q4_K_M via llama.cpp with llava-server for vision. Q4_K_M file size 15 GB (text) + ~3-5 GB (vision). Minimum VRAM: 16 GB — RTX 4080 (16GB) at Q4_K_M text-only, or Q3_K_M + vision. Recommended: RTX 4090 24GB at Q4_K_M + vision. Throughput: ~30-50 tok/s on RTX 4090 at Q4_K_M text-only; vision encoding adds ~1-3s per image. InternVL architecture — InternViT encoder is large (6B), making vision VRAM proportionally higher than Llama/Qwen vision models at the same text backbone size. Check llama.cpp InternVL 26B support — may differ from 78B support. Use for: document OCR, chart understanding, visual QA, UI screenshot analysis. Not for: text-only general chat (use standard 26B text model). Context: 32K advertised; practical with vision at Q4 on 24 GB is 4-8K. For larger vision models: InternVL 2.5 78B.
Hardware guidance
Minimum: RTX 3060 12GB at Q3_K_M + vision (tight). Recommended: RTX 4090 24GB at Q4_K_M + vision (8K context). VRAM math: 26B text at Q4 ≈ 15 GB. InternViT encoder: ~4-6 GB. KV cache at 8K: ~5 GB. Total with vision: ~24-26 GB. RTX 4090 24GB: Q4 + vision + 4K context — tight. Offload vision encoder activations for headroom. RTX 4080 16GB: Q3_K_M + vision at 4K. MacBook Pro M4 Max 36GB+: Q4 + vision at 5-10 tok/s. Cloud: A10 24GB at Q4_K_M + vision. InternViT is the bottleneck — budget 4-6 GB specifically for the vision encoder. AWQ-INT4 drops text to ~13 GB, helping VRAM fit.
What breaks first
- InternViT VRAM domination. The vision encoder is proportionally larger than the text backbone. At 26B text, the 6B vision encoder takes 25-30% of total VRAM — much higher ratio than Llama/Qwen vision models. 2. Multimodal GGUF scarcity. Pre-converted InternVL 26B GGUFs with vision are rare. You may need to convert from hf or use text-only. 3. Resolution sensitivity. InternViT's quality degrades sharply with low-resolution inputs. But high-res inputs spike vision encoder VRAM by 3-5 GB. Find the resolution sweet spot for your use case. 4. Tokenizer format. InternVL uses a custom vision+text tokenizer format. Standard llama.cpp llava may not handle InternVL's specific multimodal token embedding correctly. Validate vision outputs against reference.
Runtime recommendation
Common beginner mistakes
Mistake: Expecting InternVL 26B to have the same vision-to-text VRAM ratio as Llama 3.2 Vision. Fix: InternViT is ~6B — 20× larger than CLIP. Budget 4-6 GB for vision encoder alone. Your 16 GB GPU may not fit vision+text at Q4. Mistake: Using InternVL 26B vision projector with 78B GGUF. Fix: Different model sizes, different projectors. Match models exactly. Mistake: Assuming 26B = half the quality of 78B. Fix: The 26B is significantly weaker at complex visual reasoning. 78B is the recommendation for document understanding and OCR. 26B is the budget option. Mistake: Sending images without preprocessing. Fix: InternVL expects specific image preprocessing. Use the model's image processor or resize to the encoder's expected input size.
Family & lineage
How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.
Strengths
- MIT license
- Strong on charts and documents
Weaknesses
- Smaller community than Qwen-VL
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 16.0 GB | 20 GB |
Get the model
HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of InternVL 2.5 26B.
Frequently asked
What's the minimum VRAM to run InternVL 2.5 26B?
Can I use InternVL 2.5 26B commercially?
What's the context length of InternVL 2.5 26B?
Does InternVL 2.5 26B support images?
Source: huggingface.co/OpenGVLab/InternVL2_5-26B
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify InternVL 2.5 26B runs on your specific hardware before committing money.