RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
  1. >
  2. Home
  3. /Models
  4. /InternVL 2.5 78B
other
78B parameters
Commercial OK
Multimodal
·Reviewed May 2026

InternVL 2.5 78B

InternVL 2.5 flagship. Approaches frontier proprietary VLMs on document and OCR tasks.

License: MIT·Released Dec 5, 2024·Context: 32,768 tokens

Overview

InternVL 2.5 flagship. Approaches frontier proprietary VLMs on document and OCR tasks.

How to run it

InternVL 2.5 78B is OpenGVLab's multimodal model — 78B text backbone with a vision encoder based on InternViT. Run at Q4_K_M via llama.cpp with llava-server or vLLM multimodal pipeline. Q4_K_M file size ~45 GB (text) + ~4-6 GB (vision). Minimum VRAM: 48 GB — RTX A6000 at Q3_K_M with vision, or text-only Q4_K_M. Recommended: A100 80GB at AWQ-INT4 for full vision serving. Throughput: ~8-15 tok/s on A6000 at Q4_K_M text-only; vision encoding adds ~2-4s per image. InternVL uses a custom architecture (InternViT + InternLM2/LLaMA backbone) — ecosystem support is narrower than Llama-based vision models. Check llama.cpp InternVL support before provisioning. Ollama may not have InternVL 2.5 — use raw llama.cpp llava-server. For production serving: vLLM with custom model registration (if supported). InternVL is known for strong vision-language benchmarks, especially on document understanding and OCR-heavy tasks.

Hardware guidance

Minimum: RTX A6000 48GB at Q3_K_M + vision (tight). Recommended: A100 80GB at AWQ-INT4. VRAM math: 78B dense at Q4_K_M ≈ 45 GB. InternViT encoder: ~5-8 GB (varies by resolution). KV cache at 8K: ~12 GB. Total with vision: ~62-65 GB. Single A6000 48GB is 15+ GB short — must use Q3_K_M or text-only Q4_K_M. Dual RTX 3090 48 GB total: Q4_K_M text-only or Q3_K_M + vision. A100 80GB: comfortable for Q4 + vision + 8K. Mac Studio M4 Ultra 128GB: Q4_K_M + vision, 2-5 tok/s (Apple Silicon InternVL support uncertain). Cloud: A100 at $5-10/hr. InternVL's InternViT is large — expect 2-3× the vision encoder VRAM of Llama 3.2 Vision's CLIP encoder.

What breaks first

  1. InternVL architecture support. llama.cpp's InternVL support is experimental — vision features may not project correctly, causing garbled image descriptions. Validate against reference outputs from the official InternVL GitHub repo. 2. InternViT VRAM bloat. The InternViT encoder is 6B+ parameters — significantly larger than typical vision encoders (CLIP is ~300M). At high resolutions, InternViT activations can spike to 10-15 GB. 3. Tokenizer incompatibility. InternVL may use a different tokenizer than standard LLaMA. Using the wrong tokenizer silently produces incorrect image token embeddings. 4. Multimodal GGUF availability. Pre-converted multimodal GGUFs for InternVL are less common than Llama 3.2 Vision. You may need to convert from hf yourself.

Runtime recommendation

llama.cpp with InternVL-compatible llava-server build. Verify InternVL support in your llama.cpp version. vLLM if InternVL is registered as a supported architecture. Avoid Ollama — InternVL is unlikely to be in the standard catalog. Use OpenGVLab's reference serving code as fallback.

Common beginner mistakes

Mistake: Using a Llama 3.2 Vision mmproj with InternVL text GGUF. Fix: Vision projectors are architecture-specific. Download the InternVL mmproj from the InternVL hf repo. Mistake: Assuming InternVL works with standard Ollama vision tags. Fix: InternVL requires custom model registration. Use llama.cpp directly with the correct multimodal GGUF. Mistake: Sending high-res images expecting InternViT to handle them. Fix: InternViT is large but has fixed input resolution limits. Resize images to the encoder's expected size to avoid OOM. Mistake: Expecting InternVL to run at Llama 3.2 Vision's VRAM footprint. Fix: InternViT is 5-10× larger than CLIP. Vision VRAM is proportionally higher. Budget extra 5-10 GB for InternViT.

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Parent / base model
InternVL 2.5 26B26B
Consumer
Family siblings (internvl-2.5)
InternVL 2.5 26B26B
Consumer
InternVL 2.5 78B78B
You are here

Strengths

  • MIT license
  • Frontier-tier OCR

Weaknesses

  • 48GB+ VRAM tier

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M45.0 GB52 GB

Get the model

HuggingFace

Original weights

huggingface.co/OpenGVLab/InternVL2_5-78B

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of InternVL 2.5 78B.

NVIDIA GB200 NVL72
13824GB · nvidia
AMD Instinct MI355X
288GB · amd
AMD Instinct MI325X
256GB · amd
AMD Instinct MI300X
192GB · amd
NVIDIA B200
192GB · nvidia
NVIDIA H100 NVL
188GB · nvidia
NVIDIA H200
141GB · nvidia
Intel Gaudi 3
128GB · intel

Frequently asked

What's the minimum VRAM to run InternVL 2.5 78B?

52GB of VRAM is enough to run InternVL 2.5 78B at the Q4_K_M quantization (file size 45.0 GB). Higher-quality quantizations need more.

Can I use InternVL 2.5 78B commercially?

Yes — InternVL 2.5 78B ships under the MIT, which permits commercial use. Always read the license text before deployment.

What's the context length of InternVL 2.5 78B?

InternVL 2.5 78B supports a context window of 32,768 tokens (about 33K).

Does InternVL 2.5 78B support images?

Yes — InternVL 2.5 78B is multimodal and accepts text + vision inputs. Vision support requires a runner that handles its image-conditioning architecture.

Source: huggingface.co/OpenGVLab/InternVL2_5-78B

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Compare hardware
  • Dual 3090 vs RTX 5090 (48 GB or 32 GB) →
  • RTX 3090 vs RTX 4090 →
Buyer guides
  • 16 GB vs 24 GB VRAM — what 70B-class models need →
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
  • Best used GPU for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →
  • Model keeps crashing →
Recommended hardware
  • NVIDIA GB200 NVL72 →
  • AMD Instinct MI355X →
  • AMD Instinct MI325X →
  • AMD Instinct MI300X →
  • NVIDIA B200 →
Alternatives
InternVL 2.5 26B
Before you buy

Verify InternVL 2.5 78B runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Same tier
Models in the same parameter band as this one
  • Llama 3.3 70B Instruct
    llama · 70B
    9.1/10
  • DeepSeek R1 Distill Llama 70B
    deepseek · 70B
    9.0/10
  • Qwen 2.5 72B Instruct
    qwen · 72B
    9.0/10
  • Llama 3.1 70B Instruct
    llama · 70B
    8.0/10
Step up
More capable — bigger memory footprint
  • DeepSeek V4 Pro (1.6T MoE)
    deepseek · 1600B
    unrated
  • Qwen 3.5 235B-A17B (MoE)
    qwen · 397B
    unrated
Step down
Smaller — faster, runs on weaker hardware
  • Qwen 3 30B-A3B
    qwen · 30B
    unrated
  • Gemma 4 31B Dense
    gemma · 31B
    unrated