other

78B parameters

Commercial OK

Multimodal

Reviewed May 2026

InternVL 2.5 78B

InternVL 2.5 flagship. Approaches frontier proprietary VLMs on document and OCR tasks.

License: MIT·Released Dec 5, 2024·Context: 32,768 tokens

Overview

InternVL 2.5 flagship. Approaches frontier proprietary VLMs on document and OCR tasks.

How to run it

InternVL 2.5 78B is OpenGVLab's multimodal model — 78B text backbone with a vision encoder based on InternViT. Run at Q4_K_M via llama.cpp with llava-server or vLLM multimodal pipeline. Q4_K_M file size ~45 GB (text) + ~4-6 GB (vision). Minimum VRAM: 48 GB — RTX A6000 at Q3_K_M with vision, or text-only Q4_K_M. Recommended: A100 80GB at AWQ-INT4 for full vision serving. Throughput: ~8-15 tok/s on A6000 at Q4_K_M text-only; vision encoding adds ~2-4s per image. InternVL uses a custom architecture (InternViT + InternLM2/LLaMA backbone) — ecosystem support is narrower than Llama-based vision models. Check llama.cpp InternVL support before provisioning. Ollama may not have InternVL 2.5 — use raw llama.cpp llava-server. For production serving: vLLM with custom model registration (if supported). InternVL is known for strong vision-language benchmarks, especially on document understanding and OCR-heavy tasks.

Hardware guidance

Minimum: RTX A6000 48GB at Q3_K_M + vision (tight). Recommended: A100 80GB at AWQ-INT4. VRAM math: 78B dense at Q4_K_M ≈ 45 GB. InternViT encoder: ~5-8 GB (varies by resolution). KV cache at 8K: ~12 GB. Total with vision: ~62-65 GB. Single A6000 48GB is 15+ GB short — must use Q3_K_M or text-only Q4_K_M. Dual RTX 3090 48 GB total: Q4_K_M text-only or Q3_K_M + vision. A100 80GB: comfortable for Q4 + vision + 8K. Mac Studio M4 Ultra 128GB: Q4_K_M + vision, 2-5 tok/s (Apple Silicon InternVL support uncertain). Cloud: A100 at $5-10/hr. InternVL's InternViT is large — expect 2-3× the vision encoder VRAM of Llama 3.2 Vision's CLIP encoder.

What breaks first

InternVL architecture support. llama.cpp's InternVL support is experimental — vision features may not project correctly, causing garbled image descriptions. Validate against reference outputs from the official InternVL GitHub repo. 2. InternViT VRAM bloat. The InternViT encoder is 6B+ parameters — significantly larger than typical vision encoders (CLIP is ~300M). At high resolutions, InternViT activations can spike to 10-15 GB. 3. Tokenizer incompatibility. InternVL may use a different tokenizer than standard LLaMA. Using the wrong tokenizer silently produces incorrect image token embeddings. 4. Multimodal GGUF availability. Pre-converted multimodal GGUFs for InternVL are less common than Llama 3.2 Vision. You may need to convert from hf yourself.

Runtime recommendation

llama.cpp with InternVL-compatible llava-server build. Verify InternVL support in your llama.cpp version. vLLM if InternVL is registered as a supported architecture. Avoid Ollama — InternVL is unlikely to be in the standard catalog. Use OpenGVLab's reference serving code as fallback.

Common beginner mistakes

Mistake: Using a Llama 3.2 Vision mmproj with InternVL text GGUF. Fix: Vision projectors are architecture-specific. Download the InternVL mmproj from the InternVL hf repo. Mistake: Assuming InternVL works with standard Ollama vision tags. Fix: InternVL requires custom model registration. Use llama.cpp directly with the correct multimodal GGUF. Mistake: Sending high-res images expecting InternViT to handle them. Fix: InternViT is large but has fixed input resolution limits. Resize images to the encoder's expected size to avoid OOM. Mistake: Expecting InternVL to run at Llama 3.2 Vision's VRAM footprint. Fix: InternViT is 5-10× larger than CLIP. Vision VRAM is proportionally higher. Budget extra 5-10 GB for InternViT.

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Parent / base model

InternVL 2.5 26B26B

Consumer

Family siblings (internvl-2.5)

InternVL 2.5 26B26B

Consumer

InternVL 2.5 78B78B

You are here

Strengths

MIT license
Frontier-tier OCR

Weaknesses

48GB+ VRAM tier

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

Quantization	File size	VRAM required
Q4_K_M	45.0 GB	52 GB

Get the model

HuggingFace

Original weights

huggingface.co/OpenGVLab/InternVL2_5-78B

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of InternVL 2.5 78B.

Frequently asked

What's the minimum VRAM to run InternVL 2.5 78B?

52GB of VRAM is enough to run InternVL 2.5 78B at the Q4_K_M quantization (file size 45.0 GB). Higher-quality quantizations need more.

Can I use InternVL 2.5 78B commercially?

Yes — InternVL 2.5 78B ships under the MIT, which permits commercial use. Always read the license text before deployment.

What's the context length of InternVL 2.5 78B?

InternVL 2.5 78B supports a context window of 32,768 tokens (about 33K).

Does InternVL 2.5 78B support images?

Yes — InternVL 2.5 78B is multimodal and accepts text + vision inputs. Vision support requires a runner that handles its image-conditioning architecture.

Source: huggingface.co/OpenGVLab/InternVL2_5-78B

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Compare hardware

Buyer guides

When it doesn't work

Recommended hardware

Alternatives

InternVL 2.5 26B

Before you buy

Verify InternVL 2.5 78B runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →

other

78B parameters

Commercial OK

Multimodal

Reviewed May 2026

InternVL 2.5 78B

InternVL 2.5 flagship. Approaches frontier proprietary VLMs on document and OCR tasks.

License: MIT·Released Dec 5, 2024·Context: 32,768 tokens

Overview

InternVL 2.5 flagship. Approaches frontier proprietary VLMs on document and OCR tasks.

How to run it

Hardware guidance

What breaks first

InternVL architecture support. llama.cpp's InternVL support is experimental — vision features may not project correctly, causing garbled image descriptions. Validate against reference outputs from the official InternVL GitHub repo. 2. InternViT VRAM bloat. The InternViT encoder is 6B+ parameters — significantly larger than typical vision encoders (CLIP is ~300M). At high resolutions, InternViT activations can spike to 10-15 GB. 3. Tokenizer incompatibility. InternVL may use a different tokenizer than standard LLaMA. Using the wrong tokenizer silently produces incorrect image token embeddings. 4. Multimodal GGUF availability. Pre-converted multimodal GGUFs for InternVL are less common than Llama 3.2 Vision. You may need to convert from hf yourself.

Runtime recommendation

Common beginner mistakes

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Parent / base model

InternVL 2.5 26B26B

Consumer

Family siblings (internvl-2.5)

InternVL 2.5 26B26B

Consumer

InternVL 2.5 78B78B

You are here

Strengths

MIT license
Frontier-tier OCR

Weaknesses

48GB+ VRAM tier

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

Quantization	File size	VRAM required
Q4_K_M	45.0 GB	52 GB

Get the model

HuggingFace

Original weights

huggingface.co/OpenGVLab/InternVL2_5-78B

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of InternVL 2.5 78B.

Frequently asked

What's the minimum VRAM to run InternVL 2.5 78B?

52GB of VRAM is enough to run InternVL 2.5 78B at the Q4_K_M quantization (file size 45.0 GB). Higher-quality quantizations need more.

Can I use InternVL 2.5 78B commercially?

Yes — InternVL 2.5 78B ships under the MIT, which permits commercial use. Always read the license text before deployment.

What's the context length of InternVL 2.5 78B?

InternVL 2.5 78B supports a context window of 32,768 tokens (about 33K).

Does InternVL 2.5 78B support images?

Yes — InternVL 2.5 78B is multimodal and accepts text + vision inputs. Vision support requires a runner that handles its image-conditioning architecture.

Source: huggingface.co/OpenGVLab/InternVL2_5-78B

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Compare hardware

Buyer guides

When it doesn't work

Recommended hardware

Alternatives

InternVL 2.5 26B

Before you buy

Verify InternVL 2.5 78B runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →