other

0.77B parameters

Commercial OK

Reviewed May 2026

Florence-2 Large

770M-parameter unified vision foundation model with a DaViT image encoder and BART-style seq2seq decoder. One model, one set of weights — handles captioning, OCR, region/grounding, segmentation, and dense detection via task-prompt tokens. Trained on FLD-5B (5.4B annotations over 126M images).

License: mit·Context: 0 tokens

BLK · VERDICT

Our verdict

OP · Eruo Fredoline|VERIFIED MAY 29, 2026

unrated

Absurd value for 770M params and the most under-rated vision model Microsoft has shipped. Use it when you need many vision tasks on cheap hardware and a chat interface is not the point.

Overview

770M-parameter unified vision foundation model with a DaViT image encoder and BART-style seq2seq decoder. One model, one set of weights — handles captioning, OCR, region/grounding, segmentation, and dense detection via task-prompt tokens. Trained on FLD-5B (5.4B annotations over 126M images).

Strengths

One 770M checkpoint for caption / detailed-caption / OCR / OCR-with-region / grounding / detection / segmentation
Outperforms many task-specialist models 10x its size on COCO, RefCOCO, TextVQA
MIT license, no usage restrictions
Tiny by VLM standards — runs in <2GB VRAM at FP16, viable on CPU and edge devices
Task-prompt API: <CAPTION>, <OD>, <OCR>, <REFERRING_EXPRESSION_SEGMENTATION>, etc.

Weaknesses

Not a conversational VLM — pure task-prompt, no free-form chat
Caption outputs are short and factual; verbose narration weaker than Qwen2.5-VL
OCR is good for English print but lags GOT-OCR2 on formulas, complex tables, CJK
Trust-remote-code required in transformers — extra friction for locked-down deployments

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

Quantization	File size	VRAM required
Q4_K_M	0.4 GB	1 GB

Get the model

HuggingFace

Original weights

huggingface.co/microsoft/Florence-2-large

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Florence-2 Large.

NVIDIA GB200 NVL72

13824GB · nvidia

AMD Instinct MI350X

NVIDIA B300 (Blackwell Ultra)

288GB · nvidia

AMD Instinct MI355X

AMD Instinct MI325X

AMD Instinct MI300X

192GB · nvidia

NVIDIA H100 NVL

188GB · nvidia

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Same tier

Models in the same parameter band as this one

Step up

More capable — bigger memory footprint

Step down

Smaller — faster, runs on weaker hardware

No verdicted models in the next tier down yet.

Frequently asked

What's the minimum VRAM to run Florence-2 Large?

1GB of VRAM is enough to run Florence-2 Large at the Q4_K_M quantization (file size 0.4 GB). Higher-quality quantizations need more.

Can I use Florence-2 Large commercially?

Yes — Florence-2 Large ships under the mit, which permits commercial use. Always read the license text before deployment.

What's the context length of Florence-2 Large?

Florence-2 Large supports a context window of 0 tokens (about 0K).

Source: huggingface.co/microsoft/Florence-2-large

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Compare hardware

Buyer guides

When it doesn't work

Recommended hardware

Before you buy

Verify Florence-2 Large runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →