other
0.77B parameters
Commercial OK
Reviewed May 2026

Florence-2 Large

770M-parameter unified vision foundation model with a DaViT image encoder and BART-style seq2seq decoder. One model, one set of weights — handles captioning, OCR, region/grounding, segmentation, and dense detection via task-prompt tokens. Trained on FLD-5B (5.4B annotations over 126M images).

License: mit·Context: 0 tokens
BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED MAY 29, 2026
unrated

Absurd value for 770M params and the most under-rated vision model Microsoft has shipped. Use it when you need many vision tasks on cheap hardware and a chat interface is not the point.

Overview

770M-parameter unified vision foundation model with a DaViT image encoder and BART-style seq2seq decoder. One model, one set of weights — handles captioning, OCR, region/grounding, segmentation, and dense detection via task-prompt tokens. Trained on FLD-5B (5.4B annotations over 126M images).

Strengths

  • One 770M checkpoint for caption / detailed-caption / OCR / OCR-with-region / grounding / detection / segmentation
  • Outperforms many task-specialist models 10x its size on COCO, RefCOCO, TextVQA
  • MIT license, no usage restrictions
  • Tiny by VLM standards — runs in <2GB VRAM at FP16, viable on CPU and edge devices
  • Task-prompt API: <CAPTION>, <OD>, <OCR>, <REFERRING_EXPRESSION_SEGMENTATION>, etc.

Weaknesses

  • Not a conversational VLM — pure task-prompt, no free-form chat
  • Caption outputs are short and factual; verbose narration weaker than Qwen2.5-VL
  • OCR is good for English print but lags GOT-OCR2 on formulas, complex tables, CJK
  • Trust-remote-code required in transformers — extra friction for locked-down deployments

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M0.4 GB1 GB

Get the model

HuggingFace

Original weights

huggingface.co/microsoft/Florence-2-large

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Florence-2 Large.

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Step up
More capable — bigger memory footprint
Step down
Smaller — faster, runs on weaker hardware
No verdicted models in the next tier down yet.

Frequently asked

What's the minimum VRAM to run Florence-2 Large?

1GB of VRAM is enough to run Florence-2 Large at the Q4_K_M quantization (file size 0.4 GB). Higher-quality quantizations need more.

Can I use Florence-2 Large commercially?

Yes — Florence-2 Large ships under the mit, which permits commercial use. Always read the license text before deployment.

What's the context length of Florence-2 Large?

Florence-2 Large supports a context window of 0 tokens (about 0K).

Source: huggingface.co/microsoft/Florence-2-large

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Before you buy

Verify Florence-2 Large runs on your specific hardware before committing money.