RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Neural network architectures / Vision Transformer (ViT)
Neural network architectures

Vision Transformer (ViT)

A Vision Transformer (ViT) is a neural network architecture that applies the Transformer model, originally designed for text, directly to image patches. Instead of using convolutional layers, ViT splits an image into fixed-size patches (e.g., 16x16 pixels), flattens them into sequences, and processes them with self-attention layers. Operators encounter ViT when running multimodal models (e.g., LLaVA, Qwen-VL) that use a ViT as the vision encoder to convert images into embeddings the language model can attend to. ViT variants like ViT-L/14 (large, 14×14 patch size) are common, and their size (e.g., ~300M parameters for ViT-L) adds to VRAM usage alongside the language model.

Deeper dive

ViT was introduced by Dosovitskiy et al. (2020) as a direct alternative to convolutional neural networks (CNNs) for image classification. The core idea: treat an image as a sequence of patches, embed each patch with a linear projection, add positional embeddings, and feed the sequence into a standard Transformer encoder. ViT lacks the inductive biases of CNNs (translation equivariance, locality), so it requires large-scale pretraining (e.g., ImageNet-21k, JFT-300M) to match CNN performance. However, once pretrained, ViT scales efficiently with compute and data. In local AI, ViT is used as the vision backbone in multimodal LLMs. Common variants include ViT-B/16 (base, 16×16 patches, ~86M params), ViT-L/14 (large, ~304M params), and ViT-H/14 (huge, ~632M params). Operators should note that ViT's parameter count is additive to the language model's, so a 7B LLM with a ViT-L encoder uses ~7.3B total parameters. Quantization (e.g., Q4) applies to both components, reducing VRAM footprint.

Practical example

Running LLaVA 1.5 7B on an RTX 3090 (24 GB VRAM) requires loading both the Vicuna-7B language model and a ViT-L/14 vision encoder. At FP16, the ViT alone uses ~600 MB; the language model uses ~14 GB. With Q4 quantization, the ViT drops to ~150 MB and the LLM to ~5 GB, fitting comfortably with room for context. Without quantization, a 24 GB card barely fits the LLM alone, so the ViT would force system-RAM offload, dropping tokens/sec from ~30 to ~5.

Workflow example

In LM Studio or Ollama, when you load a multimodal model like llava:7b, the runtime loads both the language model and the ViT encoder into VRAM. You can verify this by checking nvidia-smi — you'll see two model files loaded (e.g., model-00001-of-00002.safetensors for the LLM and vision_encoder.safetensors for the ViT). If VRAM is tight, you can quantize the ViT separately using llama-quantize or rely on the model's built-in quantization config. In Hugging Face Transformers, loading LlavaForConditionalGeneration automatically instantiates the ViT as model.vision_tower.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →