Vision Transformer (ViT)
A Vision Transformer (ViT) is a neural network architecture that applies the Transformer model, originally designed for text, directly to image patches. Instead of using convolutional layers, ViT splits an image into fixed-size patches (e.g., 16x16 pixels), flattens them into sequences, and processes them with self-attention layers. Operators encounter ViT when running multimodal models (e.g., LLaVA, Qwen-VL) that use a ViT as the vision encoder to convert images into embeddings the language model can attend to. ViT variants like ViT-L/14 (large, 14×14 patch size) are common, and their size (e.g., ~300M parameters for ViT-L) adds to VRAM usage alongside the language model.
Deeper dive
ViT was introduced by Dosovitskiy et al. (2020) as a direct alternative to convolutional neural networks (CNNs) for image classification. The core idea: treat an image as a sequence of patches, embed each patch with a linear projection, add positional embeddings, and feed the sequence into a standard Transformer encoder. ViT lacks the inductive biases of CNNs (translation equivariance, locality), so it requires large-scale pretraining (e.g., ImageNet-21k, JFT-300M) to match CNN performance. However, once pretrained, ViT scales efficiently with compute and data. In local AI, ViT is used as the vision backbone in multimodal LLMs. Common variants include ViT-B/16 (base, 16×16 patches, ~86M params), ViT-L/14 (large, ~304M params), and ViT-H/14 (huge, ~632M params). Operators should note that ViT's parameter count is additive to the language model's, so a 7B LLM with a ViT-L encoder uses ~7.3B total parameters. Quantization (e.g., Q4) applies to both components, reducing VRAM footprint.
Practical example
Running LLaVA 1.5 7B on an RTX 3090 (24 GB VRAM) requires loading both the Vicuna-7B language model and a ViT-L/14 vision encoder. At FP16, the ViT alone uses ~600 MB; the language model uses ~14 GB. With Q4 quantization, the ViT drops to ~150 MB and the LLM to ~5 GB, fitting comfortably with room for context. Without quantization, a 24 GB card barely fits the LLM alone, so the ViT would force system-RAM offload, dropping tokens/sec from ~30 to ~5.
Workflow example
In LM Studio or Ollama, when you load a multimodal model like llava:7b, the runtime loads both the language model and the ViT encoder into VRAM. You can verify this by checking nvidia-smi — you'll see two model files loaded (e.g., model-00001-of-00002.safetensors for the LLM and vision_encoder.safetensors for the ViT). If VRAM is tight, you can quantize the ViT separately using llama-quantize or rely on the model's built-in quantization config. In Hugging Face Transformers, loading LlavaForConditionalGeneration automatically instantiates the ViT as model.vision_tower.
Reviewed by Fredoline Eruo. See our editorial policy.