Computer Vision (domain)
Computer vision is the field of AI that enables machines to interpret and process visual data—images, videos, or live camera feeds—by assigning labels, detecting objects, or reconstructing 3D scenes. In local AI, operators encounter computer vision through models like YOLO for real-time object detection or CLIP for image-text similarity. These models typically require GPU VRAM for inference; a 7B-parameter vision model at FP16 needs ~14 GB, while smaller quantized versions (e.g., YOLOv8 nano) run on CPU at ~30 FPS. The runtime loads image tensors, runs them through a convolutional or transformer backbone, and outputs bounding boxes, class probabilities, or embeddings.
Deeper dive
Modern computer vision relies on deep neural networks, primarily convolutional neural networks (CNNs) and vision transformers (ViTs). CNNs use sliding filters to detect edges, textures, and higher-level features, while ViTs split images into patches and apply self-attention, often achieving higher accuracy at the cost of more compute. Operators choose models based on latency and accuracy trade-offs: YOLOv8 (CNN) runs at ~100 FPS on an RTX 4090, whereas ViT-based DETR may run at 10 FPS. Quantization (e.g., INT8) can reduce VRAM usage by 2–4× with minor accuracy loss. Common tasks include classification (e.g., ResNet), object detection (YOLO, DETR), segmentation (SAM), and image generation (Stable Diffusion). Local deployment avoids cloud latency and privacy concerns, but VRAM limits often force smaller batch sizes or lower resolution.
Practical example
On an RTX 3060 12 GB, running YOLOv8n (nano) at 640×640 achieves ~80 FPS, using ~1 GB VRAM. Switching to YOLOv8x (extra-large) at the same resolution uses ~6 GB and drops to ~15 FPS. For a 4K video stream, operators may need to downscale frames or use a lighter model to maintain real-time performance.
Workflow example
In LM Studio, an operator can load a vision-language model like LLaVA 7B (Q4) and provide an image via the UI. The runtime encodes the image into embeddings using a CLIP vision encoder (2 GB VRAM), then feeds them to the language model for captioning or question answering. In Ollama, ollama run llava:7b allows image input with ollama run llava:7b "describe this image" --image photo.jpg. The operator sees token generation speed (20 tok/s on an M2 Max) and VRAM usage in the system monitor.
Reviewed by Fredoline Eruo. See our editorial policy.