RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Specialized domains / Computer Vision (domain)
Specialized domains

Computer Vision (domain)

Computer vision is the field of AI that enables machines to interpret and process visual data—images, videos, or live camera feeds—by assigning labels, detecting objects, or reconstructing 3D scenes. In local AI, operators encounter computer vision through models like YOLO for real-time object detection or CLIP for image-text similarity. These models typically require GPU VRAM for inference; a 7B-parameter vision model at FP16 needs ~14 GB, while smaller quantized versions (e.g., YOLOv8 nano) run on CPU at ~30 FPS. The runtime loads image tensors, runs them through a convolutional or transformer backbone, and outputs bounding boxes, class probabilities, or embeddings.

Deeper dive

Modern computer vision relies on deep neural networks, primarily convolutional neural networks (CNNs) and vision transformers (ViTs). CNNs use sliding filters to detect edges, textures, and higher-level features, while ViTs split images into patches and apply self-attention, often achieving higher accuracy at the cost of more compute. Operators choose models based on latency and accuracy trade-offs: YOLOv8 (CNN) runs at ~100 FPS on an RTX 4090, whereas ViT-based DETR may run at 10 FPS. Quantization (e.g., INT8) can reduce VRAM usage by 2–4× with minor accuracy loss. Common tasks include classification (e.g., ResNet), object detection (YOLO, DETR), segmentation (SAM), and image generation (Stable Diffusion). Local deployment avoids cloud latency and privacy concerns, but VRAM limits often force smaller batch sizes or lower resolution.

Practical example

On an RTX 3060 12 GB, running YOLOv8n (nano) at 640×640 achieves ~80 FPS, using ~1 GB VRAM. Switching to YOLOv8x (extra-large) at the same resolution uses ~6 GB and drops to ~15 FPS. For a 4K video stream, operators may need to downscale frames or use a lighter model to maintain real-time performance.

Workflow example

In LM Studio, an operator can load a vision-language model like LLaVA 7B (Q4) and provide an image via the UI. The runtime encodes the image into embeddings using a CLIP vision encoder (2 GB VRAM), then feeds them to the language model for captioning or question answering. In Ollama, ollama run llava:7b allows image input with ollama run llava:7b "describe this image" --image photo.jpg. The operator sees token generation speed (20 tok/s on an M2 Max) and VRAM usage in the system monitor.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →