RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Computer vision / Object Detection
Computer vision

Object Detection

Object detection is a computer vision task that identifies and localizes specific objects within an image or video frame. Unlike classification, which labels the entire image, detection outputs bounding boxes around each object along with a class label (e.g., 'person', 'car'). Operators encounter object detection when running models like YOLO, DETR, or SSD via frameworks such as Hugging Face Transformers or ONNX Runtime. The task matters for local AI because inference latency and VRAM usage scale with input resolution and number of detected objects; real-time detection (e.g., 30 FPS) requires efficient models and often quantization.

Deeper dive

Object detection models typically consist of a backbone (e.g., ResNet, EfficientNet) for feature extraction, a neck (e.g., FPN) for multi-scale features, and a head that predicts bounding boxes and class probabilities. Two main paradigms exist: two-stage detectors (e.g., Faster R-CNN) first propose regions, then classify each; one-stage detectors (e.g., YOLO, SSD) predict directly in a single pass, trading accuracy for speed. Transformer-based detectors like DETR treat detection as a set prediction problem, removing hand-crafted components. For local AI operators, the choice depends on hardware: YOLOv8-nano runs at ~100 FPS on an RTX 3060, while DETR may require a 24 GB card for high-resolution inputs. Quantization to INT8 can reduce VRAM usage by 2-4x with minor accuracy loss.

Practical example

A rig with an RTX 3060 (12 GB VRAM) running YOLOv8n (nano) via ONNX Runtime can process 640x640 images at ~100 FPS, using ~1 GB VRAM. Switching to YOLOv8x (extra-large) at the same resolution uses ~6 GB VRAM and runs at ~15 FPS. For higher accuracy on small objects, operators may increase input resolution to 1280x1280, which quadruples VRAM and latency.

Workflow example

In Hugging Face Transformers, operators load a detection model via pipeline('object-detection', model='facebook/detr-resnet-50'). The pipeline returns a list of dicts with 'box' (xmin, ymin, xmax, ymax) and 'label'/'score'. For real-time video, operators use YOLO via Ultralytics: model = YOLO('yolov8n.pt'); results = model(frame). The results include boxes.xyxy and boxes.cls. VRAM monitoring with nvidia-smi helps avoid OOM errors when processing high-resolution streams.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →