Computer vision

Object Detection

Object detection is a computer vision task that identifies and localizes specific objects within an image or video frame. Unlike classification, which labels the entire image, detection outputs bounding boxes around each object along with a class label (e.g., 'person', 'car'). Operators encounter object detection when running models like YOLO, DETR, or SSD via frameworks such as Hugging Face Transformers or ONNX Runtime. The task matters for local AI because inference latency and VRAM usage scale with input resolution and number of detected objects; real-time detection (e.g., 30 FPS) requires efficient models and often quantization.

Deeper dive

Object detection models typically consist of a backbone (e.g., ResNet, EfficientNet) for feature extraction, a neck (e.g., FPN) for multi-scale features, and a head that predicts bounding boxes and class probabilities. Two main paradigms exist: two-stage detectors (e.g., Faster R-CNN) first propose regions, then classify each; one-stage detectors (e.g., YOLO, SSD) predict directly in a single pass, trading accuracy for speed. Transformer-based detectors like DETR treat detection as a set prediction problem, removing hand-crafted components. For local AI operators, the choice depends on hardware: YOLOv8-nano runs at ~100 FPS on an RTX 3060, while DETR may require a 24 GB card for high-resolution inputs. Quantization to INT8 can reduce VRAM usage by 2-4x with minor accuracy loss.

Practical example

A rig with an RTX 3060 (12 GB VRAM) running YOLOv8n (nano) via ONNX Runtime can process 640x640 images at ~100 FPS, using ~1 GB VRAM. Switching to YOLOv8x (extra-large) at the same resolution uses ~6 GB VRAM and runs at ~15 FPS. For higher accuracy on small objects, operators may increase input resolution to 1280x1280, which quadruples VRAM and latency.

Workflow example

In Hugging Face Transformers, operators load a detection model via pipeline('object-detection', model='facebook/detr-resnet-50'). The pipeline returns a list of dicts with 'box' (xmin, ymin, xmax, ymax) and 'label'/'score'. For real-time video, operators use YOLO via Ultralytics: model = YOLO('yolov8n.pt'); results = model(frame). The results include boxes.xyxy and boxes.cls. VRAM monitoring with nvidia-smi helps avoid OOM errors when processing high-resolution streams.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work