Semantic Segmentation
Semantic segmentation is a computer vision task that assigns a class label (e.g., 'car', 'road', 'person') to every pixel in an image. Unlike object detection, which draws bounding boxes around objects, segmentation produces a pixel‑wise mask. Operators encounter it when running models like YOLOv8‑seg or SAM (Segment Anything Model) on local hardware. The output is a segmentation map, often visualized as a color‑coded overlay. VRAM matters because high‑resolution images require larger tensors: a 1024×1024 input may need 2–4 GB of VRAM for inference, depending on model size and batch size.
Deeper dive
Semantic segmentation models typically use encoder‑decoder architectures. The encoder (e.g., ResNet, EfficientNet) compresses the image into feature maps, and the decoder (e.g., U‑Net, DeepLab) upsamples them back to the original resolution, outputting per‑pixel logits. A softmax or sigmoid converts logits to class probabilities. Common variants include panoptic segmentation (combines semantic and instance segmentation) and instance segmentation (separates each object instance). Operators running local AI often use quantized versions (e.g., INT8) to reduce VRAM footprint. Inference speed depends on resolution and model depth: a lightweight model like Fast‑SCNN runs at ~30 FPS on an RTX 3060, while a heavy model like Mask2Former may drop to 5 FPS on the same hardware.
Practical example
An operator wants to segment street scenes from a dashcam feed. Using YOLOv8n‑seg (nano, 3.2M parameters) at 640×640 resolution, inference takes ~10 ms per frame on an RTX 3060 and uses ~1.5 GB VRAM. The output is a 640×640 array of class IDs, which can be overlaid on the original image. For higher accuracy, switching to YOLOv8x‑seg (52M parameters) at the same resolution increases VRAM to ~3 GB and latency to ~40 ms per frame.
Workflow example
In LM Studio or Ollama, operators load a segmentation model (e.g., ollama run sam2) and provide an image path. The runtime outputs a JSON with per‑pixel labels or a base64‑encoded mask. In Hugging Face Transformers, the pipeline is: from transformers import pipeline; segmenter = pipeline('image-segmentation', model='nvidia/segformer-b0-finetuned-ade-512-512'); result = segmenter('image.jpg'). The result contains a list of masks with scores and labels. Operators often post‑process masks with OpenCV to extract contours or calculate area.
Reviewed by Fredoline Eruo. See our editorial policy.