Computer vision

Image Segmentation

Image segmentation is a computer vision task that partitions an image into multiple segments or regions, each corresponding to a distinct object or part of an object. Unlike object detection (bounding boxes) or classification (whole-image labels), segmentation assigns a label to every pixel. Two main types exist: semantic segmentation (all pixels of the same class get the same label, e.g., all 'road' pixels) and instance segmentation (each individual object instance gets a separate label, e.g., 'car 1', 'car 2'). Operators encounter segmentation when they need pixel-level precision for tasks like background removal, medical imaging, or autonomous driving. Models like SAM (Segment Anything) or YOLOv8-seg run on local hardware, with VRAM demands scaling with image resolution and model size.

Deeper dive

Image segmentation models typically use encoder-decoder architectures. The encoder extracts features, and the decoder upsamples to the original resolution, outputting a per-pixel class probability map (semantic) or a combination of class and instance masks (instance). Common architectures include U-Net (medical), Mask R-CNN (instance), and transformer-based models like SAM. For local inference, operators must consider resolution: a 1024x1024 image requires ~4x the VRAM of 512x512. Quantization (e.g., FP16 to INT8) can reduce memory but may slightly degrade mask accuracy. SAM, for example, runs at ~2-3 seconds per image on an RTX 3090 with FP16, but can be sped up with smaller prompts or lower resolution. Some models (like YOLOv8-seg) are optimized for real-time segmentation on consumer GPUs.

Practical example

Running SAM (Segment Anything) on an RTX 3060 12GB: loading the ViT-H model (2.4B params) at FP16 uses ~6 GB VRAM. Segmenting a 1024x1024 image adds ~2 GB for intermediate tensors, totaling ~8 GB — fits comfortably. But a 2048x2048 image would exceed 12 GB, forcing CPU offload and slowing inference from ~3 seconds to ~30 seconds. Operators can reduce resolution or use the smaller ViT-B model (91M params) to stay within VRAM.

Workflow example

In LM Studio, an operator loads a segmentation model like 'sam-vit-base' from Hugging Face. After loading, they drag an image into the UI and click 'Segment'. The model outputs a mask overlay. In Python with Hugging Face Transformers, the workflow is: from transformers import SamModel, SamProcessor; model = SamModel.from_pretrained('facebook/sam-vit-base'); processor = SamProcessor.from_pretrained('facebook/sam-vit-base'); inputs = processor(images=image, return_tensors='pt'); outputs = model(**inputs); masks = processor.image_processor.post_process_masks(outputs.pred_masks, inputs['original_sizes'], inputs['reshaped_input_sizes']). VRAM usage can be monitored with nvidia-smi.

Reviewed by Fredoline Eruo. See our editorial policy.