Computer vision

R-CNN family (Fast/Faster/Mask)

The R-CNN family is a series of object detection architectures that evolved from region-based convolutional neural networks. R-CNN (Region-based CNN) extracts region proposals via selective search, then classifies each with a CNN. Fast R-CNN improves speed by sharing CNN computation across proposals and using a single-stage training pipeline. Faster R-CNN replaces selective search with a Region Proposal Network (RPN), making detection nearly real-time. Mask R-CNN extends Faster R-CNN to also predict pixel-level segmentation masks. Operators encounter these as pre-trained models in torchvision or detectron2 for tasks like counting objects in images.

Deeper dive

The R-CNN family marks a progression in object detection. Original R-CNN (2014) used selective search to generate 2000 region proposals per image, then ran a CNN on each cropped region, making it slow (47s per image on a GPU). Fast R-CNN (2015) introduced RoI pooling to process the entire image through a CNN once, then extract features for each proposal, reducing runtime to ~0.3s per image. Faster R-CNN (2015) replaced the external proposal method with a learned RPN that shares convolutional layers with the detection network, achieving ~0.2s per image. Mask R-CNN (2017) adds a parallel branch for predicting segmentation masks, using RoIAlign to preserve spatial details. These models are typically used with backbones like ResNet-50 or ResNet-101, and are available in frameworks like detectron2, torchvision, and MMDetection. For local AI, they require moderate VRAM: a Faster R-CNN with ResNet-50 may need ~2-4 GB at inference, while Mask R-CNN can require ~4-6 GB depending on image size.

Practical example

A operator running a security camera feed on an RTX 3060 (12 GB VRAM) might use a pre-trained Faster R-CNN from torchvision to detect people and vehicles. Loading the model (torchvision.models.detection.fasterrcnn_resnet50_fpn) uses ~2.5 GB VRAM, leaving room for video frames. Inference on a 640x480 frame takes ~50-100 ms, achieving ~10-20 FPS. For segmentation, Mask R-CNN would use ~4 GB VRAM and run at ~5-10 FPS on the same hardware.

Workflow example

In a Python script using torchvision, operators load a pre-trained Faster R-CNN: model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True). They then run inference on a batch of images: with torch.no_grad(): outputs = model(images). The output contains bounding boxes, labels, and scores. For custom training, operators can fine-tune on a custom dataset using detectron2's config system, adjusting batch size and image size to fit VRAM. For example, training on an RTX 3090 (24 GB) with batch size 2 and images resized to 800x1333.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work