COCO
COCO (Common Objects in Context) is a large-scale image dataset created by Microsoft for object detection, segmentation, and captioning. It contains 330K images with 80 object categories, each annotated with instance-level segmentation masks and bounding boxes. Operators encounter COCO as the standard benchmark for evaluating vision models: a model's COCO mAP (mean Average Precision) score indicates how well it detects and segments objects. When selecting a vision model for local inference—like YOLOv8 or DETR—the COCO mAP on the model card tells you its baseline accuracy before fine-tuning on custom data.
Deeper dive
COCO was released in 2014 and has become the de facto benchmark for object detection and segmentation. Its 80 categories cover everyday objects (person, car, dog, etc.) in complex scenes with occlusions and varying scales. The dataset provides three annotation types: bounding boxes, instance segmentation masks, and captions (5 per image). Evaluation metrics include AP (Average Precision) at IoU thresholds from 0.50 to 0.95, with AP50 and AP75 commonly reported. For operators, COCO matters because most pre-trained vision models (e.g., YOLOv8, EfficientDet, Mask R-CNN) report COCO AP as their primary accuracy metric. A model trained on COCO can be fine-tuned on custom datasets, but its COCO AP gives a rough estimate of general-purpose detection quality. When running inference locally, COCO-trained models can detect the 80 COCO classes out of the box—useful for prototyping before fine-tuning on domain-specific data.
Practical example
When you download a YOLOv8n model from Ultralytics, the model card states 'COCO AP 37.3'—meaning it achieves 37.3% average precision on the COCO validation set. If you run yolo predict model=yolov8n.pt source=image.jpg, the model detects objects from COCO's 80 classes (person, car, etc.) with that accuracy. To detect custom objects (e.g., industrial defects), you'd fine-tune the COCO-pretrained model on your own dataset.
Workflow example
In Hugging Face Transformers, loading a DETR model for inference: from transformers import DetrForObjectDetection; model = DetrForObjectDetection.from_pretrained('facebook/detr-resnet-50') uses a COCO-pretrained checkpoint. When you run inference on an image, the model outputs bounding boxes and class labels from COCO's 80 categories. To evaluate your own model, you download the COCO validation set (5GB) and run python coco_eval.py --results results.json --ann_file instances_val2017.json to compute mAP.
Reviewed by Fredoline Eruo. See our editorial policy.