Computer vision

Image Classification

Image classification is a computer vision task where a model assigns a single label from a predefined set to an input image. For operators running local AI, this means loading a vision model (e.g., ResNet, ViT, or a multimodal model like LLaVA) and passing an image through it to get a class prediction. The model outputs a probability distribution over classes; the highest-probability class is the prediction. Practical considerations include model size (e.g., ResNet-50 at ~100 MB) and inference speed, which depends on GPU VRAM and batch size. Quantization can reduce model size and speed up inference with minimal accuracy loss.

Deeper dive

Image classification models typically consist of a backbone (e.g., convolutional layers or vision transformer) that extracts features, followed by a classifier head (fully connected layer) that maps features to class scores. Training uses cross-entropy loss on labeled datasets like ImageNet (1000 classes). For local deployment, operators often use pretrained models from Hugging Face or torchvision. Inference can be done with frameworks like PyTorch, ONNX Runtime, or llama.cpp's multimodal support. Key operator concerns: VRAM usage (e.g., ViT-L/14 ~1.5 GB at FP16), latency (e.g., ~10 ms on RTX 4090 for ResNet-50), and accuracy trade-offs when quantizing to int8 or int4. Batch inference can improve throughput but requires more VRAM.

Practical example

An operator wants to classify a photo of a dog breed. They download a pretrained ResNet-50 from Hugging Face (torchvision.models.resnet50(pretrained=True)). The model is ~98 MB in FP32. On an RTX 3060 12GB, inference takes ~15 ms per image at batch size 1. To fit more models or run faster, they quantize to int8 using ONNX Runtime, reducing size to ~25 MB and latency to ~8 ms, with a 0.5% accuracy drop on ImageNet.

Workflow example

In a local AI workflow using Hugging Face Transformers, the operator runs: from transformers import AutoImageProcessor, AutoModelForImageClassification; processor = AutoImageProcessor.from_pretrained('google/vit-base-patch16-224'); model = AutoModelForImageClassification.from_pretrained('google/vit-base-patch16-224'). They then preprocess an image with inputs = processor(image, return_tensors='pt') and run outputs = model(**inputs). The predicted class is obtained via predicted_class_idx = outputs.logits.argmax(-1).item(). This loads the model into VRAM; on a 8GB GPU, ViT-base (~330 MB) fits easily.

Reviewed by Fredoline Eruo. See our editorial policy.