Computer vision

Face Recognition

Face recognition is a computer vision task that identifies or verifies a person from an image or video frame by comparing facial features against a database of known faces. In local AI, operators run face recognition models (e.g., InsightFace, FaceNet) to perform tasks like tagging photos, securing access, or monitoring video feeds. These models extract a face embedding—a fixed-size vector—and match it against stored embeddings using distance metrics (e.g., cosine similarity). Performance depends on GPU VRAM for batch processing and model size; lighter models run faster on consumer GPUs but may trade accuracy.

Deeper dive

Face recognition pipelines typically involve three stages: detection, alignment, and recognition. Detection locates faces in an image (e.g., using MTCNN or RetinaFace). Alignment normalizes the face (rotation, scale) to a canonical pose. Recognition passes the aligned face through a deep neural network (e.g., ArcFace, FaceNet) to produce a 128-512 dimensional embedding. During enrollment, embeddings are stored per identity. During inference, the system computes distances between the query embedding and enrolled embeddings, returning the closest match if below a threshold. Operators using local AI must consider model size (e.g., MobileFaceNet ~4 MB vs. ResNet-100 ~250 MB) and inference latency. Batch processing on GPU can handle multiple faces per frame, but VRAM limits batch size. Quantization (FP16, INT8) reduces memory and speeds up inference with minimal accuracy loss. Popular local frameworks: InsightFace (PyTorch/ONNX), DeepFace (wrapper), and OpenCV's DNN module.

Practical example

An operator runs InsightFace on an RTX 3060 12 GB to recognize family members in a home security camera feed. Using the lightweight MobileFaceNet model (FP16, 2 MB), the pipeline processes 30 FPS at 640x480 resolution. Each detected face is compared against a local database of 50 embeddings; the cosine similarity threshold is set to 0.6. VRAM usage stays under 2 GB, leaving room for other tasks. If the operator switches to the more accurate ResNet-100 model (90 MB FP16), VRAM usage jumps to 4 GB and FPS drops to 15, but false positives decrease.

Workflow example

In a local AI workflow using Python and InsightFace, the operator loads the model with insightface.app.FaceAnalysis(name='buffalo_l') and prepares a database of known face embeddings. For each frame from a webcam, they call app.get(img) to get detected faces, then compute embeddings. A custom script compares each embedding against the database using np.dot or scipy.spatial.distance.cosine. If a match is found above threshold, the operator logs the identity and timestamp. With Ollama, face recognition is not natively supported; instead, operators use Hugging Face Transformers with models like hustvl/yolos-small for detection and a separate embedding model, or run InsightFace via ONNX Runtime.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work