Computer vision

Feature Extraction

Feature extraction is the process of converting raw input data (like an image) into a compact set of numerical representations, or features, that capture the most relevant information for a task. In vision models, a convolutional neural network (CNN) or vision transformer processes an image through its layers, and the output from an intermediate or final layer before the classification head is taken as the feature vector. These vectors can be used for downstream tasks like similarity search, clustering, or transfer learning. Operators encounter feature extraction when they use a model not for classification but to generate embeddings—for example, extracting a 512-dimensional vector from an image using a CLIP model. The quality and dimensionality of features directly impact retrieval accuracy and storage requirements.

Practical example

A practical example: using CLIP ViT-B/32 via Hugging Face Transformers, an operator can pass an image through the model without the classification head to get a 512-element feature vector. This vector can be indexed with FAISS for image similarity search. On an RTX 3090, extracting features from a batch of 32 images at 224x224 takes 0.5 seconds, producing 32 vectors of 512 floats each (64 KB total). The same model can also extract text features, enabling cross-modal retrieval.

Workflow example

In a typical workflow, an operator loads a vision model in Hugging Face Transformers with model = AutoModel.from_pretrained('openai/clip-vit-base-patch32') and uses model.get_image_features(pixel_values) to extract features. In LM Studio, one can load a CLIP model and use the 'Embeddings' tab to generate image features. The resulting vectors are often saved to a vector database like Chroma or FAISS for later retrieval. This is common in RAG pipelines where images are searched by text queries.

Reviewed by Fredoline Eruo. See our editorial policy.