Neural network architectures

Convolutional Neural Network (CNN)

A Convolutional Neural Network (CNN) is a neural network architecture that uses convolutional layers to process grid-like data, such as images. Convolution applies a small filter (kernel) across the input, detecting local patterns like edges or textures. CNNs are common in computer vision tasks (classification, detection, segmentation) but rarely used in text generation. Operators encounter CNNs when running vision models (e.g., LLaVA, CLIP) that combine a CNN vision encoder with a language model. The CNN processes images into embeddings that the language model can attend to. VRAM usage depends on image resolution and the CNN's depth—higher resolution means more VRAM for feature maps.

Deeper dive

CNNs are built from convolutional layers, pooling layers, and fully connected layers. Convolution layers apply learned filters that slide (convolve) over the input, producing feature maps. Pooling (e.g., max pooling) downsamples these maps, reducing spatial dimensions and computational load. Early layers detect simple patterns (edges, colors); deeper layers combine them into complex features (faces, objects). Key hyperparameters: kernel size (e.g., 3x3), stride (step size), padding (to preserve dimensions), and number of filters (output channels). In local AI, CNNs appear in multimodal models: CLIP uses a ViT (Vision Transformer) or ResNet CNN to encode images; LLaVA uses a CLIP vision encoder (often ViT, but some variants use CNN backbones). Operators rarely train CNNs from scratch—they use pretrained encoders from Hugging Face or torchvision. VRAM impact: a 224x224 image through ResNet-50 uses ~200 MB for feature maps; larger resolutions (e.g., 448x448) quadruple that.

Practical example

Running LLaVA 1.6 (7B) on an RTX 3090: the vision encoder (CLIP ViT-L/14) processes a 336x336 image into 576 tokens. The CNN in CLIP is actually a Transformer, but older LLaVA versions used a ResNet-101 CNN. With a ResNet-101, a 224x224 image produces a 7x7 feature map (49 tokens). VRAM: the CNN itself uses ~1 GB for weights and ~0.5 GB for activations at batch size 1. If you increase resolution to 448x448, the feature map becomes 14x14 (196 tokens), doubling VRAM usage for the encoder. This matters because total VRAM (24 GB) must fit both encoder and language model (7B at Q4 ~4 GB).

Workflow example

In LM Studio, loading a multimodal model like LLaVA: the UI shows two model files—a vision encoder (e.g., clip-vit-large-patch14) and a language model. When you drop an image, LM Studio runs the CNN encoder first: it resizes the image, passes it through convolutional layers, and outputs embeddings. These embeddings are prepended to the text prompt tokens. In llama.cpp, the command ./llama-cli -m llava-v1.6-7b.gguf --image cat.jpg triggers the same pipeline. The runtime logs 'encoded image to 576 tokens'—that's the CNN output. If VRAM is tight, you can reduce image size via --image-size 224 to lower CNN memory.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work