Neural network architectures

Autoencoder

An autoencoder is a neural network trained to reconstruct its input after passing it through a bottleneck layer. The bottleneck forces the network to learn a compressed representation (latent space) of the data. In local AI, autoencoders appear in anomaly detection (e.g., flagging unusual system logs) and as building blocks for larger models like Stable Diffusion, where a VAE compresses images into latent space for efficient generation. The key operator concern is that autoencoders require separate encoder and decoder weights, doubling VRAM usage if both are loaded simultaneously.

Deeper dive

Autoencoders consist of an encoder that maps input to a lower-dimensional latent code, and a decoder that reconstructs the input from that code. Training minimizes reconstruction error (e.g., MSE). Variants include denoising autoencoders (corrupt input, learn to recover clean version) and variational autoencoders (VAEs) which output a distribution over latent space, enabling generative sampling. In practice, VAEs are used in image generation pipelines: the VAE encoder compresses a 512x512 image to a 64x64 latent, reducing compute for the diffusion model. Operators running Stable Diffusion locally see this as two separate model files (encoder + decoder) that together consume ~300 MB VRAM at fp16. Autoencoders are also used for dimensionality reduction (similar to PCA) and for pretraining feature extractors.

Practical example

When running Stable Diffusion in LM Studio, the VAE autoencoder compresses a 512x512 RGB image (786,432 values) into a 64x64x4 latent (16,384 values) — a 48x reduction. The diffusion model operates on this latent, then the VAE decoder reconstructs the final image. Loading both encoder and decoder adds ~300 MB VRAM on top of the ~4 GB used by the 1.5B parameter diffusion model. On an RTX 3060 12 GB, this fits comfortably; on an 8 GB card, it may force system-RAM offload.

Workflow example

In Ollama, autoencoders are not directly exposed, but the underlying architecture appears in models like llava (vision-language) where a vision encoder (trained as an autoencoder variant) extracts image features. When you run ollama run llava:7b and provide an image, the runtime loads the vision encoder (~300 MB) alongside the language model. In Hugging Face Transformers, you can load a VAE with from_pretrained('stabilityai/sd-vae-ft-mse') and run vae.encode(pixel_values) to get latents. Operators monitoring VRAM usage via nvidia-smi will see the encoder/decoder weights occupy separate allocations.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work