Multi-Layer Perceptron (MLP)
A Multi-Layer Perceptron (MLP) is a feedforward neural network composed of at least three layers: an input layer, one or more hidden layers, and an output layer. Each layer consists of neurons fully connected to the next, with nonlinear activation functions (e.g., ReLU) between layers. In transformer-based language models, MLP blocks follow the attention mechanism in each layer, processing token representations to learn complex patterns. For operators, MLP layers are a major contributor to model size and VRAM usage—e.g., in a 7B parameter model, the MLP weights often account for roughly two-thirds of total parameters.
Deeper dive
The MLP in transformers consists of two linear transformations with a nonlinear activation in between, often expressed as MLP(x) = W2 * GELU(W1 * x). The first linear layer expands the hidden dimension (e.g., from 4096 to 11008 in Llama 2 7B), and the second projects back. This expansion factor (typically ~2.7×) is a key design choice affecting model capacity and memory footprint. Variants like SwiGLU (used in Llama 3) replace GELU with a gated activation, adding a third weight matrix. For operators, the MLP's size directly impacts quantization decisions: Q4 quantized MLP weights reduce VRAM but may slightly degrade performance. During inference, MLP computations are matrix multiplications that benefit from GPU tensor cores; on CPU, they become memory-bandwidth bound.
Practical example
In Llama 3.1 8B, each transformer layer has an MLP with three weight matrices (gate, up, down) due to SwiGLU. At FP16, these matrices total ~1.2 GB per layer. With 32 layers, the MLP weights alone consume ~38 GB—more than half the model's 16 GB VRAM on an RTX 4090. Quantizing to Q4_K_M reduces each MLP matrix to ~0.3 GB per layer, fitting the full model in ~16 GB VRAM.
Workflow example
When loading a model in llama.cpp, you can inspect MLP structure via --verbose output, which logs layer dimensions like llama_model_load: ggml ctx size = XXX MB. In LM Studio, the model info panel shows parameter counts per component. When quantizing with llama-quantize, the MLP weights are compressed alongside attention weights—operators can choose quantization methods (e.g., Q4_0, Q5_1) that affect MLP precision specifically.
Reviewed by Fredoline Eruo. See our editorial policy.