Neural network architectures

Feedforward Neural Network

A feedforward neural network (FFNN) is the simplest type of neural network where connections between nodes do not form cycles. Data moves in one direction from input to output through hidden layers. Each layer applies a linear transformation followed by a nonlinear activation function. In local AI, FFNNs are the building blocks of transformer models—every attention and MLP layer is a feedforward subnetwork. Operators encounter FFNNs indirectly: the 'MLP' layers in a transformer are feedforward networks that process each token's representation independently.

Deeper dive

A feedforward network consists of an input layer, one or more hidden layers, and an output layer. Each neuron in a layer is connected to every neuron in the next layer (fully connected). The network computes a function f(x) = W₃·σ(W₂·σ(W₁·x + b₁) + b₂) + b₃, where W are weight matrices, b are biases, and σ is an activation like ReLU. Training adjusts these weights via backpropagation. In transformers, the feedforward sublayer (often called FFN or MLP) expands the dimension from d_model to 4×d_model and then projects back. For example, Llama 3.1 8B uses an intermediate size of 14336 with d_model=4096. This expansion accounts for roughly two-thirds of the model's parameters. Operators tuning inference may notice that FFN layers are compute-bound and benefit from quantization (e.g., Q4_K_M reduces their memory footprint by 4×).

Practical example

Consider Llama 3.1 8B: each transformer layer has a feedforward network with two weight matrices (gate and up) of shape [4096, 14336] and one down matrix of shape [14336, 4096]. At FP16, these three matrices consume about 3 * 4096 * 14336 * 2 bytes ≈ 352 MB per layer. With 32 layers, the total FFN weight memory is ~11 GB out of the model's ~16 GB. Quantizing to Q4_K_M reduces this to ~3 GB, allowing the model to fit on a 12 GB GPU.

Workflow example

When running llama-cli -m llama-3.1-8b-Q4_K_M.gguf, the runtime loads the model and executes each transformer layer. For each token, the feedforward sublayer computes: x = down(GELU(gate(x)) * up(x)). Operators can inspect FFN dimensions using llama.cpp's --print-details flag, which outputs layer shapes. In LM Studio, the model info panel shows parameter counts per layer type—the FFN portion is listed as 'mlp' parameters.

Reviewed by Fredoline Eruo. See our editorial policy.