Natural language processing

GPT (architecture)

GPT (Generative Pre-trained Transformer) is a decoder-only Transformer architecture that predicts the next token in a sequence by attending to all previous tokens via masked self-attention. Operators encounter GPT in local AI runtimes like llama.cpp, Ollama, and LM Studio when loading models such as Llama, Mistral, or GPT-2. The architecture's autoregressive nature means inference is sequential: each new token requires a forward pass through the entire model, making VRAM and compute latency critical for real-time use.

Deeper dive

The GPT architecture, introduced by OpenAI in 2018, uses a stack of Transformer decoder blocks. Each block contains masked multi-head self-attention (preventing attention to future tokens) and feed-forward layers, with layer normalization and residual connections. Unlike encoder-decoder models (e.g., T5), GPT is decoder-only: it generates text left-to-right, token by token. Pre-training on large text corpora learns language patterns; fine-tuning adapts to specific tasks. Variants like GPT-2, GPT-3, and open-source models (Llama, Mistral) follow this design. For local operators, the key trade-off is model size vs. context length: larger models (e.g., 70B parameters) require more VRAM and compute, often necessitating quantization or offloading.

Practical example

A 7B-parameter GPT-style model (e.g., Mistral 7B) at 16-bit precision requires ~14 GB VRAM. On an RTX 3090 (24 GB), it fits with room for a 4K context. Quantizing to 4-bit reduces VRAM to ~4 GB, fitting on an RTX 3060 (12 GB) and enabling ~30-50 tok/s. A 70B model at 4-bit needs ~40 GB VRAM, exceeding consumer GPUs; operators then offload layers to system RAM via llama.cpp, dropping speed to ~3-5 tok/s.

Workflow example

When running ollama run llama3.1:8b, Ollama loads the GPT-style model into VRAM. The runtime first checks available VRAM; if insufficient, it falls back to CPU offload. Operators see token generation speed in the terminal (e.g., ~40 tok/s on an RTX 4070). In LM Studio, the 'Model' tab shows parameter count and quantization level, directly reflecting the GPT architecture's size and precision trade-offs.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work