RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Natural language processing / GPT (architecture)
Natural language processing

GPT (architecture)

GPT (Generative Pre-trained Transformer) is a decoder-only Transformer architecture that predicts the next token in a sequence by attending to all previous tokens via masked self-attention. Operators encounter GPT in local AI runtimes like llama.cpp, Ollama, and LM Studio when loading models such as Llama, Mistral, or GPT-2. The architecture's autoregressive nature means inference is sequential: each new token requires a forward pass through the entire model, making VRAM and compute latency critical for real-time use.

Deeper dive

The GPT architecture, introduced by OpenAI in 2018, uses a stack of Transformer decoder blocks. Each block contains masked multi-head self-attention (preventing attention to future tokens) and feed-forward layers, with layer normalization and residual connections. Unlike encoder-decoder models (e.g., T5), GPT is decoder-only: it generates text left-to-right, token by token. Pre-training on large text corpora learns language patterns; fine-tuning adapts to specific tasks. Variants like GPT-2, GPT-3, and open-source models (Llama, Mistral) follow this design. For local operators, the key trade-off is model size vs. context length: larger models (e.g., 70B parameters) require more VRAM and compute, often necessitating quantization or offloading.

Practical example

A 7B-parameter GPT-style model (e.g., Mistral 7B) at 16-bit precision requires ~14 GB VRAM. On an RTX 3090 (24 GB), it fits with room for a 4K context. Quantizing to 4-bit reduces VRAM to ~4 GB, fitting on an RTX 3060 (12 GB) and enabling ~30-50 tok/s. A 70B model at 4-bit needs ~40 GB VRAM, exceeding consumer GPUs; operators then offload layers to system RAM via llama.cpp, dropping speed to ~3-5 tok/s.

Workflow example

When running ollama run llama3.1:8b, Ollama loads the GPT-style model into VRAM. The runtime first checks available VRAM; if insufficient, it falls back to CPU offload. Operators see token generation speed in the terminal (e.g., ~40 tok/s on an RTX 4070). In LM Studio, the 'Model' tab shows parameter count and quantization level, directly reflecting the GPT architecture's size and precision trade-offs.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →