RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Transformer & LLM components / Self-Attention
Transformer & LLM components

Self-Attention

Self-attention computes a weighted representation of every position in a sequence by comparing each token against every other token within the same sequence — queries, keys, and values all derive from the same input. For a sequence length n, self-attention computes an n×n attention matrix where entry (i,j) represents how much token i should attend to token j. This O(n²) computation is the transformer's core mechanism for modeling long-range dependencies, enabling a token to directly access information from any other token regardless of distance (unlike RNNs where distant information must propagate through many steps).

Practical example

Self-attention is the "self" part — every token attends to every other token in the same sequence. This is what makes transformers powerful at understanding context. The cost is quadratic: processing 2048 tokens ≈ 4M attention pairs; 32768 tokens ≈ 1B attention pairs, 250× more compute for 16× more tokens. This is the fundamental scaling wall for long contexts.

Workflow example

Self-attention is the main bottleneck in inference. Monitor: (1) time_to_first_token (TTFT) — includes all the self-attention computation on the prompt, (2) tokens_per_second (TPS) during generation — lower because generation only computes attention for the new token against all previous. If TTFT is 5s for a 4K prompt, that's normal. If TPS at 32K context drops from 50 to 5 tok/s, you're hitting the memory-bandwidth wall.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →