RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Hardware & infrastructure / Tensor Core
Hardware & infrastructure

Tensor Core

Tensor Cores are specialized hardware units on NVIDIA GPUs (Volta architecture and later) that perform fused multiply-add (FMA) operations on 4×4 matrices in a single clock cycle. They accelerate matrix math used in neural network training and inference, particularly for mixed-precision (FP16/BF16 with FP32 accumulation) workloads. For local AI operators, Tensor Cores matter because they deliver 2–8× higher throughput than standard CUDA cores for matrix-heavy operations like attention and feed-forward layers, directly reducing tokens-per-second latency in llama.cpp, vLLM, and other runtimes. However, they require batch sizes >1 or specific model sharding to fully utilize; single-stream inference on small models often sees limited benefit.

Deeper dive

Tensor Cores were introduced with the NVIDIA Volta V100 in 2017 and have since appeared in Turing (RTX 20-series), Ampere (RTX 30-series), Ada Lovelace (RTX 40-series), and Blackwell (RTX 50-series) architectures. Each Tensor Core can compute D = A × B + C, where A, B, C, and D are 4×4 matrices. When combined with warp-level matrix multiply-and-accumulate (WMMA) instructions, a single SM (streaming multiprocessor) can process 64×64 matrix tiles per cycle. The key operator-relevant detail is that Tensor Cores operate at lower precision (FP16, BF16, INT8, INT4) than FP32 CUDA cores, enabling higher throughput at the cost of numerical range. Modern runtimes like llama.cpp and vLLM automatically use Tensor Cores when the model is loaded in half-precision (FP16/BF16) or quantized (INT8/INT4) formats. On consumer GPUs, the number of Tensor Cores scales with the GPU tier: an RTX 4090 has 512 Tensor Cores (4th gen), while an RTX 3060 has 112 (3rd gen). For inference, the practical speedup depends on batch size—larger batches saturate Tensor Cores better. At batch size 1, memory bandwidth often becomes the bottleneck, reducing the advantage.

Practical example

On an RTX 4090 (512 Tensor Cores), running Llama 3.1 8B at FP16 with llama.cpp achieves ~120 tok/s at batch size 1, but at batch size 8 it reaches ~400 tok/s—a 3.3× gain from Tensor Core utilization. On an RTX 3060 (112 Tensor Cores), the same model at batch size 1 yields ~25 tok/s, and batch size 8 yields ~70 tok/s. The difference is smaller because the 3060's memory bandwidth (360 GB/s vs 1008 GB/s on the 4090) limits single-stream throughput.

Workflow example

In llama.cpp, Tensor Cores are engaged automatically when the model is loaded with --type k (FP16) or --type q4_0 (INT4 quantized). You can verify Tensor Core usage by enabling verbose logging (--verbose) and checking for lines like 'compute capability: 8.9' (Ada) and 'using Tensor Cores for matrix multiplication'. In vLLM, Tensor Cores are used by default for FP16/BF16 models; you can force FP32 with --dtype float32 to disable them (slower but higher precision). In LM Studio, the 'GPU Offload' slider implicitly enables Tensor Cores when the model is loaded in half-precision.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • CUDA vs ROCm →
When it doesn't work
  • CUDA out of memory →
  • CUDA driver too old →
  • PyTorch CUDA not available →
  • Windows cannot find CUDA →
Compare hardware
  • RTX 4090 vs RTX 5090 →
Hardware
  • RTX 4090 →
  • RTX 5090 →