Tensor Core
Tensor Cores are specialized hardware units on NVIDIA GPUs (Volta architecture and later) that perform fused multiply-add (FMA) operations on 4×4 matrices in a single clock cycle. They accelerate matrix math used in neural network training and inference, particularly for mixed-precision (FP16/BF16 with FP32 accumulation) workloads. For local AI operators, Tensor Cores matter because they deliver 2–8× higher throughput than standard CUDA cores for matrix-heavy operations like attention and feed-forward layers, directly reducing tokens-per-second latency in llama.cpp, vLLM, and other runtimes. However, they require batch sizes >1 or specific model sharding to fully utilize; single-stream inference on small models often sees limited benefit.
Deeper dive
Tensor Cores were introduced with the NVIDIA Volta V100 in 2017 and have since appeared in Turing (RTX 20-series), Ampere (RTX 30-series), Ada Lovelace (RTX 40-series), and Blackwell (RTX 50-series) architectures. Each Tensor Core can compute D = A × B + C, where A, B, C, and D are 4×4 matrices. When combined with warp-level matrix multiply-and-accumulate (WMMA) instructions, a single SM (streaming multiprocessor) can process 64×64 matrix tiles per cycle. The key operator-relevant detail is that Tensor Cores operate at lower precision (FP16, BF16, INT8, INT4) than FP32 CUDA cores, enabling higher throughput at the cost of numerical range. Modern runtimes like llama.cpp and vLLM automatically use Tensor Cores when the model is loaded in half-precision (FP16/BF16) or quantized (INT8/INT4) formats. On consumer GPUs, the number of Tensor Cores scales with the GPU tier: an RTX 4090 has 512 Tensor Cores (4th gen), while an RTX 3060 has 112 (3rd gen). For inference, the practical speedup depends on batch size—larger batches saturate Tensor Cores better. At batch size 1, memory bandwidth often becomes the bottleneck, reducing the advantage.
Practical example
On an RTX 4090 (512 Tensor Cores), running Llama 3.1 8B at FP16 with llama.cpp achieves ~120 tok/s at batch size 1, but at batch size 8 it reaches ~400 tok/s—a 3.3× gain from Tensor Core utilization. On an RTX 3060 (112 Tensor Cores), the same model at batch size 1 yields ~25 tok/s, and batch size 8 yields ~70 tok/s. The difference is smaller because the 3060's memory bandwidth (360 GB/s vs 1008 GB/s on the 4090) limits single-stream throughput.
Workflow example
In llama.cpp, Tensor Cores are engaged automatically when the model is loaded with --type k (FP16) or --type q4_0 (INT4 quantized). You can verify Tensor Core usage by enabling verbose logging (--verbose) and checking for lines like 'compute capability: 8.9' (Ada) and 'using Tensor Cores for matrix multiplication'. In vLLM, Tensor Cores are used by default for FP16/BF16 models; you can force FP32 with --dtype float32 to disable them (slower but higher precision). In LM Studio, the 'GPU Offload' slider implicitly enables Tensor Cores when the model is loaded in half-precision.
Reviewed by Fredoline Eruo. See our editorial policy.