INT8

INT8 (8-bit integer) is a numerical format that uses 8 bits to represent integers, typically in the range [-128, 127] for signed or [0, 255] for unsigned. In local AI, INT8 is used for quantizing model weights and activations to reduce memory footprint and accelerate inference. Compared to FP16 (16-bit float), INT8 halves the storage requirement and can double throughput on hardware with INT8 tensor core support, such as NVIDIA GPUs with Turing or newer architectures. Operators encounter INT8 when choosing quantization levels (e.g., Q8_0 in llama.cpp) to fit larger models into VRAM or increase token generation speed.

INT8 quantization converts floating-point values to 8-bit integers, typically using a scaling factor and zero-point to map the original range. Two common approaches are per-tensor and per-channel quantization. Per-tensor uses one scale for the entire tensor, while per-channel assigns a scale per output channel, preserving more accuracy. In practice, INT8 quantization of weights alone (weight-only) reduces model size by ~50% compared to FP16, with minimal accuracy loss for many models. Activation quantization (dynamic or static) further reduces memory but requires calibration data. Hardware support varies: NVIDIA GPUs from Turing (RTX 20 series) onward have INT8 tensor cores that accelerate matrix multiplications, while AMD RX 7000 series and Apple M-series support INT8 via different instructions. In llama.cpp, quantization levels like Q8_0 store weights as signed 8-bit integers with a block-wise scale, achieving near-lossless compression for 7B-70B models. Operators must balance the trade-off: INT8 offers speed and memory savings but may cause slight perplexity increase compared to FP16.

A 7B parameter model in FP16 requires ~14 GB of VRAM. Quantizing to INT8 reduces this to ~7 GB, allowing it to run on an RTX 3060 (12 GB) instead of requiring an RTX 3090 (24 GB). Inference speed may increase from ~20 tok/s to ~35 tok/s on an RTX 4090 due to INT8 tensor core utilization.

In llama.cpp, run ./main -m model.gguf -ngl 35 --numa with a Q8_0 quantized model. The Q8_0 quantization level indicates INT8 weights with block-wise scaling. In Ollama, ollama pull llama3.1:8b-q8_0 downloads an INT8 quantized model. In LM Studio, select a Q8_0 GGUF file from the model browser. The runtime loads the INT8 weights into VRAM and uses INT8 tensor cores for matrix multiplication, visible in the tokens/sec output.

Reviewed by Fredoline Eruo. See our editorial policy.

When it doesn't work

Deeper dive

Practical example

Workflow example