RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Hardware & infrastructure / INT8
Hardware & infrastructure

INT8

INT8 (8-bit integer) is a numerical format that uses 8 bits to represent integers, typically in the range [-128, 127] for signed or [0, 255] for unsigned. In local AI, INT8 is used for quantizing model weights and activations to reduce memory footprint and accelerate inference. Compared to FP16 (16-bit float), INT8 halves the storage requirement and can double throughput on hardware with INT8 tensor core support, such as NVIDIA GPUs with Turing or newer architectures. Operators encounter INT8 when choosing quantization levels (e.g., Q8_0 in llama.cpp) to fit larger models into VRAM or increase token generation speed.

Deeper dive

INT8 quantization converts floating-point values to 8-bit integers, typically using a scaling factor and zero-point to map the original range. Two common approaches are per-tensor and per-channel quantization. Per-tensor uses one scale for the entire tensor, while per-channel assigns a scale per output channel, preserving more accuracy. In practice, INT8 quantization of weights alone (weight-only) reduces model size by ~50% compared to FP16, with minimal accuracy loss for many models. Activation quantization (dynamic or static) further reduces memory but requires calibration data. Hardware support varies: NVIDIA GPUs from Turing (RTX 20 series) onward have INT8 tensor cores that accelerate matrix multiplications, while AMD RX 7000 series and Apple M-series support INT8 via different instructions. In llama.cpp, quantization levels like Q8_0 store weights as signed 8-bit integers with a block-wise scale, achieving near-lossless compression for 7B-70B models. Operators must balance the trade-off: INT8 offers speed and memory savings but may cause slight perplexity increase compared to FP16.

Practical example

A 7B parameter model in FP16 requires ~14 GB of VRAM. Quantizing to INT8 reduces this to ~7 GB, allowing it to run on an RTX 3060 (12 GB) instead of requiring an RTX 3090 (24 GB). Inference speed may increase from ~20 tok/s to ~35 tok/s on an RTX 4090 due to INT8 tensor core utilization.

Workflow example

In llama.cpp, run ./main -m model.gguf -ngl 35 --numa with a Q8_0 quantized model. The Q8_0 quantization level indicates INT8 weights with block-wise scaling. In Ollama, ollama pull llama3.1:8b-q8_0 downloads an INT8 quantized model. In LM Studio, select a Q8_0 GGUF file from the model browser. The runtime loads the INT8 weights into VRAM and uses INT8 tensor cores for matrix multiplication, visible in the tokens/sec output.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →