02. Calculating VRAM Needs
Accurate VRAM calculation prevents both overspending and frustrating performance failures. The formula has three components: model weights, context buffer, and operational overhead.
Model Weights
Parameters × bytes per parameter = base VRAM
- FP32 (32-bit float): 4 bytes per parameter
- FP16 (16-bit float): 2 bytes per parameter
- INT8 (8-bit integer): 1 byte per parameter
- INT4 (4-bit integer): 0.5 bytes per parameter
A 70B parameter model in FP16 requires 140GB minimum—just for weights.
KV Cache and Context
The context buffer stores conversation history and attention calculations. For a model with 8K context length:
KV_cache ≈ 2 × layers × hidden_size × context_length × bytes_per_param
Practically, expect 512MB to 4GB for context buffers depending on model architecture and context length.
Operational Overhead
Allocate 20-30% additional VRAM for the inference engine, tokenization, and output processing. This overhead is non-trivial.
Worked Example: Llama 3 8B
- Model weights in FP16: 8B × 2 = 16GB
- KV cache for 2K context: ~1GB
- Overhead: ~2GB
- Total: ~19GB recommended minimum
With INT4 quantization: 8B × 0.5 = 4GB for weights, total ~7GB
Common Model VRAM Requirements
| Model | FP16 | INT8 | INT4 |
|---|---|---|---|
| 7B | 16GB | 8GB | 6GB |
| 13B | 26GB | 14GB | 10GB |
| 33B | 66GB | 36GB | 20GB |
| 70B | 140GB | 74GB | 40GB |
Calculate the VRAM requirements for a 13B model at INT8 precision with 20% overhead. Show your math step by step.