Calculating VRAM Needs — Hardware Planning for Local AI (Chapter 2)

Accurate VRAM calculation prevents both overspending and frustrating performance failures. The formula has three components: model weights, context buffer, and operational overhead.

Model Weights

Parameters × bytes per parameter = base VRAM

FP32 (32-bit float): 4 bytes per parameter
FP16 (16-bit float): 2 bytes per parameter
INT8 (8-bit integer): 1 byte per parameter
INT4 (4-bit integer): 0.5 bytes per parameter

A 70B parameter model in FP16 requires 140GB minimum—just for weights.

KV Cache and Context

The context buffer stores conversation history and attention calculations. For a model with 8K context length:

KV_cache ≈ 2 × layers × hidden_size × context_length × bytes_per_param

Practically, expect 512MB to 4GB for context buffers depending on model architecture and context length.

Operational Overhead

Allocate 20-30% additional VRAM for the inference engine, tokenization, and output processing. This overhead is non-trivial.

Worked Example: Llama 3 8B

Model weights in FP16: 8B × 2 = 16GB
KV cache for 2K context: ~1GB
Overhead: ~2GB
Total: ~19GB recommended minimum

With INT4 quantization: 8B × 0.5 = 4GB for weights, total ~7GB

Common Model VRAM Requirements

Model	FP16	INT8	INT4
7B	16GB	8GB	6GB
13B	26GB	14GB	10GB
33B	66GB	36GB	20GB
70B	140GB	74GB	40GB