RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Hardware Planning for Local AI
  6. /Ch. 2
Hardware Planning for Local AI

02. Calculating VRAM Needs

Chapter 2 of 20 · 15 min
KEY INSIGHT

Quantization reduces VRAM requirements dramatically at acceptable quality loss—INT4 quantization runs 7B models on 6GB VRAM. ```python # VRAM Calculator Template def calculate_vram_requirement( model_size_billions: float, precision: str = "FP16", context_length: int = 2048, overhead_factor: float = 1.25 ) -> float: """Calculate VRAM requirement in GB.""" precision_bytes = { "FP32": 4, "FP16": 2, "INT8": 1, "INT4": 0.5 } weights_gb = model_size_billions * precision_bytes[precision] return weights_gb * overhead_factor # Example: Mistral 7B in INT4 print(f"{calculate_vram_requirement(7, 'INT4'):.1f} GB minimum") ```

Accurate VRAM calculation prevents both overspending and frustrating performance failures. The formula has three components: model weights, context buffer, and operational overhead.

Model Weights

Parameters × bytes per parameter = base VRAM

  • FP32 (32-bit float): 4 bytes per parameter
  • FP16 (16-bit float): 2 bytes per parameter
  • INT8 (8-bit integer): 1 byte per parameter
  • INT4 (4-bit integer): 0.5 bytes per parameter

A 70B parameter model in FP16 requires 140GB minimum—just for weights.

KV Cache and Context

The context buffer stores conversation history and attention calculations. For a model with 8K context length:

KV_cache ≈ 2 × layers × hidden_size × context_length × bytes_per_param

Practically, expect 512MB to 4GB for context buffers depending on model architecture and context length.

Operational Overhead

Allocate 20-30% additional VRAM for the inference engine, tokenization, and output processing. This overhead is non-trivial.

Worked Example: Llama 3 8B

  • Model weights in FP16: 8B × 2 = 16GB
  • KV cache for 2K context: ~1GB
  • Overhead: ~2GB
  • Total: ~19GB recommended minimum

With INT4 quantization: 8B × 0.5 = 4GB for weights, total ~7GB

Common Model VRAM Requirements

Model FP16 INT8 INT4
7B 16GB 8GB 6GB
13B 26GB 14GB 10GB
33B 66GB 36GB 20GB
70B 140GB 74GB 40GB
EXERCISE

Calculate the VRAM requirements for a 13B model at INT8 precision with 20% overhead. Show your math step by step.

← Chapter 1
VRAM is Everything
Chapter 3 →
GPU Selection: Budget Tier