VRAM is Everything — Hardware Planning for Local AI (Chapter 1)

When running AI models locally, one specification dominates all others: VRAM (Video Random Access Memory). This dedicated memory on your GPU stores the model weights, activations, and inference buffers during generation. If VRAM runs out, inference fails or crawls to unusable speeds.

Modern large language models have billions of parameters. Each parameter typically requires 2 bytes in FP16 precision or 4 bytes in FP32 precision. A 7 billion parameter model needs 14GB just to load the weights in FP16. Add attention mechanisms, KV caches, and batch processing, and your actual requirement exceeds the base calculation.

Consider the RTX 4060 Ti with 16GB versus the RTX 4060 with 8GB. The 8GB model struggles with 7B models in FP16, often requiring quantized weights (4-bit or 8-bit) to fit. The 16GB variant runs the same model comfortably and can even handle 13B models with 4-bit quantization. That single spec—VRAM—determines which models you can run effectively.

CPU-only inference is technically possible but typically provides 1-5 tokens per second versus 20-50+ tokens per second on a modern GPU. For interactive use cases, this difference is prohibitive. VRAM is not the only factor, but it is the gating factor.

The NVLink interconnect on high-end NVIDIA GPUs allows VRAM pooling across multiple cards, but this adds cost and complexity. For most users, selecting a single GPU with adequate VRAM is simpler and more cost-effective than multi-GPU configurations.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.