Hardware & infrastructure

HBM (High Bandwidth Memory)

HBM (High Bandwidth Memory) is a 3D-stacked DRAM design that vertically layers memory dies with through-silicon vias (TSVs) to achieve much higher bandwidth per watt than traditional GDDR memory. In local AI, HBM appears on high-end GPUs like the NVIDIA H100 (80 GB HBM3, 3.35 TB/s) and AMD MI300X (192 GB HBM3, ~5.2 TB/s). The key operator implication: HBM’s bandwidth directly determines how fast a model’s weights can be fed into compute units, making it the primary bottleneck for prompt processing and token generation in large models. Consumer GPUs (RTX 4090, RX 7900 XTX) use GDDR6X or GDDR6, which have lower bandwidth (1 TB/s) and higher power draw per GB transferred.

Deeper dive

HBM differs from GDDR in architecture: GDDR uses a wide, parallel bus on a PCB, while HBM stacks DRAM dies vertically and connects them through a silicon interposer with thousands of TSVs. This reduces physical trace length and allows much wider memory buses (e.g., 1024-bit vs 384-bit for GDDR). The result is bandwidth per watt roughly 3-5x higher than GDDR at the same process node. HBM generations: HBM2 (up to 2.0 GT/s per pin, 256 GB/s per stack), HBM2e (up to 3.2 GT/s), HBM3 (up to 6.4 GT/s, with per-pin bandwidth up to 819 GB/s per stack). For operators, the practical difference shows in large model inference: a 70B parameter model at Q4 (40 GB) on an H100 with HBM3 can achieve ~100 tok/s, while the same model on an RTX 4090 with GDDR6X (1 TB/s) would be limited to ~30 tok/s due to bandwidth constraints. HBM is also more expensive per GB, which is why consumer cards stick with GDDR.

Practical example

Consider running Llama 3.1 70B Q4_K_M (40 GB). On an NVIDIA H100 (80 GB HBM3, 3.35 TB/s), the model fits entirely in VRAM and inference runs at ~100 tok/s. On an RTX 4090 (24 GB GDDR6X, ~1 TB/s), the model cannot fit; even if offloaded to system RAM, the bandwidth bottleneck from PCIe and DDR5 (50 GB/s) drops throughput to ~2 tok/s. The bandwidth difference (3.35 TB/s vs 1 TB/s) directly limits how fast weights can be loaded for each token.

Workflow example

When selecting hardware for local inference, operators check memory bandwidth specs. In llama.cpp, the --memory-limit flag can cap usage, but the runtime's performance is bound by VRAM bandwidth. For example, running ./llama-cli -m llama-70b-q4.gguf -n 256 on an H100 will show ~100 tok/s, while the same command on an RTX 4090 (if the model could fit) would show ~30 tok/s. In practice, operators targeting large models (30B+) often choose HBM-equipped hardware (e.g., used A100s, H100s) or accept slower GDDR speeds.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work