How to measure memory usage during model inference
NVIDIA GPU with nvidia-smi installed, or AMD GPU with rocm-smi installed, plus a running inference session
What this does
Captures GPU and system RAM consumption before, during, and after model inference, producing baseline and peak memory readings. The end state is a clear record of how much memory a model consumes at rest and under load.
Steps
Capture baseline GPU memory before inference. Checks free VRAM with no model loaded.
nvidia-smi --query-gpu=memory.used,memory.free,memory.total --format=csv,noheader,nounitsExpected output:
450, 7361, 7808(Used: 450 MiB, Free: 7361 MiB, Total: 7808 MiB).Load the model and measure GPU memory during inference. Opens a second terminal while the model generates output.
nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounitsExpected output: A value higher than baseline, e.g.
1850MiB used under load.Check system-wide RAM with free. Verifies host RAM consumption alongside GPU usage.
free -hExpected output: Total, used, and available memory across Mem and Swap rows.
- Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Verification
nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits
# Expected: value exceeds baseline from step 1 by at least the model size
Common failures
- nvidia-smi not found: NVIDIA driver not installed or not in PATH; install the driver and confirm with
which nvidia-smi. - model still loading: Wait for the model to finish loading before measuring; initial VRAM spikes during KV cache allocation.
- memory reported as zero: Inference runtime may be running on CPU only; verify GPU selection in the runtime config.
- baseline and peak identical: Model is running on CPU only; check CUDA visibility with
nvidia-smidirectly. - process not found: Ollama process may run under a different user; check with
sudo ps aux | grep ollama.