What this does

Captures GPU and system RAM consumption before, during, and after model inference, producing baseline and peak memory readings. The end state is a clear record of how much memory a model consumes at rest and under load.

Steps

Capture baseline GPU memory before inference. Checks free VRAM with no model loaded.
```
nvidia-smi --query-gpu=memory.used,memory.free,memory.total --format=csv,noheader,nounits
```
Expected output: 450, 7361, 7808 (Used: 450 MiB, Free: 7361 MiB, Total: 7808 MiB).
Load the model and measure GPU memory during inference. Opens a second terminal while the model generates output.
```
nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits
```
Expected output: A value higher than baseline, e.g. 1850 MiB used under load.
Check system-wide RAM with free. Verifies host RAM consumption alongside GPU usage.
```
free -h
```
Expected output: Total, used, and available memory across Mem and Swap rows.

Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits
# Expected: value exceeds baseline from step 1 by at least the model size

Common failures

nvidia-smi not found: NVIDIA driver not installed or not in PATH; install the driver and confirm with which nvidia-smi.
model still loading: Wait for the model to finish loading before measuring; initial VRAM spikes during KV cache allocation.
memory reported as zero: Inference runtime may be running on CPU only; verify GPU selection in the runtime config.
baseline and peak identical: Model is running on CPU only; check CUDA visibility with nvidia-smi directly.
process not found: Ollama process may run under a different user; check with sudo ps aux | grep ollama.

How to measure memory usage during model inference

What this does

Steps

Verification

Common failures

Related guides