How to monitor GPU utilization for AI inference servers with nvidia-ml-py
NVIDIA GPU, nvidia-ml-py or pynvml installed
What this does
This guide uses the NVIDIA Management Library (pynvml) to collect real-time GPU metrics — utilization, memory usage, temperature, and power draw — and exposes them as Prometheus metrics. Operators can track GPU saturation, detect memory leaks in inference servers, and set alerts for thermal throttling. The solution works with any NVIDIA GPU supported by the driver and does not require DCGM.
Steps
Verify the NVIDIA driver and nvidia-smi are functional:
nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csvExpected output: CSV header followed by a data row, e.g.,
45 %, 8192 MiB, 24576 MiB.Install the Python libraries:
pip install nvidia-ml-py prometheus-clientCreate a
gpu_metrics.pymodule that initializes NVML and defines Prometheus gauges:from pynvml import nvmlInit, nvmlDeviceGetCount, nvmlDeviceGetHandleByIndex from pynvml import nvmlDeviceGetUtilizationRates, nvmlDeviceGetMemoryInfo from prometheus_client import Gauge, generate_latest, CollectorRegistry registry = CollectorRegistry() gpu_util = Gauge("nvidia_gpu_utilization_percent", "GPU utilization", ["gpu"], registry=registry) gpu_mem_used = Gauge("nvidia_gpu_memory_used_bytes", "GPU memory used", ["gpu"], registry=registry) gpu_mem_total = Gauge("nvidia_gpu_memory_total_bytes", "GPU memory total", ["gpu"], registry=registry) nvmlInit()Add a collection loop that updates the gauges every 5 seconds:
def collect_gpu_metrics(): count = nvmlDeviceGetCount() for i in range(count): handle = nvmlDeviceGetHandleByIndex(i) util = nvmlDeviceGetUtilizationRates(handle) mem = nvmlDeviceGetMemoryInfo(handle) gpu_util.labels(gpu=str(i)).set(util.gpu) gpu_mem_used.labels(gpu=str(i)).set(mem.used) gpu_mem_total.labels(gpu=str(i)).set(mem.total)Serve metrics via a simple HTTP server. Use FastAPI or a minimal
http.serverwith a/metricsroute that returnsgenerate_latest(registry).Run the collector as a background thread or asyncio task to ensure metrics stay current between scrapes.
Add a Prometheus scrape config:
- job_name: "gpu-metrics" scrape_interval: 5s static_configs: - targets: ["localhost:9400"]Query GPU utilization in Prometheus:
nvidia_gpu_utilization_percentExpected: value between 0 and 100 for each GPU index.
Verification
curl -s http://localhost:9400/metrics | grep nvidia_gpu_utilization_percent
Expected output: nvidia_gpu_utilization_percent{gpu="0"} 45.0 (or similar non-zero value).
Common failures
- NVML initialization fails — confirm the NVIDIA driver is loaded with
lsmod | grep nvidia. If absent, install withubuntu-drivers autoinstalland reboot. - Gauges always report zero — the collection function may not be called. Verify the background thread is started:
threading.Thread(target=collect_loop, daemon=True).start(). - Permission denied on /dev/nvidia0 — the user running the Python process must be in the
videogroup. Runsudo usermod -aG video $USERand re-login. - nvidia-smi works but pynvml fails — version mismatch between the installed
nvidia-ml-pyand the driver. Check compatibility:pip show nvidia-ml-pyand compare withnvidia-smi --version.