What this does

This guide uses the NVIDIA Management Library (pynvml) to collect real-time GPU metrics — utilization, memory usage, temperature, and power draw — and exposes them as Prometheus metrics. Operators can track GPU saturation, detect memory leaks in inference servers, and set alerts for thermal throttling. The solution works with any NVIDIA GPU supported by the driver and does not require DCGM.

Steps

Verify the NVIDIA driver and nvidia-smi are functional:
```
nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv
```
Expected output: CSV header followed by a data row, e.g., 45 %, 8192 MiB, 24576 MiB.

Install the Python libraries:

pip install nvidia-ml-py prometheus-client

Create a gpu_metrics.py module that initializes NVML and defines Prometheus gauges:

from pynvml import nvmlInit, nvmlDeviceGetCount, nvmlDeviceGetHandleByIndex
from pynvml import nvmlDeviceGetUtilizationRates, nvmlDeviceGetMemoryInfo
from prometheus_client import Gauge, generate_latest, CollectorRegistry

registry = CollectorRegistry()
gpu_util = Gauge("nvidia_gpu_utilization_percent", "GPU utilization", ["gpu"], registry=registry)
gpu_mem_used = Gauge("nvidia_gpu_memory_used_bytes", "GPU memory used", ["gpu"], registry=registry)
gpu_mem_total = Gauge("nvidia_gpu_memory_total_bytes", "GPU memory total", ["gpu"], registry=registry)

nvmlInit()

Add a collection loop that updates the gauges every 5 seconds:

def collect_gpu_metrics():
    count = nvmlDeviceGetCount()
    for i in range(count):
        handle = nvmlDeviceGetHandleByIndex(i)
        util = nvmlDeviceGetUtilizationRates(handle)
        mem = nvmlDeviceGetMemoryInfo(handle)
        gpu_util.labels(gpu=str(i)).set(util.gpu)
        gpu_mem_used.labels(gpu=str(i)).set(mem.used)
        gpu_mem_total.labels(gpu=str(i)).set(mem.total)

Serve metrics via a simple HTTP server. Use FastAPI or a minimal http.server with a /metrics route that returns generate_latest(registry).
Run the collector as a background thread or asyncio task to ensure metrics stay current between scrapes.

Add a Prometheus scrape config:

- job_name: "gpu-metrics"
  scrape_interval: 5s
  static_configs:
    - targets: ["localhost:9400"]

Query GPU utilization in Prometheus:
```
nvidia_gpu_utilization_percent
```
Expected: value between 0 and 100 for each GPU index.

Verification

curl -s http://localhost:9400/metrics | grep nvidia_gpu_utilization_percent

Expected output: nvidia_gpu_utilization_percent{gpu="0"} 45.0 (or similar non-zero value).

Common failures

NVML initialization fails — confirm the NVIDIA driver is loaded with lsmod | grep nvidia. If absent, install with ubuntu-drivers autoinstall and reboot.
Gauges always report zero — the collection function may not be called. Verify the background thread is started: threading.Thread(target=collect_loop, daemon=True).start().
Permission denied on /dev/nvidia0 — the user running the Python process must be in the video group. Run sudo usermod -aG video $USER and re-login.
nvidia-smi works but pynvml fails — version mismatch between the installed nvidia-ml-py and the driver. Check compatibility: pip show nvidia-ml-py and compare with nvidia-smi --version.

How to monitor GPU utilization for AI inference servers with nvidia-ml-py

What this does

Steps

Verification

Common failures

Related guides