HOW-TO · OPS

How to use Prometheus remote_write for long-term AI metrics archival

advanced25 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Prometheus running, remote storage endpoint (Thanos/VictoriaMetrics)

What this does

This guide configures Prometheus remote_write to ship AI service metrics to a long-term storage backend for historical analysis, cost auditing, and capacity planning. Local Prometheus retains high-resolution data for 30 days while remote storage retains downsampled data for years. This enables year-over-year comparisons of token usage trends, model performance regressions, and infrastructure utilization patterns without bloating the local Prometheus TSDB.

Steps

  1. Identify the remote storage endpoint URL. For Thanos Receive:

    http://thanos-receive:19291/api/v1/receive
    

    For VictoriaMetrics:

    http://victoriametrics:8428/api/v1/write
    
  2. Add the remote_write block to prometheus.yml:

    remote_write:
      - url: "http://victoriametrics:8428/api/v1/write"
        queue_config:
          capacity: 10000
          max_shards: 20
          min_shards: 1
          max_samples_per_send: 5000
          batch_send_deadline: 5s
          min_backoff: 30ms
          max_backoff: 5s
        write_relabel_configs:
          - source_labels: [__name__]
            regex: "ai_.*"
            action: keep
          - source_labels: [__name__]
            regex: "vllm:.*"
            action: keep
    
  3. Configure the write relabel config to filter which metrics are sent. The example above keeps only metrics with the ai_ or vllm: prefix, reducing remote storage costs by excluding irrelevant metrics.

  4. Restart Prometheus and confirm the remote write connection is established:

    sudo systemctl restart prometheus
    curl -s http://localhost:9090/api/v1/status/config | jq '.data.remote_write[0].url'
    

    Expected output: the configured remote write URL.

  5. Monitor remote write health from the Prometheus UI. Navigate to Status > Targets and check the remote write endpoint shows no errors. Or query:

    curl -s http://localhost:9090/api/v1/query?query=prometheus_remote_storage_succeeded_samples_total
    

    Expected output: a non-zero value confirming samples are being written.

  6. Configure the remote storage for downsampling. In VictoriaMetrics, add deduplication and retention flags:

    -dedup.minScrapeInterval=30s -retentionPeriod=24
    
  7. Verify historical data is queryable through the remote storage's query API. For VictoriaMetrics:

    curl "http://victoriametrics:8428/api/v1/query?query=ai_token_input_total&time=$(date -d '30 days ago' +%s)"
    

    Expected output: metric data from 30 days ago if the remote write has been running that long.

Verification

curl -s http://localhost:9090/api/v1/status/tsdb | jq '.data.headStats.numSeries'

Expected output: the number of local time series. Verify this number is lower after filtering with write_relabel_configs than without.

Common failures

  • Remote storage is unreachable — check network connectivity from the Prometheus host to the remote endpoint. Use curl -v http://victoriametrics:8428/api/v1/write to confirm the endpoint responds.
  • WAL grows unbounded — the remote endpoint is accepting data too slowly. Increase max_shards and capacity, or reduce the number of metrics being sent via write_relabel_configs.
  • Data gaps in long-term storage — Prometheus restarts flush the WAL, potentially losing samples. Enable send_exemplars: false if the remote endpoint does not support exemplars.
  • Filter excludes important metrics — test the relabel config with a dry run. Use promtool check config prometheus.yml and verify the intended metrics are kept.

Related guides