16. Cost Analysis
Understanding cluster costs enables capacity planning, charge-back reporting, and identification of waste from idle resources.
Hardware Cost Calculation
Calculate amortized hardware cost per node:
def amortized_cost(capital_cost, years=4, utilization=0.7):
"""
Calculate effective cost per hour accounting for amortization
and utilization efficiency.
"""
annual_hours = 365 * 24
total_hours = annual_hours * years
hourly_capital = capital_cost / total_hours
effective_hourly = hourly_capital / utilization
return {
'capital_cost_per_hour': hourly_capital,
'effective_cost_per_hour': effective_hourly,
'annual_cost': effective_hourly * annual_hours
}
# Example: RTX 4090 workstation (~$4000) over 4 years at 70% utilization
costs = amortized_cost(4000)
print(f"Effective hourly cost: ${costs['effective_cost_per_hour']:.2f}")
Electricity costs add to the calculation:
def power_cost(watts, rate_per_kwh=0.12, hours_per_month=730):
"""Calculate monthly electricity cost"""
kw = watts / 1000
return kw * rate_per_kwh * hours_per_month
# Single RTX 4090 at 450W average
print(f"Monthly electricity: ${power_cost(450):.2f}")
GPU Utilization Efficiency
GPU utilization directly impacts cost-effectiveness. An idle GPU costs as much to power as a utilized one:
def cost_per_token(gpu_cost_per_hour, avg_tokens_per_second, utilization_percent):
"""Calculate cost per generated token"""
tokens_per_hour = avg_tokens_per_second * 3600 * (utilization_percent / 100)
return gpu_cost_per_hour / tokens_per_hour if tokens_per_hour > 0 else float('inf')
Batch processing dramatically improves tokens-per-dollar by overlapping GPU computation across multiple requests.
Waste Identification
Common waste sources include:
| Waste Source | Indicator | Typical Savings |
|---|---|---|
| Idle nodes | $node_memory_pressure_bytes = 0 for extended periods | 40-60% |
| Oversized pods | memory limit >> actual usage | 20-30% |
| Model duplication | Same model on multiple nodes | 50%+ storage |
| Test/development clusters | Running 24/7 | 60% reduction possible |
Query Prometheus to identify waste:
# Average GPU utilization across all nodes (last 7 days)
avg(avg_over_time(DCGM_FI_DEV_GPU_UTIL[7d])) by ( exported_pod)
# Memory requests vs actual usage
avg(container_memory_working_set_bytes) by (pod)
/ avg(kube_pod_container_resource_requests{resource="memory"}) by (pod)
Charge-Back Reporting
Assign costs to teams or projects using Kubernetes labels:
# Query cost by team
kubectl get pods -l team=research -o jsonpath='{.items[*].spec.nodeName}' | tr ' ' '\n' | sort | uniq -c
Detailed charge-back requires integrating resource metrics with billing rates per team.
Instrument a running inference deployment with resource monitoring, calculate the effective cost-per-request from GPU utilization data, and propose three specific changes that would reduce costs.