16. Cost Analysis

Chapter 16 of 18 · 20 min

Understanding cluster costs enables capacity planning, charge-back reporting, and identification of waste from idle resources.

Hardware Cost Calculation

Calculate amortized hardware cost per node:

def amortized_cost(capital_cost, years=4, utilization=0.7):
    """
    Calculate effective cost per hour accounting for amortization
    and utilization efficiency.
    """
    annual_hours = 365 * 24
    total_hours = annual_hours * years
    
    hourly_capital = capital_cost / total_hours
    effective_hourly = hourly_capital / utilization
    
    return {
        'capital_cost_per_hour': hourly_capital,
        'effective_cost_per_hour': effective_hourly,
        'annual_cost': effective_hourly * annual_hours
    }

# Example: RTX 4090 workstation (~$4000) over 4 years at 70% utilization
costs = amortized_cost(4000)
print(f"Effective hourly cost: ${costs['effective_cost_per_hour']:.2f}")

Electricity costs add to the calculation:

def power_cost(watts, rate_per_kwh=0.12, hours_per_month=730):
    """Calculate monthly electricity cost"""
    kw = watts / 1000
    return kw * rate_per_kwh * hours_per_month

# Single RTX 4090 at 450W average
print(f"Monthly electricity: ${power_cost(450):.2f}")

GPU Utilization Efficiency

GPU utilization directly impacts cost-effectiveness. An idle GPU costs as much to power as a utilized one:

def cost_per_token(gpu_cost_per_hour, avg_tokens_per_second, utilization_percent):
    """Calculate cost per generated token"""
    tokens_per_hour = avg_tokens_per_second * 3600 * (utilization_percent / 100)
    return gpu_cost_per_hour / tokens_per_hour if tokens_per_hour > 0 else float('inf')

Batch processing dramatically improves tokens-per-dollar by overlapping GPU computation across multiple requests.

Waste Identification

Common waste sources include:

Waste Source Indicator Typical Savings
Idle nodes $node_memory_pressure_bytes = 0 for extended periods 40-60%
Oversized pods memory limit >> actual usage 20-30%
Model duplication Same model on multiple nodes 50%+ storage
Test/development clusters Running 24/7 60% reduction possible

Query Prometheus to identify waste:

# Average GPU utilization across all nodes (last 7 days)
avg(avg_over_time(DCGM_FI_DEV_GPU_UTIL[7d])) by ( exported_pod)

# Memory requests vs actual usage
avg(container_memory_working_set_bytes) by (pod) 
/ avg(kube_pod_container_resource_requests{resource="memory"}) by (pod)

Charge-Back Reporting

Assign costs to teams or projects using Kubernetes labels:

# Query cost by team
kubectl get pods -l team=research -o jsonpath='{.items[*].spec.nodeName}' | tr ' ' '\n' | sort | uniq -c

Detailed charge-back requires integrating resource metrics with billing rates per team.

EXERCISE

Instrument a running inference deployment with resource monitoring, calculate the effective cost-per-request from GPU utilization data, and propose three specific changes that would reduce costs.