21. Cost Optimization

Chapter 21 of 24 · 20 min

KEY INSIGHT

Cost optimization without measurement produces either minimal savings or service degradation; instrumentation must precede any infrastructure changes that affect capacity or performance. ### GPU Utilization Analysis ```python # scripts/cost_analysis.py import boto3 from datetime import datetime, timedelta def calculate_gpu_power_cost(utilization_data: list) -> dict: """Calculate inference cost based on GPU utilization. NVIDIA A10G: 150W TDP, ~$0.50/hour at $0.10/kWh """ GPU_POWER_WATTS = 150 ENERGY_COST_PER_KWH = 0.10 total_watt_hours = 0 for period in utilization_data: utilization_pct = period['gpu_utilization'] duration_hours = period['duration_seconds'] / 3600 power_draw = (GPU_POWER_WATTS * utilization_pct / 100) watt_hours += power_draw * duration_hours kwh = total_watt_hours / 1000 cost = kwh * ENERGY_COST_PER_KWH return { 'total_kwh': round(kwh, 2), 'total_cost_dollars': round(cost, 4), 'utilization_samples': len(utilization_data) } def recommend_instance_rightsizing(current_metrics: dict) -> dict: """Compare current utilization to potential savings.""" avg_gpu_util = current_metrics['avg_gpu_utilization'] if avg_gpu_util < 30: recommendation = "Downgrade to smaller GPU or batch requests" potential_savings = 0.40 # 40% cost reduction elif avg_gpu_util < 50: recommendation = "Consolidate workloads to improve utilization" potential_savings = 0.20 elif avg_gpu_util < 70: recommendation = "Current utilization acceptable" potential_savings = 0 else: recommendation = "Consider additional capacity" potential_savings = -0.20 # Cost increase return { 'recommendation': recommendation, 'potential_savings_pct': potential_savings, 'avg_utilization': avg_gpu_util } ``` ### Spot Instance Strategy ```yaml # deployment-spot.yaml apiVersion: apps/v1 kind: Deployment metadata: name: inference-server-batch namespace: production spec: replicas: 2 template: spec: nodeSelector: node.kubernetes.io/lifecycle: spot tolerations: - key: "node.kubernetes.io/lifecycle" operator: "Equal" value: "spot" effect: "NoSchedule" containers: - name: inference image: registry.internal/inference-server:v2.1.0 resources: limits: nvidia.com/gpu: 1 resources: requests: nvidia.com/gpu: 1 memory: "16Gi" cpu: "4" priorityClassName: spot-instance apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: spot-instance value: -1000 globalDefault: false description: "Spot instances with interruption risk" ``` ### Batch Inference Scheduling ```bash # Create batch inference queue with lower priority kubectl create queue inference-batch \ --min-allocatable-resources="nvidia.com/gpu=0" \ --max-resources="nvidia.com/gpu=4" \ --priority=10 # Submit batch job kubectl submit job batch-inference \ --queue=inference-batch \ --image=registry.internal/inference-server:v2.1.0 \ --batch-size=32 \ --input=s3://data/inference-requests/ ```

Inference serving costs scale with GPU utilization, memory consumption, and infrastructure redundancy. Optimization requires balancing cost reduction against SLO compliance, often requiring measurement-driven decisions about trade-offs.

EXERCISE

Instrument cost tracking for an inference deployment. Calculate the cost per inference request based on GPU utilization data. Implement batch inference scheduling for offline workloads and compare the cost per request against real-time processing costs.