05. Cost-Aware Selection
Cost modeling transforms abstract monetary concerns into explicit routing parameters. The hybrid system quantifies per-request expenses and incorporates these values into routing decisions. Operators establish cost budgets, and the system allocates requests across backends to maximize value within financial constraints.
Token-based cost accounting applies to cloud inference where providers charge per token processed. Input and output tokens carry distinct pricing in most provider models. Prompt caching introduces additional accounting complexity. Operators track consumption against account budgets and throttle routing toward expensive backendswhen allocations approach exhaustion.
Hardware depreciation schedules allocate upfront infrastructure investment across expected operational lifetime. Cost per token for local inference combines power consumption, hardware depreciation, and maintenance overhead. This total cost of ownership enables apples-to-apples comparison against cloud pricing. Amortized costs reveal the true economics driving local versus cloud decisions.
python
import httpx
from dataclasses import dataclass
from typing import Optional
@dataclass
class CostEstimate:
"""Represents cost breakdown for a single inference request."""
input_tokens: int
output_tokens: int
provider: str
input_cost_per_1k: float
output_cost_per_1k: float
def total_cost(self) -> float:
"""Calculate total estimated cost in dollars."""
input_cost = (self.input_tokens / 1000) * self.input_cost_per_1k
output_cost = (self.output_tokens / 1000) * self.output_cost_per_1k
return input_cost + output_cost
def estimate_request_cost(
model: str,
input_text: str,
max_output_tokens: int,
provider_rates: dict
) -> CostEstimate:
"""Estimate cost for a hypothetical request."""
model_config = provider_rates.get(model, provider_rates["default"])
# Approximate token counts (simplified for demonstration)
input_tokens = len(input_text) // 4 # Rough approximation
output_tokens = min(max_output_tokens, 500)
return CostEstimate(
input_tokens=input_tokens,
output_tokens=output_tokens,
provider=model_config["provider"],
input_cost_per_1k=model_config["input_rate"],
output_cost_per_1k=model_config["output_rate"]
)
# Example provider rate configuration
PROVIDER_RATES = {
"gpt-4-turbo": {
"provider": "openai",
"input_rate": 0.01,
"output_rate": 0.03
},
"claude-3-opus": {
"provider": "anthropic",
"input_rate": 0.015,
"output_rate": 0.075
},
"llama-3-70b": {
"provider": "local",
"input_rate": 0.0001,
"output_rate": 0.0001 # Amortized hardware cost
}
}
Dynamic cost optimization adjusts routing based on real-time budget status. Daily spend caps trigger increased local routing. Monthly projections inform policy relaxation or tightening. Anomaly detection identifies unexpected consumption spikes warranting investigation.
Cost categorization enables departmental or customer-specific billing. Different organizational units maintain separate budgets. Usage attribution allocates charges to responsible parties. Premium tiers access more capable models at correspondingly higher per-request costs.
Margin-based routing considers revenue alongside cost. Requests associated with high-value customers warrant premium service. Conversions tied to inference quality justify expensive backends. This commercial perspective complements pure cost minimization.
Build a cost tracking dashboard that displays per-model spend, per-customer allocation, and real-time routing distribution against model costs. Identify opportunities for cost reduction through routing adjustment.