16. Cost-Per-Token Optimization
Chapter 16 of 18 · 20 min
Every token has a cost. Optimizing token usage reduces expenses without degrading output quality—often improving it through conciseness.
Token Cost Breakdown
| Model | Input Cost/1M tokens | Output Cost/1M tokens | Notes |
|---|---|---|---|
| GPT-4o | $5.00 | $15.00 | Higher quality, higher cost |
| GPT-3.5-turbo | $0.50 | $1.50 | Lower cost, acceptable quality |
| Llama 3 70B (Ollama) | ~$0.00 | ~$0.00 | Self-hosted costs (GPU + electricity) |
| Mistral 7B (Ollama) | ~$0.00 | ~$0.00 | Lower resource requirements |
Prompt Compression Techniques
Remove redundancy while preserving meaning:
# Before: 287 tokens
"""
You are an expert data analyst working for a Fortune 500 company.
Your role is to analyze datasets and provide insights. You have
access to tools that can help you process data. When given a
dataset, first explore its structure, then identify key patterns,
and finally present your findings in a clear format.
Dataset: {dataset}
"""
# After: 89 tokens (69% reduction)
"""
Analyze this dataset and report key patterns: {dataset}
"""
# Preserved: Task definition, input/output specification
Structure Optimization
Reduce tokens through formatting changes:
# Verbose structure
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "system", "content": "Answer questions accurately."},
{"role": "system", "content": "Be concise in your responses."},
{"role": "user", "content": "What is Python?"}
]
# Optimized structure
messages = [
{"role": "system", "content": "Helpful assistant. Answer accurately, be concise."},
{"role": "user", "content": "What is Python?"}
]
Dynamic Few-Shot Selection
Only include relevant examples:
def select_few_shot_examples(query, example_bank, max_examples=2):
"""Choose examples similar to the query to minimize token use."""
query_embedding = embed(query)
scored_examples = []
for example in example_bank:
example_embedding = embed(example["input"])
similarity = cosine_similarity(query_embedding, example_embedding)
scored_examples.append((similarity, example))
# Select top-k most similar, not all
return [ex for _, ex in sorted(scored_examples, reverse=True)[:max_examples]]
Cost Monitoring
Track cost per output to identify optimization opportunities:
import tiktoken
def estimate_cost(prompt, model="gpt-4"):
encoder = tiktoken.encoding_for_model(model)
input_tokens = len(encoder.encode(prompt))
# Rough cost estimates (verify current pricing)
pricing = {
"gpt-4": {"input": 0.000005, "output": 0.000015},
"gpt-3.5-turbo": {"input": 0.0000005, "output": 0.0000015}
}
return input_tokens * pricing[model]["input"]
EXERCISE
Take a 500-token prompt and reduce it to under 300 tokens while preserving functionality. Measure output quality before and after using a standardized evaluation. Document what can be safely removed.