RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Advanced Prompt Engineering
  6. /Ch. 16
Advanced Prompt Engineering

16. Cost-Per-Token Optimization

Chapter 16 of 18 · 20 min
KEY INSIGHT

Token optimization is bounded by minimum information content—removing tokens beyond that threshold degrades output quality faster than it reduces cost.

Every token has a cost. Optimizing token usage reduces expenses without degrading output quality—often improving it through conciseness.

Token Cost Breakdown

Model Input Cost/1M tokens Output Cost/1M tokens Notes
GPT-4o $5.00 $15.00 Higher quality, higher cost
GPT-3.5-turbo $0.50 $1.50 Lower cost, acceptable quality
Llama 3 70B (Ollama) ~$0.00 ~$0.00 Self-hosted costs (GPU + electricity)
Mistral 7B (Ollama) ~$0.00 ~$0.00 Lower resource requirements

Prompt Compression Techniques

Remove redundancy while preserving meaning:

# Before: 287 tokens
"""
You are an expert data analyst working for a Fortune 500 company.
Your role is to analyze datasets and provide insights. You have
access to tools that can help you process data. When given a
dataset, first explore its structure, then identify key patterns,
and finally present your findings in a clear format.

Dataset: {dataset}
"""

# After: 89 tokens (69% reduction)
"""
Analyze this dataset and report key patterns: {dataset}
"""

# Preserved: Task definition, input/output specification

Structure Optimization

Reduce tokens through formatting changes:

# Verbose structure
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "system", "content": "Answer questions accurately."},
    {"role": "system", "content": "Be concise in your responses."},
    {"role": "user", "content": "What is Python?"}
]

# Optimized structure
messages = [
    {"role": "system", "content": "Helpful assistant. Answer accurately, be concise."},
    {"role": "user", "content": "What is Python?"}
]

Dynamic Few-Shot Selection

Only include relevant examples:

def select_few_shot_examples(query, example_bank, max_examples=2):
    """Choose examples similar to the query to minimize token use."""
    query_embedding = embed(query)
    
    scored_examples = []
    for example in example_bank:
        example_embedding = embed(example["input"])
        similarity = cosine_similarity(query_embedding, example_embedding)
        scored_examples.append((similarity, example))
    
    # Select top-k most similar, not all
    return [ex for _, ex in sorted(scored_examples, reverse=True)[:max_examples]]

Cost Monitoring

Track cost per output to identify optimization opportunities:

import tiktoken

def estimate_cost(prompt, model="gpt-4"):
    encoder = tiktoken.encoding_for_model(model)
    input_tokens = len(encoder.encode(prompt))
    
    # Rough cost estimates (verify current pricing)
    pricing = {
        "gpt-4": {"input": 0.000005, "output": 0.000015},
        "gpt-3.5-turbo": {"input": 0.0000005, "output": 0.0000015}
    }
    
    return input_tokens * pricing[model]["input"]
EXERCISE

Take a 500-token prompt and reduce it to under 300 tokens while preserving functionality. Measure output quality before and after using a standardized evaluation. Document what can be safely removed.

← Chapter 15
Prompt Security
Chapter 17 →
Prompt Compression