20. Model Comparison Project
This final chapter guides you through a complete model comparison workflow, applying everything learned in the course.
Project overview:
You will evaluate 3-5 models for a specified use case, produce a comparison report, and make a final recommendation. This mimics real-world model selection decisions.
Step 1: Define the use case
Write a clear problem statement:
use_case: "Legal document summarization for a small law firm"
requirements:
- Can run on local hardware (budget: $2000 GPU)
- Context length: Must handle contracts up to 50 pages
- Language: English legal text
- Output: 1-2 paragraph summaries
- Latency: <30 seconds for typical document
- Privacy: No data leaves local machine
constraints:
budget: 2000 # USD
max_latency: 30 # seconds
privacy: local_only
Step 2: Define evaluation criteria
evaluation_criteria:
accuracy:
- Does summary capture key points?
- Are legal terms preserved correctly?
- Does it avoid hallucination?
speed:
- Time to process 10K token document
- Throughput in tokens/second
memory:
- VRAM required
- Does it fit in available hardware?
cost:
- Hardware cost
- Electricity cost estimate
Step 3: Select candidate models
Based on use case requirements, select 3-5 candidates:
candidates = [
{
"name": "Llama 3.1 8B Instruct",
"params": "8B",
"quantization": "Q4_K_M",
"context": 128000,
"estimated_vram": "5.2GB",
"source": "huggingface"
},
{
"name": "Mistral 7B Instruct v0.3",
"params": "7B",
"quantization": "Q4_K_M",
"context": 32768,
"estimated_vram": "4.5GB",
"source": "huggingface"
},
# Add more candidates...
]
Step 4: Run benchmarks
# benchmark_suite.py
def run_full_benchmark(models: list, test_cases: list):
results = {}
for model in models:
model_results = {
"latency": measure_latency(model, test_cases),
"accuracy": measure_accuracy(model, test_cases, scorer),
"memory": measure_peak_memory(model)
}
results[model.name] = model_results
return results
def measure_accuracy(model, cases, scorer):
scores = []
for case in cases:
response = model.generate(case["document"])
summary = response # Extract summary
score = scorer(summary, case["reference_summary"])
scores.append(score)
return {
"mean": sum(scores) / len(scores),
"median": sorted(scores)[len(scores)//2],
"per_case": scores
}
Step 5: Create comparison report
# Model Comparison Report: Legal Document Summarization
## Candidate Models
| Model | VRAM | Context | Latency (p50) | Latency (p95) | Accuracy |
|-------|------|---------|---------------|---------------|----------|
| Llama 3.1 8B | 5.2GB | 128K | 8s | 12s | 0.78 |
| Mistral 7B | 4.5GB | 32K | 7s | 10s | 0.74 |
| Phi-3-medium | 9GB | 128K | 12s | 18s | 0.82 |
## Analysis
### Strengths
- Llama 3.1: Best context, fits in budget GPU
- Phi-3-medium: Highest accuracy but requires expensive GPU
### Weaknesses
- Mistral: Context limit excludes long contracts
- Phi-3-medium: Budget GPU insufficient
## Recommendation
**Winner: Llama 3.1 8B Q4_K_M**
Rationale:
1. Handles 50-page documents (128K context)
2. Fits in RTX 4070 (8GB VRAM)
3. Accuracy acceptable for use case
4. Fast enough for production latency requirements
Step 6: Validate with production test
Before finalizing, run the recommended model on real documents for a week:
# production_validation.py
def validate_in_production(model, documents: list, feedback_loop: int = 50):
results = []
for i, doc in enumerate(documents):
summary = model.generate(doc)
results.append({
"doc_id": i,
"summary": summary,
"user_feedback": None # Collect after review
})
if (i + 1) % feedback_loop == 0:
# Review batch, adjust if needed
avg_score = collect_feedback(results[-feedback_loop:])
log(f"Batch {i // feedback_loop} average: {avg_score}")
return aggregate_results(results)
Complete the full model comparison project for a use case relevant to your work. Document each step and produce a final recommendation with rationale.