13. Model Selection for Chat
Chat applications have specific requirements: response quality, coherence across turns, and personality. This chapter provides selection criteria for conversational models.
Chat model requirements:
- Instruction following: Understand implicit and explicit user intent
- Context awareness: Maintain conversation history without hallucinating
- Appropriate tone: Professional but approachable, not robotic
- Latency tolerance: Chat users expect responses within 2-3 seconds
Model selection by capability tier:
Low VRAM (<8GB):
- Phi-3-mini (3.8B): Surprisingly capable, good for quick interactions
- Gemma 2B: Lightweight, acceptable quality
- Qwen 2.5 1.5B: Minimal quality, useful for basic tasks
Mid VRAM (8-16GB):
- Llama 3.2 3B: Excellent quality per parameter, fast
- Mistral 7B: Good all-around, well-tuned versions available
- Phi-3-medium (14B): Strong reasoning if VRAM allows
High VRAM (24GB+):
- Llama 3.1 8B: Strong general capability
- Mistral Large: Superior reasoning for complex conversations
- Command R+: Optimized for RAG and tool use
Evaluation checklist for chat:
Test scenarios:
- Simple questions: "What is the capital of France?"
- Clarification requests: "Can you make that shorter?"
- Multi-turn context: "Earlier you mentioned X, expand on that"
- Edge cases: "I'm not sure what I need-help me decide"
- Length control: "Give me a one-sentence summary"
Common failure modes:
- Repetition loops: Model repeats same phrase or idea
- System prompt leakage: Includes instructions in response
- Personality shifts: Different tone across conversation
- Generic answers: Safe but unhelpful responses
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Define your chat use case with specific requirements. Filter candidate models by VRAM needs, then run your own 10-question chat benchmark to select the best performer.