Model Selection for Chat — Understanding AI Models (Chapter 13)

Chat applications have specific requirements: response quality, coherence across turns, and personality. This chapter provides selection criteria for conversational models.

Chat model requirements:

Instruction following: Understand implicit and explicit user intent
Context awareness: Maintain conversation history without hallucinating
Appropriate tone: Professional but approachable, not robotic
Latency tolerance: Chat users expect responses within 2-3 seconds

Model selection by capability tier:

Low VRAM (<8GB):

Phi-3-mini (3.8B): Surprisingly capable, good for quick interactions
Gemma 2B: Lightweight, acceptable quality
Qwen 2.5 1.5B: Minimal quality, useful for basic tasks

Mid VRAM (8-16GB):

Llama 3.2 3B: Excellent quality per parameter, fast
Mistral 7B: Good all-around, well-tuned versions available
Phi-3-medium (14B): Strong reasoning if VRAM allows

High VRAM (24GB+):

Llama 3.1 8B: Strong general capability
Mistral Large: Superior reasoning for complex conversations
Command R+: Optimized for RAG and tool use

Evaluation checklist for chat:

Test scenarios:
  - Simple questions: "What is the capital of France?"
  - Clarification requests: "Can you make that shorter?"
  - Multi-turn context: "Earlier you mentioned X, expand on that"
  - Edge cases: "I'm not sure what I need-help me decide"
  - Length control: "Give me a one-sentence summary"

Common failure modes:

Repetition loops: Model repeats same phrase or idea
System prompt leakage: Includes instructions in response
Personality shifts: Different tone across conversation
Generic answers: Safe but unhelpful responses

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.