02. Architecture Design
The architecture follows a layered approach: clients communicate with an API gateway, which routes to backend services, which communicate with the model serving layer. Each layer has distinct scaling characteristics and failure modes.
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Client │────▶│ API Gateway │────▶│ Backend │
│ (React) │◀────│ (nginx) │◀────│ (FastAPI) │
└─────────────┘ └─────────────┘ └─────────────┘
│
┌──────────────────────────┼──────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Model Serve │ │ PostgreSQL │ │ Redis │
│ (llama.cpp)│ │ (metadata) │ │ (cache) │
└─────────────┘ └─────────────┘ └─────────────┘
The API gateway handles TLS termination, rate limiting, and request logging. It rewrites URLs to route traffic to the FastAPI backend. Headers inject trace IDs for request correlation across services.
FastAPI handles HTTP requests and manages async task queues. File uploads save to temporary storage, then queue for processing. The /ask endpoint streams responses using Server-Sent Events while the model processes the request.
Model serving runs separately from the API layer. The FastAPI backend communicates with llama.cpp or vLLM via HTTP. This separation allows independent scaling—more model servers can add when GPU capacity exists.
Database schema uses three core tables: users, documents, and messages. Documents store file paths and processing status. Messages link to documents and include the full prompt/response cycle for audit and retraining.
Redis caches embeddings and session state. A 15-minute TTL on embeddings reduces repeated embedding computation. Session tokens expire after 24 hours.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Create an architecture diagram with request flows for upload, question-asking, and streaming response. Label failure points at each hop.