Capstone Scope — Capstone: Full-Stack AI App (Chapter 1)

This capstone brings together every layer of a production AI application. The goal is not to build a demo—it is to ship something that handles real traffic, fails gracefully, and can be maintained by a team. The scope covers model serving infrastructure, API gateway design, frontend implementation, integration testing, security hardening, Docker Compose deployment, CI/CD automation, monitoring, and documentation.

The project is a document Q&A application. Users upload PDFs and ask questions in natural language. The backend uses a local LLM for inference. This setup tests every component in a realistic scenario: file handling, async processing, streaming responses, rate limiting, and monitoring queue depth.

Key technical decisions shape the scope. The model serving layer uses llama.cpp for CPU inference or vLLM for GPU acceleration—vLLM provides better throughput for multi-user scenarios but requires CUDA. The API gateway runs nginx with Lua scripting for request routing and authentication. The frontend is a single-page React application with Server-Sent Events for streaming. PostgreSQL stores user data and conversation history. Redis handles session state and caching. All services containerize with Docker Compose, with Kubernetes manifests provided for production clusters.

Failure modes determine scope boundaries. Long-running inference jobs need timeout handling. Large PDF uploads require size limits and virus scanning. Rate limiting prevents resource exhaustion. Circuit breakers protect against downstream service failures. These concerns drive architecture decisions throughout the course.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.