RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Capstone: Full-Stack AI App
  6. /Ch. 2
Capstone: Full-Stack AI App

02. Architecture Design

Chapter 2 of 18 · 15 min
KEY INSIGHT

Draw the data flow first, then design each service interface around that flow, not the other way around.

The architecture follows a layered approach: clients communicate with an API gateway, which routes to backend services, which communicate with the model serving layer. Each layer has distinct scaling characteristics and failure modes.

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Client    │────▶│ API Gateway │────▶│   Backend   │
│  (React)    │◀────│   (nginx)   │◀────│  (FastAPI)  │
└─────────────┘     └─────────────┘     └─────────────┘
                                              │
                   ┌──────────────────────────┼──────────────────────────┐
                   │                          │                          │
                   ▼                          ▼                          ▼
           ┌─────────────┐           ┌─────────────┐           ┌─────────────┐
           │ Model Serve │           │ PostgreSQL  │           │    Redis    │
           │  (llama.cpp)│           │  (metadata) │           │   (cache)   │
           └─────────────┘           └─────────────┘           └─────────────┘

The API gateway handles TLS termination, rate limiting, and request logging. It rewrites URLs to route traffic to the FastAPI backend. Headers inject trace IDs for request correlation across services.

FastAPI handles HTTP requests and manages async task queues. File uploads save to temporary storage, then queue for processing. The /ask endpoint streams responses using Server-Sent Events while the model processes the request.

Model serving runs separately from the API layer. The FastAPI backend communicates with llama.cpp or vLLM via HTTP. This separation allows independent scaling—more model servers can add when GPU capacity exists.

Database schema uses three core tables: users, documents, and messages. Documents store file paths and processing status. Messages link to documents and include the full prompt/response cycle for audit and retraining.

Redis caches embeddings and session state. A 15-minute TTL on embeddings reduces repeated embedding computation. Session tokens expire after 24 hours.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Create an architecture diagram with request flows for upload, question-asking, and streaming response. Label failure points at each hop.

← Chapter 1
Capstone Scope
Chapter 3 →
Model Serving Setup