Backend Implementation — Capstone: First AI Product (Chapter 6)

The backend provides the intelligence that makes your product valuable. This chapter covers implementing the server-side components of a local AI product: model integration, data handling, API design, and performance optimization.

Model Integration

Local AI models require different integration patterns than cloud APIs. You handle model selection, loading, inference optimization, and hardware utilization yourself.

# src/models/inference.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

class LocalModel:
    def __init__(self, model_path: str, device: str = "auto"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_path,
            torch_dtype=torch.float16,
            device_map=device
        )
    
    def generate(self, prompt: str, max_tokens: int = 256) -> str:
        inputs = self.tokenizer(prompt, return_tensors="pt")
        inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
        
        outputs = self.model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=0.7,
            do_sample=True
        )
        
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

This pattern loads the model once and reuses it across requests. For production, consider model quantization to reduce memory requirements or batching strategies to improve throughput.

API Design

Design APIs that are simple to use correctly and difficult to misuse. Follow REST conventions for resource-oriented interfaces. Use clear endpoint names, consistent response formats, and appropriate HTTP status codes.

# src/api/routes.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel

app = FastAPI()

class QueryRequest(BaseModel):
    text: str
    max_results: int = 10

class QueryResponse(BaseModel):
    results: list[dict]
    query: str
    processing_time_ms: float

@app.post("/search", response_model=QueryResponse)
async def search(request: QueryRequest):
    if not request.text.strip():
        raise HTTPException(status_code=400, detail="Query text required")
    
    results = perform_search(request.text, request.max_results)
    return QueryResponse(
        results=results,
        query=request.text,
        processing_time_ms=calculate_time()
    )

Document your API with examples showing successful requests, error cases, and expected response formats.

Data Storage

Local products need local data management. SQLite handles most use cases without external dependencies. For more complex needs, consider SQLite with full-text search extensions or embedded document stores.

# src/data/storage.py
import sqlite3
from contextlib import contextmanager

@contextmanager
def get_db(db_path: str = "data/app.db"):
    conn = sqlite3.connect(db_path)
    conn.row_factory = sqlite3.Row
    try:
        yield conn
    finally:
        conn.close()

def init_schema():
    with get_db() as conn:
        conn.executescript("""
            CREATE TABLE IF NOT EXISTS documents (
                id INTEGER PRIMARY KEY,
                path TEXT UNIQUE,
                content TEXT,
                indexed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
            );
            
            CREATE TABLE IF NOT EXISTS search_logs (
                id INTEGER PRIMARY KEY,
                query TEXT,
                results_count INTEGER,
                created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
            );
        """)

Performance Considerations

Local AI inference is computationally expensive. Optimize by:

Caching frequent queries and their results
Using smaller models for simple tasks
Loading models once and keeping them in memory
Processing documents in batches when indexing
Using quantization to fit larger models in memory

Measure performance with profiling tools and optimize only the bottlenecks your measurements reveal.