06. Backend Implementation
The backend provides the intelligence that makes your product valuable. This chapter covers implementing the server-side components of a local AI product: model integration, data handling, API design, and performance optimization.
Model Integration
Local AI models require different integration patterns than cloud APIs. You handle model selection, loading, inference optimization, and hardware utilization yourself.
# src/models/inference.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
class LocalModel:
def __init__(self, model_path: str, device: str = "auto"):
self.tokenizer = AutoTokenizer.from_pretrained(model_path)
self.model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.float16,
device_map=device
)
def generate(self, prompt: str, max_tokens: int = 256) -> str:
inputs = self.tokenizer(prompt, return_tensors="pt")
inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
outputs = self.model.generate(
**inputs,
max_new_tokens=max_tokens,
temperature=0.7,
do_sample=True
)
return self.tokenizer.decode(outputs[0], skip_special_tokens=True)
This pattern loads the model once and reuses it across requests. For production, consider model quantization to reduce memory requirements or batching strategies to improve throughput.
API Design
Design APIs that are simple to use correctly and difficult to misuse. Follow REST conventions for resource-oriented interfaces. Use clear endpoint names, consistent response formats, and appropriate HTTP status codes.
# src/api/routes.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
app = FastAPI()
class QueryRequest(BaseModel):
text: str
max_results: int = 10
class QueryResponse(BaseModel):
results: list[dict]
query: str
processing_time_ms: float
@app.post("/search", response_model=QueryResponse)
async def search(request: QueryRequest):
if not request.text.strip():
raise HTTPException(status_code=400, detail="Query text required")
results = perform_search(request.text, request.max_results)
return QueryResponse(
results=results,
query=request.text,
processing_time_ms=calculate_time()
)
Document your API with examples showing successful requests, error cases, and expected response formats.
Data Storage
Local products need local data management. SQLite handles most use cases without external dependencies. For more complex needs, consider SQLite with full-text search extensions or embedded document stores.
# src/data/storage.py
import sqlite3
from contextlib import contextmanager
@contextmanager
def get_db(db_path: str = "data/app.db"):
conn = sqlite3.connect(db_path)
conn.row_factory = sqlite3.Row
try:
yield conn
finally:
conn.close()
def init_schema():
with get_db() as conn:
conn.executescript("""
CREATE TABLE IF NOT EXISTS documents (
id INTEGER PRIMARY KEY,
path TEXT UNIQUE,
content TEXT,
indexed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS search_logs (
id INTEGER PRIMARY KEY,
query TEXT,
results_count INTEGER,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
""")
Performance Considerations
Local AI inference is computationally expensive. Optimize by:
- Caching frequent queries and their results
- Using smaller models for simple tasks
- Loading models once and keeping them in memory
- Processing documents in batches when indexing
- Using quantization to fit larger models in memory
Measure performance with profiling tools and optimize only the bottlenecks your measurements reveal.
Implement the core backend functionality for your product with a working API endpoint, data storage, and model integration. Verify it works before adding any frontend.