Security Boundaries — Hybrid Local-Cloud AI Architecture (Chapter 16)

Security boundaries define the perimeter between your AI gateway and untrusted environments. Inference requests carry sensitive data—user prompts, organizational information, proprietary context—that requires protection through the entire processing pipeline.

Authentication and authorization form the outermost boundary. API keys should be hashed before storage; never store plaintext credentials. Token-based authentication with short-lived refresh tokens prevents long-term credential exposure. Role-based access control restricts which models and providers users can access.

from fastapi import HTTPException, Depends
from fastapi.security import HTTPBearer
from jose import jwt, JWTError
from passlib.context import CryptContext

security = HTTPBearer()
pwd_context = CryptContext(schemes=["bcrypt"], deprecated="auto")

async def verify_token(credentials = Depends(security)) -> TokenPayload:
    try:
        payload = jwt.decode(
            credentials.credentials,
            config.jwt_secret,
            algorithms=["HS256"]
        )
        user_id = payload.get("sub")
        if user_id is None:
            raise HTTPException(401, "Invalid token")
        
        # Check permissions
        permissions = get_user_permissions(user_id)
        return TokenPayload(user_id=user_id, permissions=permissions)
    except JWTError:
        raise HTTPException(401, "Token verification failed")

def require_permission(permission: str):
    async def check_permission(payload: TokenPayload = Depends(verify_token)):
        if permission not in payload.permissions:
            raise HTTPException(403, "Permission denied")
        return payload
    return check_permission

# Protected endpoint example
@router.post("/v1/completions")
async def completions(
    request: CompletionRequest,
    auth: TokenPayload = Depends(require_permission("inference:local"))
):
    # Request authenticated and authorized
    pass

Data isolation ensures requests don't leak between tenants or sessions. Dedicated inference resources for sensitive workloads prevent co-resident model inference from accessing other users' prompts. Network segmentation isolates inference traffic from general application traffic.

Prompt injection represents a nuanced attack vector. Malicious users embed instructions within prompts that attempt to manipulate model behavior or extract information from context. Input validation, output filtering, and sandboxed execution mitigate injection risks.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.