What this does

Setting up authentication for local AI endpoints protects inference APIs from unauthorized access. The authentication layer verifies client identity through API keys for programmatic access and JWT tokens for user-facing applications. Authenticated requests proceed to the model server while unauthenticated requests receive a 401 Unauthorized response. This prevents resource theft, enables usage tracking per client, and provides the foundation for tiered access control and billing.

Steps

Design the authentication scheme. For API key auth, generate keys using secrets.token_urlsafe(32) and store the SHA-256 hash in the database—never store raw keys. The database table: CREATE TABLE api_keys (id INTEGER PRIMARY KEY, key_hash TEXT UNIQUE, user_id INTEGER, tier TEXT DEFAULT 'free', created_at TIMESTAMP). Implement the API key verification middleware. In FastAPI, create a dependency: async def verify_api_key(x_api_key: str = Header(...)): key_hash = hashlib.sha256(x_api_key.encode()).hexdigest(); result = await db.fetch_one("SELECT * FROM api_keys WHERE key_hash = ?", key_hash); if not result: raise HTTPException(status_code=401, detail="Invalid API key"); return result. Apply to routes: @app.post("/v1/chat", dependencies=[Depends(verify_api_key)]). For JWT auth, implement login: @app.post("/auth/login"); def login(username: str, password: str): user = get_user(username); if not verify_password(password, user.hashed_password): raise HTTPException(401); token = jwt.encode({"sub": user.id, "exp": datetime.utcnow() + timedelta(hours=24)}, SECRET_KEY); return {"access_token": token}. Create a JWT verification dependency similar to the API key dependency. For admin endpoints, add role-based access: check tier == "admin" within the dependency. Implement key rotation: when rotating a key, add the new key to the database and mark the old one with expires_at = datetime.utcnow() + timedelta(days=7) for a grace period. Add a key management endpoint accessible only to admins: POST /admin/keys for generation, DELETE /admin/keys/{id} for revocation. Log all authentication failures with timestamp and IP for security monitoring.

Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Confirm the local starting state. Print the active binary, package version, model name, or configuration path before changing the workflow.
Run the smallest complete path. Execute the minimum command or script that proves the guide works end to end on the local machine.
Compare against expected output. Check the final line, status code, generated artifact, or model response against the verification section before expanding the setup.
Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

Send a request without any auth header to the protected endpoint and verify a 401 response with {"detail": "Not authenticated"}. Send a request with an invalid API key and verify 401. Send a request with a valid API key and verify a 200 response with expected inference output. Test JWT flow: login with correct credentials to receive a token, then use the token in Authorization: Bearer <token> header and verify access. Test token expiration: create a token with 1-second expiry, wait, then verify 401. Test admin-only endpoint with a free-tier API key and verify 403 Forbidden.

Common failures

API key stored in plaintext: The verification step must hash the incoming key and compare hashes—never store or compare raw keys. JWT secret key hardcoded in source: Use environment variables (os.environ["JWT_SECRET"]) with a fallback check that raises an error if not set. Token validation slow on every request: Cache decoded tokens in memory with a TTL matching token expiry to avoid repeated cryptographic operations. Missing HTTPS: Always run auth endpoints behind TLS in production to prevent token capture. Brute force vulnerability on login: Implement rate limiting on the login endpoint—allow 5 attempts per minute per IP.

Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

implement-rate-limiting-ai-apis
build-multi-tenant-ai-serving
implement-guardrails-ai-agents