HOW-TO · SUP

How to build a multi-tenant AI serving infrastructure

advanced45 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Kubernetes or Docker, AI inference service

What this does

Building a multi-tenant AI serving infrastructure allows multiple users, teams, or customers to share a single set of GPU resources while maintaining isolation, fair scheduling, and per-tenant cost tracking. The infrastructure provides each tenant with a dedicated API endpoint (or namespace), enforces usage quotas, and ensures one tenant's heavy workload does not degrade performance for others. The architecture spans the API gateway, request queue, model server, and monitoring layers.

Steps

Design the tenant data model. Create a tenants table: CREATE TABLE tenants (id UUID PRIMARY KEY, name TEXT, api_key_hash TEXT, tier TEXT, max_concurrent_requests INT DEFAULT 5, daily_token_limit BIGINT, created_at TIMESTAMP). Implement the API gateway configuration. For each tenant, generate an API key and create a Kong route with the key-auth plugin. Configure per-tenant rate limiting at the gateway: kong plugin add rate-limiting --config minute=100 --config policy=local for the free tier, higher for paid. Build the request scheduler. This component pulls requests from the gateway, assigns them to available inference slots, and enforces per-tenant concurrency limits. Use a priority queue where tenant priority is determined by tier and current utilization. Implement in Redis: ZADD request_queue <priority> <request_id> and ZPOPMAX request_queue to fetch the highest priority job. For the inference layer, deploy the model server with multi-instance support. Use vLLM or TGI with --max-num-seqs 256 to allow concurrent generation across tenants. Implement namespace isolation: each tenant's requests include a tenant_id header, which the scheduler uses to track GPU time. For additional isolation, run separate model instances per tenant tier—for example, T4 GPUs for free tier and A100 for enterprise. Add a usage tracking layer that logs each request's token count, latency, and tenant ID. Aggregate hourly: INSERT INTO usage_records (tenant_id, tokens_used, request_count, timestamp) SELECT tenant_id, SUM(prompt_tokens + completion_tokens), COUNT(*), date_trunc('hour', timestamp) FROM request_logs GROUP BY tenant_id, date_trunc('hour', timestamp). Enforce quotas by checking cumulative usage before admitting new requests. If a tenant exceeds the daily token limit, return 429 with {"error": "quota_exceeded", "reset_at": "tomorrow 00:00 UTC"}.

  • Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

  • Confirm the local starting state. Print the active binary, package version, model name, or configuration path before changing the workflow.

  • Run the smallest complete path. Execute the minimum command or script that proves the guide works end to end on the local machine.

  • Compare against expected output. Check the final line, status code, generated artifact, or model response against the verification section before expanding the setup.

  • Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

Create two tenants with different API keys. Send 10 concurrent requests from Tenant A and 2 from Tenant B—both should complete, and Tenant B should experience similar latency to idling. Exceed Tenant A's concurrency limit and verify additional requests return 429. Exceed the daily token quota and verify requests are rejected with the correct reset time. Check the usage_records table for accurate per-tenant token counts. Verify that a request without a valid API key returns 401.

Common failures

Tenant isolation breached: A tenant accessing another's data means namespace isolation is insufficient—add tenant ID to every database query with Row-Level Security (RLS) or WHERE clause enforcement. Scheduler starvation: Low-priority tenants never getting GPU time—implement fair queuing with weighted round-robin after 5 high-priority requests. Quota reset not happening at midnight: Use date_trunc('day', timestamp) grouping with UTC timezone consistently. GPU memory fragmentation with many concurrent small requests: Set --max-model-len appropriately and enable prefix caching in the inference engine. Database connection exhaustion: Use connection pooling (PgBouncer) with at least 20 connections to handle multi-tenant load.

  • Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
  • Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

  • implement-rate-limiting-ai-apis
  • setup-authentication-local-ai-endpoints
  • deploy-ai-kubernetes-gpu-nodes