RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Production Local AI Deployment
  6. /Ch. 22
Production Local AI Deployment

22. Multi-Tenant Serving

Chapter 22 of 24 · 25 min
KEY INSIGHT

Multi-tenant serving architectures must enforce resource fairness explicitly because some tenants will inevitably submit workloads that attempt to monopolize shared resources. ### Tenant Isolation with Kubernetes Namespaces ```yaml # tenant-a-deployment.yaml apiVersion: v1 kind: Namespace metadata: name: tenant-a labels: tenant: tenant-a apiVersion: v1 kind: ResourceQuota metadata: name: tenant-a-quota namespace: tenant-a spec: hard: requests.nvidia.com/gpu: "4" limits.nvidia.com/gpu: "4" requests.memory: "64Gi" limits.memory: "64Gi" requests.cpu: "16" pods: "10" --- apiVersion: policyv1 kind: LimitRange metadata: name: tenant-a-limits namespace: tenant-a spec: limits: - type: Container default: nvidia.com/gpu: 1 defaultRequest: nvidia.com/gpu: 1 max: nvidia.com/gpu: 2 ``` ### Model Multitenancy with Shared GPU ```python # multi_tenant_inference.py from dataclasses import dataclass from typing import Dict, Optional import torch @dataclass class TenantConfig: tenant_id: str model_name: str max_batch_size: int memory_limit_gb: int rate_limit_rpm: int class MultiTenantInferenceServer: def __init__(self): self.tenants: Dict[str, TenantConfig] = {} self.active_requests: Dict[str, int] {} self.model_cache: Dict[str, torch.nn.Module] = {} async def route_request( self, tenant_id: str, request_data: dict ) -> dict: tenant = self.tenants.get(tenant_id) if not tenant: raise ValueError(f"Unknown tenant: {tenant_id}") # Rate limiting if self.active_requests[tenant_id] >= tenant.rate_limit_rpm: raise ValueError(f"Rate limit exceeded for {tenant_id}") # Memory enforcement gpu_memory = torch.cuda.memory_allocated() if gpu_memory > (tenant.memory_limit_gb * 1e9): self._evict_lru_models(tenant_id) # Process with tenant's assigned model model = self._load_model(tenant.model_name) result = await self._predict(model, request_data) self.active_requests[tenant_id] += 1 return result def _load_model(self, model_name: str) -> torch.nn.Module: if model_name not in self.model_cache: self.model_cache[model_name] = self._load_from_disk(model_name) return self.model_cache[model_name] ``` ### Isolated Inference with Model Partitioning ```python # partitioned_inference.py class PartitionedInference: """GPU memory partitioning for isolated tenant workloads.""" @staticmethod def calculate_partition_sizes( total_memory_gb: float, tenant_allocations: Dict[str, float] ) -> Dict[str, tuple]: """Calculate GPU memory partitions for each tenant.""" partitions = {} current_offset_gb = 0.0 for tenant_id, allocation_pct in sorted( tenant_allocations.items(), key=lambda x: x[1], reverse=True ): partition_size = (total_memory_gb * allocation_pct / 100) partitions[tenant_id] = ( current_offset_gb, partition_size ) current_offset_gb += partition_size return partitions def allocate_tenant_memory( self, tenant_id: str, partition_start: float, partition_size: float ): """Set CUDA memory allocator for specific tenant.""" # In production, use custom CUDA memory allocator # that respects tenant boundaries pass ``` ### Tenant Billing Metrics ```python # tenant_billing.py from prometheus_client import Counter tenant_compute_usage = Counter( 'tenant_gpu_compute_seconds_total', 'Total GPU compute time per tenant', ['tenant_id', 'model_name'] ) tenant_request_count = Counter( 'tenant_requests_total', 'Total requests per tenant', ['tenant_id', 'status'] ) def generate_tenant_invoice(tenant_id: str, period_days: int) -> dict: """Generate billing report for tenant.""" compute_seconds = get_metric_sum( 'tenant_gpu_compute_seconds_total', labels={'tenant_id': tenant_id}, period=f'{period_days}d' ) requests = get_metric_sum( 'tenant_requests_total', labels={'tenant_id': tenant_id}, period=f'{period_days}d' ) # Tiered pricing example compute_cost = compute_seconds * 0.0001 # $0.36/hour request_cost = requests * 0.0002 # $0.20/1000 requests return { 'tenant_id': tenant_id, 'compute_seconds': compute_seconds, 'request_count': requests, 'compute_cost': compute_cost, 'request_cost': request_cost, 'total_cost': compute_cost + request_cost } ```

Serving multiple tenants on shared infrastructure reduces per-tenant costs through resource multiplexing while introducing isolation, fair resource allocation, and billing challenges. Different isolation levels serve different use cases: logical isolation suits trusted tenants while hardware separation may be required for security-sensitive workloads.


EXERCISE

Implement a multi-tenant inference service that enforces per-tenant rate limits and GPU memory quotas using Kubernetes ResourceQuota. Add tenant-specific Prometheus metrics for compute usage and request counts. Verify that one tenant's runaway workload does not impact another tenant's latency SLO.

← Chapter 21
Cost Optimization
Chapter 23 →
Security Hardening