Production Local AI Deployment
Learn production local ai deployment through RunLocalAI's practical lens: deployment, docker, kubernetes and monitoring, hardware fit, runtime settings, verification habits and local-vs-cloud tradeoffs.
- B009
- B011
Why this course matters
Production Local AI Deployment is for builders turning local models into working tools, agents and retrieval systems. It connects deployment, docker, kubernetes, monitoring and scaling to the questions RunLocalAI wants every reader to answer before they install, upgrade or scale a model: will it run, what will it cost in memory, what setting changes the result, and how do you verify the answer instead of trusting a demo?
What you will be able to do
By the end, you should be able to explain the main tradeoffs in plain language, choose a safe next experiment, and use the chapter exercises as a repeatable operator checklist. The course favors local evidence, hardware fit, context limits, latency and failure modes over generic AI vocabulary.
How to use this course
Start at chapter one if the topic is new. If you already have a working stack, scan for chapters such as Production Mindset, Dockerfile Optimization, Multi-Stage Builds and Docker Compose for AI Stack and use those lessons as a quality-control pass before changing a workstation, team workflow or production-like local deployment.
- 01Production MindsetProduction deployments require automation, observability, and document *before* deployment, not after.15 min
- 02Dockerfile OptimizationDockerfile optimization prioritizes layer caching efficiency and minimal runtime footprint, not code organization preferences.15 min
- 03Multi-Stage BuildsMulti-stage builds separate compilation from execution, enabling minimal production images while preserving the build pipeline.20 min
- 04Docker Compose for AI StackDocker Compose definitions become infrastructure as code, establishing reproducible multi-service deployments that mirror production Kubernetes topologies.15 min
- 05GPU Access in DockerGPU access requires matched CUDA versions between host drivers and container runtimes, declared through Docker runtime configuration and resource requests.20 min
- 06Resource LimitsResource limits make resource consumption predictable by establishing ceilings that trigger remedial action rather than silent degradation.20 min
- 07Kubernetes BasicsKubernetes operates through declarative specifications applied by reconciliation loops, not imperative commands executed at request time.20 min
- 08GPU Node SelectionGPU node selection combines labels for matching, taints for isolation, and topology spread for availability across failure domains.20 min
- 09Kubernetes DeploymentsDeployments abstract ReplicaSet management, providing controlled updates and rollback capabilities that maintain service availability during infrastructure changes.20 min
- 10Services and IngressService and Ingress resources create stable network abstractions over transient pods, enabling service discovery without coupling to pod lifecycle details.20 min
- 11ConfigMaps and SecretsConfigMaps and Secrets decouple configuration from container images, enabling environment-specific behavior without rebuilds or credential exposure.25 min
- 12Horizontal Pod AutoscalingHorizontal Pod Autoscaler matches replica count to demand, maintaining quality of service through automatic capacity adjustment while respecting scaling bounds.25 min
- 13Load BalancingLoad balancers for inference serving must route based on predicted compute cost, not request volume, because identical request counts can produce dramatically different computational loads. ### Traffic Distribution Patterns The `nginx` load balancer handles inference routing with upstream blocks that track backend health: ```nginx upstream inference_cluster { least_conn; server model-server-1:8000 weight=3; server model-server-2:8000 weight=3; server model-server-3:8000 weight=2; keepalive 32; } server { listen 443 ssl; location /predict { proxy_pass http://inference_cluster; proxy_http_version 1.1; proxy_set_header Connection ""; proxy_set_header X-Request-Length $request_length; proxy_connect_timeout 300s; proxy_read_timeout 300s; } } ``` The `least_conn` directive routes new requests to the backend with the fewest active connections, which provides better distribution for variable-latency inference workloads than round-robin algorithms. ### gRPC Load Balancing Considerations gRPC's HTTP/2 multiplexing complicates load balancing because a single TCP connection carries multiple streams. Solution: implement grpclb with client-side load balancing: ```python import grpc from grpc_lb import load_balancer balancer = load_balancer.Resolver( target="inference-cluster.consul:8001", lb_policy="round_robin" ) channel = grpc.insecure_channel( balancer.target(), options=[ ('grpc.lb_policy_name', 'round_robin'), ('grpc.service_config', '{"loadBalancingConfig":[{"round_robin":{}}]}') ] ) ``` ### Health Check Configuration Effective health checks prevent routing requests to failing or overloaded model servers: ```yaml health_check: enabled: true interval: 5s timeout: 3s healthy_threshold: 2 unhealthy_threshold: 3 # Inference-specific checks check: path: /health expected_status: 200 expected_response: "OK" # Abort if GPU memory exceeds threshold abort_on: gpu_memory_percent: 95 queue_depth: 100 ```20 min
- 14Prometheus MetricsMetrics collection overhead must remain below 1% of inference compute capacity; aggressive sampling and aggregated histograms prevent instrumentation from becoming a bottleneck. ### Prometheus Integration The Triton Inference Server exposes Prometheus metrics on port 8002: ```yaml # prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'inference-servers' static_configs: - targets: - 'model-server-1:8002' - 'model-server-2:8002' - 'model-server-3:8002' metrics_path: /metrics relabel_configs: - source_labels: [__address__] target_label: instance regex: '(.*):\d+' replacement: '${1}' ``` ### Custom Metrics with Prometheus Client Expose application-specific metrics for model inference: ```python from prometheus_client import Counter, Histogram, Gauge, generate_latest from starlette.applications import Starlette from starlette.routing import Route # Request metrics inference_requests = Counter( 'inference_requests_total', 'Total inference requests', ['model_name', 'status'] ) inference_latency = Histogram( 'inference_latency_seconds', 'Inference request latency', ['model_name'], buckets=(0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0) ) # Resource metrics gpu_memory_used = Gauge( 'gpu_memory_used_bytes', 'GPU memory currently in use', ['device_id'] ) request_queue_depth = Gauge( 'inference_queue_depth', 'Number of requests waiting for processing', ['model_name'] ) async def metrics_endpoint(request): return Response( content=generate_latest(), media_type='text/plain' ) routes = [ Route('/metrics', metrics_endpoint), ] ``` ### Alerting Rules Define alerting thresholds for inference infrastructure: ```yaml groups: - name: inference_alerts rules: - alert: HighInferenceLatency expr: histogram_quantile(0.95, inference_latency_seconds) > 5.0 for: 5m labels: severity: warning annotations: summary: "P95 latency exceeds 5 seconds" - alert: GPUMemoryExhausted expr: (gpu_memory_used_bytes / gpu_memory_total_bytes) > 0.95 for: 1m labels: severity: critical annotations: summary: "GPU memory usage above 95%" - alert: RequestQueueBacklog expr: inference_queue_depth > 100 for: 3m labels: severity: warning annotations: summary: "Request queue depth critical" ```20 min
- 15Grafana DashboardsDashboard design should prioritize the critical path: model availability, inference latency, and error rates appear first; secondary metrics appear in expandable sections. ### Dashboard Architecture Organize dashboards hierarchically from overview to detail: ``` Production Inference ├── Overview (system-wide health) ├── Model Performance │ ├── LLM Metrics │ ├── Embedding Metrics │ └── Classification Metrics ├── Infrastructure │ ├── GPU Utilization │ ├── Memory Usage │ └── Network I/O └── Cost Analysis ``` ### Template Variables Enable flexible dashboard filtering with template variables: ```bash # dashboard.json excerpt { "templating": { "list": [ { "name": "model", "type": "query", "query": "label_values(inference_requests_total, model_name)", "multi": true }, { "name": "instance", "type": "query", "query": "label_values(gpu_utilization, instance)", "multi": true } ] } } ``` ### Essential Panels **Inference Latency Distribution Panel:** ```json { "title": "Inference Latency P50/P95/P99", "type": "graph", "targets": [ { "expr": "histogram_quantile(0.50, rate(inference_latency_seconds_bucket{model=\"$model\"}[5m]))", "legendFormat": "P50" }, { "expr": "histogram_quantile(0.95, rate(inference_latency_seconds_bucket{model=\"$model\"}[5m]))", "legendFormat": "P95" }, { "expr": "histogram_quantile(0.99, rate(inference_latency_seconds_bucket{model=\"$model\"}[5m]))", "legendFormat": "P99" } ], "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0} } ``` **GPU Utilization Heatmap:** ```json { "title": "GPU Utilization Heatmap", "type": "heatmap", "targets": [ { "expr": "rate(gpu_utilization_percent{instance=~\"$instance\"}[1m])", "bucketSet": { "colors": ["#00b894", "#fdcb6e", "#d63031"] } } ] } ``` ### Automated Alert Visualization Add status panels showing active alert counts: ```json { "title": "Alert Status", "type": "stat", "targets": [ { "expr": "_COUNT{alertname=~\".*\", status=\"firing\"}", "legendFormat": "Firing" }, { "expr": "COUNT(ALERTS{status=\"pending\"})", "legendFormat": "Pending" } ], "options": { "colorMode": "background", "colorValue": true } } ```20 min
- 16CI/CD PipelinePipeline design must handle both application code updates and model artifact updates as first-class citizens, with separate validation stages for each artifact type. ### GitHub Actions Pipeline ```yaml # .github/workflows/inference-deploy.yml name: Inference Model CI/CD on: push: branches: [main] paths: - 'models/**' - 'src/**' - 'Dockerfile' - 'requirements.txt' pull_request: branches: [main] env: REGISTRY: registry.internal IMAGE_NAME: inference-server MODEL_REGISTRY: ```s3://model-artifacts/``` jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Set up Docker Buildx uses: docker/setup-buildx-action@v3 - name: Build application image uses: docker/build-push-action@v5 with: context: ./src push: false tags: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:test cache-from: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest - name: Run unit tests run: | docker run --rm ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:test \ pytest tests/unit -v validate-model: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Download model artifacts run: | aws s3 sync ${{ env.MODEL_REGISTRY }}/staging/ ./models/ - name: Validate model schema run: | python scripts/validate_model.py \ --model-dir ./models \ --expected-input "input_ids:float32[?,512]" \ --expected-output "logits:float32[?,512,vocab_size]" - name: Benchmark model performance run: | python scripts/benchmark.py \ --model ./models/model.pt \ --batch-sizes 1,4,8,16 \ --target-throughput 100 deploy-staging: needs: [build, validate-model] runs-on: ubuntu-latest environment: staging steps: - name: Deploy to staging run: | kubectl set image deployment/inference-server \ app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \ --namespace=staging ``` ### Model Registry Integration Automate model promotion through stages based on validation results: ```python # scripts/promote_model.py import boto3 def promote_model(model_name: str, from_stage: str, to_stage: str): s3 = boto3.client('s3') bucket = 'model-artifacts' # Get model metadata metadata_key = f"{from_stage}/{model_name}/metadata.json" metadata = s3.get_object(Bucket=bucket, Key=metadata_key) # Check validation results validation_passed = ( metadata['Cors'] == 'PASSED' and metadata['Benchmark'] == 'PASSED' ) if not validation_passed: raise ValueError(f"Model {model_name} validation incomplete") # Copy to target stage copy_source = {'Bucket': bucket, 'Key': f"{from_stage}/{model_name}"} s3.copy(copy_source, bucket, f"{to_stage}/{model_name}") # Update latest pointer s3.put_object( Bucket=bucket, Key=f"latest/{model_name}", Body=f"{to_stage}/{model_name}".encode() ) ```20 min
- 17Canary DeploymentsEffective canary deployments treat traffic percentage as a dynamic control, starting at 1-5% and increasing only when real-time metrics confirm equivalent or improved performance. ### Argo Rollouts Implementation ```yaml # canary-deployment.yaml apiVersion: argoproj.io/v1alpha1 kind: Rollout metadata: name: inference-rollout namespace: production spec: replicas: 10 strategy: canary: steps: - setWeight: 5 - pause: {duration: 10m} - analysis: templates: - templateName: inference-analysis args: - name: service-name value: inference-rollout canaryMetadata: labels: version: canary stableMetadata: labels: version: stable trafficRouting: nginx: stableIngress: inference-stable-internal additionalIngress: inference-canary-internal annotationPrefix: nginx.ingress.kubernetes.io routeSpecificMetadata: - name: inference-canary-internal annotations: canary-weight: "5" selector: matchLabels: app: inference-server template: metadata: labels: app: inference-server spec: containers: - name: inference image: registry.internal/inference-server:latest resources: limits: nvidia.com/gpu: 1 ``` ### Analysis Templates Define automated validation criteria: ```yaml # analysis-template.yaml apiVersion: argoproj.io/v1alpha1 kind: AnalysisTemplate metadata: name: inference-analysis namespace: production spec: args: - name: service-name metrics: - name: latency-check interval: 5m successCondition: result[0] <= 1.5 * 1000 failureLimit: 3 provider: prometheus: address: http://prometheus:9090 query: | histogram_quantile(0.95, sum(rate(inference_latency_seconds_bucket{ export_service="{.{args.service-name}}" }[5m])) by (le) ) - name: error-rate-check interval: 5m successCondition: result[0] < 0.01 failureLimit: 1 provider: prometheus: address: http://prometheus:9090 query: | sum(rate(inference_requests_total{ export_service="{.{args.service-name}}", status="error" }[5m])) / sum(rate(inference_requests_total{ export_service="{.{args.service-name}}" }[5m])) ``` ### Manual Promotion ```bash # Pause canary progression for manual review kubectl argo rollouts pause inference-rollout -n production # Manually increase traffic weight kubectl argo rollouts set weight inference-rollout 25 -n production # Full promotion kubectl argo rollouts promote inference-rollout -n production ```20 min
- 18Rollback StrategiesRollback automation must be tested as frequently as deployment automation; procedures that fail when invoked under pressure defeat their purpose entirely. ### Kubernetes Rollback ```bash # Immediate rollback to previous revision kubectl rollout undo deployment/inference-server -n production # Rollback to specific revision kubectl rollout history deployment/inference-server -n production kubectl rollout undo deployment/inference-server \ --to-revision=3 -n production # Watch rollback progress kubectl rollout status deployment/inference-server -n production --timeout=300s ``` ### Automated Rollback with Argo Rollouts ```yaml # analysis-template.yaml (abort criteria) spec: metrics: - name: error-rate-critical interval: 2m failureCondition: result[0] > 0.05 provider: prometheus: query: | sum(rate(inference_errors_total[2m])) / sum(rate(inference_requests_total[2m])) > 0.05 - name: latency-critical interval: 2m failureCondition: result[0] > 3000 provider: prometheus: query: | histogram_quantile(0.95, inference_latency_seconds) > 3000 - name: throughput-degraded interval: 5m failureCondition: result[0] < 0.5 * 100 provider: prometheus: query: | rate(inference_requests_total[5m]) < 50 ``` ### Grace Period Configuration Configure appropriate rollback windows that allow post-deployment observation: ```yaml spec: strategy: canary: analysis: templates: - templateName: inference-analysis startingStep: 2 limit: 2 args: - name: service-name value: inference-rollout ``` ### Rolling Update Rollback ```python # scripts/rollback_handler.py import kubernetes from kubernetes.client import AppsV1Api def check_rollback_criteria(deployment_name: str, namespace: str) -> bool: """Evaluate whether automatic rollback should trigger.""" api = AppsV1Api() rollout = api.read_namespaced_deployment( name=deployment_name, namespace=namespace ) # Check for excessive error rates error_rate = get_error_rate(deployment_name, namespace) if error_rate > 0.05: return True # Check for timeout conditions last_update = rollout.status.conditions[-1].last_transition_time elapsed = datetime.now(timezone.utc) - last_update if elapsed > timedelta(minutes=30): if rollout.status.available_replicas < rollout.spec.replicas: return True return False def execute_rollback(deployment_name: str, namespace: str): """Execute rollback procedure.""" api = AppsV1Api() # Rollback to previous version api.patch_namespaced_deployment( name=deployment_name, namespace=namespace, body={"spec": {"template": {"metadata": {"annotations": { "kubectl.kubernetes.io/restartedAt": datetime.now().isoformat() }}}}} ) # Log rollback event log_event( f"Automatic rollback executed for {deployment_name}", severity="critical" ) ```20 min
- 19High AvailabilityHigh availability is not achieved by simply running multiple replicas; each component—networking, storage, and compute—must have redundant paths with automatic failover. ### Multi-AZ Model Server Deployment ```yaml # inference-ha-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: inference-server namespace: production spec: replicas: 3 selector: <<<<<<< HEAD matchLabels: app: inference-server topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: inference-server podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchLabels: app: inference-server topologyKey: kubernetes.io/hostname ======= matchExpressions: - key: app operator: In values: - inference-server strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0 >>>>>>> local template: metadata: labels: app: inference-server spec: containers: - name: inference image: registry.internal/inference-server:v2.1.0 resources: requests: nvidia.com/gpu: 1 limits: nvidia.com/gpu: 1 readinessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 30 periodSeconds: 10 livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 60 periodSeconds: 30 env: - name: MODEL_NAME value: "production-model" - name: GPU_DEVICE_IDS value: "0" ``` ### Redis HA for Request Caching ```yaml # redis-ha.yaml apiVersion: redis.redis.redis.com/v1 kind: RedisCluster metadata: name: inference-cache namespace: production spec: clusterSize: 3 persistence: enabled: false kubernetesConfig: resources: requests: cpu: 500m memory: 1Gi limits: cpu: 1000m memory: 2Gi tls: enabled: true secretName: redis-tls-cert ``` ### Database Connection pooling for Model Metadata ```python # database_pool.py from sqlalchemy.pool import QueuePool from sqlalchemy import create_engine engine = create_engine( "postgresql://user:pass@pg-primary:5432/inference", poolclass=QueuePool, pool_size=20, max_overflow=10, pool_pre_ping=True, connect_args={ "options": "-c pool_mode=transaction" } ) # Enable automatic failover def get_read_node(): """Route read queries to replica.""" return create_engine( "postgresql://user:pass@pg-replica:5432/inference", poolclass=QueuePool, pool_size=10, connect_args={ "options": "-c pool_mode=transaction" } ) ```20 min
- 20Disaster RecoveryDisaster recovery testing must include actual restoration procedures, not just backup verification; plans that have never been executed contain undocumented failure modes. ### Backup Strategy ```bash # Model artifact backup script #!/bin/bash # backup-models.sh S3_BUCKET="s3://inference-backups/models" TIMESTAMP=$(date +%Y%m%d_%H%M%S) # Create point-in-time backup aws s3 sync \ s3://model-artifacts/production/ \ "${S3_BUCKET}/point-in-time/${TIMESTAMP}/" \ --storage-class GLACIER # Verify backup integrity aws s3 sync \ "${S3_BUCKET}/point-in-time/${TIMESTAMP}/" \ /tmp/backup_verify/ \ --dryrun # Create backup manifest cat > /tmp/backup_manifest.json <<EOF { "timestamp": "${TIMESTAMP}", "models": [ $(aws s3 ls s3://model-artifacts/production/ --json | jq -r '.[] | {key: .Key, size: .Size, md5: .ETag}') ], "checksum": "$(aws s3 ls ${S3_BUCKET}/point-in-time/${TIMESTAMP}/ --summarize | grep 'Total Objects' | awk '{print $4}')" } EOF aws s3 cp /tmp/backup_manifest.json "${S3_BUCKET}/manifests/" ``` ### Database Backup ```bash # Database point-in-time recovery backup #!/bin/bash # backup-db.sh PGHOST="pg-primary.internal" PGDATABASE="inference" WAL_S3_PATH="s3://inference-backups/wal/" # Configure continuous archiving psql -h $PGHOST -U postgres <<EOF ALTER SYSTEM SET wal_level = replica; ALTER SYSTEM SET max_wal_senders = 3; ALTER SYSTEM SET wal_keep_size = 1024; ALTER SYSTEM SET archive_mode = on; ALTER SYSTEM SET archive_command = 'aws s3 cp %p ${WAL_S3_PATH}%f'; EOF # Base backup pg_basebackup \ -h $PGHOST \ -U postgres \ -D /tmp/basebackup_$(date +%Y%m%d) \ -Ft \ -z \ -P \ -Xs ``` ### Recovery Runbook ```markdown # DR-001: Full System Recovery ## Prerequisites - [ ] New infrastructure provisioned - [ ] Network connectivity verified - [ ] Access credentials validated ## Restore Order ### 1. Database (RPO: 5 minutes target) ``` cd /tmp/restoration aws s3 sync s3://inference-backups/db/latest/ ./db/ rm -rf /var/lib/postgresql/data/* tar -xzf base_backup.tar.gz -C /var/lib/postgresql/data/ pg_ctl start -D /var/lib/postgresql/data/ ``` ### 2. Model Artifacts ``` aws s3 sync s3://inference-backups/models/latest/ \ s3://model-artifacts/production/ ``` ### 3. Configuration State ``` kubectl apply -f ./configs/namespace.yaml kubectl apply -f ./configs/secrets.yaml kubectl apply -f ./configs/configmaps.yaml ``` ### 4. Inference Services ``` kubectl apply -f ./inference/deployment.yaml kubectl apply -f ./inference/service.yaml kubectl rollout status deployment/inference-server ``` ### Verification - [ ] Health endpoints responding - [ ] Basic inference test passes - [ ] Prometheus metrics flowing - [ ] Alert channels active ```20 min
- 21Cost OptimizationCost optimization without measurement produces either minimal savings or service degradation; instrumentation must precede any infrastructure changes that affect capacity or performance. ### GPU Utilization Analysis ```python # scripts/cost_analysis.py import boto3 from datetime import datetime, timedelta def calculate_gpu_power_cost(utilization_data: list) -> dict: """Calculate inference cost based on GPU utilization. NVIDIA A10G: 150W TDP, ~$0.50/hour at $0.10/kWh """ GPU_POWER_WATTS = 150 ENERGY_COST_PER_KWH = 0.10 total_watt_hours = 0 for period in utilization_data: utilization_pct = period['gpu_utilization'] duration_hours = period['duration_seconds'] / 3600 power_draw = (GPU_POWER_WATTS * utilization_pct / 100) watt_hours += power_draw * duration_hours kwh = total_watt_hours / 1000 cost = kwh * ENERGY_COST_PER_KWH return { 'total_kwh': round(kwh, 2), 'total_cost_dollars': round(cost, 4), 'utilization_samples': len(utilization_data) } def recommend_instance_rightsizing(current_metrics: dict) -> dict: """Compare current utilization to potential savings.""" avg_gpu_util = current_metrics['avg_gpu_utilization'] if avg_gpu_util < 30: recommendation = "Downgrade to smaller GPU or batch requests" potential_savings = 0.40 # 40% cost reduction elif avg_gpu_util < 50: recommendation = "Consolidate workloads to improve utilization" potential_savings = 0.20 elif avg_gpu_util < 70: recommendation = "Current utilization acceptable" potential_savings = 0 else: recommendation = "Consider additional capacity" potential_savings = -0.20 # Cost increase return { 'recommendation': recommendation, 'potential_savings_pct': potential_savings, 'avg_utilization': avg_gpu_util } ``` ### Spot Instance Strategy ```yaml # deployment-spot.yaml apiVersion: apps/v1 kind: Deployment metadata: name: inference-server-batch namespace: production spec: replicas: 2 template: spec: nodeSelector: node.kubernetes.io/lifecycle: spot tolerations: - key: "node.kubernetes.io/lifecycle" operator: "Equal" value: "spot" effect: "NoSchedule" containers: - name: inference image: registry.internal/inference-server:v2.1.0 resources: limits: nvidia.com/gpu: 1 resources: requests: nvidia.com/gpu: 1 memory: "16Gi" cpu: "4" priorityClassName: spot-instance apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: spot-instance value: -1000 globalDefault: false description: "Spot instances with interruption risk" ``` ### Batch Inference Scheduling ```bash # Create batch inference queue with lower priority kubectl create queue inference-batch \ --min-allocatable-resources="nvidia.com/gpu=0" \ --max-resources="nvidia.com/gpu=4" \ --priority=10 # Submit batch job kubectl submit job batch-inference \ --queue=inference-batch \ --image=registry.internal/inference-server:v2.1.0 \ --batch-size=32 \ --input=s3://data/inference-requests/ ```20 min
- 22Multi-Tenant ServingMulti-tenant serving architectures must enforce resource fairness explicitly because some tenants will inevitably submit workloads that attempt to monopolize shared resources. ### Tenant Isolation with Kubernetes Namespaces ```yaml # tenant-a-deployment.yaml apiVersion: v1 kind: Namespace metadata: name: tenant-a labels: tenant: tenant-a apiVersion: v1 kind: ResourceQuota metadata: name: tenant-a-quota namespace: tenant-a spec: hard: requests.nvidia.com/gpu: "4" limits.nvidia.com/gpu: "4" requests.memory: "64Gi" limits.memory: "64Gi" requests.cpu: "16" pods: "10" --- apiVersion: policyv1 kind: LimitRange metadata: name: tenant-a-limits namespace: tenant-a spec: limits: - type: Container default: nvidia.com/gpu: 1 defaultRequest: nvidia.com/gpu: 1 max: nvidia.com/gpu: 2 ``` ### Model Multitenancy with Shared GPU ```python # multi_tenant_inference.py from dataclasses import dataclass from typing import Dict, Optional import torch @dataclass class TenantConfig: tenant_id: str model_name: str max_batch_size: int memory_limit_gb: int rate_limit_rpm: int class MultiTenantInferenceServer: def __init__(self): self.tenants: Dict[str, TenantConfig] = {} self.active_requests: Dict[str, int] {} self.model_cache: Dict[str, torch.nn.Module] = {} async def route_request( self, tenant_id: str, request_data: dict ) -> dict: tenant = self.tenants.get(tenant_id) if not tenant: raise ValueError(f"Unknown tenant: {tenant_id}") # Rate limiting if self.active_requests[tenant_id] >= tenant.rate_limit_rpm: raise ValueError(f"Rate limit exceeded for {tenant_id}") # Memory enforcement gpu_memory = torch.cuda.memory_allocated() if gpu_memory > (tenant.memory_limit_gb * 1e9): self._evict_lru_models(tenant_id) # Process with tenant's assigned model model = self._load_model(tenant.model_name) result = await self._predict(model, request_data) self.active_requests[tenant_id] += 1 return result def _load_model(self, model_name: str) -> torch.nn.Module: if model_name not in self.model_cache: self.model_cache[model_name] = self._load_from_disk(model_name) return self.model_cache[model_name] ``` ### Isolated Inference with Model Partitioning ```python # partitioned_inference.py class PartitionedInference: """GPU memory partitioning for isolated tenant workloads.""" @staticmethod def calculate_partition_sizes( total_memory_gb: float, tenant_allocations: Dict[str, float] ) -> Dict[str, tuple]: """Calculate GPU memory partitions for each tenant.""" partitions = {} current_offset_gb = 0.0 for tenant_id, allocation_pct in sorted( tenant_allocations.items(), key=lambda x: x[1], reverse=True ): partition_size = (total_memory_gb * allocation_pct / 100) partitions[tenant_id] = ( current_offset_gb, partition_size ) current_offset_gb += partition_size return partitions def allocate_tenant_memory( self, tenant_id: str, partition_start: float, partition_size: float ): """Set CUDA memory allocator for specific tenant.""" # In production, use custom CUDA memory allocator # that respects tenant boundaries pass ``` ### Tenant Billing Metrics ```python # tenant_billing.py from prometheus_client import Counter tenant_compute_usage = Counter( 'tenant_gpu_compute_seconds_total', 'Total GPU compute time per tenant', ['tenant_id', 'model_name'] ) tenant_request_count = Counter( 'tenant_requests_total', 'Total requests per tenant', ['tenant_id', 'status'] ) def generate_tenant_invoice(tenant_id: str, period_days: int) -> dict: """Generate billing report for tenant.""" compute_seconds = get_metric_sum( 'tenant_gpu_compute_seconds_total', labels={'tenant_id': tenant_id}, period=f'{period_days}d' ) requests = get_metric_sum( 'tenant_requests_total', labels={'tenant_id': tenant_id}, period=f'{period_days}d' ) # Tiered pricing example compute_cost = compute_seconds * 0.0001 # $0.36/hour request_cost = requests * 0.0002 # $0.20/1000 requests return { 'tenant_id': tenant_id, 'compute_seconds': compute_seconds, 'request_count': requests, 'compute_cost': compute_cost, 'request_cost': request_cost, 'total_cost': compute_cost + request_cost } ```25 min
- 23Security HardeningSecurity controls introduce friction; effective hardening applies defense-in-depth only where friction has acceptable operational overhead, avoiding controls that drive users toward workarounds. ### TLS Configuration ```yaml # ingress-tls.yaml apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: inference-ingress namespace: production annotations: nginx.ingress.kubernetes.io/ssl-redirect: "true" nginx.ingress.kubernetes.io/proxy-body-size: "100m" nginx.ingress.kubernetes.io/proxy-read-timeout: "300" spec: tls: - hosts: - inference.prod.internal - inference-api.example.com secretName: inference-tls-cert rules: - host: inference.prod.internal http: paths: - path: / pathType: Prefix backend: service: name: inference-server port: number: 8000 ``` ### Mutual TLS for Internal Services ```yaml # mtls-policy.yaml apiVersion: security.istio.io/v1beta1 kind: PeerAuthentication metadata: name: inference-mtls namespace: production spec: mtls: mode: STRICT apiVersion: security.istio.io/v1beta1 kind: AuthorizationPolicy metadata: name: inference-authz namespace: production spec: selector: matchLabels: app: inference-server rules: - from: - source: principals: ["cluster.local/ns/production/sa/internal-client"] to: - operation: methods: ["POST"] paths: ["/predict"] - from: - source: namespaces: ["gateway"] to: - operation: methods: ["GET"] paths: ["/health"] ``` ### API Authentication ```python # auth.py from fastapi import HTTPException, Security, Depends from fastapi.security import APIKeyHeader from starlette.requests import Request from collections import defaultdict import hashlib import time API_KEY_HEADER = APIKeyHeader(name="X-API-Key", auto_error=False) class TenantAuthenticator: def __init__(self): self.api_keys: dict[str, dict] = {} self.cache_validity_seconds = 300 def register_tenant(self, tenant_id: str, api_key: str, rate_limit: int): key_hash = hashlib.sha256(api_key.encode()).hexdigest() self.api_keys[key_hash] = { 'tenant_id': tenant_id, 'rate_limit': rate_limit, 'created_at': time.time() } def authenticate( self, request: Request, api_key: str = Security(API_KEY_HEADER) ) -> str: if not api_key: raise HTTPException(status_code=401, detail="API key required") key_hash = hashlib.sha256(api_key.encode()).hexdigest() credentials = self.api_keys.get(key_hash) if not credentials: raise HTTPException(status_code=401, detail="Invalid API key") # Per-tenant rate limiting tenant_id = credentials['tenant_id'] rate_key = f"{tenant_id}:{int(time.time() / 60)}" if not hasattr(self, 'rate_counts'): self.rate_counts = defaultdict(int) if self.rate_counts[rate_key] >= credentials['rate_limit']: raise HTTPException( status_code=429, detail="Rate limit exceeded" ) self.rate_counts[rate_key] += 1 return tenant_id ``` ### Secrets Management ```bash # Fetch secrets from vault kubectl create secret generic inference-secrets \ --from-literal=api-key=$(vault kv get -field=value secret/inference/api-key) \ --from-literal=model-storage-key=$(vault kv get -field=key secret/inference/storage) \ -n production # View secret references in deployment # NOTE: secrets should never appear in container logs or stdout ``` ### Security Scanning ```dockerfile # Dockerfile with multi-stage build and security hardening FROM python:3.11-slim as builder WORKDIR /build COPY requirements.txt . RUN pip install --user -r requirements.txt # Production stage with minimal attack surface FROM python:3.11-slim as production RUN useradd --create-home --shell /bin/false appuser COPY --from=builder /root/.local /root/.local COPY --chown=appuser:appuser ./app /app USER appuser ENV PATH=/root/.local/bin:$PATH ENV PYTHONDONTWRITEBYTECODE=1 ENV PYTHONUNBUFFERED=1 WORKDIR /app CMD ["python", "server.py"] ```20 min
- 24Production Stack ProjectProduction inference systems are not assembled once; they require continuous refinement as traffic patterns evolve, models update, and infrastructure components require maintenance. ### Final Architecture Overview ``` ┌─────────────────────────────────────────────────────────────────┐ │ Load Balancer │ │ (nginx + health checks) │ └─────────────────────────────────────────────────────────────────┘ │ ┌─────────────────────┼─────────────────────┐ ▼ ▼ ▼ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ Model Server 1│ │ Model Server 2│ │ Model Server 3│ │ (Triton/PyTorch)│ │ (Triton/PyTorch)│ │ (Triton/PyTorch)│ │ GPU: A10G │ │ GPU: A10G │ │ GPU: A10G │ └───────────────┘ └───────────────┘ └───────────────┘ │ │ │ └─────────────────────┼─────────────────────┘ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Redis Cache │ │ (result caching) │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ PostgreSQL │ │ (metadata store) │ └─────────────────────────────────────────────────────────────────┘ ```20 min