17. Canary Deployments

Chapter 17 of 24 · 20 min

KEY INSIGHT

Effective canary deployments treat traffic percentage as a dynamic control, starting at 1-5% and increasing only when real-time metrics confirm equivalent or improved performance. ### Argo Rollouts Implementation ```yaml # canary-deployment.yaml apiVersion: argoproj.io/v1alpha1 kind: Rollout metadata: name: inference-rollout namespace: production spec: replicas: 10 strategy: canary: steps: - setWeight: 5 - pause: {duration: 10m} - analysis: templates: - templateName: inference-analysis args: - name: service-name value: inference-rollout canaryMetadata: labels: version: canary stableMetadata: labels: version: stable trafficRouting: nginx: stableIngress: inference-stable-internal additionalIngress: inference-canary-internal annotationPrefix: nginx.ingress.kubernetes.io routeSpecificMetadata: - name: inference-canary-internal annotations: canary-weight: "5" selector: matchLabels: app: inference-server template: metadata: labels: app: inference-server spec: containers: - name: inference image: registry.internal/inference-server:latest resources: limits: nvidia.com/gpu: 1 ``` ### Analysis Templates Define automated validation criteria: ```yaml # analysis-template.yaml apiVersion: argoproj.io/v1alpha1 kind: AnalysisTemplate metadata: name: inference-analysis namespace: production spec: args: - name: service-name metrics: - name: latency-check interval: 5m successCondition: result[0] <= 1.5 * 1000 failureLimit: 3 provider: prometheus: address: http://prometheus:9090 query: | histogram_quantile(0.95, sum(rate(inference_latency_seconds_bucket{ export_service="{.{args.service-name}}" }[5m])) by (le) ) - name: error-rate-check interval: 5m successCondition: result[0] < 0.01 failureLimit: 1 provider: prometheus: address: http://prometheus:9090 query: | sum(rate(inference_requests_total{ export_service="{.{args.service-name}}", status="error" }[5m])) / sum(rate(inference_requests_total{ export_service="{.{args.service-name}}" }[5m])) ``` ### Manual Promotion ```bash # Pause canary progression for manual review kubectl argo rollouts pause inference-rollout -n production # Manually increase traffic weight kubectl argo rollouts set weight inference-rollout 25 -n production # Full promotion kubectl argo rollouts promote inference-rollout -n production ```

Canary deployments reduce risk by routing a small percentage of production traffic to newly deployed model versions, enabling real-world validation before full rollout. This technique catches regressions that benchmarks miss because synthetic test data rarely captures production distribution accurately.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Deploy Argo Rollouts to a local Kubernetes cluster. Configure a canary deployment strategy for an inference server with automated analysis checks for error rate and latency thresholds. Generate load against the deployment and observe automatic traffic shifting as metrics remain healthy.