RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Production Local AI Deployment
  6. /Ch. 17
Production Local AI Deployment

17. Canary Deployments

Chapter 17 of 24 · 20 min
KEY INSIGHT

Effective canary deployments treat traffic percentage as a dynamic control, starting at 1-5% and increasing only when real-time metrics confirm equivalent or improved performance. ### Argo Rollouts Implementation ```yaml # canary-deployment.yaml apiVersion: argoproj.io/v1alpha1 kind: Rollout metadata: name: inference-rollout namespace: production spec: replicas: 10 strategy: canary: steps: - setWeight: 5 - pause: {duration: 10m} - analysis: templates: - templateName: inference-analysis args: - name: service-name value: inference-rollout canaryMetadata: labels: version: canary stableMetadata: labels: version: stable trafficRouting: nginx: stableIngress: inference-stable-internal additionalIngress: inference-canary-internal annotationPrefix: nginx.ingress.kubernetes.io routeSpecificMetadata: - name: inference-canary-internal annotations: canary-weight: "5" selector: matchLabels: app: inference-server template: metadata: labels: app: inference-server spec: containers: - name: inference image: registry.internal/inference-server:latest resources: limits: nvidia.com/gpu: 1 ``` ### Analysis Templates Define automated validation criteria: ```yaml # analysis-template.yaml apiVersion: argoproj.io/v1alpha1 kind: AnalysisTemplate metadata: name: inference-analysis namespace: production spec: args: - name: service-name metrics: - name: latency-check interval: 5m successCondition: result[0] <= 1.5 * 1000 failureLimit: 3 provider: prometheus: address: http://prometheus:9090 query: | histogram_quantile(0.95, sum(rate(inference_latency_seconds_bucket{ export_service="{.{args.service-name}}" }[5m])) by (le) ) - name: error-rate-check interval: 5m successCondition: result[0] < 0.01 failureLimit: 1 provider: prometheus: address: http://prometheus:9090 query: | sum(rate(inference_requests_total{ export_service="{.{args.service-name}}", status="error" }[5m])) / sum(rate(inference_requests_total{ export_service="{.{args.service-name}}" }[5m])) ``` ### Manual Promotion ```bash # Pause canary progression for manual review kubectl argo rollouts pause inference-rollout -n production # Manually increase traffic weight kubectl argo rollouts set weight inference-rollout 25 -n production # Full promotion kubectl argo rollouts promote inference-rollout -n production ```

Canary deployments reduce risk by routing a small percentage of production traffic to newly deployed model versions, enabling real-world validation before full rollout. This technique catches regressions that benchmarks miss because synthetic test data rarely captures production distribution accurately.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Deploy Argo Rollouts to a local Kubernetes cluster. Configure a canary deployment strategy for an inference server with automated analysis checks for error rate and latency thresholds. Generate load against the deployment and observe automatic traffic shifting as metrics remain healthy.

← Chapter 16
CI/CD Pipeline
Chapter 18 →
Rollback Strategies