Deployment Strategies: Rolling, Blue-Green, and Canary Releases

Compare and implement deployment strategies—rolling updates, blue-green deployments, and canary releases—to reduce risk and enable safe production releases.

published: reading time: 13 min read

Choosing the right deployment strategy balances risk, speed, and resource cost. This guide compares rolling updates, blue-green, and canary deployments with implementation examples.

Rolling Update Mechanics

Rolling updates gradually replace old pods with new ones. Kubernetes handles this natively for Deployments.

Basic rolling update configuration:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  replicas: 6
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1 # Allow 1 extra pod during update
      maxUnavailable: 0 # Never have fewer than desired replicas
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
        version: v2
    spec:
      containers:
        - name: myapp
          image: myregistry.azurecr.io/myapp:v2.0.0
          ports:
            - containerPort: 8080

Monitor rolling update progress:

# Watch rollout status
kubectl rollout status deployment/myapp

# View deployment details
kubectl describe deployment myapp

# Check revision history
kubectl rollout history deployment/myapp

Rolling update behavior:

Parameter6 ReplicasEffect
maxSurge: 1, maxUnavailable: 07 pods during transitionMaximum availability, slower
maxSurge: 2, maxUnavailable: 08 pods during transitionFaster, more resources
maxSurge: 0, maxUnavailable: 15 pods during transitionMinimum resources, some downtime

Rollback a rolling update:

# Immediate rollback to previous version
kubectl rollout undo deployment/myapp

# Rollback to specific revision
kubectl rollout undo deployment/myapp --to-revision=3

# Watch rollback
kubectl rollout status deployment/myapp

Blue-Green Deployment Setup

Blue-green deployments run two identical environments and switch traffic between them. This enables instant rollback and zero-downtime deployments.

Infrastructure setup:

Internet → Load Balancer → Blue (v1) OR Green (v2)
                           ↓              ↓
                       [Production]   [Production]

Kubernetes implementation with two Deployments:

# Blue deployment (current version)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-blue
  labels:
    app: myapp
    slot: blue
spec:
  replicas: 6
  selector:
    matchLabels:
      app: myapp
      slot: blue
  template:
    metadata:
      labels:
        app: myapp
        slot: blue
        version: v1
    spec:
      containers:
        - name: myapp
          image: myregistry.azurecr.io/myapp:v1.0.0
---
# Green deployment (new version)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-green
  labels:
    app: myapp
    slot: green
spec:
  replicas: 6
  selector:
    matchLabels:
      app: myapp
      slot: green
  template:
    metadata:
      labels:
        app: myapp
        slot: green
        version: v2
    spec:
      containers:
        - name: myapp
          image: myregistry.azurecr.io/myapp:v2.0.0

Service switching between slots:

# Initial state: traffic to blue
apiVersion: v1
kind: Service
metadata:
  name: myapp
spec:
  selector:
    app: myapp
    slot: blue
  ports:
    - port: 80
      targetPort: 8080

# Switch to green (update selector)
# kubectl patch service myapp -p '{"spec":{"selector":{"slot":"green"}}}'

Blue-green with Argo Rollouts:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: myapp
spec:
  strategy:
    blueGreen:
      activeService: myapp-blue
      previewService: myapp-preview
      autoPromotionEnabled: false # Manual promotion
      scaleDownDelaySeconds: 600 # Keep old version for 10 min after switch
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
        - name: myapp
          image: myregistry.azurecr.io/myapp:v2.0.0

Canary Deployment with Argo Rollouts

Canary deployments gradually shift traffic to the new version, monitoring metrics to detect issues.

Argo Rollouts canary configuration:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: myapp
spec:
  replicas: 10
  strategy:
    canary:
      steps:
        - setWeight: 5 # Start with 5% traffic to new version
        - pause: {} # Wait for manual inspection
        - setWeight: 20
        - pause: { duration: 10m } # Auto-proceed after 10 minutes
        - setWeight: 50
        - pause: {}
      canaryMetadata:
        labels:
          role: canary
      stableMetadata:
        labels:
          role: stable
      trafficRouting:
        nginx:
          stableIngress: myapp-stable
          additionalIngressAnnotations:
            canary-by-header: X-Canary
      analysis:
        templates:
          - templateName: success-rate
        startingStep: 1
        args:
          - name: service-name
            value: myapp-canary
  selector:
    matchLabels:
      app: myapp
  template:
    spec:
      containers:
        - name: myapp
          image: myregistry.azurecr.io/myapp:v2.0.0

Analysis template for automated checks:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 2m
      successCondition: result[0] >= 0.95
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{service="{{args.service-name}}",status!~"5.."}[5m]))
            /
            sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))

    - name: error-rate
      interval: 1m
      successCondition: result[0] < 0.01
      failureLimit: 5
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{service="{{args.service-name}}",status=~"5.."}[5m]))

Feature Flags Integration

Feature flags decouple deployment from release, enabling precise control over who sees new features.

LaunchDarkly in Kubernetes:

# Inject feature flag context into pod
apiVersion: v1
kind: ConfigMap
metadata:
  name: feature-flags
data:
  LD_CLIENT_KEY: "sdk-xxxxx"

---
# Pod spec with flag evaluation
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  template:
    spec:
      containers:
        - name: myapp
          image: myregistry.azurecr.io/myapp:v2.0.0
          env:
            - name: LD_CLIENT_KEY
              valueFrom:
                configMapKeyRef:
                  name: feature-flags
                  key: LD_CLIENT_KEY
          # App reads flags and shows/hides features

Progressive percentage rollout with flags:

// Example: gradual rollout of new checkout
const launchDarkly = require("@launchdarkly/node-server-sdk");

const client = launchDarkly.init(process.env.LD_CLIENT_KEY);

async function shouldShowNewCheckout(userId) {
  return client.variation("new-checkout-flow", { key: userId }, false);
}

// Route based on flag
app.get("/checkout", async (req, res) => {
  const userId = req.user.id;
  const useNewCheckout = await shouldShowNewCheckout(userId);

  if (useNewCheckout) {
    res.redirect("/checkout/new");
  } else {
    res.redirect("/checkout/legacy");
  }
});

Rollback Triggers and Automation

Automated rollback prevents bad releases from affecting users.

Prometheus metrics-triggered rollback:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: myapp
spec:
  strategy:
    canary:
      analysis:
        templates:
          - templateName: error-rate-check
        automatic: true # Auto-rollback on failure
        # Rollback if error rate exceeds threshold
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate-check
spec:
  metrics:
    - name: error-rate
      interval: 1m
      successCondition: result[0] < 0.05
      failureCondition: result[0] > 0.05
      failureLimit: 2 # Trigger rollback after 2 consecutive failures
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{service="myapp-canary",status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total{service="myapp-canary"}[5m]))

GitHub Actions automated rollback:

rollback:
  runs-on: ubuntu-latest
  if: failure()
  steps:
    - name: Rollback deployment
      run: |
        # Rollback in Kubernetes
        kubectl rollout undo deployment/myapp -n production

        # Or rollback Helm
        helm rollback myapp -n production

    - name: Notify
      uses: slackapi/slack-github-action@v1
      with:
        channel-id: "deployments"
        payload: |
          {
            "text": "Production deployment failed, rolled back automatically",
            "blocks": [
              {
                "type": "section",
                "text": {
                  "type": "mrkdwn",
                  "text": ":x: *Production deployment failed*\nRolling back to previous version."
                }
              }
            ]
          }

Choosing the Right Strategy

StrategyRiskSpeedCostBest For
RollingLowMediumLowStateless services, Kubernetes native
Blue-GreenVery LowFastHigh (2x resources)Database migrations, zero-downtime requirements
CanaryLow-MediumSlowMediumNew features, A/B testing, gradual rollout

Decision factors:

  1. Application state: Stateful apps may have issues with rolling updates
  2. Traffic sensitivity: User-facing apps benefit from blue-green or canary
  3. Resource budget: Blue-green requires double the capacity
  4. Rollback speed: How fast must you recover from a bad deploy
  5. Testing confidence: Low confidence = canary with analysis

When to Use / When Not to Use

When rolling updates make sense

Rolling updates work best in Kubernetes for stateless services where you can have multiple versions running simultaneously. If your application handles traffic gracefully when some instances run the old version and others run the new version simultaneously, rolling updates are the simplest choice.

Use rolling updates when you need zero-downtime deployments and cannot afford double the infrastructure for blue-green. They are the Kubernetes default for a reason.

When blue-green makes sense

Blue-green is the right choice when you need instant switchover and instant rollback. Database migrations are the classic use case. You run the migration against the blue environment, validate it works, then switch all traffic to green in one atomic operation. If something goes wrong, you switch back to blue.

Blue-green also makes sense when you need to validate a full environment before taking traffic. You can run smoke tests against green before switching, and keep blue warm for a fast rollback.

When canary makes sense

Canary deployments are best for risky changes where you want real production traffic validation before committing fully. A new algorithm, a major UI redesign, a significant infrastructure change — these are all good canary candidates.

Use canary when you have the metrics infrastructure to validate the change automatically. Without metrics, canary is just slow blue-green.

Production Failure Scenarios

Common Deployment Failures

FailureImpactMitigation
Rolling update pods crash during transitionService degraded during deploySet maxUnavailable: 0, monitor closely
Blue-green traffic switch failsHalf traffic goes to old versionTest traffic switch in staging, use weighted routing
Canary analysis triggers on unrelated metricHealthy deploy blockedUse metrics specific to the change
PDB blocks necessary evictionCluster upgrade blockedSet PDB appropriately, do not overprotect
Service selector mismatch after switchTraffic routed to wrong podsValidate selectors match before switching

Deployment Rollback Flow

flowchart TD
    A[Deploy New Version] --> B{Health Check Pass?}
    B -->|No| C[Rollback to Previous]
    B -->|Yes| D[Monitor for 10 min]
    D --> E{Metrics OK?}
    E -->|Yes| F[Deployment Complete]
    E -->|No| G[Auto Rollback]
    C --> H[Alert Team]
    G --> H
    H --> I[Investigate Root Cause]

Observability Hooks

Track deployments to catch failures early and measure deployment health.

What to monitor:

  • Deployment duration (spot stuck deployments)
  • Pod restart count during rollout
  • Error rate spike during transition
  • Traffic distribution after switch
  • Rollback frequency per service
# Check rollout status
kubectl rollout status deployment/myapp --timeout=5m

# Check pod age during rollout
kubectl get pods -l app=myapp -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.phase}{"\t"}{.metadata.creationTimestamp}{"\n"}{end}'

# View rollout history
kubectl rollout history deployment/myapp

Common Pitfalls / Anti-Patterns

Not testing the rollback procedure

A rollback strategy you have never tested is not a rollback strategy. Practice rolling back in staging so you know what happens when you call kubectl rollout undo in production at 2am.

Setting PDB too aggressively

PodDisruptionBudgets that require 100% availability block legitimate cluster operations like node upgrades. A PDB that says “always keep 3 pods available” on a 3-replica deployment means no pod can ever be evicted.

Using the same strategy for all services

A simple stateless API and a complex stateful service with database connections need different deployment strategies. Cookie-cutter approaches lead to either over-engineering simple services or under-protecting complex ones.

Ignoring database schema changes

Deployment strategies handle application versions, not schema migrations. If your new version requires a new column that the old version cannot handle, deploying the new version before the migration is a disaster. Treat database migrations as a separate release concern.

Trade-off Summary

StrategyDeployment TimeResource CostRollback SpeedRisk Level
Rolling updateModerate (proportional to batch size)Low (no extra capacity)Minutes (reverse batch)Low-Medium
Blue-greenFast (instant switchover)2x (double infrastructure)Instant (switch traffic back)Low
CanaryGradual (traffic shifting)Low-Medium (few extra pods)Fast (drop traffic to new)Low
RecreateFast (no orchestration)Zero extraMinutes (redeploy old version)High

Quick Recap

Key Takeaways

  • Rolling updates are the Kubernetes default for a reason — they work for stateless services
  • Blue-green gives you instant switchover and instant rollback for higher confidence
  • Canary reduces risk by validating with real traffic before full rollout
  • Always test your rollback procedure in staging, not for the first time in production
  • Monitor deployment metrics: duration, error rate, pod restarts

Deployment Checklist

# Before deployment
kubectl rollout history deployment/myapp
kubectl get pdb myapp -o yaml

# During deployment
kubectl rollout status deployment/myapp --timeout=10m
kubectl get pods -l app=myapp --watch

# After deployment
kubectl rollout status deployment/myapp
kubectl logs -l app=myapp --tail=100 | grep ERROR
kubectl get events --sort-by='.lastTimestamp' | grep myapp

Interview Questions

Q: You need to deploy a database migration as part of a new version. The migration is backwards-incompatible. How do you handle the deployment? A: Backwards-incompatible migrations require a multi-phase approach. Option 1: deploy the new application version alongside the old, run the migration while both versions are running, then cut over traffic once the migration completes. Option 2: use the expand-contract pattern — first deploy schema changes that are backwards-compatible (new columns with defaults, new tables), then deploy the new application code, then clean up old schema. For truly incompatible changes, blue-green with a migration freeze window is often the safest. Never run migrations as part of the deployment pipeline without a rollback plan.

Q: A canary deployment is sending 10% of traffic to the new version, and error rates spike. What do you do? A: Immediately halt the rollout: reduce canary traffic to 0% or revert to the previous version using your traffic management tool (Argo Rollouts, Flagger, or your service mesh). Do not try to debug while serving traffic to users. After reverting, investigate: check application logs and metrics for the new version, look for differences in configuration or environment variables, verify the new version is reading from the correct data stores. Common causes: the new version has a subtle bug that only manifests at scale, dependency connectivity issues, or incorrect resource configuration.

Q: How do you design a deployment strategy for a stateful service like Kafka that requires zero data loss? A: Stateful services need careful sequencing: scale up the new brokers before decommissioning old ones, wait for topic replication to catch up, then migrate partition leadership. Use Kafka’s built-in partition reassignment tool to move partitions safely. Set unclean.leader.election.enable=false to prevent data loss during broker failures. For Kafka specifically, use Strimzi or Kafka Operator on Kubernetes for managed StatefulSets. Always test the failure scenario in a staging environment first. Incremental rollout with careful monitoring of replication lag is essential.

Q: What are PodDisruptionBudgets and why do they matter during deployments? A: PDBs ensure a minimum number of pods remain available during voluntary disruptions like node drains and deployments. Without PDBs, Kubernetes could evict too many pods simultaneously, causing service disruption. Set minAvailable or maxUnavailable based on your availability requirements. For stateful services with replication, minAvailable: 1 ensures at least one replica stays up. During deployments with multiple replicas, PDBs prevent Kubernetes from terminating too many pods at once, maintaining quorum for clustered applications.

Q: You want to deploy to 1000 nodes but avoid a thundering herd problem. How do you approach this? A: The thundering herd problem occurs when many nodes pull images or restart simultaneously, overwhelming the registry or network. Avoid by: configuring RollingUpdate with maxSurge: 10-25% and maxUnavailable: 0 so updates happen in controlled batches, staggering deployments across node pools if you have multiple pools, using a wave-based deployment approach where you tag nodes and deploy to wave 1, wait for stability, then proceed. For image pulls specifically, use a local registry mirror or cache (Harbor, Amazon ECR), pre-pull images onto nodes, and set imagePullPolicy: IfNotPresent.

Conclusion

Each deployment strategy serves different needs. Rolling updates work well in Kubernetes and require minimal extra resources. Blue-green deployments provide instant switchover and easy rollback at the cost of double infrastructure. Canary deployments offer granular control and risk reduction through gradual traffic shifting. For more on automated deployments, see our CI/CD Pipelines guide, and for GitOps patterns, see our GitOps article.

Category

Related Posts

Health Checks: Liveness, Readiness, and Service Availability

Master health check implementation for microservices including liveness probes, readiness probes, and graceful degradation patterns.

#microservices #health-checks #kubernetes

Zero-Downtime Database Migration Strategies

Learn how to evolve your database schema in production without downtime. This guide covers expand-contract patterns, backward-compatible migrations, rollback strategies, and tools.

#database #migrations #devops

Container Security: Image Scanning and Vulnerability Management

Implement comprehensive container security: from scanning images for vulnerabilities to runtime security monitoring and secrets protection.

#container-security #docker #kubernetes