Deployment Strategies: Rolling, Blue-Green, and Canary Releases

Compare and implement deployment strategies—rolling updates, blue-green deployments, and canary releases—to reduce risk and enable safe production releases.

published: March 25, 2026 reading time: 29 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

Deployment strategy determines how traffic shifts from old to new code and how fast you recover when a rollout goes wrong. Rolling updates replace pods one-by-one with no extra infrastructure needed; blue-green runs two full environments and switches traffic atomically for instant rollback; canary sends a slice of users to new code and grows based on real metrics. Argo Rollouts automates canary analysis with Prometheus gates that halt and roll back when error rates spike. Pick based on your infrastructure budget and rollback speed requirements, not because canary is fashionable. Whichever strategy you use, practice the rollback in staging—finding out it does not work at 2am is a bad time.

Deployment Strategies: Rolling, Blue-Green, and Canary Releases

Introduction

Production deployments are one of the riskiest things you do as an engineer. A bad release can take down a service thousands of users depend on, and rolling back takes time you often do not have. Deployment strategies control how code reaches users — whether that’s gradual traffic shifts, instant cutovers, or maintaining hot spare environments. The strategy you pick affects how fast you ship, how much infrastructure you need, and how quickly you can recover if something goes wrong.

Rolling updates, blue-green, and canary each handle risk differently. Rolling updates replace instances one by one, keeping old versions running throughout. Blue-green runs two complete environments and switches all traffic at once. Canary sends a fraction of traffic to the new version and grows that fraction based on real metrics. None is universally better — the right choice depends on how much downtime your app tolerates, your infrastructure budget, and how fast you need to roll back.

This guide covers the mechanics of each strategy with Kubernetes-native implementations and Argo Rollouts examples. You’ll learn how to configure rolling updates for zero-downtime deploys, implement blue-green cutovers for instant rollback, set up canary analysis with automated metric gates, and pick the right approach for your situation. Each strategy includes production failure scenarios so you know what can go wrong before it does.

Rolling Update Mechanics

Rolling updates gradually replace old pods with new ones. Kubernetes handles this natively for Deployments — you configure a few parameters and it takes care of the rest.

Basic rolling update configuration:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  replicas: 6
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1 # Allow 1 extra pod during update
      maxUnavailable: 0 # Never have fewer than desired replicas
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
        version: v2
    spec:
      containers:
        - name: myapp
          image: myregistry.azurecr.io/myapp:v2.0.0
          ports:
            - containerPort: 8080

Monitor rolling update progress:

# Watch rollout status
kubectl rollout status deployment/myapp

# View deployment details
kubectl describe deployment myapp

# Check revision history
kubectl rollout history deployment/myapp

Rolling update behavior:

Parameter	6 Replicas	Effect
maxSurge: 1, maxUnavailable: 0	7 pods during transition	Maximum availability, slower
maxSurge: 2, maxUnavailable: 0	8 pods during transition	Faster, more resources
maxSurge: 0, maxUnavailable: 1	5 pods during transition	Minimum resources, some downtime

Rollback a rolling update:

# Immediate rollback to previous version
kubectl rollout undo deployment/myapp

# Rollback to specific revision
kubectl rollout undo deployment/myapp --to-revision=3

# Watch rollback
kubectl rollout status deployment/myapp

Blue-Green Deployment Setup

Blue-green deployments run two identical environments and switch traffic between them. This enables instant rollback and zero-downtime deployments.

Infrastructure setup:

Internet → Load Balancer → Blue (v1) OR Green (v2)
                           ↓              ↓
                       [Production]   [Production]

Kubernetes implementation with two Deployments:

# Blue deployment (current version)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-blue
  labels:
    app: myapp
    slot: blue
spec:
  replicas: 6
  selector:
    matchLabels:
      app: myapp
      slot: blue
  template:
    metadata:
      labels:
        app: myapp
        slot: blue
        version: v1
    spec:
      containers:
        - name: myapp
          image: myregistry.azurecr.io/myapp:v1.0.0
---
# Green deployment (new version)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-green
  labels:
    app: myapp
    slot: green
spec:
  replicas: 6
  selector:
    matchLabels:
      app: myapp
      slot: green
  template:
    metadata:
      labels:
        app: myapp
        slot: green
        version: v2
    spec:
      containers:
        - name: myapp
          image: myregistry.azurecr.io/myapp:v2.0.0

Service switching between slots:

# Initial state: traffic to blue
apiVersion: v1
kind: Service
metadata:
  name: myapp
spec:
  selector:
    app: myapp
    slot: blue
  ports:
    - port: 80
      targetPort: 8080

# Switch to green (update selector)
# kubectl patch service myapp -p '{"spec":{"selector":{"slot":"green"}}}'

Blue-green with Argo Rollouts:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: myapp
spec:
  strategy:
    blueGreen:
      activeService: myapp-blue
      previewService: myapp-preview
      autoPromotionEnabled: false # Manual promotion
      scaleDownDelaySeconds: 600 # Keep old version for 10 min after switch
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
        - name: myapp
          image: myregistry.azurecr.io/myapp:v2.0.0

Canary Deployment with Argo Rollouts

Canary deployments gradually shift traffic to the new version, monitoring metrics to detect issues.

Argo Rollouts canary configuration:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: myapp
spec:
  replicas: 10
  strategy:
    canary:
      steps:
        - setWeight: 5 # Start with 5% traffic to new version
        - pause: {} # Wait for manual inspection
        - setWeight: 20
        - pause: { duration: 10m } # Auto-proceed after 10 minutes
        - setWeight: 50
        - pause: {}
      canaryMetadata:
        labels:
          role: canary
      stableMetadata:
        labels:
          role: stable
      trafficRouting:
        nginx:
          stableIngress: myapp-stable
          additionalIngressAnnotations:
            canary-by-header: X-Canary
      analysis:
        templates:
          - templateName: success-rate
        startingStep: 1
        args:
          - name: service-name
            value: myapp-canary
  selector:
    matchLabels:
      app: myapp
  template:
    spec:
      containers:
        - name: myapp
          image: myregistry.azurecr.io/myapp:v2.0.0

Analysis template for automated checks:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 2m
      successCondition: result[0] >= 0.95
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{service="{{args.service-name}}",status!~"5.."}[5m]))
            /
            sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))

    - name: error-rate
      interval: 1m
      successCondition: result[0] < 0.01
      failureLimit: 5
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{service="{{args.service-name}}",status=~"5.."}[5m]))

Feature Flags Integration

Feature flags decouple deployment from release, enabling precise control over who sees new features.

LaunchDarkly in Kubernetes:

# Inject feature flag context into pod
apiVersion: v1
kind: ConfigMap
metadata:
  name: feature-flags
data:
  LD_CLIENT_KEY: "sdk-xxxxx"

---
# Pod spec with flag evaluation
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  template:
    spec:
      containers:
        - name: myapp
          image: myregistry.azurecr.io/myapp:v2.0.0
          env:
            - name: LD_CLIENT_KEY
              valueFrom:
                configMapKeyRef:
                  name: feature-flags
                  key: LD_CLIENT_KEY
          # App reads flags and shows/hides features

Progressive percentage rollout with flags:

// Example: gradual rollout of new checkout
const launchDarkly = require("@launchdarkly/node-server-sdk");

const client = launchDarkly.init(process.env.LD_CLIENT_KEY);

async function shouldShowNewCheckout(userId) {
  return client.variation("new-checkout-flow", { key: userId }, false);
}

// Route based on flag
app.get("/checkout", async (req, res) => {
  const userId = req.user.id;
  const useNewCheckout = await shouldShowNewCheckout(userId);

  if (useNewCheckout) {
    res.redirect("/checkout/new");
  } else {
    res.redirect("/checkout/legacy");
  }
});

Rollback Triggers and Automation

Automated rollback prevents bad releases from affecting users.

Prometheus metrics-triggered rollback:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: myapp
spec:
  strategy:
    canary:
      analysis:
        templates:
          - templateName: error-rate-check
        automatic: true # Auto-rollback on failure
        # Rollback if error rate exceeds threshold

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate-check
spec:
  metrics:
    - name: error-rate
      interval: 1m
      successCondition: result[0] < 0.05
      failureCondition: result[0] > 0.05
      failureLimit: 2 # Trigger rollback after 2 consecutive failures
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{service="myapp-canary",status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total{service="myapp-canary"}[5m]))

GitHub Actions automated rollback:

rollback:
  runs-on: ubuntu-latest
  if: failure()
  steps:
    - name: Rollback deployment
      run: |
        # Rollback in Kubernetes
        kubectl rollout undo deployment/myapp -n production

        # Or rollback Helm
        helm rollback myapp -n production

    - name: Notify
      uses: slackapi/slack-github-action@v1
      with:
        channel-id: "deployments"
        payload: |
          {
            "text": "Production deployment failed, rolled back automatically",
            "blocks": [
              {
                "type": "section",
                "text": {
                  "type": "mrkdwn",
                  "text": ":x: *Production deployment failed*\nRolling back to previous version."
                }
              }
            ]
          }

Choosing the Right Strategy

Strategy	Risk	Speed	Cost	Best For
Rolling	Low	Medium	Low	Stateless services, Kubernetes native
Blue-Green	Very Low	Fast	High (2x resources)	Database migrations, zero-downtime requirements
Canary	Low-Medium	Slow	Medium	New features, A/B testing, gradual rollout

Decision factors:

Application state: Stateful apps may have issues with rolling updates
Traffic sensitivity: User-facing apps benefit from blue-green or canary
Resource budget: Blue-green requires double the capacity
Rollback speed: How fast must you recover from a bad deploy
Testing confidence: Low confidence = canary with analysis

When to Use / When Not to Use

When rolling updates make sense

Rolling updates work best in Kubernetes for stateless services where you can have multiple versions running simultaneously. If your application handles traffic gracefully when some instances run the old version and others run the new version simultaneously, rolling updates are the simplest choice.

Use rolling updates when you need zero-downtime deployments and cannot afford double the infrastructure for blue-green. They are the Kubernetes default for a reason.

The sweet spot for rolling updates is stateless APIs and services where request-level idempotency means mixing versions does not cause problems. If your service reads from one database and writes to one database, rolling updates are safe. Both versions handle the same requests the same way, so traffic splitting between old and new pods does not create data inconsistencies.

What makes rolling updates tricky is state that lives outside the database. In-memory session state, local file writes, cached data on disk — these become problems when a request that started on a v1 pod finishes on a v2 pod. For services with that kind of state, you need either sticky sessions or a different strategy entirely.

Rolling updates also work well when you have many services and cannot afford the operational overhead of blue-green for all of them. Blue-green for one critical service is manageable. Blue-green for twenty microservices is a significant burden. Roll with the default for most services, reserve blue-green for the ones that genuinely need it.

When blue-green makes sense

Blue-green is the right choice when you need instant switchover and instant rollback. Database migrations are the classic use case. You run the migration against the blue environment, validate it works, then switch all traffic to green in one atomic operation. If something goes wrong, you switch back to blue.

Blue-green also makes sense when you need to validate a full environment before taking traffic. You can run smoke tests against green before switching, and keep blue warm for a fast rollback.

The database migration case is the most compelling argument for blue-green. When your application and database schema need to change together in a way that is not backward-compatible, you cannot safely run both versions simultaneously. The old application will not know how to handle the new schema, and the new application will not know how to handle the old schema. Blue-green sidesteps this by ensuring all traffic moves to the new version at once, with the migration already applied.

Another case where blue-green earns its cost: services where partial availability is worse than no availability. A payment service that is half-functional — some transactions going to the old version, some to the new — creates reconciliation nightmares. If you cannot afford that window, blue-green eliminates it.

The infrastructure cost is real. Running two complete environments means doubling your compute bill. For stateless services that are already on Kubernetes and cheap to run, this is manageable. For GPU-heavy ML inference services or memory-intensive workloads, double the capacity is a significant expense. Evaluate whether the rollback speed and confidence justify the cost for your specific service.

When canary makes sense

Canary deployments are best for risky changes where you want real production traffic validation before committing fully. A new algorithm, a major UI redesign, a significant infrastructure change — these are all good canary candidates.

Use canary when you have the metrics infrastructure to validate the change automatically. Without metrics, canary is just slow blue-green.

The defining characteristic of a good canary candidate is uncertainty. You are not sure whether the change works at scale, whether the new algorithm performs better or worse, whether the UI redesign confuses or delights users. Canary lets you find out with real production traffic before rolling out to everyone.

ML model deployments are a canonical canary use case. A model that performed well in staging can degrade in production due to data distribution differences. Sending 5% of traffic to the new model while monitoring error rates and business metrics catches those cases before they affect all users. If the new model performs better, you expand. If it performs worse, you roll back and investigate.

UI redesigns are another strong canary case, but they require care. Canary for a UI change means routing real users to the new interface, which means you need to know what metric you are optimizing for. If the metric is conversion rate, you watch conversion. If it is engagement, you watch session length. Pick the metric before you deploy, not after.

What makes canary a bad fit: low-traffic services where 5% of traffic is not enough to generate meaningful signal, services where you cannot route a small percentage of users without breaking things (some payment flows need atomic consistency), and changes that are clearly safe and just need to ship. Do not use a rocket to kill a mosquito.

Production Failure Scenarios

Common Deployment Failures

Failure	Impact	Mitigation
Rolling update pods crash during transition	Service degraded during deploy	Set maxUnavailable: 0, monitor closely
Blue-green traffic switch fails	Half traffic goes to old version	Test traffic switch in staging, use weighted routing
Canary analysis triggers on unrelated metric	Healthy deploy blocked	Use metrics specific to the change
PDB blocks necessary eviction	Cluster upgrade blocked	Set PDB appropriately, do not overprotect
Service selector mismatch after switch	Traffic routed to wrong pods	Validate selectors match before switching

Deployment Rollback Flow

flowchart TD
    A[Deploy New Version] --> B{Health Check Pass?}
    B -->|No| C[Rollback to Previous]
    B -->|Yes| D[Monitor for 10 min]
    D --> E{Metrics OK?}
    E -->|Yes| F[Deployment Complete]
    E -->|No| G[Auto Rollback]
    C --> H[Alert Team]
    G --> H
    H --> I[Investigate Root Cause]

Observability Hooks

Track deployments to catch failures early and measure deployment health.

What to monitor:

Deployment duration (spot stuck deployments)
Pod restart count during rollout
Error rate spike during transition
Traffic distribution after switch
Rollback frequency per service

# Check rollout status
kubectl rollout status deployment/myapp --timeout=5m

# Check pod age during rollout
kubectl get pods -l app=myapp -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.phase}{"\t"}{.metadata.creationTimestamp}{"\n"}{end}'

# View rollout history
kubectl rollout history deployment/myapp

Common Pitfalls / Anti-Patterns

Not testing the rollback procedure

A rollback strategy you have never tested is not a rollback strategy. Practice rolling back in staging so you know what happens when you call kubectl rollout undo in production at 2am.

What to practice:

Roll back to the previous revision and verify the application starts correctly
Time how long rollback takes so you have realistic expectations during an incident
Confirm that secrets, ConfigMaps, and volume mounts survive a rollback intact
Test rolling back a canary deployment that has promoted partially through its steps
Verify that database connections and in-flight requests are handled gracefully during rollback

Staging validation checklist:

# Deploy a known-good version first
kubectl apply -f deployment-v1.yaml

# Simulate a bad deploy
kubectl apply -f deployment-broken.yaml

# Watch it fail
kubectl rollout status deployment/myapp --timeout=2m

# Roll back immediately
kubectl rollout undo deployment/myapp

# Verify recovery
kubectl get pods -l app=myapp
kubectl logs -l app=myapp --tail=50 | grep ERROR

Without this rehearsal, you discover the gaps in your rollback procedure at 2am when users are affected.

Setting PDB too aggressively

PodDisruptionBudgets that require 100% availability block legitimate cluster operations like node upgrades. A PDB that says “always keep 3 pods available” on a 3-replica deployment means no pod can ever be evicted.

Common PDB mistakes and their consequences:

PDB Setting	Replicas	Problem
minAvailable: 3	3	Node drain blocked forever — no pod can be evicted
minAvailable: 2	2	One node failure leaves the service unavailable
maxUnavailable: 0	any	Prevents any pod termination during upgrades

Correct PDB sizing:

# For a 6-replica stateless service
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: myapp-pdb
spec:
  # Allow 1 pod to be unavailable during disruption
  # This still maintains 5/6 pods = 83% availability
  maxUnavailable: 1
  selector:
    matchLabels:
      app: myapp

For stateful services with quorum requirements:

# Kafka with 3 brokers needs at least 2 available for writes
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: kafka-pdb
spec:
  minAvailable: 2 # Maintains quorum for writes
  selector:
    app: kafka

Think of PDBs as a floor on disruption, not a ceiling on availability. The problem comes when you set that floor too high — the cluster gets stuck and nothing can be evicted when you actually need to drain a node.

Using the same strategy for all services

A simple stateless API and a complex stateful service with database connections need different deployment strategies. Cookie-cutter approaches lead to either over-engineering simple services or under-protecting complex ones.

Strategy selection by service type:

Service Type	Recommended Strategy	Why
Stateless REST API	Rolling update	Simple, Kubernetes-native, no traffic complexity
Stateful service (DB, cache)	Blue-green or recreate	Data integrity; simultaneous versions risk corruption
ML inference service	Canary (manual analysis)	GPU resources expensive; validate accuracy before full rollout
Payment service	Blue-green	Instant rollback critical; cannot afford prolonged partial availability
Internal tool / admin panel	Recreate	Zero users affected during downtime; fastest deployment
User-facing with session state	Canary	Gradual exposure limits blast radius; session affinity matters

Over-engineering a simple service:

Applying Argo Rollouts with automated metric analysis to a low-traffic internal dashboard wastes operational overhead. A basic rolling update with health checks covers this use case adequately.

Under-protecting a critical service:

Using a rolling update for a payment processing service means partial availability during transition — some transactions go to the old version, others to new. If a schema change is involved, this creates data inconsistency windows. Blue-green eliminates that window entirely.

Decision framework:

Does the service handle user transactions or financial data?
  → Yes: Blue-green
  → No: Continue

Is the service stateful (persists data locally)?
  → Yes: Blue-green or recreate
  → No: Continue

Is the service business-critical with no tolerance for degraded states?
  → Yes: Blue-green with manual promotion gate
  → No: Rolling update

Is this a risky change (new algorithm, major refactor)?
  → Yes: Canary with automated analysis
  → No: Rolling update

Pick the strategy that fits each service’s risk profile. Don’t apply the same approach everywhere.

Ignoring database schema changes

Deployment strategies handle application versions, not schema migrations. If your new version requires a new column that the old version cannot handle, deploying the new version before the migration is a disaster. Treat database migrations as a separate release concern.

The expand-contract pattern for safe schema changes:

This approach separates schema changes from application changes across multiple deployments:

Phase	Action	Old App	New App
1 — Expand	Add new column (nullable, with default)	Works	Works
2 — Migrate	Backfill existing rows	Works	Works
3 — Contract	Remove old column/index	Won’t work	Works

Phase 1 adds the new schema element. Both old and new application versions continue working because the old column still exists. Phase 2 populates the new column with data. Phase 3 removes the old element only after all instances run the new version.

What goes wrong without this separation:

Old version reads missing column: Application queries fail with “column does not exist” errors
New version writes old column: Data written to deprecated column is lost when that column is dropped
Index removal during traffic split: Queries that relied on the index become slow or time out on instances still using the old version

Backward-compatible changes that are safe for rolling updates:

Adding a new table
Adding a new nullable column with a default value
Adding a new index (watch for write performance impact)
Adding a new not-null column with a default

Backward-incompatible changes that require separate handling:

Removing or renaming a column
Changing a column type
Changing a not-null constraint
Removing an index

For incompatible changes, use the expand-contract pattern or blue-green with a migration freeze window. Deployment strategies handle application versions — they are not a substitute for a data migration plan.

Trade-off Summary

Strategy	Deployment Time	Resource Cost	Rollback Speed	Risk Level
Rolling update	Moderate (proportional to batch size)	Low (no extra capacity)	Minutes (reverse batch)	Low-Medium
Blue-green	Fast (instant switchover)	2x (double infrastructure)	Instant (switch traffic back)	Low
Canary	Gradual (traffic shifting)	Low-Medium (few extra pods)	Fast (drop traffic to new)	Low
Recreate	Fast (no orchestration)	Zero extra	Minutes (redeploy old version)	High

Interview Questions

1. You need to deploy a database migration as part of a new version. The migration is backwards-incompatible. How do you handle the deployment?

Backwards-incompatible migrations require a multi-phase approach. Option 1: deploy the new application version alongside the old, run the migration while both versions are running, then cut over traffic once the migration completes. Option 2: use the expand-contract pattern — first deploy schema changes that are backwards-compatible (new columns with defaults, new tables), then deploy the new application code, then clean up old schema. For truly incompatible changes, blue-green with a migration freeze window is often the safest. Never run migrations as part of the deployment pipeline without a rollback plan.

2. A canary deployment is sending 10% of traffic to the new version, and error rates spike. What do you do?

Immediately halt the rollout: reduce canary traffic to 0% or revert to the previous version using your traffic management tool (Argo Rollouts, Flagger, or your service mesh). Do not try to debug while serving traffic to users. After reverting, investigate: check application logs and metrics for the new version, look for differences in configuration or environment variables, verify the new version is reading from the correct data stores. Common causes: the new version has a subtle bug that only manifests at scale, dependency connectivity issues, or incorrect resource configuration.

3. How do you design a deployment strategy for a stateful service like Kafka that requires zero data loss?

Stateful services need careful sequencing: scale up the new brokers before decommissioning old ones, wait for topic replication to catch up, then migrate partition leadership. Use Kafka's built-in partition reassignment tool to move partitions safely. Set unclean.leader.election.enable=false to prevent data loss during broker failures. For Kafka specifically, use Strimzi or Kafka Operator on Kubernetes for managed StatefulSets. Always test the failure scenario in a staging environment first. Incremental rollout with careful monitoring of replication lag is essential.

4. What are PodDisruptionBudgets and why do they matter during deployments?

PDBs ensure a minimum number of pods remain available during voluntary disruptions like node drains and deployments. Without PDBs, Kubernetes could evict too many pods simultaneously, causing service disruption. Set minAvailable or maxUnavailable based on your availability requirements. For stateful services with replication, minAvailable: 1 ensures at least one replica stays up. During deployments with multiple replicas, PDBs prevent Kubernetes from terminating too many pods at once, maintaining quorum for clustered applications.

5. You want to deploy to 1000 nodes but avoid a thundering herd problem. How do you approach this?

The thundering herd problem occurs when many nodes pull images or restart simultaneously, overwhelming the registry or network. Avoid by: configuring RollingUpdate with maxSurge: 10-25% and maxUnavailable: 0 so updates happen in controlled batches, staggering deployments across node pools if you have multiple pools, using a wave-based deployment approach where you tag nodes and deploy to wave 1, wait for stability, then proceed. For image pulls specifically, use a local registry mirror or cache (Harbor, Amazon ECR), pre-pull images onto nodes, and set imagePullPolicy: IfNotPresent.

6. Compare blue-green and canary deployment strategies. When would you choose one over the other?

Blue-green provides instant switchover with two identical environments — all traffic moves at once. This is ideal when you need instant rollback capability or are dealing with database migrations. Canary sends a percentage of traffic gradually, allowing real-world validation before full rollout. Choose canary when you want to validate with production traffic before committing fully, when you have robust metrics to detect issues early, or when you need A/B testing capability. Blue-green is simpler operationally but requires double the infrastructure.

7. How do you handle backward-compatible schema changes during a rolling deployment?

For backward-compatible changes: add new columns/tables first (with defaults), deploy new application code that uses the new schema, then clean up old columns after all instances run the new version. During the transition, both old and new code must work — old code ignores new columns, new code can read both. Never remove a column or change a column type in a backward-incompatible way while old pods still run. The expand-contract pattern separates schema migration from code deployment.

8. What metrics would you monitor during a canary deployment to detect problems early?

Key metrics: error rate (5xx responses), latency percentiles (p50, p95, p99), throughput (requests per second), pod restart count, CPU/memory utilization, and business metrics like conversion rate or error counts in logs. Set automated gates that compare canary metrics against baseline: if error rate exceeds 2x baseline or latency increases by 50%, halt the rollout. Also monitor infrastructure metrics like registry pull times and network throughput to catch resource constraints.

9. How do you implement a rollback strategy for a Kubernetes deployment?

Kubernetes native rollback: kubectl rollout undo deployment/myapp reverts to the previous revision. For Argo Rollouts, use kubectl argo rollouts abort myapp and it automatically routes traffic back to stable. Key prerequisites: have revision history enabled (default), keep previous replicasets available for quick switchback. Automate rollback triggers based on metrics — if error rate exceeds threshold in Argo Rollouts, it automatically rolls back after consecutive failures. Always test rollback in staging.

10. Explain the role of feature flags in deployment strategies. How do they complement canary deployments?

Feature flags decouple deployment from release — you deploy code to production but control who sees it via flags. This allows gradual rollout independent of deployment strategy: deploy canary 5% but enable new feature for internal users only. Flags provide kill switches per feature, not just per version. They complement canary by adding another dimension of control: canary controls which version serves traffic, flags control which features are active within that version. Combine them for maximum control over risky releases.

11. Your rolling update is stuck and pods are CrashLoopBackOff. How do you diagnose and fix it?

Diagnosis steps: kubectl describe pod shows events and exit codes, kubectl logs shows application output. Common causes: application crashes on startup due to missing environment variables or config maps, health check failing due to incorrect probe configuration, dependency connection failures, or permission issues with service accounts. Fix by checking the deployment spec matches the application's expectations, verify ConfigMaps and Secrets are mounted correctly, adjust readiness/liveness probes if they fail too aggressively, and ensure the container image tag points to the correct version.

12. How do you ensure zero-downtime deployments when the application maintains long-lived connections?

For applications with long-lived connections (WebSockets, gRPC streams, database connections): use graceful shutdown to drain existing connections before terminating pods. Set terminationGracePeriodSeconds long enough to complete in-flight requests. Use preStop hooks to wait for load balancer to deregister the pod before stopping the container. For databases, use connection pooling with health-check aware connections. Blue-green is often better for stateful connections since you switch entire environments atomically.

13. What is the difference between maxSurge and maxUnavailable in rolling updates?

maxSurge controls how many extra pods can run above the desired count during update — allows more capacity during transition. maxUnavailable controls how many pods can be below desired count — controls minimum availability. Settings: maxSurge: 1, maxUnavailable: 0 = maximum availability, slower (7 pods during transition with 6 desired). maxSurge: 0, maxUnavailable: 1 = minimum resources, some downtime (5 pods during transition). Balance based on whether you can tolerate temporary extra resource consumption versus brief capacity reduction.

14. How do you handle deployments when you have multiple regions or availability zones?

Multi-region deployment strategy: use canary in one region first (e.g., us-east-1 at 5%), monitor metrics, then expand to other regions progressively. For blue-green, maintain complete environments per region with separate load balancers. Ensure database migrations are backward-compatible since different regions may run different versions during transition. Set up cross-region traffic management via global load balancers. Consider deploying to secondary regions first if they have lower traffic — lower risk exposure.

15. Your deployment requires a configuration change that is not backward-compatible. How do you stage this across environments?

Non-backward-compatible config changes require a multi-step approach: 1) Deploy new application version with old config accepted (dual compatibility), 2) Update config in staging, validate new version works with new config, 3) Update config in production — application already deployed and ready. Never change config and application simultaneously if incompatible. If the config change requires application code changes too, use blue-green to switch atomically. Consider feature flags to toggle behavior during transition period.

16. What are the trade-offs between using Argo Rollouts versus native Kubernetes rolling updates?

Native Kubernetes: simpler, built-in, no additional components. Works well for basic rolling updates with health checks. Argo Rollouts: adds progressive delivery (canary percentages, automated analysis, blue-green), integrates with service meshes and ingress controllers for fine-grained traffic splitting, provides rollback automation based on metrics. Trade-offs: Argo adds operational complexity and requires installing the operator. Use native Kubernetes for simpler deployments where basic rolling updates with health checks suffice. Use Argo when you need canary analysis, metric-driven rollback, or advanced traffic management.

17. How do you test a deployment strategy before applying it to production?

Testing approach: 1) Staging environment that mirrors production topology, 2) Run the exact same deployment procedure in staging before production, 3) Include rollback testing — deploy, trigger simulated failure, verify rollback works, 4) Test failure scenarios: what happens if a pod crashes mid-deployment, if the registry is unreachable, if the database migration fails, 5) Load test during deployment to ensure rolling updates handle production traffic, 6) Practice on a non-production cluster first if available. Document the procedure and verify the team knows how to execute it.

18. Explain how you would implement a canary deployment with progressive percentage increases.

Progressive canary implementation with Argo Rollouts: define steps with weights and pause durations. Example: setWeight: 5% → pause (manual) → setWeight: 20% → pause 10m (auto-proceed if metrics OK) → setWeight: 50% → pause (manual) → setWeight: 100%. At each step, analysis templates run Prometheus queries to check error rate, latency, and custom metrics. If metrics exceed thresholds, rollout aborts automatically. Use traffic routing annotations for nginx/istio to split traffic at each step. Start with internal users at 5%, then beta users, then gradual public rollout.

19. When should you use a recreate deployment strategy versus a rolling update?

Recreate strategy terminates all old pods before creating new ones — causes downtime but is simple. Use recreate when: application is stateless and can tolerate downtime, you are deploying infrastructure changes that cannot be done incrementally (e.g., changing the underlying network), you need to ensure no two versions run simultaneously for compliance reasons, or you are in a development environment where simplicity matters more than availability. Avoid recreate in production for user-facing services. Rolling update is almost always better for production since it maintains availability.

20. How do you handle deployment pipeline failures that block the entire team?

When deployment pipeline fails and blocks the team: 1) Immediately communicate status to stakeholders, 2) Identify whether it's a pipeline infrastructure issue or an application issue, 3) If infrastructure (registry down, runner failures), use fallback: deploy from last known good artifact stored locally, 4) If application issue, check if rollback to previous version is faster than fixing forward, 5) Enable bypass for critical fixes with appropriate approval and documentation, 6) Post-mortem to prevent recurrence: improve monitoring to catch issues earlier, add redundant systems for critical path components, create rollback runbooks that work.

Conclusion

Key Takeaways

Rolling updates are the Kubernetes default for a reason — they work for stateless services
Blue-green gives you instant switchover and instant rollback for higher confidence
Canary reduces risk by validating with real traffic before full rollout
Always test your rollback procedure in staging, not for the first time in production
Monitor deployment metrics: duration, error rate, pod restarts

Deployment Checklist

# Before deployment
kubectl rollout history deployment/myapp
kubectl get pdb myapp -o yaml

# During deployment
kubectl rollout status deployment/myapp --timeout=10m
kubectl get pods -l app=myapp --watch

# After deployment
kubectl rollout status deployment/myapp
kubectl logs -l app=myapp --tail=100 | grep ERROR
kubectl get events --sort-by='.lastTimestamp' | grep myapp

Deployment Strategies: Rolling, Blue-Green, and Canary Releases

Introduction

Rolling Update Mechanics

Blue-Green Deployment Setup

Canary Deployment with Argo Rollouts

Feature Flags Integration

Rollback Triggers and Automation

Choosing the Right Strategy

When to Use / When Not to Use

When rolling updates make sense

When blue-green makes sense

When canary makes sense

Production Failure Scenarios

Common Deployment Failures

Deployment Rollback Flow

Observability Hooks

Common Pitfalls / Anti-Patterns

Not testing the rollback procedure

Setting PDB too aggressively

Using the same strategy for all services

Ignoring database schema changes

Trade-off Summary

Interview Questions

Further Reading

Official Documentation

Related Guides

Tools and References

Conclusion

Key Takeaways

Deployment Checklist

Category

Tags

Related Posts

Health Checks: Liveness, Readiness, and Service Availability

Zero-Downtime Database Migration Strategies

Container Security: Image Scanning and Vulnerability Management