Deployment Strategies: Rolling, Blue-Green, and Canary Releases
Compare and implement deployment strategies—rolling updates, blue-green deployments, and canary releases—to reduce risk and enable safe production releases.
Deployment Strategies: Rolling, Blue-Green, and Canary Releases
Introduction
Production deployments are one of the riskiest things you do as an engineer. A bad release can take down a service thousands of users depend on, and rolling back takes time you often do not have. Deployment strategies control how code reaches users — whether that’s gradual traffic shifts, instant cutovers, or maintaining hot spare environments. The strategy you pick affects how fast you ship, how much infrastructure you need, and how quickly you can recover if something goes wrong.
Rolling updates, blue-green, and canary each handle risk differently. Rolling updates replace instances one by one, keeping old versions running throughout. Blue-green runs two complete environments and switches all traffic at once. Canary sends a fraction of traffic to the new version and grows that fraction based on real metrics. None is universally better — the right choice depends on how much downtime your app tolerates, your infrastructure budget, and how fast you need to roll back.
This guide covers the mechanics of each strategy with Kubernetes-native implementations and Argo Rollouts examples. You’ll learn how to configure rolling updates for zero-downtime deploys, implement blue-green cutovers for instant rollback, set up canary analysis with automated metric gates, and pick the right approach for your situation. Each strategy includes production failure scenarios so you know what can go wrong before it does.
Rolling Update Mechanics
Rolling updates gradually replace old pods with new ones. Kubernetes handles this natively for Deployments — you configure a few parameters and it takes care of the rest.
Basic rolling update configuration:
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
replicas: 6
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow 1 extra pod during update
maxUnavailable: 0 # Never have fewer than desired replicas
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
version: v2
spec:
containers:
- name: myapp
image: myregistry.azurecr.io/myapp:v2.0.0
ports:
- containerPort: 8080
Monitor rolling update progress:
# Watch rollout status
kubectl rollout status deployment/myapp
# View deployment details
kubectl describe deployment myapp
# Check revision history
kubectl rollout history deployment/myapp
Rolling update behavior:
| Parameter | 6 Replicas | Effect |
|---|---|---|
| maxSurge: 1, maxUnavailable: 0 | 7 pods during transition | Maximum availability, slower |
| maxSurge: 2, maxUnavailable: 0 | 8 pods during transition | Faster, more resources |
| maxSurge: 0, maxUnavailable: 1 | 5 pods during transition | Minimum resources, some downtime |
Rollback a rolling update:
# Immediate rollback to previous version
kubectl rollout undo deployment/myapp
# Rollback to specific revision
kubectl rollout undo deployment/myapp --to-revision=3
# Watch rollback
kubectl rollout status deployment/myapp
Blue-Green Deployment Setup
Blue-green deployments run two identical environments and switch traffic between them. This enables instant rollback and zero-downtime deployments.
Infrastructure setup:
Internet → Load Balancer → Blue (v1) OR Green (v2)
↓ ↓
[Production] [Production]
Kubernetes implementation with two Deployments:
# Blue deployment (current version)
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-blue
labels:
app: myapp
slot: blue
spec:
replicas: 6
selector:
matchLabels:
app: myapp
slot: blue
template:
metadata:
labels:
app: myapp
slot: blue
version: v1
spec:
containers:
- name: myapp
image: myregistry.azurecr.io/myapp:v1.0.0
---
# Green deployment (new version)
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-green
labels:
app: myapp
slot: green
spec:
replicas: 6
selector:
matchLabels:
app: myapp
slot: green
template:
metadata:
labels:
app: myapp
slot: green
version: v2
spec:
containers:
- name: myapp
image: myregistry.azurecr.io/myapp:v2.0.0
Service switching between slots:
# Initial state: traffic to blue
apiVersion: v1
kind: Service
metadata:
name: myapp
spec:
selector:
app: myapp
slot: blue
ports:
- port: 80
targetPort: 8080
# Switch to green (update selector)
# kubectl patch service myapp -p '{"spec":{"selector":{"slot":"green"}}}'
Blue-green with Argo Rollouts:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: myapp
spec:
strategy:
blueGreen:
activeService: myapp-blue
previewService: myapp-preview
autoPromotionEnabled: false # Manual promotion
scaleDownDelaySeconds: 600 # Keep old version for 10 min after switch
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: myregistry.azurecr.io/myapp:v2.0.0
Canary Deployment with Argo Rollouts
Canary deployments gradually shift traffic to the new version, monitoring metrics to detect issues.
Argo Rollouts canary configuration:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: myapp
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 5 # Start with 5% traffic to new version
- pause: {} # Wait for manual inspection
- setWeight: 20
- pause: { duration: 10m } # Auto-proceed after 10 minutes
- setWeight: 50
- pause: {}
canaryMetadata:
labels:
role: canary
stableMetadata:
labels:
role: stable
trafficRouting:
nginx:
stableIngress: myapp-stable
additionalIngressAnnotations:
canary-by-header: X-Canary
analysis:
templates:
- templateName: success-rate
startingStep: 1
args:
- name: service-name
value: myapp-canary
selector:
matchLabels:
app: myapp
template:
spec:
containers:
- name: myapp
image: myregistry.azurecr.io/myapp:v2.0.0
Analysis template for automated checks:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 2m
successCondition: result[0] >= 0.95
failureLimit: 3
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{service="{{args.service-name}}",status!~"5.."}[5m]))
/
sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))
- name: error-rate
interval: 1m
successCondition: result[0] < 0.01
failureLimit: 5
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{service="{{args.service-name}}",status=~"5.."}[5m]))
Feature Flags Integration
Feature flags decouple deployment from release, enabling precise control over who sees new features.
LaunchDarkly in Kubernetes:
# Inject feature flag context into pod
apiVersion: v1
kind: ConfigMap
metadata:
name: feature-flags
data:
LD_CLIENT_KEY: "sdk-xxxxx"
---
# Pod spec with flag evaluation
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
template:
spec:
containers:
- name: myapp
image: myregistry.azurecr.io/myapp:v2.0.0
env:
- name: LD_CLIENT_KEY
valueFrom:
configMapKeyRef:
name: feature-flags
key: LD_CLIENT_KEY
# App reads flags and shows/hides features
Progressive percentage rollout with flags:
// Example: gradual rollout of new checkout
const launchDarkly = require("@launchdarkly/node-server-sdk");
const client = launchDarkly.init(process.env.LD_CLIENT_KEY);
async function shouldShowNewCheckout(userId) {
return client.variation("new-checkout-flow", { key: userId }, false);
}
// Route based on flag
app.get("/checkout", async (req, res) => {
const userId = req.user.id;
const useNewCheckout = await shouldShowNewCheckout(userId);
if (useNewCheckout) {
res.redirect("/checkout/new");
} else {
res.redirect("/checkout/legacy");
}
});
Rollback Triggers and Automation
Automated rollback prevents bad releases from affecting users.
Prometheus metrics-triggered rollback:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: myapp
spec:
strategy:
canary:
analysis:
templates:
- templateName: error-rate-check
automatic: true # Auto-rollback on failure
# Rollback if error rate exceeds threshold
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: error-rate-check
spec:
metrics:
- name: error-rate
interval: 1m
successCondition: result[0] < 0.05
failureCondition: result[0] > 0.05
failureLimit: 2 # Trigger rollback after 2 consecutive failures
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{service="myapp-canary",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="myapp-canary"}[5m]))
GitHub Actions automated rollback:
rollback:
runs-on: ubuntu-latest
if: failure()
steps:
- name: Rollback deployment
run: |
# Rollback in Kubernetes
kubectl rollout undo deployment/myapp -n production
# Or rollback Helm
helm rollback myapp -n production
- name: Notify
uses: slackapi/slack-github-action@v1
with:
channel-id: "deployments"
payload: |
{
"text": "Production deployment failed, rolled back automatically",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": ":x: *Production deployment failed*\nRolling back to previous version."
}
}
]
}
Choosing the Right Strategy
| Strategy | Risk | Speed | Cost | Best For |
|---|---|---|---|---|
| Rolling | Low | Medium | Low | Stateless services, Kubernetes native |
| Blue-Green | Very Low | Fast | High (2x resources) | Database migrations, zero-downtime requirements |
| Canary | Low-Medium | Slow | Medium | New features, A/B testing, gradual rollout |
Decision factors:
- Application state: Stateful apps may have issues with rolling updates
- Traffic sensitivity: User-facing apps benefit from blue-green or canary
- Resource budget: Blue-green requires double the capacity
- Rollback speed: How fast must you recover from a bad deploy
- Testing confidence: Low confidence = canary with analysis
When to Use / When Not to Use
When rolling updates make sense
Rolling updates work best in Kubernetes for stateless services where you can have multiple versions running simultaneously. If your application handles traffic gracefully when some instances run the old version and others run the new version simultaneously, rolling updates are the simplest choice.
Use rolling updates when you need zero-downtime deployments and cannot afford double the infrastructure for blue-green. They are the Kubernetes default for a reason.
When blue-green makes sense
Blue-green is the right choice when you need instant switchover and instant rollback. Database migrations are the classic use case. You run the migration against the blue environment, validate it works, then switch all traffic to green in one atomic operation. If something goes wrong, you switch back to blue.
Blue-green also makes sense when you need to validate a full environment before taking traffic. You can run smoke tests against green before switching, and keep blue warm for a fast rollback.
When canary makes sense
Canary deployments are best for risky changes where you want real production traffic validation before committing fully. A new algorithm, a major UI redesign, a significant infrastructure change — these are all good canary candidates.
Use canary when you have the metrics infrastructure to validate the change automatically. Without metrics, canary is just slow blue-green.
Production Failure Scenarios
Common Deployment Failures
| Failure | Impact | Mitigation |
|---|---|---|
| Rolling update pods crash during transition | Service degraded during deploy | Set maxUnavailable: 0, monitor closely |
| Blue-green traffic switch fails | Half traffic goes to old version | Test traffic switch in staging, use weighted routing |
| Canary analysis triggers on unrelated metric | Healthy deploy blocked | Use metrics specific to the change |
| PDB blocks necessary eviction | Cluster upgrade blocked | Set PDB appropriately, do not overprotect |
| Service selector mismatch after switch | Traffic routed to wrong pods | Validate selectors match before switching |
Deployment Rollback Flow
flowchart TD
A[Deploy New Version] --> B{Health Check Pass?}
B -->|No| C[Rollback to Previous]
B -->|Yes| D[Monitor for 10 min]
D --> E{Metrics OK?}
E -->|Yes| F[Deployment Complete]
E -->|No| G[Auto Rollback]
C --> H[Alert Team]
G --> H
H --> I[Investigate Root Cause]
Observability Hooks
Track deployments to catch failures early and measure deployment health.
What to monitor:
- Deployment duration (spot stuck deployments)
- Pod restart count during rollout
- Error rate spike during transition
- Traffic distribution after switch
- Rollback frequency per service
# Check rollout status
kubectl rollout status deployment/myapp --timeout=5m
# Check pod age during rollout
kubectl get pods -l app=myapp -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.phase}{"\t"}{.metadata.creationTimestamp}{"\n"}{end}'
# View rollout history
kubectl rollout history deployment/myapp
Common Pitfalls / Anti-Patterns
Not testing the rollback procedure
A rollback strategy you have never tested is not a rollback strategy. Practice rolling back in staging so you know what happens when you call kubectl rollout undo in production at 2am.
Setting PDB too aggressively
PodDisruptionBudgets that require 100% availability block legitimate cluster operations like node upgrades. A PDB that says “always keep 3 pods available” on a 3-replica deployment means no pod can ever be evicted.
Using the same strategy for all services
A simple stateless API and a complex stateful service with database connections need different deployment strategies. Cookie-cutter approaches lead to either over-engineering simple services or under-protecting complex ones.
Ignoring database schema changes
Deployment strategies handle application versions, not schema migrations. If your new version requires a new column that the old version cannot handle, deploying the new version before the migration is a disaster. Treat database migrations as a separate release concern.
Trade-off Summary
| Strategy | Deployment Time | Resource Cost | Rollback Speed | Risk Level |
|---|---|---|---|---|
| Rolling update | Moderate (proportional to batch size) | Low (no extra capacity) | Minutes (reverse batch) | Low-Medium |
| Blue-green | Fast (instant switchover) | 2x (double infrastructure) | Instant (switch traffic back) | Low |
| Canary | Gradual (traffic shifting) | Low-Medium (few extra pods) | Fast (drop traffic to new) | Low |
| Recreate | Fast (no orchestration) | Zero extra | Minutes (redeploy old version) | High |
Interview Questions
Backwards-incompatible migrations require a multi-phase approach. Option 1: deploy the new application version alongside the old, run the migration while both versions are running, then cut over traffic once the migration completes. Option 2: use the expand-contract pattern — first deploy schema changes that are backwards-compatible (new columns with defaults, new tables), then deploy the new application code, then clean up old schema. For truly incompatible changes, blue-green with a migration freeze window is often the safest. Never run migrations as part of the deployment pipeline without a rollback plan.
Immediately halt the rollout: reduce canary traffic to 0% or revert to the previous version using your traffic management tool (Argo Rollouts, Flagger, or your service mesh). Do not try to debug while serving traffic to users. After reverting, investigate: check application logs and metrics for the new version, look for differences in configuration or environment variables, verify the new version is reading from the correct data stores. Common causes: the new version has a subtle bug that only manifests at scale, dependency connectivity issues, or incorrect resource configuration.
Stateful services need careful sequencing: scale up the new brokers before decommissioning old ones, wait for topic replication to catch up, then migrate partition leadership. Use Kafka's built-in partition reassignment tool to move partitions safely. Set unclean.leader.election.enable=false to prevent data loss during broker failures. For Kafka specifically, use Strimzi or Kafka Operator on Kubernetes for managed StatefulSets. Always test the failure scenario in a staging environment first. Incremental rollout with careful monitoring of replication lag is essential.
PDBs ensure a minimum number of pods remain available during voluntary disruptions like node drains and deployments. Without PDBs, Kubernetes could evict too many pods simultaneously, causing service disruption. Set minAvailable or maxUnavailable based on your availability requirements. For stateful services with replication, minAvailable: 1 ensures at least one replica stays up. During deployments with multiple replicas, PDBs prevent Kubernetes from terminating too many pods at once, maintaining quorum for clustered applications.
The thundering herd problem occurs when many nodes pull images or restart simultaneously, overwhelming the registry or network. Avoid by: configuring RollingUpdate with maxSurge: 10-25% and maxUnavailable: 0 so updates happen in controlled batches, staggering deployments across node pools if you have multiple pools, using a wave-based deployment approach where you tag nodes and deploy to wave 1, wait for stability, then proceed. For image pulls specifically, use a local registry mirror or cache (Harbor, Amazon ECR), pre-pull images onto nodes, and set imagePullPolicy: IfNotPresent.
Blue-green provides instant switchover with two identical environments — all traffic moves at once. This is ideal when you need instant rollback capability or are dealing with database migrations. Canary sends a percentage of traffic gradually, allowing real-world validation before full rollout. Choose canary when you want to validate with production traffic before committing fully, when you have robust metrics to detect issues early, or when you need A/B testing capability. Blue-green is simpler operationally but requires double the infrastructure.
For backward-compatible changes: add new columns/tables first (with defaults), deploy new application code that uses the new schema, then clean up old columns after all instances run the new version. During the transition, both old and new code must work — old code ignores new columns, new code can read both. Never remove a column or change a column type in a backward-incompatible way while old pods still run. The expand-contract pattern separates schema migration from code deployment.
Key metrics: error rate (5xx responses), latency percentiles (p50, p95, p99), throughput (requests per second), pod restart count, CPU/memory utilization, and business metrics like conversion rate or error counts in logs. Set automated gates that compare canary metrics against baseline: if error rate exceeds 2x baseline or latency increases by 50%, halt the rollout. Also monitor infrastructure metrics like registry pull times and network throughput to catch resource constraints.
Kubernetes native rollback: kubectl rollout undo deployment/myapp reverts to the previous revision. For Argo Rollouts, use kubectl argo rollouts abort myapp and it automatically routes traffic back to stable. Key prerequisites: have revision history enabled (default), keep previous replicasets available for quick switchback. Automate rollback triggers based on metrics — if error rate exceeds threshold in Argo Rollouts, it automatically rolls back after consecutive failures. Always test rollback in staging.
Feature flags decouple deployment from release — you deploy code to production but control who sees it via flags. This allows gradual rollout independent of deployment strategy: deploy canary 5% but enable new feature for internal users only. Flags provide kill switches per feature, not just per version. They complement canary by adding another dimension of control: canary controls which version serves traffic, flags control which features are active within that version. Combine them for maximum control over risky releases.
Diagnosis steps: kubectl describe pod shows events and exit codes, kubectl logs shows application output. Common causes: application crashes on startup due to missing environment variables or config maps, health check failing due to incorrect probe configuration, dependency connection failures, or permission issues with service accounts. Fix by checking the deployment spec matches the application's expectations, verify ConfigMaps and Secrets are mounted correctly, adjust readiness/liveness probes if they fail too aggressively, and ensure the container image tag points to the correct version.
For applications with long-lived connections (WebSockets, gRPC streams, database connections): use graceful shutdown to drain existing connections before terminating pods. Set terminationGracePeriodSeconds long enough to complete in-flight requests. Use preStop hooks to wait for load balancer to deregister the pod before stopping the container. For databases, use connection pooling with health-check aware connections. Blue-green is often better for stateful connections since you switch entire environments atomically.
maxSurge controls how many extra pods can run above the desired count during update — allows more capacity during transition. maxUnavailable controls how many pods can be below desired count — controls minimum availability. Settings: maxSurge: 1, maxUnavailable: 0 = maximum availability, slower (7 pods during transition with 6 desired). maxSurge: 0, maxUnavailable: 1 = minimum resources, some downtime (5 pods during transition). Balance based on whether you can tolerate temporary extra resource consumption versus brief capacity reduction.
Multi-region deployment strategy: use canary in one region first (e.g., us-east-1 at 5%), monitor metrics, then expand to other regions progressively. For blue-green, maintain complete environments per region with separate load balancers. Ensure database migrations are backward-compatible since different regions may run different versions during transition. Set up cross-region traffic management via global load balancers. Consider deploying to secondary regions first if they have lower traffic — lower risk exposure.
Non-backward-compatible config changes require a multi-step approach: 1) Deploy new application version with old config accepted (dual compatibility), 2) Update config in staging, validate new version works with new config, 3) Update config in production — application already deployed and ready. Never change config and application simultaneously if incompatible. If the config change requires application code changes too, use blue-green to switch atomically. Consider feature flags to toggle behavior during transition period.
Native Kubernetes: simpler, built-in, no additional components. Works well for basic rolling updates with health checks. Argo Rollouts: adds progressive delivery (canary percentages, automated analysis, blue-green), integrates with service meshes and ingress controllers for fine-grained traffic splitting, provides rollback automation based on metrics. Trade-offs: Argo adds operational complexity and requires installing the operator. Use native Kubernetes for simpler deployments where basic rolling updates with health checks suffice. Use Argo when you need canary analysis, metric-driven rollback, or advanced traffic management.
Testing approach: 1) Staging environment that mirrors production topology, 2) Run the exact same deployment procedure in staging before production, 3) Include rollback testing — deploy, trigger simulated failure, verify rollback works, 4) Test failure scenarios: what happens if a pod crashes mid-deployment, if the registry is unreachable, if the database migration fails, 5) Load test during deployment to ensure rolling updates handle production traffic, 6) Practice on a non-production cluster first if available. Document the procedure and verify the team knows how to execute it.
Progressive canary implementation with Argo Rollouts: define steps with weights and pause durations. Example: setWeight: 5% → pause (manual) → setWeight: 20% → pause 10m (auto-proceed if metrics OK) → setWeight: 50% → pause (manual) → setWeight: 100%. At each step, analysis templates run Prometheus queries to check error rate, latency, and custom metrics. If metrics exceed thresholds, rollout aborts automatically. Use traffic routing annotations for nginx/istio to split traffic at each step. Start with internal users at 5%, then beta users, then gradual public rollout.
Recreate strategy terminates all old pods before creating new ones — causes downtime but is simple. Use recreate when: application is stateless and can tolerate downtime, you are deploying infrastructure changes that cannot be done incrementally (e.g., changing the underlying network), you need to ensure no two versions run simultaneously for compliance reasons, or you are in a development environment where simplicity matters more than availability. Avoid recreate in production for user-facing services. Rolling update is almost always better for production since it maintains availability.
When deployment pipeline fails and blocks the team: 1) Immediately communicate status to stakeholders, 2) Identify whether it's a pipeline infrastructure issue or an application issue, 3) If infrastructure (registry down, runner failures), use fallback: deploy from last known good artifact stored locally, 4) If application issue, check if rollback to previous version is faster than fixing forward, 5) Enable bypass for critical fixes with appropriate approval and documentation, 6) Post-mortem to prevent recurrence: improve monitoring to catch issues earlier, add redundant systems for critical path components, create rollback runbooks that work.
Further Reading
Official Documentation
- Kubernetes Deployment Documentation - Official guide to Deployments and rolling updates
- Argo Rollouts Documentation - Progressive delivery with Argo Rollouts
- LaunchDarkly Feature Flags Documentation - Feature flag management best practices
Related Guides
- CI/CD Pipeline Design - Pipeline architecture patterns and optimization
- GitOps Implementation - GitOps workflows with ArgoCD and Flux
- Container Registry Setup - Image storage and scanning strategies
- Automated Testing in CI/CD - Testing strategies and quality gates
Tools and References
- Argo Rollouts GitHub - Open source progressive delivery controller
- Flagger - Progressive delivery Kubernetes operator
- Weave Flux - GitOps operator for Kubernetes
- Prometheus Operator - Monitoring for Kubernetes deployments
Conclusion
Key Takeaways
- Rolling updates are the Kubernetes default for a reason — they work for stateless services
- Blue-green gives you instant switchover and instant rollback for higher confidence
- Canary reduces risk by validating with real traffic before full rollout
- Always test your rollback procedure in staging, not for the first time in production
- Monitor deployment metrics: duration, error rate, pod restarts
Deployment Checklist
# Before deployment
kubectl rollout history deployment/myapp
kubectl get pdb myapp -o yaml
# During deployment
kubectl rollout status deployment/myapp --timeout=10m
kubectl get pods -l app=myapp --watch
# After deployment
kubectl rollout status deployment/myapp
kubectl logs -l app=myapp --tail=100 | grep ERROR
kubectl get events --sort-by='.lastTimestamp' | grep myapp Category
Related Posts
Health Checks: Liveness, Readiness, and Service Availability
Master health check implementation for microservices including liveness probes, readiness probes, and graceful degradation patterns.
Zero-Downtime Database Migration Strategies
Learn how to evolve your database schema in production without downtime. This guide covers expand-contract patterns, backward-compatible migrations, rollback strategies, and tools.
Container Security: Image Scanning and Vulnerability Management
Implement comprehensive container security: from scanning images for vulnerabilities to runtime security monitoring and secrets protection.