Deployment Strategies: Rolling, Blue-Green, and Canary Releases
Compare and implement deployment strategies—rolling updates, blue-green deployments, and canary releases—to reduce risk and enable safe production releases.
Choosing the right deployment strategy balances risk, speed, and resource cost. This guide compares rolling updates, blue-green, and canary deployments with implementation examples.
Rolling Update Mechanics
Rolling updates gradually replace old pods with new ones. Kubernetes handles this natively for Deployments.
Basic rolling update configuration:
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
replicas: 6
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow 1 extra pod during update
maxUnavailable: 0 # Never have fewer than desired replicas
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
version: v2
spec:
containers:
- name: myapp
image: myregistry.azurecr.io/myapp:v2.0.0
ports:
- containerPort: 8080
Monitor rolling update progress:
# Watch rollout status
kubectl rollout status deployment/myapp
# View deployment details
kubectl describe deployment myapp
# Check revision history
kubectl rollout history deployment/myapp
Rolling update behavior:
| Parameter | 6 Replicas | Effect |
|---|---|---|
| maxSurge: 1, maxUnavailable: 0 | 7 pods during transition | Maximum availability, slower |
| maxSurge: 2, maxUnavailable: 0 | 8 pods during transition | Faster, more resources |
| maxSurge: 0, maxUnavailable: 1 | 5 pods during transition | Minimum resources, some downtime |
Rollback a rolling update:
# Immediate rollback to previous version
kubectl rollout undo deployment/myapp
# Rollback to specific revision
kubectl rollout undo deployment/myapp --to-revision=3
# Watch rollback
kubectl rollout status deployment/myapp
Blue-Green Deployment Setup
Blue-green deployments run two identical environments and switch traffic between them. This enables instant rollback and zero-downtime deployments.
Infrastructure setup:
Internet → Load Balancer → Blue (v1) OR Green (v2)
↓ ↓
[Production] [Production]
Kubernetes implementation with two Deployments:
# Blue deployment (current version)
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-blue
labels:
app: myapp
slot: blue
spec:
replicas: 6
selector:
matchLabels:
app: myapp
slot: blue
template:
metadata:
labels:
app: myapp
slot: blue
version: v1
spec:
containers:
- name: myapp
image: myregistry.azurecr.io/myapp:v1.0.0
---
# Green deployment (new version)
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-green
labels:
app: myapp
slot: green
spec:
replicas: 6
selector:
matchLabels:
app: myapp
slot: green
template:
metadata:
labels:
app: myapp
slot: green
version: v2
spec:
containers:
- name: myapp
image: myregistry.azurecr.io/myapp:v2.0.0
Service switching between slots:
# Initial state: traffic to blue
apiVersion: v1
kind: Service
metadata:
name: myapp
spec:
selector:
app: myapp
slot: blue
ports:
- port: 80
targetPort: 8080
# Switch to green (update selector)
# kubectl patch service myapp -p '{"spec":{"selector":{"slot":"green"}}}'
Blue-green with Argo Rollouts:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: myapp
spec:
strategy:
blueGreen:
activeService: myapp-blue
previewService: myapp-preview
autoPromotionEnabled: false # Manual promotion
scaleDownDelaySeconds: 600 # Keep old version for 10 min after switch
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: myregistry.azurecr.io/myapp:v2.0.0
Canary Deployment with Argo Rollouts
Canary deployments gradually shift traffic to the new version, monitoring metrics to detect issues.
Argo Rollouts canary configuration:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: myapp
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 5 # Start with 5% traffic to new version
- pause: {} # Wait for manual inspection
- setWeight: 20
- pause: { duration: 10m } # Auto-proceed after 10 minutes
- setWeight: 50
- pause: {}
canaryMetadata:
labels:
role: canary
stableMetadata:
labels:
role: stable
trafficRouting:
nginx:
stableIngress: myapp-stable
additionalIngressAnnotations:
canary-by-header: X-Canary
analysis:
templates:
- templateName: success-rate
startingStep: 1
args:
- name: service-name
value: myapp-canary
selector:
matchLabels:
app: myapp
template:
spec:
containers:
- name: myapp
image: myregistry.azurecr.io/myapp:v2.0.0
Analysis template for automated checks:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 2m
successCondition: result[0] >= 0.95
failureLimit: 3
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{service="{{args.service-name}}",status!~"5.."}[5m]))
/
sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))
- name: error-rate
interval: 1m
successCondition: result[0] < 0.01
failureLimit: 5
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{service="{{args.service-name}}",status=~"5.."}[5m]))
Feature Flags Integration
Feature flags decouple deployment from release, enabling precise control over who sees new features.
LaunchDarkly in Kubernetes:
# Inject feature flag context into pod
apiVersion: v1
kind: ConfigMap
metadata:
name: feature-flags
data:
LD_CLIENT_KEY: "sdk-xxxxx"
---
# Pod spec with flag evaluation
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
template:
spec:
containers:
- name: myapp
image: myregistry.azurecr.io/myapp:v2.0.0
env:
- name: LD_CLIENT_KEY
valueFrom:
configMapKeyRef:
name: feature-flags
key: LD_CLIENT_KEY
# App reads flags and shows/hides features
Progressive percentage rollout with flags:
// Example: gradual rollout of new checkout
const launchDarkly = require("@launchdarkly/node-server-sdk");
const client = launchDarkly.init(process.env.LD_CLIENT_KEY);
async function shouldShowNewCheckout(userId) {
return client.variation("new-checkout-flow", { key: userId }, false);
}
// Route based on flag
app.get("/checkout", async (req, res) => {
const userId = req.user.id;
const useNewCheckout = await shouldShowNewCheckout(userId);
if (useNewCheckout) {
res.redirect("/checkout/new");
} else {
res.redirect("/checkout/legacy");
}
});
Rollback Triggers and Automation
Automated rollback prevents bad releases from affecting users.
Prometheus metrics-triggered rollback:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: myapp
spec:
strategy:
canary:
analysis:
templates:
- templateName: error-rate-check
automatic: true # Auto-rollback on failure
# Rollback if error rate exceeds threshold
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: error-rate-check
spec:
metrics:
- name: error-rate
interval: 1m
successCondition: result[0] < 0.05
failureCondition: result[0] > 0.05
failureLimit: 2 # Trigger rollback after 2 consecutive failures
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{service="myapp-canary",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="myapp-canary"}[5m]))
GitHub Actions automated rollback:
rollback:
runs-on: ubuntu-latest
if: failure()
steps:
- name: Rollback deployment
run: |
# Rollback in Kubernetes
kubectl rollout undo deployment/myapp -n production
# Or rollback Helm
helm rollback myapp -n production
- name: Notify
uses: slackapi/slack-github-action@v1
with:
channel-id: "deployments"
payload: |
{
"text": "Production deployment failed, rolled back automatically",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": ":x: *Production deployment failed*\nRolling back to previous version."
}
}
]
}
Choosing the Right Strategy
| Strategy | Risk | Speed | Cost | Best For |
|---|---|---|---|---|
| Rolling | Low | Medium | Low | Stateless services, Kubernetes native |
| Blue-Green | Very Low | Fast | High (2x resources) | Database migrations, zero-downtime requirements |
| Canary | Low-Medium | Slow | Medium | New features, A/B testing, gradual rollout |
Decision factors:
- Application state: Stateful apps may have issues with rolling updates
- Traffic sensitivity: User-facing apps benefit from blue-green or canary
- Resource budget: Blue-green requires double the capacity
- Rollback speed: How fast must you recover from a bad deploy
- Testing confidence: Low confidence = canary with analysis
When to Use / When Not to Use
When rolling updates make sense
Rolling updates work best in Kubernetes for stateless services where you can have multiple versions running simultaneously. If your application handles traffic gracefully when some instances run the old version and others run the new version simultaneously, rolling updates are the simplest choice.
Use rolling updates when you need zero-downtime deployments and cannot afford double the infrastructure for blue-green. They are the Kubernetes default for a reason.
When blue-green makes sense
Blue-green is the right choice when you need instant switchover and instant rollback. Database migrations are the classic use case. You run the migration against the blue environment, validate it works, then switch all traffic to green in one atomic operation. If something goes wrong, you switch back to blue.
Blue-green also makes sense when you need to validate a full environment before taking traffic. You can run smoke tests against green before switching, and keep blue warm for a fast rollback.
When canary makes sense
Canary deployments are best for risky changes where you want real production traffic validation before committing fully. A new algorithm, a major UI redesign, a significant infrastructure change — these are all good canary candidates.
Use canary when you have the metrics infrastructure to validate the change automatically. Without metrics, canary is just slow blue-green.
Production Failure Scenarios
Common Deployment Failures
| Failure | Impact | Mitigation |
|---|---|---|
| Rolling update pods crash during transition | Service degraded during deploy | Set maxUnavailable: 0, monitor closely |
| Blue-green traffic switch fails | Half traffic goes to old version | Test traffic switch in staging, use weighted routing |
| Canary analysis triggers on unrelated metric | Healthy deploy blocked | Use metrics specific to the change |
| PDB blocks necessary eviction | Cluster upgrade blocked | Set PDB appropriately, do not overprotect |
| Service selector mismatch after switch | Traffic routed to wrong pods | Validate selectors match before switching |
Deployment Rollback Flow
flowchart TD
A[Deploy New Version] --> B{Health Check Pass?}
B -->|No| C[Rollback to Previous]
B -->|Yes| D[Monitor for 10 min]
D --> E{Metrics OK?}
E -->|Yes| F[Deployment Complete]
E -->|No| G[Auto Rollback]
C --> H[Alert Team]
G --> H
H --> I[Investigate Root Cause]
Observability Hooks
Track deployments to catch failures early and measure deployment health.
What to monitor:
- Deployment duration (spot stuck deployments)
- Pod restart count during rollout
- Error rate spike during transition
- Traffic distribution after switch
- Rollback frequency per service
# Check rollout status
kubectl rollout status deployment/myapp --timeout=5m
# Check pod age during rollout
kubectl get pods -l app=myapp -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.phase}{"\t"}{.metadata.creationTimestamp}{"\n"}{end}'
# View rollout history
kubectl rollout history deployment/myapp
Common Pitfalls / Anti-Patterns
Not testing the rollback procedure
A rollback strategy you have never tested is not a rollback strategy. Practice rolling back in staging so you know what happens when you call kubectl rollout undo in production at 2am.
Setting PDB too aggressively
PodDisruptionBudgets that require 100% availability block legitimate cluster operations like node upgrades. A PDB that says “always keep 3 pods available” on a 3-replica deployment means no pod can ever be evicted.
Using the same strategy for all services
A simple stateless API and a complex stateful service with database connections need different deployment strategies. Cookie-cutter approaches lead to either over-engineering simple services or under-protecting complex ones.
Ignoring database schema changes
Deployment strategies handle application versions, not schema migrations. If your new version requires a new column that the old version cannot handle, deploying the new version before the migration is a disaster. Treat database migrations as a separate release concern.
Trade-off Summary
| Strategy | Deployment Time | Resource Cost | Rollback Speed | Risk Level |
|---|---|---|---|---|
| Rolling update | Moderate (proportional to batch size) | Low (no extra capacity) | Minutes (reverse batch) | Low-Medium |
| Blue-green | Fast (instant switchover) | 2x (double infrastructure) | Instant (switch traffic back) | Low |
| Canary | Gradual (traffic shifting) | Low-Medium (few extra pods) | Fast (drop traffic to new) | Low |
| Recreate | Fast (no orchestration) | Zero extra | Minutes (redeploy old version) | High |
Quick Recap
Key Takeaways
- Rolling updates are the Kubernetes default for a reason — they work for stateless services
- Blue-green gives you instant switchover and instant rollback for higher confidence
- Canary reduces risk by validating with real traffic before full rollout
- Always test your rollback procedure in staging, not for the first time in production
- Monitor deployment metrics: duration, error rate, pod restarts
Deployment Checklist
# Before deployment
kubectl rollout history deployment/myapp
kubectl get pdb myapp -o yaml
# During deployment
kubectl rollout status deployment/myapp --timeout=10m
kubectl get pods -l app=myapp --watch
# After deployment
kubectl rollout status deployment/myapp
kubectl logs -l app=myapp --tail=100 | grep ERROR
kubectl get events --sort-by='.lastTimestamp' | grep myapp
Interview Questions
Q: You need to deploy a database migration as part of a new version. The migration is backwards-incompatible. How do you handle the deployment? A: Backwards-incompatible migrations require a multi-phase approach. Option 1: deploy the new application version alongside the old, run the migration while both versions are running, then cut over traffic once the migration completes. Option 2: use the expand-contract pattern — first deploy schema changes that are backwards-compatible (new columns with defaults, new tables), then deploy the new application code, then clean up old schema. For truly incompatible changes, blue-green with a migration freeze window is often the safest. Never run migrations as part of the deployment pipeline without a rollback plan.
Q: A canary deployment is sending 10% of traffic to the new version, and error rates spike. What do you do? A: Immediately halt the rollout: reduce canary traffic to 0% or revert to the previous version using your traffic management tool (Argo Rollouts, Flagger, or your service mesh). Do not try to debug while serving traffic to users. After reverting, investigate: check application logs and metrics for the new version, look for differences in configuration or environment variables, verify the new version is reading from the correct data stores. Common causes: the new version has a subtle bug that only manifests at scale, dependency connectivity issues, or incorrect resource configuration.
Q: How do you design a deployment strategy for a stateful service like Kafka that requires zero data loss?
A: Stateful services need careful sequencing: scale up the new brokers before decommissioning old ones, wait for topic replication to catch up, then migrate partition leadership. Use Kafka’s built-in partition reassignment tool to move partitions safely. Set unclean.leader.election.enable=false to prevent data loss during broker failures. For Kafka specifically, use Strimzi or Kafka Operator on Kubernetes for managed StatefulSets. Always test the failure scenario in a staging environment first. Incremental rollout with careful monitoring of replication lag is essential.
Q: What are PodDisruptionBudgets and why do they matter during deployments?
A: PDBs ensure a minimum number of pods remain available during voluntary disruptions like node drains and deployments. Without PDBs, Kubernetes could evict too many pods simultaneously, causing service disruption. Set minAvailable or maxUnavailable based on your availability requirements. For stateful services with replication, minAvailable: 1 ensures at least one replica stays up. During deployments with multiple replicas, PDBs prevent Kubernetes from terminating too many pods at once, maintaining quorum for clustered applications.
Q: You want to deploy to 1000 nodes but avoid a thundering herd problem. How do you approach this?
A: The thundering herd problem occurs when many nodes pull images or restart simultaneously, overwhelming the registry or network. Avoid by: configuring RollingUpdate with maxSurge: 10-25% and maxUnavailable: 0 so updates happen in controlled batches, staggering deployments across node pools if you have multiple pools, using a wave-based deployment approach where you tag nodes and deploy to wave 1, wait for stability, then proceed. For image pulls specifically, use a local registry mirror or cache (Harbor, Amazon ECR), pre-pull images onto nodes, and set imagePullPolicy: IfNotPresent.
Conclusion
Each deployment strategy serves different needs. Rolling updates work well in Kubernetes and require minimal extra resources. Blue-green deployments provide instant switchover and easy rollback at the cost of double infrastructure. Canary deployments offer granular control and risk reduction through gradual traffic shifting. For more on automated deployments, see our CI/CD Pipelines guide, and for GitOps patterns, see our GitOps article.
Category
Related Posts
Health Checks: Liveness, Readiness, and Service Availability
Master health check implementation for microservices including liveness probes, readiness probes, and graceful degradation patterns.
Zero-Downtime Database Migration Strategies
Learn how to evolve your database schema in production without downtime. This guide covers expand-contract patterns, backward-compatible migrations, rollback strategies, and tools.
Container Security: Image Scanning and Vulnerability Management
Implement comprehensive container security: from scanning images for vulnerabilities to runtime security monitoring and secrets protection.