Chaos Engineering: Breaking Things on Purpose
Chaos engineering injects failures into production systems to find weaknesses before they cause outages. Learn chaos experiments, game days, and fault injection.
Chaos Engineering: Breaking Things on Purpose
Introduction
You cannot test resilience by reading code. A system that handles normal traffic perfectly can fail spectacularly under partial failures. Dependencies fail in combinations you did not anticipate. Caches expire simultaneously. Network partitions isolate data centers.
Traditional testing cannot find these weaknesses. Unit tests test components. Integration tests test connections. Load tests test performance. None test what happens when parts of your system fail.
Chaos engineering fills this gap. You deliberately break things to discover what breaks. Then you fix it before users discover it instead.
Core Concepts
Netflix formalized chaos engineering principles. Rather than guessing what might fail, you design experiments to discover weaknesses systematically. Each principle builds toward a culture of proactive resilience testing.
Building a Hypothesis
Before running an experiment, state what you expect to happen. “If we kill one instance of Service A, requests will reroute to other instances with less than 1% errors.” This gives your experiment a clear pass/fail criterion.
Varying Real-World Events
Focus on real failures. Server crashes, network latency, disk full, CPU exhaustion. Not theoretical attacks on hypothetical vulnerabilities.
Running in Production
Test environments never match production. Traffic patterns differ. Dependencies differ. Load differs. Run experiments in production, but with safeguards.
Automating Experiments
Manual chaos is not chaos engineering. Run experiments continuously. Automate the injection and measurement.
Minimizing Blast Radius
Start small. Kill one instance, not three. Introduce 100ms latency, not 10 seconds. You want to learn something without causing an actual outage.
Types of Chaos Experiments
Different failure modes reveal different weaknesses. Infrastructure failures expose gaps in your deployment and recovery logic. Network failures test how your services handle unreliable communication. Application failures validate your error handling and resource management. Dependency failures check whether your system degrades gracefully when upstream services misbehave.
Infrastructure Chaos
Kill servers. Stop containers. Fill disks. Exhaust memory. These experiments test whether your infrastructure recovers from node failures.
# Chaos experiment: kill a random instance
def kill_random_instance():
instances = ec2.describe_instances(Filters=[{'Name': 'tag:Service', 'Values': ['payment']}])
target = random.choice(instances)
ec2.terminate_instances(InstanceIds=[target['InstanceId']])
return target
Network Chaos
Introduce latency. Drop packets. Partition network. DNS failure. These experiments test how your system handles network issues.
# Introduce 200ms latency to a service
tc qdisc add dev eth0 root netem delay 200ms
# Remove latency
tc qdisc del dev eth0 root netem
Application Chaos
Crash application processes. Throw exceptions. Consume resources. These experiments test application-level fault tolerance.
# Kill a random process of a service
def kill_process(service_name):
processes = psutil.process_iter(['pid', 'name', 'cmdline'])
for p in processes:
if service_name in p.info['name']:
p.kill()
return p.info['pid']
Dependency Chaos
Fail external services. Introduce latency to databases. Return errors from APIs. These experiments test how your system handles upstream failures.
# Chaos experiment: make downstream service fail
def inject_service_failure(service_url):
# Return 503 for all requests to the service
iptables -A OUTPUT -p tcp -d {service_ip} -j DROP
Running a Chaos Experiment
A chaos experiment follows a structured lifecycle: define what normal looks like, form a hypothesis about what will happen, inject the failure, observe the results, decide whether to stop or continue, then restore the system. This loop builds confidence that your system handles real failures gracefully.
Experiment Steps
Step 1: Define Steady State
Before breaking things, measure normal behavior. What is your baseline error rate, latency, and throughput? Know what “normal” looks like so you recognize problems when they appear.
def measure_steady_state():
errors = 0
total = 0
for _ in range(1000):
try:
response = requests.get('http://api.example.com/health')
if response.status_code != 200:
errors += 1
except:
errors += 1
total += 1
return errors / total # Baseline error rate
Step 2: Form a Hypothesis
State what you expect: “If we introduce 500ms latency to the database, error rate will stay below 1% because connections will timeout and retry.”
Step 3: Inject Failure
Start small. Introduce the failure.
# Introduce 500ms latency to database
chaos_controller.inject_latency('database', 500)
Step 4: Observe
Measure the same metrics you measured for steady state. Did the system behave as you expected?
error_rate = measure_steady_state()
if error_rate > 0.01: # Hypothesis violated
alert_team("Chaos experiment failed: error rate exceeded threshold")
rollback()
Step 5: Stop or Mitigate
If the system handled the failure, document the finding. If the system failed catastrophically, stop the experiment and escalate. Fix the weakness before running more experiments.
Step 6: Restore
Always restore the system to normal state. Automated rollbacks ensure nothing stays broken.
Game Days
A game day turns chaos engineering into a team sport. Rather than running experiments in isolation, you bring the whole team together to practice failure scenarios. The goal is to build muscle memory for incident response so that when real failures occur, everyone knows their role and executes without panic.
Preparing for a Game Day
- Define scenarios to test
- Assign roles: injector, observers, incident commander
- Set start and end times
- Prepare rollback procedures
- Notify stakeholders
Running the Exercise
- Start with a scenario walkthrough
- Inject the failure
- Observe and document
- Call the incident if needed
- Debrief after
Game Day Scenarios
- Single data center failure
- Database becomes read-only
- API rate limit exceeded
- TLS certificate expires
- Message queue backs up
Tools for Chaos Engineering
The chaos tooling landscape spans from simple scripts to enterprise platforms. Netflix built the first widely-known tool, but the ecosystem has grown to support Kubernetes-native chaos, cloud-provider managed services, and custom automation. Choosing the right tool depends on your environment, team size, and maturity with chaos engineering.
Tool Categories
Chaos Monkey
Netflix’s original. Randomly kills EC2 instances. Simple but effective for testing basic resilience.
Gremlin
A commercial chaos tool with Kubernetes support, targeting options, and safety features. Suitable for teams beginning with chaos engineering.
Litmus
Open source chaos for Kubernetes. Chaos experiments defined as custom resources. Integrates with Prometheus for monitoring.
Your Own Scripts
For simple experiments, scripts work fine. You do not need a full chaos platform to get started.
Tool Evaluation Matrix
Tool Comparison Matrix
| Tool | Type | Kubernetes Support | Target Options | Safety Features | Cost | Best For |
|---|---|---|---|---|---|---|
| Chaos Monkey | Open Source | No (AWS only) | Instance-based | Basic | Free | Netflix-style EC2 killing |
| Gremlin | Commercial | Yes | Service, Pod, Node | Attack visualizer, halt button | Paid | Teams starting with chaos; enterprise support |
| Litmus | Open Source | Yes (native) | Pod, Node, Network | Custom resources, Argo integration | Free | Kubernetes-native environments; GitOps workflows |
| AWS FIS | Managed Service | N/A (AWS-native) | AWS resources | CloudWatch integration, IAM controls | Pay-per-use | AWS workloads without third-party tools |
| Azure Chaos Studio | Managed Service | N/A (Azure-native) | VM, Kubernetes, PaaS | Logic Apps integration | Pay-per-use | Azure workloads |
| Custom Scripts | DIY | Varies | Highly flexible | Depends on implementation | Free (dev time) | Simple experiments; specific edge cases |
Reference and Operations
Selection Criteria
- AWS workloads: Use AWS Fault Injection Simulator (native, no extra tools)
- Azure workloads: Use Azure Chaos Studio (native, no extra tools)
- Kubernetes-first: Litmus for open source, Gremlin for enterprise features
- Simple EC2 chaos: Chaos Monkey or custom scripts
- Enterprise requirements: Gremlin provides SLA, support, and advanced targeting
CI/CD Integration
Integrate chaos experiments into your deployment pipeline to validate resilience automatically:
# GitHub Actions workflow for chaos testing
name: Chaos Engineering Pipeline
on:
push:
branches: [main]
schedule:
# Run chaos experiments weekly in production
- cron: "0 2 * * 0"
jobs:
chaos-experiment:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run Steady State Check
run: |
./scripts/measure_steady_state.sh
echo "baseline_error_rate=$(cat steady_state.txt)" >> $GITHUB_ENV
- name: Inject Chaos - Database Latency
run: |
kubectl label namespace chaos-testing experiment=active
helm install chaos-litmus litmuschaos/litmus -n chaos-testing
# Wait for steady state to stabilize
sleep 30
# Run the experiment
kubectl apply -f experiments/database-latency.yaml -n payment
- name: Monitor Impact
run: |
./scripts/monitor_during_chaos.sh
ERROR_RATE=$(cat error_rate.txt)
# Compare to baseline
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "Experiment failed: error rate $ERROR_RATE exceeds threshold"
exit 1
fi
- name: Cleanup
if: always()
run: |
kubectl delete -f experiments/database-latency.yaml -n payment || true
helm uninstall chaos-litmus -n chaos-testing || true
- name: Document Results
if: always()
run: |
./scripts/upload_chaos_results.sh
# Litmus experiment definition for CI/CD
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: db-latency-chaos
namespace: payment
spec:
appinfo:
appns: payment
applabel: "app=checkout"
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: db-latency
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "60"
- name: LATENCY
value: "500"
- name: DB_component
value: "postgres"
Cloud Provider Chaos Services
AWS and Azure both ship built-in chaos testing that ties directly into their infrastructure. These managed services handle agent deployment, IAM integration, and rollback automation out of the box. The tradeoff is vendor lock-in, but for teams running predominantly on one cloud, native tools remove operational overhead.
AWS Fault Injection Simulator (FIS)
AWS FIS lets you inject faults into EC2, ECS, EKS, and RDS without managing separate tooling. You define experiments in JSON templates and run them against target resources.
# AWS FIS: Simulate EC2 instance failure
- action: aws:ec2:terminate-instances
targets:
instances: target-group
description: "Terminate random EC2 instance"
parameters:
instanceTerminationPercentage: 50
AWS FIS integrates with CloudWatch alarms to automatically roll back experiments if metrics degrade beyond thresholds. This is safer than open-loop chaos.
Azure Chaos Studio
Azure Chaos Studio works similarly, with support for virtual machines, AKS, Service Bus, and Cosmos DB. Experiments use a visual designer or programmatic API.
{
"actions": [
{
"type": " faults.azure.vms.terminate",
"parameters": {
"nodeCount": 1,
"abruptionDuration": "PT30S"
}
}
],
"selectors": [
{
"type": "TagSelector",
"tags": [{ "key": "chaos", "value": "enabled" }]
}
]
}
Cloud-native tools have a real advantage: no agent deployment, native IAM integration, and rollback automation built in. The tradeoff is being locked to one cloud provider.
When to Use Cloud-Native vs Third-Party
| Criteria | Cloud-Native (FIS, Chaos Studio) | Third-Party (Gremlin, Litmus) |
|---|---|---|
| Multi-cloud | No | Yes |
| Deep infrastructure integration | Yes | Requires agents |
| Managed service | Yes | No |
| Cost | Pay-per-use | Subscription |
Steady-State Automation
After running chaos experiments, feed the results back into your system automatically. Steady-state automation means your system continuously validates its own resilience instead of treating chaos testing as a one-off exercise.
A minimal steady-state loop:
def steady_state_loop():
while True:
# Define what "healthy" looks like
baseline = {
'p99_latency_ms': 200,
'error_rate': 0.01,
'availability': 0.999
}
# Run a small chaos experiment
result = run_minimal_chaos()
# Measure actual state
actual = measure_system_health()
# Compare against baseline
if mismatches(actual, baseline):
alert_oncall()
rollback_experiment()
sleep(interval=3600) # Run every hour
This does not replace full chaos engineering. It catches regressions before they reach production.
Building a Chaos Practice
Moving from occasional experiments to a mature chaos practice requires structure. You start small to build confidence, then gradually increase scope as you learn what breaks. Automation keeps experiments running continuously, and sharing findings ensures the whole organization benefits from what you discover.
Start Small
Kill one server in a test environment. See what happens. Fix what breaks. This teaches more than any book.
Graduate to Production
Once comfortable in test, run experiments in production. Start with low-risk experiments: kill instances in a redundant service, introduce small latency.
Automate
Move from manual to automated experiments. Run them continuously in production. The more you automate, the more you learn.
Share Findings
When an experiment reveals a weakness, share the finding. Write post-mortems. Add to runbooks. The goal is organizational learning.
When to Use / When Not to Use Chaos Engineering
Use chaos engineering when:
- You have production systems with real users who would be affected by outages
- Your team has monitoring and observability in place to measure impact
- You have rollback procedures that can stop experiments safely
- You have identified specific failure modes you want to validate
Do not use chaos engineering when:
- Your system is unstable and cannot handle normal load without chaos
- You lack observability to distinguish experiment impact from real failures
- You have no way to stop experiments if things go wrong
- Your organization is not prepared to act on findings
Trade-off Analysis
Kubernetes-based vs VM-based Chaos
| Aspect | Kubernetes Chaos | VM/Instance Chaos |
|---|---|---|
| Blast radius control | Namespace + pod selectors | Tag-based instance targeting |
| Injection precision | Container-level resource exhaustion | Instance-level termination |
| Recovery speed | Auto-healing via controllers | Manual or ASG-based recovery |
| Network chaos depth | Pod-to-pod, service-level partitioning | Subnet, security group blocking |
| Tool requirements | Litmus, Chaos Mesh, service mesh faults | Chaos Monkey, Gremlin, custom scripts |
| Learning curve | Higher (Kubernetes knowledge required) | Lower (standard cloud DevOps) |
| Failure mode coverage | Container crashes, OOMKilled, evicted pods | Instance termination, network partition |
Trade-off Summary
| Factor | With Chaos Engineering | Without Chaos Engineering |
|---|---|---|
| Failure Discovery | Proactive - find weaknesses before users | Reactive - users discover failures first |
| Confidence | High - validated under controlled conditions | Moderate - assumes architecture is sound |
| Risk | Controlled experiments with kill switches | Uncontrolled production failures |
| Cost | Tooling, training, experiment time | Outages, incident response, recovery |
| Team Skills | Requires chaos expertise and experiment design | Standard DevOps skills |
| Culture | Requires blameless post-mortem culture | Standard incident response |
| Time to Value | Weeks to months for meaningful experiments | Immediate, but reactive |
| Coverage | Discovers unknown-unknowns | Only known-knowns from monitoring |
Real-world Failure Scenarios
Actual chaos engineering deployments have uncovered critical weaknesses that traditional testing missed. These documented cases illustrate how controlled experiments reveal systemic risks.
Netflix - Cascading Database Failures
During a chaos experiment targeting database availability, Netflix discovered that losing a single database instance caused a cascading failure across multiple services. The root cause: connection pool exhaustion when retries flooded surviving instances. The fix was implementing connection pool limits and per-service circuit breakers.
LinkedIn - Service Mesh Partition Blind Spots
LinkedIn’s team ran network partition experiments and found that their service mesh’s health checks passed while actual inter-service communication was failing. The mesh marked pods healthy even when they could not receive traffic. This revealed a gap between liveness and availability that health checks alone did not catch.
Slack - Memory Leak Cascade
A chaos experiment that killed worker processes revealed a hidden memory leak in Slack’s job queue. When workers restarted, the queue would flood them with pending work, causing memory exhaustion again. The experiment exposed a restart loop that only occurred under specific load conditions impossible to reproduce in staging.
Amazon - Availability Zone Failure Simulation
Amazon’s internal chaos practice simulated full availability zone failures and discovered that their auto-scaling policies were too conservative for AZ-level loss. The scaling policies assumed gradual traffic increases, not sudden 33% capacity loss. This finding led to multi-AZ deployment requirements for all critical services.
Chess.com - DNS Chaos Revealed Cache Dependency
When Chess.com injected DNS failures, they discovered their application incorrectly cached DNS lookups for critical service discovery. Services failed to reconnect after DNS TTL expired because the application treated DNS failures as permanent rather than transient.
Common Pitfalls / Anti-Patterns
Teams new to chaos engineering often make predictable mistakes. These pitfalls can turn valuable experiments into real outages or wasted effort. Recognizing them helps you design safer experiments and extract more value from your chaos practice.
No Rollback Plan
Running chaos without a way to stop is reckless. Always have a rollback plan. If things go wrong, you need to be able to stop.
Testing in a Vacuum
Experiments that run but nobody watches teach nothing. Monitor the system during experiments. If you do not see the failure, you did not learn anything.
Too Aggressive Too Fast
Starting with large-scale failures causes real outages. Start with small failures. Gradually increase scope as you build confidence.
Ignoring Results
Experiments that reveal weaknesses but nobody fixes them are wasted. Follow through. The point is to improve resilience, not to watch things break.
Interview Questions
Use an external chaos approach: inject faults at the network layer using tools like TC (traffic control) on the node itself, or use a service mesh's traffic management capabilities (Envoy's fault injection) which does not require per-service agents. For Kubernetes, Chaos Mesh network chaos operator can inject faults without sidecar agents using CNI-level interference. Alternatively, use firewall rules on the node to drop or delay packets to specific pod IPs. If you truly cannot install anything, chaos engineering via the service mesh's built-in fault injection is the cleanest approach.
Your application has an undocumented assumption about downstream service latency. This is a resilience gap: the application should have timeouts and retry logic that handles variable latency gracefully. The 200ms threshold is likely close to your current timeout configuration. When latency exceeds the timeout, requests fail. The fix involves setting explicit timeouts on all outbound calls, implementing retry with exponential backoff and jitter, adding circuit breakers to stop calling failing services, and documenting the latency SLOs your application depends on. This is also a chaos engineering success. You found a hidden weakness before users did.
Frame it around risk reduction and reliability metrics. Run a pre-chaos baseline: measure your system's steady-state metrics (error rate, latency, throughput). Then run an experiment that simulates a real failure mode, like a pod crash or network partition, and measure the blast radius. Compare the impact of that failure when handled versus unhandled. Present the findings: "When we kill a pod unannounced, error rates spike to X% for Y minutes. After adding readiness probes and proper pod disruption budgets, the same failure causes no observable impact." Quantify the risk in terms of potential downtime cost and translate to business impact.
Immediately stop the experiment using your chaos tool's abort mechanism (Litmus, Chaos Mesh, Gremlin all have clean abort). If the tool is not responding, use kubectl rollout undo to revert any deployment changes, and manually remove any injected faults (delete chaos CRDs). Scale up unaffected services to handle load if cascading failures are causing capacity issues. Once stable, run a post-mortem. Chaos engineering should always have a correlation ID and a rollback plan before starting. The cascading failure itself is a valuable finding. It means your circuit breakers and fallback mechanisms are not working as intended.
Chaos engineering proactively discovers weaknesses in a controlled, iterative experiment. You start small, measure the blast radius, and design fixes before failures occur. Disaster recovery testing validates that recovery procedures actually work after a failure. You simulate a full outage and execute your runbook to recover. Chaos engineering is proactive discovery; DR testing is reactive validation. Both are necessary. DR tests are typically lower frequency (quarterly or annually), while chaos experiments can run continuously in production.
A steady-state hypothesis defines what "normal" looks like for your system before you run an experiment. You measure metrics like error rate, latency P99, and throughput under normal conditions. These metrics become your baseline.
The hypothesis states what you expect to happen when you inject failure. You then run the experiment and compare the metrics against your baseline. If metrics deviate beyond acceptable thresholds, the hypothesis is disproven—your system is less resilient than expected.
Without a steady-state hypothesis, you cannot measure impact objectively. You are just causing chaos without learning anything.
Blast radius measures how much of your system is affected by an experiment. Start by identifying what components could be impacted: direct dependencies, downstream services, and shared resources.
Quantify impact before running: what percentage of requests would fail if this component goes down? What is the business impact per minute of degradation? Start experiments at the smallest scope that gives meaningful signal—kill one instance, not all instances.
Monitor blast radius during the experiment. If impact exceeds thresholds, abort immediately. The goal is to find weaknesses without causing real damage.
Infrastructure chaos: kill instances, simulate network partition, add latency, consume resources like CPU and memory. Tests infrastructure-level resilience.
Network chaos: inject network latency, DNS failures, packet loss, blackhole routing. Tests how the system handles network instability.
Application chaos: simulate service errors, inject exceptions, delay responses. Tests application-level fault tolerance.
Dependency chaos: fail dependent services, simulate downstream timeouts. Tests how the system handles upstream or downstream failures.
Most mature chaos programs progress from infrastructure to application chaos as team confidence grows.
SRE focuses on reliability targets (SLOs, SLIs, SLAs) and error budgets. Chaos engineering is a tool for validating that your system can maintain reliability under failure conditions. Game days—a planned chaos experiment session—are a core SRE practice.
SRE's emphasis on measuring reliability and making data-driven decisions aligns with chaos engineering's hypothesis-driven approach. Both practices value learning from failures over assuming systems work.
Many organizations start chaos engineering as an SRE initiative because SRE teams already have the reliability mindset and metrics expertise needed to design meaningful experiments.
Chaos engineering can cause real harm if not practiced responsibly. Experiments in production without proper safeguards can cause real outages affecting real users. Always have a kill switch and the ability to abort immediately.
Consider stakeholder impact: notify relevant teams before running production experiments. Consider customer impact: even "safe" experiments might violate SLAs if they cause degradation. Consider safety-critical systems: chaos engineering in healthcare, aviation, or financial trading systems requires extreme caution.
The principle of "first, do no harm" applies. Start in staging. Automate rollback. Only move to production when you have confidence in your tooling and process.
LitmusChaos is Kubernetes-native and open source, with chaos experiments defined as CRDs you can manage via GitOps. AWS FIS is a managed service that runs on AWS infrastructure without installing agents on your targets. The core difference is ecosystem: Litmus works anywhere Kubernetes runs, while FIS only works on AWS resources.
Choose Litmus when you run multi-cloud or hybrid environments, already use Kubernetes extensively, or want to contribute to an open source project. Choose AWS FIS when your workload is entirely on AWS, you want native CloudWatch integration, and you prefer managed services over self-managed tooling.
Start in staging with a single, predictable failure mode. Choose an experiment with a tight blast radius: kill one non-production instance, introduce 100ms latency to a non-critical service. Define the hypothesis clearly and set automatic rollback triggers. Run the pilot, measure, and document findings.
Present results to management before proposing production experiments. Quantify what you learned: "In staging, our payment service degraded gracefully when the email service added 200ms latency because we already had a circuit breaker in place. This confirmed our architecture handles this failure mode." The pilot proves value while managing perception risk.
Service meshes like Linkerd and Istio provide traffic management primitives that double as chaos injection points. Envoy's fault injection feature can delay or abort requests at the proxy layer without requiring code changes or sidecar agents for every service. This is network-level chaos at its cleanest.
Using a service mesh for chaos has advantages: you do not install chaos agents per service, faults are defined declaratively, and the mesh handles cleanup automatically. The tradeoff is coupling your chaos tooling to your service mesh choice. If you are already running a service mesh, its built-in fault injection is often the fastest path to chaos experiments.
DR testing validates that recovery procedures work; chaos engineering discovers what needs recovering. Use them together: run a chaos experiment that simulates your target failure mode (data center loss, database corruption, network partition), then execute your DR runbook to recover. Measure the recovery time against your RTO target.
For example, inject a failure that makes your primary database unreachable. Start your DR procedure: promote the standby, update DNS, verify replication. If recovery completes within your RTO, your DR procedure is validated. If not, you have a specific gap to fix before a real disaster.
Run experiments during maintenance windows when SLAs may be temporarily relaxed. Define blast radius boundaries that stay well below observable degradation thresholds: if your SLA is 99.9% (43 minutes monthly downtime), a 5-minute experiment causing 0.1% extra error rate might still breach that window. Better to run in staging or use synthetic traffic that does not count against SLA metrics.
Get legal sign-off on your chaos program before running in production. Some enterprise contracts require advance notice of any production changes, including experiments. Treat chaos engineering as a change management event that requires documentation and approval.
Measure the same metrics you use for steady-state baseline: error rate, latency P99, throughput, and resource utilization. The key is comparing experiment metrics against baseline. If error rate jumps from 0.1% to 2%, your hypothesis likely failed unless you predicted that degradation.
Also watch for cascading effects: if latency on Service A triggers retries that exhaust Service B's connection pool, you have a cascade. This is valuable finding even if it was not your target failure. Set automated alerts with abort thresholds: if error rate exceeds 5%, stop the experiment immediately.
Chaos experiments run against production, not against the deployment pipeline. What CI/CD integration means is scheduling automated experiments that run between deployments or on a cron schedule, not as gate-keeping steps that block deployments.
A deployment pipeline that blocks on chaos experiments will create flaky pipelines when experiments fail for reasons unrelated to the code change. Instead, run steady-state chaos loops in production and use pipeline failures only when a specific chaos regression is detected that maps directly to the change being deployed.
A steady-state hypothesis is what you expect to happen during a specific experiment: "if we kill one instance, error rate will stay below 1%." It is experiment-scoped and hypothesis-driven. Steady-state automation is a continuous loop that runs small chaos experiments repeatedly to catch regressions before they reach production.
Steady-state automation runs continuously and compares system health against baseline on an ongoing basis. The hypothesis is implicit: if the system degrades beyond threshold, something is wrong. You do not need a specific failure mode in mind for automation; the loop just validates that healthy looks like healthy.
Chaos experiments reveal how your system behaves under stress, which informs capacity planning. When you kill an instance and latency spikes because remaining instances are overwhelmed, you have found a capacity gap. The experiment tells you how many instances you need to maintain acceptable performance during partial failures.
Run experiments that simulate peak load conditions combined with partial infrastructure failure. If your system degrades gracefully at 50% capacity under normal load but fails catastrophically at 50% capacity with one instance down, you know you need more headroom or better load distribution.
Start with education: show them real post-mortems from major outages where chaos engineering would have caught the weakness. The framing matters: chaos engineering is not about breaking things for fun; it is about learning what breaks before users experience it.
Make the first experiment collaborative: involve skeptics in designing and observing the experiment. When they see the process work and witness findings they did not expect, buy-in follows. Frame success as "we found X and fixed it" not "we broke things on purpose." The goal is confidence, not chaos for chaos sake.
Further Reading
Internal Resources
- Disaster Recovery — Planning for failures at the infrastructure and data layer
- Resilience Patterns — Retry, timeout, circuit breaker, and bulkhead patterns that work with chaos engineering
- Circuit Breaker Pattern — Preventing cascade failures that chaos experiments may uncover
- System Design Roadmap — Complete learning path covering chaos engineering in context
External Resources
- Principles of Chaos Engineering — The foundational document defining chaos engineering scope and methodology
- Chaos Monkey — Netflix’s original chaos engineering tool that randomly kills production instances
- Gremlin — Commercial chaos engineering platform with guided experiments and safety features
- AWS Fault Injection Simulator — Managed chaos engineering service with built-in safety controls and automation
- LitmusChaos — CNCF-hosted chaos engineering platform for Kubernetes and cloud-native workloads
- Chaos Engineering at Netflix and Beyond (PDF) — Academic treatment of chaos engineering principles, Netflix’s implementation, and lessons learned
- The Veritas Session — Netflix talk on the evolution from Chaos Monkey to proactive failure injection
Conclusion
Chaos engineering is not about breaking things. It is about finding weaknesses before users find them. The goal is to build confidence that your system will survive failures.
The first time you run chaos, you will find problems. Fix them. Run again. Eventually, your system handles chaos and your team handles incidents with confidence.
Key Bullets:
- Chaos engineering is about finding weaknesses before users find them
- Every experiment needs a hypothesis and measurable pass/fail criteria
- Start small: one instance, small latency injection, limited scope
- Automate experiments for continuous validation
- Follow through on findings. Track to completion and fix the weaknesses
Copy/Paste Checklist:
Before running chaos experiment:
[ ] Define steady-state baseline metrics
[ ] State hypothesis: what do you expect to happen?
[ ] Define pass/fail criteria
[ ] Assign roles: injector, observer, incident commander
[ ] Prepare rollback/kill switch
[ ] Notify stakeholders if production
[ ] Verify monitoring is capturing all metrics
[ ] Document expected duration
During experiment:
[ ] Monitor steady-state metrics continuously
[ ] Watch for cascading effects
[ ] Stop if criteria exceeded
After experiment:
[ ] Restore system to normal state
[ ] Document findings
[ ] Schedule follow-up for any weaknesses found
[ ] Update runbooks with lessons learned
Observability Checklist
-
Metrics:
- Error rate (baseline vs experiment)
- Latency P50/P95/P99
- Throughput (requests per second)
- Resource utilization (CPU, memory, disk, network)
- Dependency health indicators
-
Logs:
- Experiment start/end timestamps with hypothesis
- System behavior observations during experiment
- Any anomalies detected
- Post-experiment findings and recommendations
-
Alerts:
- Error rate exceeds 1% during experiment (warning) / 5% (stop experiment)
- Latency P99 doubles during experiment
- Any resource exhaustion detected
- Experiment duration exceeds planned window
Security Checklist
- Limit chaos tooling access to authorized personnel only
- Audit log all experiment executions with timestamps and owners
- Prevent experiments from affecting security controls (authentication, encryption)
- Ensure experiments cannot exfiltrate data or breach isolation
- Validate that rollback procedures do not introduce security gaps
- Test chaos tooling for vulnerabilities (injection, privilege escalation)
For more on resilience, see Disaster Recovery and Resilience Patterns.
Category
Related Posts
Incident Response: Detection, Response, and Post-Mortems
Build an effective incident response process: from detection and escalation to resolution and blameless post-mortems that prevent recurrence.
Kubernetes High Availability: HPA, Pod Disruption Budgets, Multi-AZ
Build resilient Kubernetes applications with Horizontal Pod Autoscaler, Pod Disruption Budgets, and multi-availability zone deployments for production workloads.
Graceful Degradation: Systems That Bend Instead Break
Design systems that maintain core functionality when components fail through fallback strategies, degradation modes, and progressive service levels.