Chaos Engineering: Breaking Things on Purpose
Chaos engineering injects failures into production systems to find weaknesses before they cause outages. Learn chaos experiments, game days, and fault injection.
Chaos Engineering: Breaking Things on Purpose to Build Resilience
Netflix did something strange. They created a program that intentionally killed servers in production. They called it Chaos Monkey. The idea was straightforward: find out what breaks before your users do.
Chaos engineering is the practice of injecting failures into systems to test their resilience. You make controlled chaos to discover the uncontrolled chaos lurking in your architecture.
This article covers the principles, how to run chaos experiments, game days, and building a chaos engineering practice.
Why Chaos Engineering
You cannot test resilience by reading code. A system that handles normal traffic perfectly can fail spectacularly under partial failures. Dependencies fail in combinations you did not anticipate. Caches expire simultaneously. Network partitions isolate data centers.
Traditional testing cannot find these weaknesses. Unit tests test components. Integration tests test connections. Load tests test performance. None test what happens when parts of your system fail.
Chaos engineering fills this gap. You deliberately break things to discover what breaks. Then you fix it before users discover it instead.
Principles of Chaos Engineering
Netflix formalized chaos engineering principles:
Building a Hypothesis
Before running an experiment, state what you expect to happen. “If we kill one instance of Service A, requests will reroute to other instances with less than 1% errors.” This gives your experiment a clear pass/fail criterion.
Varying Real-World Events
Focus on real failures. Server crashes, network latency, disk full, CPU exhaustion. Not theoretical attacks on hypothetical vulnerabilities.
Running in Production
Test environments never match production. Traffic patterns differ. Dependencies differ. Load differs. Run experiments in production, but with safeguards.
Automating Experiments
Manual chaos is not chaos engineering. Run experiments continuously. Automate the injection and measurement.
Minimizing Blast Radius
Start small. Kill one instance, not three. Introduce 100ms latency, not 10 seconds. You want to learn something without causing an actual outage.
Types of Chaos Experiments
Infrastructure Chaos
Kill servers. Stop containers. Fill disks. Exhaust memory. These experiments test whether your infrastructure recovers from node failures.
# Chaos experiment: kill a random instance
def kill_random_instance():
instances = ec2.describe_instances(Filters=[{'Name': 'tag:Service', 'Values': ['payment']}])
target = random.choice(instances)
ec2.terminate_instances(InstanceIds=[target['InstanceId']])
return target
Network Chaos
Introduce latency. Drop packets. Partition network. DNS failure. These experiments test how your system handles network issues.
# Introduce 200ms latency to a service
tc qdisc add dev eth0 root netem delay 200ms
# Remove latency
tc qdisc del dev eth0 root netem
Application Chaos
Crash application processes. Throw exceptions. Consume resources. These experiments test application-level fault tolerance.
# Kill a random process of a service
def kill_process(service_name):
processes = psutil.process_iter(['pid', 'name', 'cmdline'])
for p in processes:
if service_name in p.info['name']:
p.kill()
return p.info['pid']
Dependency Chaos
Fail external services. Introduce latency to databases. Return errors from APIs. These experiments test how your system handles upstream failures.
# Chaos experiment: make downstream service fail
def inject_service_failure(service_url):
# Return 503 for all requests to the service
iptables -A OUTPUT -p tcp -d {service_ip} -j DROP
Running a Chaos Experiment
Step 1: Define Steady State
Before breaking things, measure normal behavior. What is your baseline error rate, latency, and throughput? Know what “normal” looks like so you recognize problems when they appear.
def measure_steady_state():
errors = 0
total = 0
for _ in range(1000):
try:
response = requests.get('http://api.example.com/health')
if response.status_code != 200:
errors += 1
except:
errors += 1
total += 1
return errors / total # Baseline error rate
Step 2: Form a Hypothesis
State what you expect: “If we introduce 500ms latency to the database, error rate will stay below 1% because connections will timeout and retry.”
Step 3: Inject Failure
Start small. Introduce the failure.
# Introduce 500ms latency to database
chaos_controller.inject_latency('database', 500)
Step 4: Observe
Measure the same metrics you measured for steady state. Did the system behave as you expected?
error_rate = measure_steady_state()
if error_rate > 0.01: # Hypothesis violated
alert_team("Chaos experiment failed: error rate exceeded threshold")
rollback()
Step 5: Stop or Mitigate
If the system handled the failure, document the finding. If the system failed catastrophically, stop the experiment and escalate. Fix the weakness before running more experiments.
Step 6: Restore
Always restore the system to normal state. Automated rollbacks ensure nothing stays broken.
Game Days
A game day is a planned chaos exercise with the whole team. The goal is to practice failure scenarios and improve incident response.
Preparing for a Game Day
- Define scenarios to test
- Assign roles: injector, observers, incident commander
- Set start and end times
- Prepare rollback procedures
- Notify stakeholders
Running the Exercise
- Start with a scenario walkthrough
- Inject the failure
- Observe and document
- Call the incident if needed
- Debrief after
Game Day Scenarios
- Single data center failure
- Database becomes read-only
- API rate limit exceeded
- TLS certificate expires
- Message queue backs up
Tools for Chaos Engineering
Chaos Monkey
Netflix’s original. Randomly kills EC2 instances. Simple but effective for testing basic resilience.
Gremlin
A commercial chaos tool with Kubernetes support, targeting options, and safety features. Suitable for teams beginning with chaos engineering.
Litmus
Open source chaos for Kubernetes. Chaos experiments defined as custom resources. Integrates with Prometheus for monitoring.
Your Own Scripts
For simple experiments, scripts work fine. You do not need a full chaos platform to get started.
Tool Comparison Matrix
| Tool | Type | Kubernetes Support | Target Options | Safety Features | Cost | Best For |
|---|---|---|---|---|---|---|
| Chaos Monkey | Open Source | No (AWS only) | Instance-based | Basic | Free | Netflix-style EC2 killing |
| Gremlin | Commercial | Yes | Service, Pod, Node | Attack visualizer, halt button | Paid | Teams starting with chaos; enterprise support |
| Litmus | Open Source | Yes (native) | Pod, Node, Network | Custom resources, Argo integration | Free | Kubernetes-native environments; GitOps workflows |
| AWS FIS | Managed Service | N/A (AWS-native) | AWS resources | CloudWatch integration, IAM controls | Pay-per-use | AWS workloads without third-party tools |
| Azure Chaos Studio | Managed Service | N/A (Azure-native) | VM, Kubernetes, PaaS | Logic Apps integration | Pay-per-use | Azure workloads |
| Custom Scripts | DIY | Varies | Highly flexible | Depends on implementation | Free (dev time) | Simple experiments; specific edge cases |
Selection Criteria
- AWS workloads: Use AWS Fault Injection Simulator (native, no extra tools)
- Azure workloads: Use Azure Chaos Studio (native, no extra tools)
- Kubernetes-first: Litmus for open source, Gremlin for enterprise features
- Simple EC2 chaos: Chaos Monkey or custom scripts
- Enterprise requirements: Gremlin provides SLA, support, and advanced targeting
CI/CD Integration
Integrate chaos experiments into your deployment pipeline to validate resilience automatically:
# GitHub Actions workflow for chaos testing
name: Chaos Engineering Pipeline
on:
push:
branches: [main]
schedule:
# Run chaos experiments weekly in production
- cron: "0 2 * * 0"
jobs:
chaos-experiment:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run Steady State Check
run: |
./scripts/measure_steady_state.sh
echo "baseline_error_rate=$(cat steady_state.txt)" >> $GITHUB_ENV
- name: Inject Chaos - Database Latency
run: |
kubectl label namespace chaos-testing experiment=active
helm install chaos-litmus litmuschaos/litmus -n chaos-testing
# Wait for steady state to stabilize
sleep 30
# Run the experiment
kubectl apply -f experiments/database-latency.yaml -n payment
- name: Monitor Impact
run: |
./scripts/monitor_during_chaos.sh
ERROR_RATE=$(cat error_rate.txt)
# Compare to baseline
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "Experiment failed: error rate $ERROR_RATE exceeds threshold"
exit 1
fi
- name: Cleanup
if: always()
run: |
kubectl delete -f experiments/database-latency.yaml -n payment || true
helm uninstall chaos-litmus -n chaos-testing || true
- name: Document Results
if: always()
run: |
./scripts/upload_chaos_results.sh
# Litmus experiment definition for CI/CD
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: db-latency-chaos
namespace: payment
spec:
appinfo:
appns: payment
applabel: "app=checkout"
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: db-latency
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "60"
- name: LATENCY
value: "500"
- name: DB_component
value: "postgres"
Cloud Provider Chaos Services
AWS and Azure both ship built-in chaos testing that ties directly into their infrastructure.
AWS Fault Injection Simulator (FIS)
AWS FIS lets you inject faults into EC2, ECS, EKS, and RDS without managing separate tooling. You define experiments in JSON templates and run them against target resources.
# AWS FIS: Simulate EC2 instance failure
- action: aws:ec2:terminate-instances
targets:
instances: target-group
description: "Terminate random EC2 instance"
parameters:
instanceTerminationPercentage: 50
AWS FIS integrates with CloudWatch alarms to automatically roll back experiments if metrics degrade beyond thresholds. This is safer than open-loop chaos.
Azure Chaos Studio
Azure Chaos Studio works similarly, with support for virtual machines, AKS, Service Bus, and Cosmos DB. Experiments use a visual designer or programmatic API.
{
"actions": [
{
"type": " faults.azure.vms.terminate",
"parameters": {
"nodeCount": 1,
"abruptionDuration": "PT30S"
}
}
],
"selectors": [
{
"type": "TagSelector",
"tags": [{ "key": "chaos", "value": "enabled" }]
}
]
}
Cloud-native tools have a real advantage: no agent deployment, native IAM integration, and rollback automation built in. The tradeoff is being locked to one cloud provider.
When to Use Cloud-Native vs Third-Party
| Criteria | Cloud-Native (FIS, Chaos Studio) | Third-Party (Gremlin, Litmus) |
|---|---|---|
| Multi-cloud | No | Yes |
| Deep infrastructure integration | Yes | Requires agents |
| Managed service | Yes | No |
| Cost | Pay-per-use | Subscription |
Steady-State Automation
After running chaos experiments, feed the results back into your system automatically. Steady-state automation means your system continuously validates its own resilience instead of treating chaos testing as a one-off exercise.
A minimal steady-state loop:
def steady_state_loop():
while True:
# Define what "healthy" looks like
baseline = {
'p99_latency_ms': 200,
'error_rate': 0.01,
'availability': 0.999
}
# Run a small chaos experiment
result = run_minimal_chaos()
# Measure actual state
actual = measure_system_health()
# Compare against baseline
if mismatches(actual, baseline):
alert_oncall()
rollback_experiment()
sleep(interval=3600) # Run every hour
This does not replace full chaos engineering. It catches regressions before they reach production.
Building a Chaos Practice
Start Small
Kill one server in a test environment. See what happens. Fix what breaks. This teaches more than any book.
Graduate to Production
Once comfortable in test, run experiments in production. Start with low-risk experiments: kill instances in a redundant service, introduce small latency.
Automate
Move from manual to automated experiments. Run them continuously in production. The more you automate, the more you learn.
Share Findings
When an experiment reveals a weakness, share the finding. Write post-mortems. Add to runbooks. The goal is organizational learning.
Common Mistakes
No Rollback Plan
Running chaos without a way to stop is reckless. Always have a rollback plan. If things go wrong, you need to be able to stop.
Testing in a Vacuum
Experiments that run but nobody watches teach nothing. Monitor the system during experiments. If you do not see the failure, you did not learn anything.
Too Aggressive Too Fast
Starting with large-scale failures causes real outages. Start with small failures. Gradually increase scope as you build confidence.
Ignoring Results
Experiments that reveal weaknesses but nobody fixes them are wasted. Follow through. The point is to improve resilience, not to watch things break.
When to Use / When Not to Use Chaos Engineering
Use chaos engineering when:
- You have production systems with real users who would be affected by outages
- Your team has monitoring and observability in place to measure impact
- You have rollback procedures that can stop experiments safely
- You have identified specific failure modes you want to validate
Do not use chaos engineering when:
- Your system is unstable and cannot handle normal load without chaos
- You lack observability to distinguish experiment impact from real failures
- You have no way to stop experiments if things go wrong
- Your organization is not prepared to act on findings
Trade-off Analysis
| Factor | With Chaos Engineering | Without Chaos Engineering |
|---|---|---|
| Failure Discovery | Proactive - find weaknesses before users | Reactive - users discover failures first |
| Confidence | High - validated under controlled conditions | Moderate - assumes architecture is sound |
| Risk | Controlled experiments with kill switches | Uncontrolled production failures |
| Cost | Tooling, training, experiment time | Outages, incident response, recovery |
| Team Skills | Requires chaos expertise and experiment design | Standard DevOps skills |
| Culture | Requires blameless post-mortem culture | Standard incident response |
| Time to Value | Weeks to months for meaningful experiments | Immediate, but reactive |
| Coverage | Discovers unknown-unknowns | Only known-knowns from monitoring |
Chaos Engineering Experiment Flow
graph LR
A[Define Steady State] --> B[Form Hypothesis]
B --> C[Inject Failure]
C --> D{Observe Impact}
D -->|Within预期| E[Document Finding]
D -->|Exceeds Threshold| F[Stop Experiment]
F --> G[Rollback]
G --> H[Fix Weakness]
E --> I[Graduate Experiment]
I --> C
H --> C
Chaos Engineering Architecture
graph TB
subgraph ChaosPlatform["Chaos Engineering Platform"]
Controller[Experiment Controller]
Scheduler[Experiment Scheduler]
Monitor[Monitoring Integration]
end
subgraph Target["Target System"]
subgraph Services["Microservices"]
S1[Service A]
S2[Service B]
S3[Service C]
end
subgraph Infra["Infrastructure"]
LB[Load Balancer]
DB[(Database)]
Cache[Cache]
end
end
subgraph Observability["Observability Stack"]
Metrics[Prometheus Metrics]
Logs[Log Aggregation]
Traces[Distributed Traces]
Dashboards[Grafana Dashboards]
end
Controller -->|Inject| Services
Controller -->|Inject| Infra
Monitor -->|Read| Metrics
Monitor -->|Read| Traces
Services -->|Emit| Metrics
Services -->|Emit| Logs
Services -->|Emit| Traces
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Experiment causes actual outage | Users experience real downtime | Set automatic timeouts; have kill switches; start with minimal blast radius |
| Experiment reveals unmonitored weakness | Failure spreads before team notices | Ensure comprehensive monitoring before running experiments |
| Experiment runs too long | Extended degradation affects users | Use automated stops; define maximum experiment duration |
| Team ignores findings | Weaknesses remain; future outages inevitable | Track findings to completion; include in sprint planning |
| Experiment tooling itself fails | False confidence in system resilience | Test chaos tooling itself; have manual fallback |
Observability Checklist
-
Metrics:
- Error rate (baseline vs experiment)
- Latency P50/P95/P99
- Throughput (requests per second)
- Resource utilization (CPU, memory, disk, network)
- Dependency health indicators
-
Logs:
- Experiment start/end timestamps with hypothesis
- System behavior observations during experiment
- Any anomalies detected
- Post-experiment findings and recommendations
-
Alerts:
- Error rate exceeds 1% during experiment (warning) / 5% (stop experiment)
- Latency P99 doubles during experiment
- Any resource exhaustion detected
- Experiment duration exceeds planned window
Compliance Considerations for Regulated Industries
Chaos engineering in regulated environments, finance, healthcare, aviation, government, requires extra guardrails. You are not just protecting availability; you are protecting data integrity, audit trails, and regulatory obligations.
Finance
Financial systems must maintain data integrity above all else. A chaos experiment that causes a payment to be processed twice, or a trade to be dropped, has direct financial consequences.
Key constraints:
- SLA-backed services: If your chaos experiment violates an SLA, you may owe penalties. Run experiments during maintenance windows.
- Audit logging requirements: Regulated financial services require complete audit logs. Chaos experiments that disrupt logging infrastructure can put you out of compliance even if no data is lost.
- Data consistency over availability: Your compliance posture may require strict consistency guarantees. A chaos experiment that causes stale reads in a trading system may be unacceptable regardless of availability impact.
Before running chaos on financial systems, get explicit sign-off from compliance and risk teams. Document the blast radius boundaries and ensure rollback procedures are tested.
Healthcare
Healthcare systems processing PHI (Protected Health Information) must comply with HIPAA in the US. Key considerations:
- PHI access logging: Chaos experiments that disrupt logging services can create compliance gaps. You must be able to demonstrate who accessed PHI and when.
- Availability vs data integrity: Medical devices and clinical systems may prioritize availability differently than you expect. A chaos experiment that makes an EHR unreachable could affect patient care.
- Fallback procedures: For critical healthcare systems, you need a tested fallback that maintains patient safety during degradation.
HIPAA does not prohibit chaos engineering, but you must ensure your chaos tooling does not become a vector for PHI exposure.
General Regulatory Framework
| Consideration | What to Do |
|---|---|
| Audit trail continuity | Ensure chaos experiments cannot disrupt your logging infrastructure; validate logs are being written before running |
| Data integrity validation | After any chaos experiment, verify data across systems is consistent; have rollback procedures for data修复 |
| Notification requirements | Some regulated industries require notifying regulators of significant outages; chaos experiments that exceed defined blast radius may trigger those requirements |
| Change management | Chaos experiments are changes to production. Document them in your change management system |
| Insurance implications | Confirm your cyber insurance policy covers intentional experiments vs. excludes them |
The right approach in regulated industries: run chaos in dedicated staging environments that mirror production’s regulatory posture, get compliance sign-off on experiment designs, and maintain an audit log of all experiments and findings.
Running Chaos in Regulated Environments
# Pre-experiment compliance checklist
def pre_chaos_compliance_check():
checks = {
'audit_logs_active': verify_log_drain_not_disrupted(),
'data_backup_recent': verify_backup_age_hours() < 24,
'compliance_signoff': get_signed_approval('compliance-team'),
'change_ticket': get_change_ticket_number(),
'rollback_tested': run_rollback_in_staging(),
'monitoring_intact': verify_all_metrics_streams_active(),
}
for check, result in checks.items():
if not result:
raise ComplianceBlock(f"Cannot run chaos: {check} failed")
return True
In regulated industries, chaos engineering is still worth doing. The findings prevent real outages that would be far more disruptive. But the experiment design and approval process takes longer, and you should expect narrower blast radius boundaries.
Security Checklist
- Limit chaos tooling access to authorized personnel only
- Audit log all experiment executions with timestamps and owners
- Prevent experiments from affecting security controls (authentication, encryption)
- Ensure experiments cannot exfiltrate data or breach isolation
- Validate that rollback procedures do not introduce security gaps
- Test chaos tooling for vulnerabilities (injection, privilege escalation)
Common Pitfalls / Anti-Patterns
Running Chaos Without Hypothesis
Breaking things randomly teaches you nothing. Every experiment needs a stated hypothesis and pass/fail criteria. Without this, you cannot learn from the exercise.
Skipping Steady-State Measurement
Starting an experiment without measuring baseline behavior means you cannot determine impact. Measure steady state before injecting any failure.
Ignoring Rollback Procedures
Chaos engineering without safe rollback is reckless. If something goes wrong, you must be able to stop. Define rollback procedures before starting.
Experiments Nobody Watches
Running an experiment without observing its impact wastes the opportunity. Someone must actively monitor during experiments.
Not Following Up on Findings
Discovering a weakness but not fixing it defeats the purpose. Track all findings and ensure they reach sprint planning.
Interview Questions
Q: You want to inject latency into a specific service but your chaos tool requires running an agent inside the cluster. How do you handle a scenario where you cannot install agents? A: Use an external chaos approach: inject faults at the network layer using tools like TC (traffic control) on the node itself, or use a service mesh’s traffic management capabilities (Envoy’s fault injection) which does not require per-service agents. For Kubernetes, Chaos Mesh network chaos operator can inject faults without sidecar agents using CNI-level interference. Alternatively, use firewall rules on the node to drop or delay packets to specific pod IPs. If you truly cannot install anything, chaos engineering via the service mesh’s built-in fault injection is the cleanest approach.
Q: A chaos experiment reveals that your application fails when a dependent service returns errors after 200ms instead of the usual 50ms. What does this tell you about your system? A: Your application has an undocumented assumption about downstream service latency. This is a resilience gap: the application should have timeouts and retry logic that handles variable latency gracefully. The 200ms threshold is likely close to your current timeout configuration. When latency exceeds the timeout, requests fail. The fix involves setting explicit timeouts on all outbound calls, implementing retry with exponential backoff and jitter, adding circuit breakers to stop calling failing services, and documenting the latency SLOs your application depends on. This is also a chaos engineering success. You found a hidden weakness before users did.
Q: How do you justify chaos engineering investment to skeptical stakeholders? A: Frame it around risk reduction and reliability metrics. Run a pre-chaos baseline: measure your system’s steady-state metrics (error rate, latency, throughput). Then run an experiment that simulates a real failure mode, like a pod crash or network partition, and measure the blast radius. Compare the impact of that failure when handled versus unhandled. Present the findings: “When we kill a pod unannounced, error rates spike to X% for Y minutes. After adding readiness probes and proper pod disruption budgets, the same failure causes no observable impact.” Quantify the risk in terms of potential downtime cost and translate to business impact.
Q: Your chaos experiment causes cascading failures across your system. How do you safely abort and recover?
A: Immediately stop the experiment using your chaos tool’s abort mechanism (Litmus, Chaos Mesh, Gremlin all have clean abort). If the tool is not responding, use kubectl rollout undo to revert any deployment changes, and manually remove any injected faults (delete chaos CRDs). Scale up unaffected services to handle load if cascading failures are causing capacity issues. Once stable, run a post-mortem. Chaos engineering should always have a correlation ID and a rollback plan before starting. The cascading failure itself is a valuable finding. It means your circuit breakers and fallback mechanisms are not working as intended.
Q: What is the difference between chaos engineering and disaster recovery testing? A: Chaos engineering proactively discovers weaknesses in a controlled, iterative experiment. You start small, measure the blast radius, and design fixes before failures occur. Disaster recovery testing validates that recovery procedures actually work after a failure. You simulate a full outage and execute your runbook to recover. Chaos engineering is proactive discovery; DR testing is reactive validation. Both are necessary. DR tests are typically lower frequency (quarterly or annually), while chaos experiments can run continuously in production.
Quick Recap
Key Bullets:
- Chaos engineering is about finding weaknesses before users find them
- Every experiment needs a hypothesis and measurable pass/fail criteria
- Start small: one instance, small latency injection, limited scope
- Automate experiments for continuous validation
- Follow through on findings. Track to completion and fix the weaknesses
Copy/Paste Checklist:
Before running chaos experiment:
[ ] Define steady-state baseline metrics
[ ] State hypothesis: what do you expect to happen?
[ ] Define pass/fail criteria
[ ] Assign roles: injector, observer, incident commander
[ ] Prepare rollback/kill switch
[ ] Notify stakeholders if production
[ ] Verify monitoring is capturing all metrics
[ ] Document expected duration
During experiment:
[ ] Monitor steady-state metrics continuously
[ ] Watch for cascading effects
[ ] Stop if criteria exceeded
After experiment:
[ ] Restore system to normal state
[ ] Document findings
[ ] Schedule follow-up for any weaknesses found
[ ] Update runbooks with lessons learned
The Point
Chaos engineering is not about breaking things. It is about finding weaknesses before users find them. The goal is to build confidence that your system will survive failures.
The first time you run chaos, you will find problems. Fix them. Run again. Eventually, your system handles chaos and your team handles incidents with confidence.
For more on resilience, see Disaster Recovery and Resilience Patterns.
Category
Related Posts
Incident Response: Detection, Response, and Post-Mortems
Build an effective incident response process: from detection and escalation to resolution and blameless post-mortems that prevent recurrence.
Kubernetes High Availability: HPA, Pod Disruption Budgets, Multi-AZ
Build resilient Kubernetes applications with Horizontal Pod Autoscaler, Pod Disruption Budgets, and multi-availability zone deployments for production workloads.
Health Checks: Liveness, Readiness, and Service Availability
Master health check implementation for microservices including liveness probes, readiness probes, and graceful degradation patterns.