Circuit Breaker Pattern: Fail Fast, Recover Gracefully
The Circuit Breaker pattern prevents cascading failures in distributed systems. Learn states, failure thresholds, half-open recovery, and implementation.
Circuit Breaker Pattern: Fail Fast, Recover Gracefully
In distributed systems, failures cascade. A slow database causes timeouts. Timeouts cause requests to queue. Queued requests cause memory exhaustion. Memory exhaustion causes the service to crash. What started as a database problem takes down your entire application.
The circuit breaker pattern stops this chain reaction. It watches for persistent failures and “opens” to stop further requests, giving the failing service time to recover.
The Problem with Cascading Failures
Consider a typical web application. Your application calls a payment service. Normally, the payment service responds in 50ms. One day, it starts responding in 5 seconds.
Your application has a timeout of 3 seconds. Requests start failing. But the payment service is not just slow, it is overwhelmed. More requests pile up, waiting for responses. Threads get exhausted. Memory fills with queued requests.
Eventually, your application cannot serve new requests at all. Not just requests to the payment service, but all requests. Your application is dead, killed by a dependency.
The circuit breaker prevents this. When failure rates exceed a threshold, the circuit breaker opens. Subsequent requests fail immediately without consuming resources. The failing service gets breathing room. Eventually, the circuit breaker tests whether the service has recovered.
Circuit Breaker States
A circuit breaker has three states:
graph LR
A[Closed] -->|failure threshold| B[Open]
B -->|timeout elapsed| C[Half-Open]
C -->|success| A
C -->|failure| B
Closed State
In closed state, requests pass through normally. The circuit breaker monitors for failures. When failures exceed a threshold within a time window, the circuit transitions to open state.
Failures typically include:
- Timeouts
- Connection errors
- HTTP 5xx responses from the downstream service
You might use a sliding window of 100 requests. If 50 fail, open the circuit. Or you might use a time window: if more than 10 requests fail in 10 seconds, open the circuit.
Open State
In open state, requests fail immediately. No actual call is made to the failing service. The circuit breaker returns an error to the caller.
This is the “fail fast” behavior. You save resources by not calling a service that is likely to fail.
After a configurable timeout, the circuit transitions to half-open state.
Half-Open State
In half-open state, the circuit breaker allows a limited number of requests through. If these requests succeed, the circuit transitions to closed. If they fail, the circuit transitions back to open.
Half-open is the “test” state. You let some traffic through to see if the downstream service has recovered.
Implementation
Basic Circuit Breaker
import time
from enum import Enum
from threading import Lock
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreaker:
def __init__(self, failure_threshold: int = 5,
timeout_seconds: float = 30.0,
half_open_requests: int = 3):
self.failure_threshold = failure_threshold
self.timeout_seconds = timeout_seconds
self.half_open_requests = half_open_requests
self.state = CircuitState.CLOSED
self.failure_count = 0
self.last_failure_time = None
self.half_open_successes = 0
self.lock = Lock()
def call(self, func, *args, **kwargs):
with self.lock:
if self.state == CircuitState.OPEN:
if self._should_attempt_reset():
self.state = CircuitState.HALF_OPEN
self.half_open_successes = 0
else:
raise CircuitOpenError("Circuit is OPEN")
if self.state == CircuitState.HALF_OPEN:
return self._handle_half_open(func, args, kwargs)
# Call the actual function
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _should_attempt_reset(self):
if self.last_failure_time is None:
return True
return time.time() - self.last_failure_time >= self.timeout_seconds
def _handle_half_open(self, func, args, kwargs):
global half_open_successes
if self.half_open_successes >= self.half_open_requests:
return self._execute_circuit_call(func, args, kwargs)
result = self._execute_circuit_call(func, args, kwargs)
self.half_open_successes += 1
if self.half_open_successes >= self.half_open_requests:
self.state = CircuitState.CLOSED
self.failure_count = 0
return result
def _execute_circuit_call(self, func, args, kwargs):
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_success(self):
with self.lock:
if self.state == CircuitState.HALF_OPEN:
return
self.failure_count = 0
def _on_failure(self):
with self.lock:
self.failure_count += 1
self.last_failure_time = time.time()
if self.state == CircuitState.HALF_OPEN:
self.state = CircuitState.OPEN
elif self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
class CircuitOpenError(Exception):
pass
Using a Decorator
A decorator makes the circuit breaker cleaner to use:
def circuit_breaker(failure_threshold=5, timeout_seconds=30.0):
breaker = CircuitBreaker(failure_threshold, timeout_seconds)
def decorator(func):
def wrapper(*args, **kwargs):
return breaker.call(func, *args, **kwargs)
return wrapper
return decorator
@circuit_breaker(failure_threshold=10, timeout_seconds=60.0)
def call_payment_service(order_id):
# This call is protected by the circuit breaker
return payments.charge(order_id)
Configuration Considerations
Failure Threshold
Set the failure threshold high enough to avoid false positives from normal variance. Set it low enough to catch real failures quickly.
For a service that normally has 99% success rate, you might set threshold at 50% failure. For a more sensitive service, 30% might be appropriate.
Timeout Duration
The timeout determines how long the circuit stays open before testing recovery. Too short and you overwhelm a struggling service. Too long and you delay recovery unnecessarily.
Start with 30-60 seconds. Adjust based on your service’s typical recovery time.
Half-Open Request Count
Allow 1-5 requests in half-open state. More requests give better signal about recovery. Fewer requests minimize impact if the service is still failing.
Half-Open: Consecutive Successes vs Percentage
When a circuit is half-open, it needs a signal to know when to fully close. Two common approaches work here.
Consecutive successes: Require N consecutive successful requests before closing. Simple and intuitive — 3 consecutive successes is a common choice. Works well when traffic is steady.
Percentage-based: Require X% of requests succeed over a window. Better when traffic is bursty — one success in a quiet period does not mean the service is healthy.
| Approach | Pros | Cons |
|---|---|---|
| Consecutive successes | Simple to reason about | Brittle in low-traffic scenarios |
| Percentage (e.g., 60% over 10 calls) | Handles burst traffic | Requires sampling window |
Most production implementations (Polly, Resilience4j) default to consecutive successes.
Common Threshold Configurations
The following table provides starting points for threshold configurations based on service criticality:
| Service Criticality | Failure Threshold | Timeout (seconds) | Half-Open Requests | Window Size |
|---|---|---|---|---|
| Critical (payments, auth) | 50% over 10s | 30 | 3 | Sliding |
| High (inventory, orders) | 50% over 20s | 60 | 5 | Sliding |
| Medium (recommendations) | 70% over 30s | 120 | 3 | Sliding |
| Low (analytics, logging) | 80% over 60s | 180 | 5 | Sliding |
Adjust based on observed error rates and service recovery times. Critical services should have lower thresholds for faster detection.
Library Comparisons
Production systems typically use established libraries rather than building from scratch:
| Library | Language | Features | State Persistence | Active |
|---|---|---|---|---|
| Polly | .NET | Retry, circuit breaker, bulkhead, timeout, fallback | Yes (via policies) | Yes |
| Resilience4j | Java | Retry, circuit breaker, bulkhead, rate limiter, timeout | Yes (via Atomikos) | Yes |
| opossum | Node.js | Circuit breaker with statistics | In-memory only | Yes |
| Hystrix | Java | Circuit breaker, bulkhead, fallback, metrics | Yes (via RxJava) | Deprecated (Netflix moved to Resilience4j) |
| seneca | Node.js | Circuit breaker, retry, timeout | In-memory | Yes |
| pybreaker | Python | Circuit breaker with state listeners | Yes (via Redis) | Yes |
Key Selection Criteria
When choosing a library:
- State persistence: If your application restarts frequently, choose a library that can persist circuit state to Redis or another distributed store
- Language: Match your application stack (Polly for .NET, Resilience4j for Java, opossum for Node.js)
- Integration: Look for integration with your existing frameworks (Spring Boot has built-in Resilience4j support)
- Metrics: Ensure the library exposes metrics for monitoring (circuit state, failure rates, latency)
For most languages, the de facto standard library is the most mature choice: Polly for .NET, Resilience4j for Java.
Circuit Breaker vs Bulkhead
Circuit breakers and bulkheads protect against failures but in different ways.
A bulkhead isolates failures so they do not spread. If one part of your system fails, bulkheads prevent that failure from affecting other parts.
A circuit breaker detects failures and stops making requests to a failing service. It saves resources and prevents cascade.
Use both. Bulkheads for structural isolation. Circuit breakers for failure detection.
graph TD
subgraph Bulkhead["Bulkhead Pattern"]
A[Service A] --> B[Pool 1]
A --> C[Pool 2]
A --> D[Pool 3]
end
subgraph Circuit["Circuit Breaker"]
E[Request] --> F{Circuit Closed?}
F -->|Yes| G[Call Service]
F -->|No| H[Fail Fast]
end
For more on resilience patterns, see Bulkhead Pattern and Resilience Patterns.
State Persistence
Circuit breaker state lives in memory, so it resets when your application restarts. This can cause a thundering herd problem: the recovering service gets flooded with requests before the circuit even has a chance to re-close.
For stateful applications, persist circuit state to durable storage. Options:
- Redis for distributed state: store circuit state centrally so all instances see the same state
- Local file or database for single-instance deployments
- Sidecar process that maintains circuit state independently
The tradeoff: centralized state adds latency on every circuit check call. A local circuit breaker is fast but does not share state across instances.
# Redis-backed circuit state
def check_circuit_redis(service_name):
state = redis.get(f"circuit:{service_name}")
if state == "OPEN":
# Check if it's time to try again
opened_at = redis.get(f"circuit:{service_name}:opened_at")
if time.time() - opened_at > open_duration:
# Try half-open
return "HALF_OPEN"
return "OPEN"
return "CLOSED"
Testing Circuit Breaker Behavior
You need to test three things: that the circuit opens on failures, that it half-closes correctly, and that it closes after successes.
# Test: circuit opens after N failures
def test_circuit_opens_on_failures():
cb = CircuitBreaker(failure_threshold=3)
for i in range(3):
cb.record_failure()
assert cb.state == "OPEN"
# Test: circuit half-opens after recovery timeout
def test_circuit_half_opens_after_timeout():
cb = CircuitBreaker(failure_threshold=3, recovery_timeout=1)
cb.state = "OPEN"
cb.opened_at = time.time() - 2
cb._check_recovery()
assert cb.state == "HALF_OPEN"
# Test: circuit closes after consecutive successes
def test_circuit_closes_after_successes():
cb = CircuitBreaker(success_threshold=3)
cb.state = "HALF_OPEN"
for i in range(3):
cb.record_success()
assert cb.state == "CLOSED"
Use chaos engineering to simulate failures in staging: inject network errors or latency to trigger circuit transitions. Make sure metrics are emitted for every state change.
Common Mistakes
Not Having Fallbacks
When the circuit is open, requests fail. Your code must handle this. Return cached data, default values, or a graceful error. Do not just let exceptions propagate.
def get_product(product_id):
try:
return circuit_breaker.call(product_service.get, product_id)
except CircuitOpenError:
return get_cached_product(product_id) # Fallback
Monitoring Only Success
Monitor circuit breaker state transitions. An opening circuit is an early warning sign. A circuit that cycles between open and half-open indicates deeper problems.
Setting Thresholds Too Tight
Thresholds that are too sensitive create false positives. Your circuit opens because of normal retry patterns or expected errors. Tune thresholds based on observed behavior.
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Circuit opens on transient error | Users see failures for otherwise healthy service | Tune failure threshold based on normal error rates; use percentage-based thresholds |
| Circuit never closes | Service marked as failed when it has recovered | Implement proper half-open state with success thresholds |
| Fallback returns stale data | Business logic uses outdated information | Set TTL on cached fallbacks; monitor data freshness; alert on fallback usage |
| Circuit state not persisted | After restart, circuit resets to closed | Persist circuit state in distributed store; restore on startup |
| Timeout during half-open test | Circuit oscillates between open and half-open | Implement success threshold before closing; require consecutive successes |
Observability Checklist
-
Metrics:
- Circuit state per downstream service (closed/open/half-open)
- Failure rate per circuit
- Request latency per circuit (when closed)
- Fallback invocation rate
- Half-open to closed transition success rate
-
Logs:
- Circuit state transitions with reason
- Fallback activations with context
- Half-open test results
- Threshold breaches leading to open
-
Alerts:
- Circuit enters open state (early warning of downstream issue)
- Circuit cycles rapidly between states
- Fallback activation rate exceeds threshold
- Half-open test failures increasing
Security Checklist
- Circuit breaker configuration not exposed to clients
- Fallback data properly sanitized (no data leakage)
- Circuit breaker state does not reveal internal system details
- Rate limiting combined with circuit breakers to prevent abuse
- Monitoring does not log sensitive request/response data
- Timeouts properly enforced to prevent resource exhaustion
Common Anti-Patterns to Avoid
One Circuit Breaker Per Application
Using a single circuit breaker for all downstream services means one failing service opens the circuit for everything. Per-service or per-group circuit breakers isolate failures.
Ignoring Circuit Health in Dashboard
A circuit breaker dashboard showing only current state misses trends. Track time in each state, transition frequency, and aggregate failure rates.
Not Testing Circuit Breaker Behavior
Circuit breakers have complex state machines. Test all transitions: normal operation, threshold breach, half-open behavior, successful recovery, and failure to recover.
Using Circuit Breakers as Substitute for Timeouts
Circuit breakers do not replace timeouts. A request waiting for a slow response consumes resources. Both are needed.
Quick Recap
Key Bullets:
- Circuit breakers prevent cascading failures by stopping requests to failing services
- Three states: closed (normal), open (fail fast), half-open (testing recovery)
- Set failure thresholds based on normal error rates; start with 50% over 10 seconds
- Always implement fallbacks when circuit opens
- Monitor state transitions and failure rates to detect problems early
Copy/Paste Checklist:
Circuit Breaker Implementation:
[ ] Define failure threshold based on normal error rates
[ ] Set timeout duration for half-open recovery test
[ ] Configure success threshold for closing (require N consecutive successes)
[ ] Implement fallback for all protected calls
[ ] Add circuit state to monitoring dashboard
[ ] Log all state transitions with reason
[ ] Set alerts for circuit opening events
[ ] Test all state transitions in staging
[ ] Never expose circuit internal state to clients
[ ] Combine with bulkheads for defense in depth
When to Use / When Not to Use
When to Use Circuit Breakers
Circuit breakers shine in these scenarios:
- Calls to external services that can become slow or unavailable (payment gateways, third-party APIs, remote microservices)
- Resource protection where you need to prevent thread/connection exhaustion during downstream outages
- Graceful degradation where you want to fail fast and use fallbacks rather than block waiting
- Cascading failure prevention where a failing service could take down your entire application
- Systems with async processing where you can queue failed requests for later retry
When Not to Use Circuit Breakers
Circuit breakers add complexity. Consider alternatives when:
- Local operations only with no external dependencies (database calls within the same process)
- No fallback available where failing fast provides no benefit since the operation must succeed
- Latency is acceptable where waiting for a slow response is preferable to immediate failure
- Very simple services where the overhead of implementing circuit breaker state management is not justified
- Operations with built-in retry that already handle failures internally
Decision Flow
graph TD
A[Circuit Breaker Decision] --> B{Calls External Service?}
B -->|No| C[Probably Not Needed]
B -->|Yes| D{Can Service Become Unavailable?}
D -->|No| E[Timeout May Suffice]
D -->|Yes| F{Has Fallback?}
F -->|Yes| G[Circuit Breaker Recommended]
F -->|No| H{Resource Protection Needed?}
H -->|Yes| G
H -->|No| I[Evaluate Complexity vs Benefit]
Pattern Combinations
Circuit breakers work best combined with:
- Timeouts - circuit breakers do not replace timeouts; both are needed
- Bulkheads - circuit breakers protect against service failures; bulkheads protect against resource exhaustion
- Fallbacks - circuit breaker opening without a fallback just returns errors
- Retries - retry before circuit breaker opens for transient failures; circuit breaker prevents retry storms
For more on related patterns, see Resilience Patterns and Bulkhead Pattern.
Category
Related Posts
Bulkhead Pattern: Isolate Failures Before They Spread
The Bulkhead pattern prevents resource exhaustion by isolating workloads. Learn how to implement bulkheads, partition resources, and use them with circuit breakers.
Resilience Patterns: Retry, Timeout, Bulkhead, and Fallback
Build systems that survive failures. Learn retry with backoff, timeout patterns, bulkhead isolation, circuit breakers, and fallback strategies.
Graceful Degradation: Systems That Bend Instead Break
Design systems that maintain core functionality when components fail through fallback strategies, degradation modes, and progressive service levels.