Circuit Breaker Pattern: Fail Fast, Recover Gracefully

The Circuit Breaker pattern prevents cascading failures in distributed systems. Learn states, failure thresholds, half-open recovery, and implementation.

published: reading time: 14 min read

Circuit Breaker Pattern: Fail Fast, Recover Gracefully

In distributed systems, failures cascade. A slow database causes timeouts. Timeouts cause requests to queue. Queued requests cause memory exhaustion. Memory exhaustion causes the service to crash. What started as a database problem takes down your entire application.

The circuit breaker pattern stops this chain reaction. It watches for persistent failures and “opens” to stop further requests, giving the failing service time to recover.

The Problem with Cascading Failures

Consider a typical web application. Your application calls a payment service. Normally, the payment service responds in 50ms. One day, it starts responding in 5 seconds.

Your application has a timeout of 3 seconds. Requests start failing. But the payment service is not just slow, it is overwhelmed. More requests pile up, waiting for responses. Threads get exhausted. Memory fills with queued requests.

Eventually, your application cannot serve new requests at all. Not just requests to the payment service, but all requests. Your application is dead, killed by a dependency.

The circuit breaker prevents this. When failure rates exceed a threshold, the circuit breaker opens. Subsequent requests fail immediately without consuming resources. The failing service gets breathing room. Eventually, the circuit breaker tests whether the service has recovered.

Circuit Breaker States

A circuit breaker has three states:

graph LR
    A[Closed] -->|failure threshold| B[Open]
    B -->|timeout elapsed| C[Half-Open]
    C -->|success| A
    C -->|failure| B

Closed State

In closed state, requests pass through normally. The circuit breaker monitors for failures. When failures exceed a threshold within a time window, the circuit transitions to open state.

Failures typically include:

  • Timeouts
  • Connection errors
  • HTTP 5xx responses from the downstream service

You might use a sliding window of 100 requests. If 50 fail, open the circuit. Or you might use a time window: if more than 10 requests fail in 10 seconds, open the circuit.

Open State

In open state, requests fail immediately. No actual call is made to the failing service. The circuit breaker returns an error to the caller.

This is the “fail fast” behavior. You save resources by not calling a service that is likely to fail.

After a configurable timeout, the circuit transitions to half-open state.

Half-Open State

In half-open state, the circuit breaker allows a limited number of requests through. If these requests succeed, the circuit transitions to closed. If they fail, the circuit transitions back to open.

Half-open is the “test” state. You let some traffic through to see if the downstream service has recovered.

Implementation

Basic Circuit Breaker

import time
from enum import Enum
from threading import Lock

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(self, failure_threshold: int = 5,
                 timeout_seconds: float = 30.0,
                 half_open_requests: int = 3):
        self.failure_threshold = failure_threshold
        self.timeout_seconds = timeout_seconds
        self.half_open_requests = half_open_requests

        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.last_failure_time = None
        self.half_open_successes = 0
        self.lock = Lock()

    def call(self, func, *args, **kwargs):
        with self.lock:
            if self.state == CircuitState.OPEN:
                if self._should_attempt_reset():
                    self.state = CircuitState.HALF_OPEN
                    self.half_open_successes = 0
                else:
                    raise CircuitOpenError("Circuit is OPEN")

            if self.state == CircuitState.HALF_OPEN:
                return self._handle_half_open(func, args, kwargs)

        # Call the actual function
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _should_attempt_reset(self):
        if self.last_failure_time is None:
            return True
        return time.time() - self.last_failure_time >= self.timeout_seconds

    def _handle_half_open(self, func, args, kwargs):
        global half_open_successes
        if self.half_open_successes >= self.half_open_requests:
            return self._execute_circuit_call(func, args, kwargs)

        result = self._execute_circuit_call(func, args, kwargs)
        self.half_open_successes += 1

        if self.half_open_successes >= self.half_open_requests:
            self.state = CircuitState.CLOSED
            self.failure_count = 0

        return result

    def _execute_circuit_call(self, func, args, kwargs):
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        with self.lock:
            if self.state == CircuitState.HALF_OPEN:
                return
            self.failure_count = 0

    def _on_failure(self):
        with self.lock:
            self.failure_count += 1
            self.last_failure_time = time.time()

            if self.state == CircuitState.HALF_OPEN:
                self.state = CircuitState.OPEN
            elif self.failure_count >= self.failure_threshold:
                self.state = CircuitState.OPEN

class CircuitOpenError(Exception):
    pass

Using a Decorator

A decorator makes the circuit breaker cleaner to use:

def circuit_breaker(failure_threshold=5, timeout_seconds=30.0):
    breaker = CircuitBreaker(failure_threshold, timeout_seconds)

    def decorator(func):
        def wrapper(*args, **kwargs):
            return breaker.call(func, *args, **kwargs)
        return wrapper
    return decorator

@circuit_breaker(failure_threshold=10, timeout_seconds=60.0)
def call_payment_service(order_id):
    # This call is protected by the circuit breaker
    return payments.charge(order_id)

Configuration Considerations

Failure Threshold

Set the failure threshold high enough to avoid false positives from normal variance. Set it low enough to catch real failures quickly.

For a service that normally has 99% success rate, you might set threshold at 50% failure. For a more sensitive service, 30% might be appropriate.

Timeout Duration

The timeout determines how long the circuit stays open before testing recovery. Too short and you overwhelm a struggling service. Too long and you delay recovery unnecessarily.

Start with 30-60 seconds. Adjust based on your service’s typical recovery time.

Half-Open Request Count

Allow 1-5 requests in half-open state. More requests give better signal about recovery. Fewer requests minimize impact if the service is still failing.

Half-Open: Consecutive Successes vs Percentage

When a circuit is half-open, it needs a signal to know when to fully close. Two common approaches work here.

Consecutive successes: Require N consecutive successful requests before closing. Simple and intuitive — 3 consecutive successes is a common choice. Works well when traffic is steady.

Percentage-based: Require X% of requests succeed over a window. Better when traffic is bursty — one success in a quiet period does not mean the service is healthy.

ApproachProsCons
Consecutive successesSimple to reason aboutBrittle in low-traffic scenarios
Percentage (e.g., 60% over 10 calls)Handles burst trafficRequires sampling window

Most production implementations (Polly, Resilience4j) default to consecutive successes.

Common Threshold Configurations

The following table provides starting points for threshold configurations based on service criticality:

Service CriticalityFailure ThresholdTimeout (seconds)Half-Open RequestsWindow Size
Critical (payments, auth)50% over 10s303Sliding
High (inventory, orders)50% over 20s605Sliding
Medium (recommendations)70% over 30s1203Sliding
Low (analytics, logging)80% over 60s1805Sliding

Adjust based on observed error rates and service recovery times. Critical services should have lower thresholds for faster detection.

Library Comparisons

Production systems typically use established libraries rather than building from scratch:

LibraryLanguageFeaturesState PersistenceActive
Polly.NETRetry, circuit breaker, bulkhead, timeout, fallbackYes (via policies)Yes
Resilience4jJavaRetry, circuit breaker, bulkhead, rate limiter, timeoutYes (via Atomikos)Yes
opossumNode.jsCircuit breaker with statisticsIn-memory onlyYes
HystrixJavaCircuit breaker, bulkhead, fallback, metricsYes (via RxJava)Deprecated (Netflix moved to Resilience4j)
senecaNode.jsCircuit breaker, retry, timeoutIn-memoryYes
pybreakerPythonCircuit breaker with state listenersYes (via Redis)Yes

Key Selection Criteria

When choosing a library:

  • State persistence: If your application restarts frequently, choose a library that can persist circuit state to Redis or another distributed store
  • Language: Match your application stack (Polly for .NET, Resilience4j for Java, opossum for Node.js)
  • Integration: Look for integration with your existing frameworks (Spring Boot has built-in Resilience4j support)
  • Metrics: Ensure the library exposes metrics for monitoring (circuit state, failure rates, latency)

For most languages, the de facto standard library is the most mature choice: Polly for .NET, Resilience4j for Java.

Circuit Breaker vs Bulkhead

Circuit breakers and bulkheads protect against failures but in different ways.

A bulkhead isolates failures so they do not spread. If one part of your system fails, bulkheads prevent that failure from affecting other parts.

A circuit breaker detects failures and stops making requests to a failing service. It saves resources and prevents cascade.

Use both. Bulkheads for structural isolation. Circuit breakers for failure detection.

graph TD
    subgraph Bulkhead["Bulkhead Pattern"]
        A[Service A] --> B[Pool 1]
        A --> C[Pool 2]
        A --> D[Pool 3]
    end
    subgraph Circuit["Circuit Breaker"]
        E[Request] --> F{Circuit Closed?}
        F -->|Yes| G[Call Service]
        F -->|No| H[Fail Fast]
    end

For more on resilience patterns, see Bulkhead Pattern and Resilience Patterns.

State Persistence

Circuit breaker state lives in memory, so it resets when your application restarts. This can cause a thundering herd problem: the recovering service gets flooded with requests before the circuit even has a chance to re-close.

For stateful applications, persist circuit state to durable storage. Options:

  • Redis for distributed state: store circuit state centrally so all instances see the same state
  • Local file or database for single-instance deployments
  • Sidecar process that maintains circuit state independently

The tradeoff: centralized state adds latency on every circuit check call. A local circuit breaker is fast but does not share state across instances.

# Redis-backed circuit state
def check_circuit_redis(service_name):
    state = redis.get(f"circuit:{service_name}")
    if state == "OPEN":
        # Check if it's time to try again
        opened_at = redis.get(f"circuit:{service_name}:opened_at")
        if time.time() - opened_at > open_duration:
            # Try half-open
            return "HALF_OPEN"
        return "OPEN"
    return "CLOSED"

Testing Circuit Breaker Behavior

You need to test three things: that the circuit opens on failures, that it half-closes correctly, and that it closes after successes.

# Test: circuit opens after N failures
def test_circuit_opens_on_failures():
    cb = CircuitBreaker(failure_threshold=3)
    for i in range(3):
        cb.record_failure()
    assert cb.state == "OPEN"

# Test: circuit half-opens after recovery timeout
def test_circuit_half_opens_after_timeout():
    cb = CircuitBreaker(failure_threshold=3, recovery_timeout=1)
    cb.state = "OPEN"
    cb.opened_at = time.time() - 2
    cb._check_recovery()
    assert cb.state == "HALF_OPEN"

# Test: circuit closes after consecutive successes
def test_circuit_closes_after_successes():
    cb = CircuitBreaker(success_threshold=3)
    cb.state = "HALF_OPEN"
    for i in range(3):
        cb.record_success()
    assert cb.state == "CLOSED"

Use chaos engineering to simulate failures in staging: inject network errors or latency to trigger circuit transitions. Make sure metrics are emitted for every state change.

Common Mistakes

Not Having Fallbacks

When the circuit is open, requests fail. Your code must handle this. Return cached data, default values, or a graceful error. Do not just let exceptions propagate.

def get_product(product_id):
    try:
        return circuit_breaker.call(product_service.get, product_id)
    except CircuitOpenError:
        return get_cached_product(product_id)  # Fallback

Monitoring Only Success

Monitor circuit breaker state transitions. An opening circuit is an early warning sign. A circuit that cycles between open and half-open indicates deeper problems.

Setting Thresholds Too Tight

Thresholds that are too sensitive create false positives. Your circuit opens because of normal retry patterns or expected errors. Tune thresholds based on observed behavior.

Production Failure Scenarios

FailureImpactMitigation
Circuit opens on transient errorUsers see failures for otherwise healthy serviceTune failure threshold based on normal error rates; use percentage-based thresholds
Circuit never closesService marked as failed when it has recoveredImplement proper half-open state with success thresholds
Fallback returns stale dataBusiness logic uses outdated informationSet TTL on cached fallbacks; monitor data freshness; alert on fallback usage
Circuit state not persistedAfter restart, circuit resets to closedPersist circuit state in distributed store; restore on startup
Timeout during half-open testCircuit oscillates between open and half-openImplement success threshold before closing; require consecutive successes

Observability Checklist

  • Metrics:

    • Circuit state per downstream service (closed/open/half-open)
    • Failure rate per circuit
    • Request latency per circuit (when closed)
    • Fallback invocation rate
    • Half-open to closed transition success rate
  • Logs:

    • Circuit state transitions with reason
    • Fallback activations with context
    • Half-open test results
    • Threshold breaches leading to open
  • Alerts:

    • Circuit enters open state (early warning of downstream issue)
    • Circuit cycles rapidly between states
    • Fallback activation rate exceeds threshold
    • Half-open test failures increasing

Security Checklist

  • Circuit breaker configuration not exposed to clients
  • Fallback data properly sanitized (no data leakage)
  • Circuit breaker state does not reveal internal system details
  • Rate limiting combined with circuit breakers to prevent abuse
  • Monitoring does not log sensitive request/response data
  • Timeouts properly enforced to prevent resource exhaustion

Common Anti-Patterns to Avoid

One Circuit Breaker Per Application

Using a single circuit breaker for all downstream services means one failing service opens the circuit for everything. Per-service or per-group circuit breakers isolate failures.

Ignoring Circuit Health in Dashboard

A circuit breaker dashboard showing only current state misses trends. Track time in each state, transition frequency, and aggregate failure rates.

Not Testing Circuit Breaker Behavior

Circuit breakers have complex state machines. Test all transitions: normal operation, threshold breach, half-open behavior, successful recovery, and failure to recover.

Using Circuit Breakers as Substitute for Timeouts

Circuit breakers do not replace timeouts. A request waiting for a slow response consumes resources. Both are needed.

Quick Recap

Key Bullets:

  • Circuit breakers prevent cascading failures by stopping requests to failing services
  • Three states: closed (normal), open (fail fast), half-open (testing recovery)
  • Set failure thresholds based on normal error rates; start with 50% over 10 seconds
  • Always implement fallbacks when circuit opens
  • Monitor state transitions and failure rates to detect problems early

Copy/Paste Checklist:

Circuit Breaker Implementation:
[ ] Define failure threshold based on normal error rates
[ ] Set timeout duration for half-open recovery test
[ ] Configure success threshold for closing (require N consecutive successes)
[ ] Implement fallback for all protected calls
[ ] Add circuit state to monitoring dashboard
[ ] Log all state transitions with reason
[ ] Set alerts for circuit opening events
[ ] Test all state transitions in staging
[ ] Never expose circuit internal state to clients
[ ] Combine with bulkheads for defense in depth

When to Use / When Not to Use

When to Use Circuit Breakers

Circuit breakers shine in these scenarios:

  • Calls to external services that can become slow or unavailable (payment gateways, third-party APIs, remote microservices)
  • Resource protection where you need to prevent thread/connection exhaustion during downstream outages
  • Graceful degradation where you want to fail fast and use fallbacks rather than block waiting
  • Cascading failure prevention where a failing service could take down your entire application
  • Systems with async processing where you can queue failed requests for later retry

When Not to Use Circuit Breakers

Circuit breakers add complexity. Consider alternatives when:

  • Local operations only with no external dependencies (database calls within the same process)
  • No fallback available where failing fast provides no benefit since the operation must succeed
  • Latency is acceptable where waiting for a slow response is preferable to immediate failure
  • Very simple services where the overhead of implementing circuit breaker state management is not justified
  • Operations with built-in retry that already handle failures internally

Decision Flow

graph TD
    A[Circuit Breaker Decision] --> B{Calls External Service?}
    B -->|No| C[Probably Not Needed]
    B -->|Yes| D{Can Service Become Unavailable?}
    D -->|No| E[Timeout May Suffice]
    D -->|Yes| F{Has Fallback?}
    F -->|Yes| G[Circuit Breaker Recommended]
    F -->|No| H{Resource Protection Needed?}
    H -->|Yes| G
    H -->|No| I[Evaluate Complexity vs Benefit]

Pattern Combinations

Circuit breakers work best combined with:

  • Timeouts - circuit breakers do not replace timeouts; both are needed
  • Bulkheads - circuit breakers protect against service failures; bulkheads protect against resource exhaustion
  • Fallbacks - circuit breaker opening without a fallback just returns errors
  • Retries - retry before circuit breaker opens for transient failures; circuit breaker prevents retry storms

For more on related patterns, see Resilience Patterns and Bulkhead Pattern.

Category

Related Posts

Bulkhead Pattern: Isolate Failures Before They Spread

The Bulkhead pattern prevents resource exhaustion by isolating workloads. Learn how to implement bulkheads, partition resources, and use them with circuit breakers.

#patterns #resilience #fault-tolerance

Resilience Patterns: Retry, Timeout, Bulkhead, and Fallback

Build systems that survive failures. Learn retry with backoff, timeout patterns, bulkhead isolation, circuit breakers, and fallback strategies.

#patterns #resilience #fault-tolerance

Graceful Degradation: Systems That Bend Instead Break

Design systems that maintain core functionality when components fail through fallback strategies, degradation modes, and progressive service levels.

#distributed-systems #fault-tolerance #resilience