Resilience Patterns: Retry, Timeout, Bulkhead, and Fallback

Build systems that survive failures. Learn retry with backoff, timeout patterns, bulkhead isolation, circuit breakers, and fallback strategies.

published: reading time: 13 min read

Resilience Patterns: Retry, Timeout, Bulkhead, and Fallback

Distributed systems fail. Networks partition. Services crash. Databases slow down. Your system will encounter failures. The question is whether it survives gracefully.

Resilience patterns give your system tools to handle failures without cascading into outages. This article covers retries, timeouts, bulkheads, circuit breakers, and fallbacks. Used together, they create systems that bend but do not break.

Retries with Backoff

When a request fails, retry it. Simple enough, except that naive retries amplify problems. If a service is overloaded, your retries add load and make things worse.

Retries work for transient failures. Network hiccups, temporary unavailability, brief lock contentions. Retries do not work when failures are fundamental. If the service is down, retries just create more load.

Exponential Backoff

Wait longer between each retry attempt. First retry after 1 second, second after 2 seconds, third after 4 seconds. This gives the service time to recover.

import time
import random

def retry_with_backoff(func, max_retries=3, base_delay=1.0, max_delay=30.0):
    for attempt in range(max_retries):
        try:
            return func()
        except transient_error:
            if attempt == max_retries - 1:
                raise

            delay = base_delay * (2 ** attempt)
            # Add jitter to prevent thundering herd
            delay = delay * (0.5 + random.random())
            time.sleep(delay)

Jitter

Without jitter, all clients retry at the same time. Add randomness to spread out retry attempts:

def retry_with_jitter(func, max_retries=3, base_delay=1.0):
    for attempt in range(max_retries):
        try:
            return func()
        except transient_error:
            if attempt == max_retries - 1:
                raise

            delay = base_delay * (2 ** attempt)
            # Full jitter
            delay = random.uniform(0, delay)
            time.sleep(delay)

What to Retry

Retry on transient failures:

  • Network timeouts
  • Connection refused
  • Service unavailable (503)
  • Gateway timeout (504)

Do not retry on client errors:

  • Bad request (400)
  • Unauthorized (401)
  • Not found (404)
  • Validation errors (422)

Do not retry on idempotent operations that succeeded: if you get a timeout but the operation completed, retrying creates duplicate work.

Timeout Patterns

Timeouts prevent your system from waiting forever. A service that never responds holds a thread, a connection, memory. Eventually, resources exhaust and the system dies.

Setting Timeouts

Set timeouts based on what the operation actually needs. A simple database query might need 1-5 seconds. A call to an external API might need 10-30 seconds. Know your services and set appropriate timeouts.

# Per-operation timeouts
TIMEOUTS = {
    'database_query': 5.0,
    'external_payment': 30.0,
    'cache_lookup': 0.5,
    'file_upload': 120.0,
}

def call_with_timeout(service_name, func, *args, **kwargs):
    timeout = TIMEOUTS.get(service_name, 30.0)
    with ThreadPoolExecutor(max_workers=1) as executor:
        future = executor.submit(func, *args, **kwargs)
        return future.result(timeout=timeout)

Timeout vs Retry

Timeouts and retries solve different problems. Timeouts prevent waiting indefinitely. Retries handle transient failures.

Use both. A request times out, you retry. The retry succeeds. Without timeout, you would have waited forever. Without retry, the timeout would have been a permanent failure.

Bulkhead Pattern

Bulkheads partition resources so that problems in one area do not drain resources from other areas. If one service fails and holds threads, bulkheads prevent that failure from affecting other services.

See the Bulkhead Pattern article for details.

Circuit Breaker Pattern

Circuit breakers detect persistent failures and stop making requests. When a service is repeatedly failing, the circuit opens. Requests fail immediately without consuming resources. This gives the failing service time to recover.

See the Circuit Breaker Pattern article for details.

Fallback Patterns

When a service fails and retries are exhausted, have a plan B. Return cached data, default values, or a graceful error. Do not let exceptions propagate.

Cached Data Fallback

def get_product(product_id):
    try:
        return product_service.get(product_id)
    except ServiceError:
        cached = cache.get(f"product:{product_id}")
        if cached:
            return cached
        raise ProductServiceUnavailable()

Default Value Fallback

def get_user_permissions(user_id):
    try:
        return permission_service.get_permissions(user_id)
    except ServiceError:
        # Safe default: minimal permissions
        return ['read:own_data']

Graceful Degradation

def get_recommendations(user_id):
    try:
        return ml_service.get_recommendations(user_id)
    except ServiceError:
        # Fall back to popularity-based recommendations
        return get_popular_items(limit=10)

Rate Limiting: Controlling Request Volume

Rate limiting protects services from being overwhelmed by too many requests. Unlike circuit breakers (which react to failures), rate limiters proactively reject excess load before it causes problems.

import time
from collections import deque

class TokenBucketRateLimiter:
    def __init__(self, rate, capacity):
        self.rate = rate  # tokens per second
        self.capacity = capacity
        self.tokens = capacity
        self.last_refill = time.time()

    def allow_request(self):
        self._refill()
        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

    def _refill(self):
        now = time.time()
        elapsed = now - self.last_refill
        self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
        self.last_refill = now

Rate limiting combinations with other resilience patterns:

Rate Limiter PositionWhat It ProtectsCombines Well With
Per-clientIndividual tenants from overwhelming shared servicesBulkhead, circuit breaker
Per-serviceA service from its downstream dependenciesTimeout, circuit breaker
Per-endpointSpecific expensive operationsBulkhead, retry
GlobalEntire system from external trafficCircuit breaker, fallback

When rate limiting fires, return HTTP 429 (Too Many Requests) with a Retry-After header so clients know when to retry.

P99-Based Timeout Selection

Setting timeouts properly is one of the hardest parts of resilience engineering. Too short and you get false failures. Too long and you waste resources waiting on dead services.

A good starting formula: timeout = P99_latency * multiplier + headroom

def calculate_timeout(p99_latency_ms, multiplier=2.0, headroom_ms=100):
    """Set timeout based on observed P99 latency."""
    return (p99_latency_ms * multiplier) + headroom_ms

# Example: downstream service P99 is 200ms
# timeout = 200 * 2 + 100 = 500ms
timeout = calculate_timeout(200)  # 500ms

Monitor your timeouts against actual latency distributions. If you are regularly hitting timeouts, either the downstream service is slow (fix it) or your timeout is too tight.

The key insight: set timeouts based on what the 99th percentile of your callers experiences, not the average. The slowest calls are where timeouts matter most.

Combining Patterns

These patterns work best together. A resilient request might look like:

def resilient_get_product(product_id):
    # 1. Bulkhead: limit concurrent calls
    with bulkhead:
        # 2. Timeout: don't wait forever
        with timeout(seconds=5.0):
            # 3. Retry: transient failures might succeed
            for attempt in range(3):
                try:
                    return product_service.get(product_id)
                except transient_error:
                    if attempt == 2:
                        raise
                    sleep_with_jitter(base_delay=1.0)

    # 4. Fallback: something went wrong
    return get_cached_product(product_id)
graph TD
    A[Request] --> B{Bulkhead available?}
    B -->|No| J[Reject]
    B -->|Yes| C[Call with Timeout]
    C -->|Timeout| D[Retry?]
    D -->|Yes| E[Retry with Backoff]
    E --> C
    D -->|No| F[Circuit Open?]
    F -->|Yes| G[Fallback]
    F -->|No| H[Return Error]
    G --> I[Return Cached/Default]
    C -->|Success| I

Common Mistakes

Retrying Everything

Retrying non-idempotent operations can create duplicate work. Charging a credit card twice is bad. Design operations to be idempotent or do not retry them.

No Timeout

Without timeouts, retries can wait forever. If the service is truly down, you just repeat the wait. Always set timeouts.

Ignoring Fallbacks

When all else fails, your system should degrade gracefully. Cached data, default values, reduced functionality. Pick something over an error page.

Over-Engineering

You do not need every pattern for every call. Simple operations might only need timeouts. More critical operations might need bulkhead + timeout + retry + fallback. Match complexity to criticality.

Production Failure Scenarios

FailureImpactMitigation
Retry amplifies outageRetries overwhelm recovering serviceUse exponential backoff with jitter; set max retries low
Timeout too longResources held while waiting for dead serviceSet timeouts based on P99 latency; add circuit breakers
Fallback returns stale dataBusiness decisions made on outdated informationMonitor fallback freshness; alert when fallback used
Bulkhead exhaustionPartition isolated but related functionality failsMonitor all partitions; implement fallback for rejected work
Combination failureMultiple patterns interact unexpectedlyTest patterns in combination; monitor cross-pattern metrics

Observability Checklist

  • Metrics:

    • Retry success rate (first attempt vs after retries)
    • Timeout rate per service
    • Circuit state per downstream service
    • Bulkhead pool utilization per partition
    • Fallback activation rate
  • Logs:

    • Retry attempts with attempt number and delay
    • Timeout events with service and duration
    • Circuit state transitions
    • Bulkhead rejections per partition
    • Fallback invocations with context
  • Alerts:

    • Retry success rate drops below threshold
    • Timeouts increasing for specific service
    • Circuit enters open state
    • Bulkhead pool utilization exceeds 80%
    • Fallback activation rate spikes

Security Checklist

  • Retries do not expose sensitive data in logs
  • Fallback data properly sanitized
  • Timeouts enforced to prevent resource exhaustion attacks
  • Circuit breaker state not exposed to clients
  • Bulkhead configuration respects security boundaries
  • Monitoring logs do not contain credentials or PII

Common Anti-Patterns to Avoid

Applying All Patterns Everywhere

Not every call needs bulkhead + timeout + retry + fallback. Start simple. Add complexity only where measurements show it is needed.

Ignoring Retry Costs

Each retry consumes client and server resources. Know the cost and set limits accordingly.

Setting Timeouts Too Generously

A 30-second timeout on a service that normally responds in 100ms defeats the purpose. Timeouts should be based on actual service capability plus headroom.

Not Testing Resilience Behaviors

Test what happens when retries fail, when timeouts trigger, when circuits open. Simulate failures in staging.

Quick Recap

Key Bullets:

  • Retries handle transient failures; always use exponential backoff with jitter
  • Timeouts prevent waiting forever; set based on P99 latency plus headroom
  • Bulkheads partition resources to contain failure; monitor each partition
  • Circuit breakers detect persistent failures and stop calling; always implement fallbacks
  • Combine patterns thoughtfully; match complexity to criticality

Copy/Paste Checklist:

Resilience Pattern Implementation:
[ ] Identify all external service calls
[ ] Classify calls by criticality (critical, important, best-effort)
[ ] Set appropriate timeout per call based on service capability
[ ] Implement retry with exponential backoff + jitter for transient failures
[ ] Add circuit breaker for calls to external services
[ ] Implement bulkheads for partitioned workloads
[ ] Define fallback for each critical call
[ ] Monitor retry rate, timeout rate, circuit state, fallback usage
[ ] Test resilience behaviors in staging
[ ] Document expected behavior for each failure mode

When to Apply Each Pattern

Use retries for:

  • Transient failures (network hiccups, brief unavailability)
  • Idempotent operations
  • Operations where failure is more expensive than waiting

Use timeouts for:

  • All external calls
  • Any operation with a known acceptable response time

Use bulkheads for:

  • Multiple services sharing resources
  • Critical vs non-critical operations
  • Tenant isolation in SaaS applications

Use circuit breakers for:

  • Calls to external services that can become persistently unavailable
  • Preventing resource waste on doomed requests

Use fallbacks for:

  • Operations where stale data is acceptable
  • Non-critical functionality
  • When you want to avoid user-facing errors

When to Use / When Not to Use

When to Use Each Pattern

Retries with Backoff:

  • Use for transient failures (network timeouts, temporary unavailability, 503 errors)
  • Use for idempotent operations where retrying is safe
  • Use when failure is more expensive than the retry cost (e.g., critical transactions)
  • Do not use for non-idempotent operations without careful handling
  • Do not use when failures indicate fundamental problems (service down, validation errors)

Timeouts:

  • Use for ALL external service calls without exception
  • Use when operations have known acceptable response times
  • Set based on P99 latency plus headroom, not average case
  • Do not set too generously (defeats purpose) or too tightly (false positives)

Bulkheads:

  • Use when multiple services share thread pools or connection limits
  • Use for critical vs non-critical operation isolation
  • Use for tenant isolation in SaaS applications
  • Do not use when overhead outweighs benefit (single service, low traffic)

Circuit Breakers:

  • Use for calls to external services that can become persistently unavailable
  • Use to prevent resource exhaustion during outages
  • Use when you need to fail fast rather than wait for timeouts
  • Do not use as substitute for timeouts - both patterns complement each other
  • Do not use when failing operations have no acceptable fallback

Fallbacks:

  • Use when stale data is acceptable (cached results, default values)
  • Use for non-critical functionality where errors create poor UX
  • Use to provide graceful degradation instead of hard errors
  • Do not use when fresh data is required (financial transactions, real-time decisions)
  • Do not use when fallback data could cause incorrect business logic

Decision Flow

graph TD
    A[Adding Resilience] --> B{Operation Type?}
    B -->|External Service Call| C{Need to Handle Outage?}
    C -->|Yes| D[Timeout + Circuit Breaker]
    D --> E{Has Fallback?}
    E -->|Yes| F[Add Fallback]
    E -->|No| G[Error Handling]
    C -->|No| H[Timeout Only]
    B -->|Shared Resources| I[Bulkhead Isolation]
    I --> C
    B -->|Transient Failure| J[Retry with Backoff + Jitter]
    J --> D
    H --> K[Monitor & Alert]
    G --> K
    F --> K

Which Pattern Handles Which Failure

Use this table to determine which resilience pattern addresses which failure type:

Failure TypePrimary PatternSupporting PatternsWhy
Transient network errorRetryTimeoutRetries recover from brief network hiccups
Slow responseTimeoutCircuit BreakerTimeout prevents waiting indefinitely
Service permanently downCircuit BreakerFallbackCircuit breaker stops calling; fallback provides alternative
Resource exhaustionBulkheadCircuit BreakerBulkhead isolates consumption; circuit breaker stops calling
Dependent service failingFallbackCircuit BreakerFallback provides alternative; circuit breaker prevents cascade
Thundering herdRetry with JitterBulkheadJitter spreads retries; bulkhead limits concurrent calls
Noisy neighborBulkheadRate LimiterBulkhead isolates partitions; rate limiter controls volume
Cascading failureCircuit BreakerBulkheadCircuit breaker stops propagation; bulkhead contains blast radius

Summary

  • Retries handle transient failures; always use exponential backoff with jitter
  • Timeouts prevent waiting forever; set based on P99 latency plus headroom
  • Bulkheads partition resources to contain failure; monitor each partition
  • Circuit breakers detect persistent failures and stop calling; always implement fallbacks
  • Combine patterns thoughtfully; match complexity to criticality

For more on resilience patterns, see Rate Limiting, Circuit Breaker Pattern, and Bulkhead Pattern.

Category

Related Posts

Bulkhead Pattern: Isolate Failures Before They Spread

The Bulkhead pattern prevents resource exhaustion by isolating workloads. Learn how to implement bulkheads, partition resources, and use them with circuit breakers.

#patterns #resilience #fault-tolerance

Circuit Breaker Pattern: Fail Fast, Recover Gracefully

The Circuit Breaker pattern prevents cascading failures in distributed systems. Learn states, failure thresholds, half-open recovery, and implementation.

#patterns #resilience #fault-tolerance

Graceful Degradation: Systems That Bend Instead Break

Design systems that maintain core functionality when components fail through fallback strategies, degradation modes, and progressive service levels.

#distributed-systems #fault-tolerance #resilience