Resilience Patterns: Retry, Timeout, Bulkhead, and Fallback
Build systems that survive failures. Learn retry with backoff, timeout patterns, bulkhead isolation, circuit breakers, and fallback strategies.
Resilience Patterns: Retry, Timeout, Bulkhead, and Fallback
Distributed systems fail. Networks partition. Services crash. Databases slow down. Your system will encounter failures. The question is whether it survives gracefully.
Resilience patterns give your system tools to handle failures without cascading into outages. This article covers retries, timeouts, bulkheads, circuit breakers, and fallbacks. Used together, they create systems that bend but do not break.
Retries with Backoff
When a request fails, retry it. Simple enough, except that naive retries amplify problems. If a service is overloaded, your retries add load and make things worse.
Retries work for transient failures. Network hiccups, temporary unavailability, brief lock contentions. Retries do not work when failures are fundamental. If the service is down, retries just create more load.
Exponential Backoff
Wait longer between each retry attempt. First retry after 1 second, second after 2 seconds, third after 4 seconds. This gives the service time to recover.
import time
import random
def retry_with_backoff(func, max_retries=3, base_delay=1.0, max_delay=30.0):
for attempt in range(max_retries):
try:
return func()
except transient_error:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt)
# Add jitter to prevent thundering herd
delay = delay * (0.5 + random.random())
time.sleep(delay)
Jitter
Without jitter, all clients retry at the same time. Add randomness to spread out retry attempts:
def retry_with_jitter(func, max_retries=3, base_delay=1.0):
for attempt in range(max_retries):
try:
return func()
except transient_error:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt)
# Full jitter
delay = random.uniform(0, delay)
time.sleep(delay)
What to Retry
Retry on transient failures:
- Network timeouts
- Connection refused
- Service unavailable (503)
- Gateway timeout (504)
Do not retry on client errors:
- Bad request (400)
- Unauthorized (401)
- Not found (404)
- Validation errors (422)
Do not retry on idempotent operations that succeeded: if you get a timeout but the operation completed, retrying creates duplicate work.
Timeout Patterns
Timeouts prevent your system from waiting forever. A service that never responds holds a thread, a connection, memory. Eventually, resources exhaust and the system dies.
Setting Timeouts
Set timeouts based on what the operation actually needs. A simple database query might need 1-5 seconds. A call to an external API might need 10-30 seconds. Know your services and set appropriate timeouts.
# Per-operation timeouts
TIMEOUTS = {
'database_query': 5.0,
'external_payment': 30.0,
'cache_lookup': 0.5,
'file_upload': 120.0,
}
def call_with_timeout(service_name, func, *args, **kwargs):
timeout = TIMEOUTS.get(service_name, 30.0)
with ThreadPoolExecutor(max_workers=1) as executor:
future = executor.submit(func, *args, **kwargs)
return future.result(timeout=timeout)
Timeout vs Retry
Timeouts and retries solve different problems. Timeouts prevent waiting indefinitely. Retries handle transient failures.
Use both. A request times out, you retry. The retry succeeds. Without timeout, you would have waited forever. Without retry, the timeout would have been a permanent failure.
Bulkhead Pattern
Bulkheads partition resources so that problems in one area do not drain resources from other areas. If one service fails and holds threads, bulkheads prevent that failure from affecting other services.
See the Bulkhead Pattern article for details.
Circuit Breaker Pattern
Circuit breakers detect persistent failures and stop making requests. When a service is repeatedly failing, the circuit opens. Requests fail immediately without consuming resources. This gives the failing service time to recover.
See the Circuit Breaker Pattern article for details.
Fallback Patterns
When a service fails and retries are exhausted, have a plan B. Return cached data, default values, or a graceful error. Do not let exceptions propagate.
Cached Data Fallback
def get_product(product_id):
try:
return product_service.get(product_id)
except ServiceError:
cached = cache.get(f"product:{product_id}")
if cached:
return cached
raise ProductServiceUnavailable()
Default Value Fallback
def get_user_permissions(user_id):
try:
return permission_service.get_permissions(user_id)
except ServiceError:
# Safe default: minimal permissions
return ['read:own_data']
Graceful Degradation
def get_recommendations(user_id):
try:
return ml_service.get_recommendations(user_id)
except ServiceError:
# Fall back to popularity-based recommendations
return get_popular_items(limit=10)
Rate Limiting: Controlling Request Volume
Rate limiting protects services from being overwhelmed by too many requests. Unlike circuit breakers (which react to failures), rate limiters proactively reject excess load before it causes problems.
import time
from collections import deque
class TokenBucketRateLimiter:
def __init__(self, rate, capacity):
self.rate = rate # tokens per second
self.capacity = capacity
self.tokens = capacity
self.last_refill = time.time()
def allow_request(self):
self._refill()
if self.tokens >= 1:
self.tokens -= 1
return True
return False
def _refill(self):
now = time.time()
elapsed = now - self.last_refill
self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
self.last_refill = now
Rate limiting combinations with other resilience patterns:
| Rate Limiter Position | What It Protects | Combines Well With |
|---|---|---|
| Per-client | Individual tenants from overwhelming shared services | Bulkhead, circuit breaker |
| Per-service | A service from its downstream dependencies | Timeout, circuit breaker |
| Per-endpoint | Specific expensive operations | Bulkhead, retry |
| Global | Entire system from external traffic | Circuit breaker, fallback |
When rate limiting fires, return HTTP 429 (Too Many Requests) with a Retry-After header so clients know when to retry.
P99-Based Timeout Selection
Setting timeouts properly is one of the hardest parts of resilience engineering. Too short and you get false failures. Too long and you waste resources waiting on dead services.
A good starting formula: timeout = P99_latency * multiplier + headroom
def calculate_timeout(p99_latency_ms, multiplier=2.0, headroom_ms=100):
"""Set timeout based on observed P99 latency."""
return (p99_latency_ms * multiplier) + headroom_ms
# Example: downstream service P99 is 200ms
# timeout = 200 * 2 + 100 = 500ms
timeout = calculate_timeout(200) # 500ms
Monitor your timeouts against actual latency distributions. If you are regularly hitting timeouts, either the downstream service is slow (fix it) or your timeout is too tight.
The key insight: set timeouts based on what the 99th percentile of your callers experiences, not the average. The slowest calls are where timeouts matter most.
Combining Patterns
These patterns work best together. A resilient request might look like:
def resilient_get_product(product_id):
# 1. Bulkhead: limit concurrent calls
with bulkhead:
# 2. Timeout: don't wait forever
with timeout(seconds=5.0):
# 3. Retry: transient failures might succeed
for attempt in range(3):
try:
return product_service.get(product_id)
except transient_error:
if attempt == 2:
raise
sleep_with_jitter(base_delay=1.0)
# 4. Fallback: something went wrong
return get_cached_product(product_id)
graph TD
A[Request] --> B{Bulkhead available?}
B -->|No| J[Reject]
B -->|Yes| C[Call with Timeout]
C -->|Timeout| D[Retry?]
D -->|Yes| E[Retry with Backoff]
E --> C
D -->|No| F[Circuit Open?]
F -->|Yes| G[Fallback]
F -->|No| H[Return Error]
G --> I[Return Cached/Default]
C -->|Success| I
Common Mistakes
Retrying Everything
Retrying non-idempotent operations can create duplicate work. Charging a credit card twice is bad. Design operations to be idempotent or do not retry them.
No Timeout
Without timeouts, retries can wait forever. If the service is truly down, you just repeat the wait. Always set timeouts.
Ignoring Fallbacks
When all else fails, your system should degrade gracefully. Cached data, default values, reduced functionality. Pick something over an error page.
Over-Engineering
You do not need every pattern for every call. Simple operations might only need timeouts. More critical operations might need bulkhead + timeout + retry + fallback. Match complexity to criticality.
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Retry amplifies outage | Retries overwhelm recovering service | Use exponential backoff with jitter; set max retries low |
| Timeout too long | Resources held while waiting for dead service | Set timeouts based on P99 latency; add circuit breakers |
| Fallback returns stale data | Business decisions made on outdated information | Monitor fallback freshness; alert when fallback used |
| Bulkhead exhaustion | Partition isolated but related functionality fails | Monitor all partitions; implement fallback for rejected work |
| Combination failure | Multiple patterns interact unexpectedly | Test patterns in combination; monitor cross-pattern metrics |
Observability Checklist
-
Metrics:
- Retry success rate (first attempt vs after retries)
- Timeout rate per service
- Circuit state per downstream service
- Bulkhead pool utilization per partition
- Fallback activation rate
-
Logs:
- Retry attempts with attempt number and delay
- Timeout events with service and duration
- Circuit state transitions
- Bulkhead rejections per partition
- Fallback invocations with context
-
Alerts:
- Retry success rate drops below threshold
- Timeouts increasing for specific service
- Circuit enters open state
- Bulkhead pool utilization exceeds 80%
- Fallback activation rate spikes
Security Checklist
- Retries do not expose sensitive data in logs
- Fallback data properly sanitized
- Timeouts enforced to prevent resource exhaustion attacks
- Circuit breaker state not exposed to clients
- Bulkhead configuration respects security boundaries
- Monitoring logs do not contain credentials or PII
Common Anti-Patterns to Avoid
Applying All Patterns Everywhere
Not every call needs bulkhead + timeout + retry + fallback. Start simple. Add complexity only where measurements show it is needed.
Ignoring Retry Costs
Each retry consumes client and server resources. Know the cost and set limits accordingly.
Setting Timeouts Too Generously
A 30-second timeout on a service that normally responds in 100ms defeats the purpose. Timeouts should be based on actual service capability plus headroom.
Not Testing Resilience Behaviors
Test what happens when retries fail, when timeouts trigger, when circuits open. Simulate failures in staging.
Quick Recap
Key Bullets:
- Retries handle transient failures; always use exponential backoff with jitter
- Timeouts prevent waiting forever; set based on P99 latency plus headroom
- Bulkheads partition resources to contain failure; monitor each partition
- Circuit breakers detect persistent failures and stop calling; always implement fallbacks
- Combine patterns thoughtfully; match complexity to criticality
Copy/Paste Checklist:
Resilience Pattern Implementation:
[ ] Identify all external service calls
[ ] Classify calls by criticality (critical, important, best-effort)
[ ] Set appropriate timeout per call based on service capability
[ ] Implement retry with exponential backoff + jitter for transient failures
[ ] Add circuit breaker for calls to external services
[ ] Implement bulkheads for partitioned workloads
[ ] Define fallback for each critical call
[ ] Monitor retry rate, timeout rate, circuit state, fallback usage
[ ] Test resilience behaviors in staging
[ ] Document expected behavior for each failure mode
When to Apply Each Pattern
Use retries for:
- Transient failures (network hiccups, brief unavailability)
- Idempotent operations
- Operations where failure is more expensive than waiting
Use timeouts for:
- All external calls
- Any operation with a known acceptable response time
Use bulkheads for:
- Multiple services sharing resources
- Critical vs non-critical operations
- Tenant isolation in SaaS applications
Use circuit breakers for:
- Calls to external services that can become persistently unavailable
- Preventing resource waste on doomed requests
Use fallbacks for:
- Operations where stale data is acceptable
- Non-critical functionality
- When you want to avoid user-facing errors
When to Use / When Not to Use
When to Use Each Pattern
Retries with Backoff:
- Use for transient failures (network timeouts, temporary unavailability, 503 errors)
- Use for idempotent operations where retrying is safe
- Use when failure is more expensive than the retry cost (e.g., critical transactions)
- Do not use for non-idempotent operations without careful handling
- Do not use when failures indicate fundamental problems (service down, validation errors)
Timeouts:
- Use for ALL external service calls without exception
- Use when operations have known acceptable response times
- Set based on P99 latency plus headroom, not average case
- Do not set too generously (defeats purpose) or too tightly (false positives)
Bulkheads:
- Use when multiple services share thread pools or connection limits
- Use for critical vs non-critical operation isolation
- Use for tenant isolation in SaaS applications
- Do not use when overhead outweighs benefit (single service, low traffic)
Circuit Breakers:
- Use for calls to external services that can become persistently unavailable
- Use to prevent resource exhaustion during outages
- Use when you need to fail fast rather than wait for timeouts
- Do not use as substitute for timeouts - both patterns complement each other
- Do not use when failing operations have no acceptable fallback
Fallbacks:
- Use when stale data is acceptable (cached results, default values)
- Use for non-critical functionality where errors create poor UX
- Use to provide graceful degradation instead of hard errors
- Do not use when fresh data is required (financial transactions, real-time decisions)
- Do not use when fallback data could cause incorrect business logic
Decision Flow
graph TD
A[Adding Resilience] --> B{Operation Type?}
B -->|External Service Call| C{Need to Handle Outage?}
C -->|Yes| D[Timeout + Circuit Breaker]
D --> E{Has Fallback?}
E -->|Yes| F[Add Fallback]
E -->|No| G[Error Handling]
C -->|No| H[Timeout Only]
B -->|Shared Resources| I[Bulkhead Isolation]
I --> C
B -->|Transient Failure| J[Retry with Backoff + Jitter]
J --> D
H --> K[Monitor & Alert]
G --> K
F --> K
Which Pattern Handles Which Failure
Use this table to determine which resilience pattern addresses which failure type:
| Failure Type | Primary Pattern | Supporting Patterns | Why |
|---|---|---|---|
| Transient network error | Retry | Timeout | Retries recover from brief network hiccups |
| Slow response | Timeout | Circuit Breaker | Timeout prevents waiting indefinitely |
| Service permanently down | Circuit Breaker | Fallback | Circuit breaker stops calling; fallback provides alternative |
| Resource exhaustion | Bulkhead | Circuit Breaker | Bulkhead isolates consumption; circuit breaker stops calling |
| Dependent service failing | Fallback | Circuit Breaker | Fallback provides alternative; circuit breaker prevents cascade |
| Thundering herd | Retry with Jitter | Bulkhead | Jitter spreads retries; bulkhead limits concurrent calls |
| Noisy neighbor | Bulkhead | Rate Limiter | Bulkhead isolates partitions; rate limiter controls volume |
| Cascading failure | Circuit Breaker | Bulkhead | Circuit breaker stops propagation; bulkhead contains blast radius |
Summary
- Retries handle transient failures; always use exponential backoff with jitter
- Timeouts prevent waiting forever; set based on P99 latency plus headroom
- Bulkheads partition resources to contain failure; monitor each partition
- Circuit breakers detect persistent failures and stop calling; always implement fallbacks
- Combine patterns thoughtfully; match complexity to criticality
For more on resilience patterns, see Rate Limiting, Circuit Breaker Pattern, and Bulkhead Pattern.
Category
Related Posts
Bulkhead Pattern: Isolate Failures Before They Spread
The Bulkhead pattern prevents resource exhaustion by isolating workloads. Learn how to implement bulkheads, partition resources, and use them with circuit breakers.
Circuit Breaker Pattern: Fail Fast, Recover Gracefully
The Circuit Breaker pattern prevents cascading failures in distributed systems. Learn states, failure thresholds, half-open recovery, and implementation.
Graceful Degradation: Systems That Bend Instead Break
Design systems that maintain core functionality when components fail through fallback strategies, degradation modes, and progressive service levels.