Resilience Patterns: Retry, Timeout, Bulkhead & Fallback
Build systems that survive failures. Learn retry with backoff, timeout patterns, bulkhead isolation, circuit breakers, and fallback strategies.
Resilience Patterns: Retry, Timeout, Bulkhead & Fallback
Introduction
Resilience patterns are techniques that help distributed systems recover from failures gracefully. Rather than treating failures as exceptional events to handle reactively, resilience patterns embed recovery mechanisms into the system’s architecture from the ground up.
The core challenge: distributed systems involve multiple services communicating over networks. Networks partition. Services crash. Databases slow down. Any component can fail at any time. Your system must anticipate these failures and respond in a way that preserves overall functionality.
Resilience patterns address this by providing structured approaches to:
- Detect failures quickly (timeouts, circuit breakers)
- Recover from transient issues (retries with backoff)
- Contain failure blast radius (bulkheads)
- Gracefully degrade when full functionality is unavailable (fallbacks)
Each pattern addresses a specific failure mode. Together, they form a comprehensive toolkit for building systems that survive production failures.
Core Concepts
Before diving into individual patterns, several core concepts apply across multiple resilience strategies. These foundational ideas shape how you evaluate and combine patterns—understanding them prevents common misapplications. They also provide a vocabulary for discussing resilience trade-offs with your team.
Understanding the distinction between different failure modes is essential for selecting the right resilience strategy. Not all failures respond to the same treatment, and applying the wrong pattern wastes resources or leaves you unprotected. The following concepts form the mental model that guides pattern selection throughout this guide.
Transient vs Permanent Failures
Not all failures are created equal. Transient failures are temporary—network hiccups, brief unavailability, lock contentions. Retrying after a delay often succeeds. Permanent failures are fundamental—if a service is down or a request is malformed, retrying just wastes resources. Understanding which failure type you’re dealing with determines which pattern to apply.
Failure Isolation
A system without isolation is a system where a failure in one component cascades to affect everything. Bulkheads partition resources so that problems in one area do not drain resources from other areas. Circuit breakers stop calling failing services entirely. Both patterns contain failure blast radius.
Designing for Graceful Degradation
When things go wrong, your system should degrade elegantly. Return cached data. Use default values. Provide reduced functionality. The goal is to keep the system working even when individual components fail. This requires planning—fallback strategies must be designed before failures occur.
Observability
Patterns only help if you can see when they’re activated. Monitoring retry rates, timeout frequencies, circuit breaker states, and fallback usage tells you when resilience mechanisms are working—and when they’re not. Without observability, you’re flying blind.
Cost vs Benefit
Every resilience pattern has overhead. Retries multiply load. Bulkheads reserve capacity. Circuit breakers add latency checks. Match complexity to criticality—simple operations might only need timeouts, while critical transactions may need the full stack.
Retries with Backoff
When a request fails, retry it. Simple enough, except that naive retries amplify problems. If a service is overloaded, your retries add load and make things worse.
Retries work for transient failures. Network hiccups, temporary unavailability, brief lock contentions. Retries do not work when failures are fundamental. If the service is down, retries just create more load.
The key to effective retries lies in knowing what to retry, how long to wait between attempts, and when to give up entirely. Blind retry loops waste resources and can trigger cascading failures; well-designed retry logic requires exponential backoff with jitter to avoid thundering herds. The subsections below break down each dimension of retry strategy.
Exponential Backoff
Wait longer between each retry attempt. First retry after 1 second, second after 2 seconds, third after 4 seconds. This gives the service time to recover.
import time
import random
def retry_with_backoff(func, max_retries=3, base_delay=1.0, max_delay=30.0):
for attempt in range(max_retries):
try:
return func()
except transient_error:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt)
# Add jitter to prevent thundering herd
delay = delay * (0.5 + random.random())
time.sleep(delay)
Jitter
Without jitter, all clients retry at the same time. Add randomness to spread out retry attempts:
def retry_with_jitter(func, max_retries=3, base_delay=1.0):
for attempt in range(max_retries):
try:
return func()
except transient_error:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt)
# Full jitter
delay = random.uniform(0, delay)
time.sleep(delay)
What to Retry
Retry on transient failures:
- Network timeouts
- Connection refused
- Service unavailable (503)
- Gateway timeout (504)
Do not retry on client errors:
- Bad request (400)
- Unauthorized (401)
- Not found (404)
- Validation errors (422)
Do not retry on idempotent operations that succeeded: if you get a timeout but the operation completed, retrying creates duplicate work.
Timeout Patterns
Timeouts prevent your system from waiting forever. A service that never responds holds a thread, a connection, memory. Eventually, resources exhaust and the system dies.
Setting Timeouts
Set timeouts based on what the operation actually needs. A simple database query might need 1-5 seconds. A call to an external API might need 10-30 seconds. Know your services and set appropriate timeouts.
# Per-operation timeouts
TIMEOUTS = {
'database_query': 5.0,
'external_payment': 30.0,
'cache_lookup': 0.5,
'file_upload': 120.0,
}
def call_with_timeout(service_name, func, *args, **kwargs):
timeout = TIMEOUTS.get(service_name, 30.0)
with ThreadPoolExecutor(max_workers=1) as executor:
future = executor.submit(func, *args, **kwargs)
return future.result(timeout=timeout)
Timeout vs Retry
Timeouts and retries solve different problems. Timeouts prevent waiting indefinitely. Retries handle transient failures.
Use both. A request times out, you retry. The retry succeeds. Without timeout, you would have waited forever. Without retry, the timeout would have been a permanent failure.
Bulkhead Pattern
Bulkheads partition resources so that problems in one area do not drain resources from other areas. If one service fails and holds threads, bulkheads prevent that failure from affecting other services.
See the Bulkhead Pattern article for details.
Circuit Breaker Pattern
Circuit breakers detect persistent failures and stop making requests. When a service is repeatedly failing, the circuit opens. Requests fail immediately without consuming resources. This gives the failing service time to recover.
See the Circuit Breaker Pattern article for details.
Fallback Patterns
When a service fails and retries are exhausted, have a plan B. Return cached data, default values, or a graceful error. Do not let exceptions propagate.
Fallbacks are the last line of defense before a failure reaches the user. A well-designed fallback preserves core functionality by substituting degraded-but-acceptable responses for ideal ones. Planning these alternatives ahead of time means your system degrades gracefully instead of crashing spectacularly.
Cached Data Fallback
def get_product(product_id):
try:
return product_service.get(product_id)
except ServiceError:
cached = cache.get(f"product:{product_id}")
if cached:
return cached
raise ProductServiceUnavailable()
Default Value Fallback
def get_user_permissions(user_id):
try:
return permission_service.get_permissions(user_id)
except ServiceError:
# Safe default: minimal permissions
return ['read:own_data']
Graceful Degradation
def get_recommendations(user_id):
try:
return ml_service.get_recommendations(user_id)
except ServiceError:
# Fall back to popularity-based recommendations
return get_popular_items(limit=10)
Rate Limiting: Controlling Request Volume
Rate limiting protects services from being overwhelmed by too many requests. Unlike circuit breakers (which react to failures), rate limiters proactively reject excess load before it causes problems.
import time
from collections import deque
class TokenBucketRateLimiter:
def __init__(self, rate, capacity):
self.rate = rate # tokens per second
self.capacity = capacity
self.tokens = capacity
self.last_refill = time.time()
def allow_request(self):
self._refill()
if self.tokens >= 1:
self.tokens -= 1
return True
return False
def _refill(self):
now = time.time()
elapsed = now - self.last_refill
self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
self.last_refill = now
Rate limiting combinations with other resilience patterns:
| Rate Limiter Position | What It Protects | Combines Well With |
|---|---|---|
| Per-client | Individual tenants from overwhelming shared services | Bulkhead, circuit breaker |
| Per-service | A service from its downstream dependencies | Timeout, circuit breaker |
| Per-endpoint | Specific expensive operations | Bulkhead, retry |
| Global | Entire system from external traffic | Circuit breaker, fallback |
When rate limiting fires, return HTTP 429 (Too Many Requests) with a Retry-After header so clients know when to retry.
P99-Based Timeout Selection
Setting timeouts properly is one of the hardest parts of resilience engineering. Too short and you get false failures. Too long and you waste resources waiting on dead services.
A good starting formula: timeout = P99_latency * multiplier + headroom
def calculate_timeout(p99_latency_ms, multiplier=2.0, headroom_ms=100):
"""Set timeout based on observed P99 latency."""
return (p99_latency_ms * multiplier) + headroom_ms
# Example: downstream service P99 is 200ms
# timeout = 200 * 2 + 100 = 500ms
timeout = calculate_timeout(200) # 500ms
Monitor your timeouts against actual latency distributions. If you are regularly hitting timeouts, either the downstream service is slow (fix it) or your timeout is too tight.
The key insight: set timeouts based on what the 99th percentile of your callers experiences, not the average. The slowest calls are where timeouts matter most.
Combining Patterns
These patterns work best together. A resilient request might look like:
def resilient_get_product(product_id):
# 1. Bulkhead: limit concurrent calls
with bulkhead:
# 2. Timeout: don't wait forever
with timeout(seconds=5.0):
# 3. Retry: transient failures might succeed
for attempt in range(3):
try:
return product_service.get(product_id)
except transient_error:
if attempt == 2:
raise
sleep_with_jitter(base_delay=1.0)
# 4. Fallback: something went wrong
return get_cached_product(product_id)
graph TD
A[Request] --> B{Bulkhead available?}
B -->|No| J[Reject]
B -->|Yes| C[Call with Timeout]
C -->|Timeout| D[Retry?]
D -->|Yes| E[Retry with Backoff]
E --> C
D -->|No| F[Circuit Open?]
F -->|Yes| G[Fallback]
F -->|No| H[Return Error]
G --> I[Return Cached/Default]
C -->|Success| I
When to Use / When Not to Use
Resilience patterns are tools, and like any tool, they have specific use cases where they excel and others where they add unnecessary complexity. Choosing the right pattern—or deciding not to use one at all—requires understanding the failure modes you’re protecting against and the cost of protection.
This section provides concrete guidance on which patterns to apply in different scenarios, including decision trees and comparison tables to help you make informed choices for your specific context.
When to Use Each Pattern
Retries with Backoff:
- Use for transient failures (network timeouts, temporary unavailability, 503 errors)
- Use for idempotent operations where retrying is safe
- Use when failure is more expensive than the retry cost (e.g., critical transactions)
- Do not use for non-idempotent operations without careful handling
- Do not use when failures indicate fundamental problems (service down, validation errors)
Timeouts:
- Use for ALL external service calls without exception
- Use when operations have known acceptable response times
- Set based on P99 latency plus headroom, not average case
- Do not set too generously (defeats purpose) or too tightly (false positives)
Bulkheads:
- Use when multiple services share thread pools or connection limits
- Use for critical vs non-critical operation isolation
- Use for tenant isolation in SaaS applications
- Do not use when overhead outweighs benefit (single service, low traffic)
Circuit Breakers:
- Use for calls to external services that can become persistently unavailable
- Use to prevent resource exhaustion during outages
- Use when you need to fail fast rather than wait for timeouts
- Do not use as substitute for timeouts - both patterns complement each other
- Do not use when failing operations have no acceptable fallback
Fallbacks:
- Use when stale data is acceptable (cached results, default values)
- Use for non-critical functionality where errors create poor UX
- Use to provide graceful degradation instead of hard errors
- Do not use when fresh data is required (financial transactions, real-time decisions)
- Do not use when fallback data could cause incorrect business logic
Decision Flow
graph TD
A[Adding Resilience] --> B{Operation Type?}
B -->|External Service Call| C{Need to Handle Outage?}
C -->|Yes| D[Timeout + Circuit Breaker]
D --> E{Has Fallback?}
E -->|Yes| F[Add Fallback]
E -->|No| G[Error Handling]
C -->|No| H[Timeout Only]
B -->|Shared Resources| I[Bulkhead Isolation]
I --> C
B -->|Transient Failure| J[Retry with Backoff + Jitter]
J --> D
H --> K[Monitor & Alert]
G --> K
F --> K
Which Pattern Handles Which Failure
Use this table to determine which resilience pattern addresses which failure type:
| Failure Type | Primary Pattern | Supporting Patterns | Why |
|---|---|---|---|
| Transient network error | Retry | Timeout | Retries recover from brief network hiccups |
| Slow response | Timeout | Circuit Breaker | Timeout prevents waiting indefinitely |
| Service permanently down | Circuit Breaker | Fallback | Circuit breaker stops calling; fallback provides alternative |
| Resource exhaustion | Bulkhead | Circuit Breaker | Bulkhead isolates consumption; circuit breaker stops calling |
| Dependent service failing | Fallback | Circuit Breaker | Fallback provides alternative; circuit breaker prevents cascade |
| Thundering herd | Retry with Jitter | Bulkhead | Jitter spreads retries; bulkhead limits concurrent calls |
| Noisy neighbor | Bulkhead | Rate Limiter | Bulkhead isolates partitions; rate limiter controls volume |
| Cascading failure | Circuit Breaker | Bulkhead | Circuit breaker stops propagation; bulkhead contains blast radius |
Service Mesh Integration
When running in a service mesh like Istio or Linkerd, resilience patterns are handled partly by the mesh itself. Understanding what the mesh provides and what you still need to implement yourself matters.
Service meshes shift many resilience concerns from application code to infrastructure configuration, but this division of responsibility requires clarity. The mesh handles network-level retries, circuit breaking, and load balancing, while your application remains responsible for business-logic-level fallbacks and health reporting.
What the Mesh Handles
Service meshes provide retry budgets (max retries, per-retry timeouts), circuit breaking via traffic policies, and outlier detection that ejects unhealthy pods. They also handle load balancing with automatic health-aware routing.
# Istio VirtualService example with resilience config
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: product-service
spec:
hosts:
- product-service
http:
- route:
- destination:
host: product-service
retries:
attempts: 3
perTryTimeout: 2s
retryOn: gateway-error,connect-failure,reset
timeout: 10s
What You Still Need to Handle
The mesh manages network-level retries and circuit breaking, but your application needs fallback logic when all endpoints fail. Cached data, default values, and graceful degradation remain your responsibility. Health endpoint implementation is critical too—the mesh uses readiness and liveness probes to determine pod health.
Implement /health/readiness to check downstream dependencies (circuit breaker state, database connectivity). Implement /health/liveness to confirm your process is running. The mesh routes traffic away from pods that fail readiness checks, so this integration is essential.
Bulkhead Isolation in the Mesh
Service mesh can enforce bulkhead isolation using destination rules that limit concurrent connections and requests per service. This works at the network level but does not replace application-level bulkheads that limit concurrent calls in your code.
Use both. The mesh-level limits protect against network-level overload. Application-level bulkheads protect against logical overload within your process.
Trade-off Analysis
When selecting resilience patterns, teams face inherent tensions. This section maps common trade-offs to help make informed decisions.
Resilience always involves trade-offs: more protection often means more latency, complexity, or resource cost. The key is understanding which tensions matter most for your system and configuring patterns accordingly. There’s no universal best choice—only the right choice for your specific constraints.
Latency vs. Reliability
| Approach | Latency Impact | Reliability Gain | When to Choose |
|---|---|---|---|
| No retry, no timeout | Lowest (fails fast) | None | Internal calls with aggressive SLAs |
| Retry without backoff | Moderate (immediate retry) | Low | Non-critical, idempotent operations |
| Retry with exponential backoff | Variable (waits before retry) | High | External services, transient failures |
| Retry + circuit breaker | Controlled (fails fast after threshold) | Highest | Services that can stay down |
Complexity vs. Protection
| Pattern Combination | Implementation Complexity | Failure Mode Coverage |
|---|---|---|
| Timeout only | Low | Slow response |
| Timeout + retry | Medium | Slow response + transient failure |
| Timeout + retry + fallback | Medium-High | Slow response + transient + permanent |
| Timeout + retry + fallback + circuit breaker | High | All common failure modes |
Resource Cost vs. Isolation
| Isolation Strategy | Resource Overhead | Blast Radius Containment | When to Use |
|---|---|---|---|
| Shared thread pool | Low | None | Single service, low traffic |
| Bulkhead per service | Medium | Per-service | Multiple downstream services |
| Bulkhead per tenant | High | Per-tenant | Multi-tenant SaaS |
| Bulkhead + circuit breaker | Highest | Maximum | Critical paths with external dependencies |
Consistency vs. Availability
Resilience patterns often involve the classic consistency/availability trade-off:
- Cached data fallbacks → improved availability, potentially stale data
- Default value fallbacks → improved availability, reduced functionality
- Circuit breakers → fail fast, degraded mode rather than waiting
Choose based on what your users can tolerate. A temporary inconsistency is often better than a timeout.
Operational Cost Trade-offs
| Pattern | Monitoring Complexity | Configuration Burden | Debugging Difficulty |
|---|---|---|---|
| Retry | Low (count retries) | Low (max_attempts, backoff) | Easy |
| Timeout | Medium (false positives vs. failures) | Medium (per-service tuning) | Medium |
| Bulkhead | High (per-partition metrics) | High (allocate resources) | Hard |
| Circuit Breaker | Medium (state transitions) | Medium (threshold tuning) | Medium |
| Fallback | High (freshness, correctness) | Low (define alternatives) | Hard |
Real-world Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Retry amplifies outage | Retries overwhelm recovering service | Use exponential backoff with jitter; set max retries low |
| Timeout too long | Resources held while waiting for dead service | Set timeouts based on P99 latency; add circuit breakers |
| Fallback returns stale data | Business decisions made on outdated information | Monitor fallback freshness; alert when fallback used |
| Bulkhead exhaustion | Partition isolated but related functionality fails | Monitor all partitions; implement fallback for rejected work |
| Combination failure | Multiple patterns interact unexpectedly | Test patterns in combination; monitor cross-pattern metrics |
Common Pitfalls / Anti-Patterns
Even experienced teams make predictable mistakes when implementing resilience patterns. These pitfalls are common enough that they’re worth discussing explicitly so you can avoid them in your own systems.
Many resilience failures stem not from missing patterns but from misapplying them—using retry logic where it shouldn’t be used, setting timeouts too aggressively, or neglecting the operational burden that additional complexity creates.
Overview of Common Pitfalls
Retrying Everything
Retrying non-idempotent operations can create duplicate work. Charging a credit card twice is bad. Design operations to be idempotent or do not retry them.
No Timeout
Without timeouts, retries can wait forever. If the service is truly down, you just repeat the wait. Always set timeouts.
Ignoring Fallbacks
When all else fails, your system should degrade gracefully. Cached data, default values, reduced functionality. Pick something over an error page.
Over-Engineering
You do not need every pattern for every call. Simple operations might only need timeouts. More critical operations might need bulkhead + timeout + retry + fallback. Match complexity to criticality.
Applying All Patterns Everywhere
Not every call needs bulkhead + timeout + retry + fallback. Start simple. Add complexity only where measurements show it is needed.
Ignoring Retry Costs
Each retry consumes client and server resources. Know the cost and set limits accordingly.
Setting Timeouts Too Generously
A 30-second timeout on a service that normally responds in 100ms defeats the purpose. Timeouts should be based on actual service capability plus headroom.
Not Testing Resilience Behaviors
Test what happens when retries fail, when timeouts trigger, when circuits open. Simulate failures in staging.
Quick Recap Checklist
Key Bullets:
- Retries handle transient failures; always use exponential backoff with jitter
- Timeouts prevent waiting forever; set based on P99 latency plus headroom
- Bulkheads partition resources to contain failure; monitor each partition
- Circuit breakers detect persistent failures and stop calling; always implement fallbacks
- Combine patterns thoughtfully; match complexity to criticality
Copy/Paste Checklist:
Resilience Pattern Implementation:
[ ] Identify all external service calls
[ ] Classify calls by criticality (critical, important, best-effort)
[ ] Set appropriate timeout per call based on service capability
[ ] Implement retry with exponential backoff + jitter for transient failures
[ ] Add circuit breaker for calls to external services
[ ] Implement bulkheads for partitioned workloads
[ ] Define fallback for each critical call
[ ] Monitor retry rate, timeout rate, circuit state, fallback usage
[ ] Test resilience behaviors in staging
[ ] Document expected behavior for each failure mode
Observability Checklist
-
Metrics:
- Retry success rate (first attempt vs after retries)
- Timeout rate per service
- Circuit state per downstream service
- Bulkhead pool utilization per partition
- Fallback activation rate
-
Logs:
- Retry attempts with attempt number and delay
- Timeout events with service and duration
- Circuit state transitions
- Bulkhead rejections per partition
- Fallback invocations with context
-
Alerts:
- Retry success rate drops below threshold
- Timeouts increasing for specific service
- Circuit enters open state
- Bulkhead pool utilization exceeds 80%
- Fallback activation rate spikes
Security Checklist
- Retries do not expose sensitive data in logs
- Fallback data properly sanitized
- Timeouts enforced to prevent resource exhaustion attacks
- Circuit breaker state not exposed to clients
- Bulkhead configuration respects security boundaries
- Monitoring logs do not contain credentials or PII
Interview Questions
Exponential backoff increases the wait time between retries exponentially: 1 second, then 2, then 4, then 8. This gives the failing service time to recover without overwhelming it with immediate retries.
Jitter adds randomness to these intervals. Without jitter, all clients retry at the same times—synchronized waves of traffic hitting a recovering service. With jitter, each client's retry schedule is spread out, smoothing the load.
Full jitter randomizes the entire delay: wait = random(0, base * 2^n). Decorrelated jitter spreads retries even further: each wait is random between base and base * 3. Use jitter in all production retry implementations.
Authentication errors (401, 403) should not retry—credentials are invalid, retrying will not help. Client errors (400) indicate a malformed request that needs fixing, not retrying. 404 errors for read operations might retry if the resource was temporarily unavailable, but not for writes.
Any error that will certainly fail the same way on retry should not be retried. Connection refused errors, malformed responses, and timeout errors with very short timeouts are examples. Retrying these wastes resources.
Timeout errors and 500-level server errors are good candidates for retry—these are often transient.
Start from the source: what is the P99 latency of your backend service under normal load? Add headroom for variance: if P99 is 200ms, adding 100ms gives you 300ms. Set your timeout at P99 plus headroom—enough to cover normal variance without waiting for true failures.
Different operations might need different timeouts. A simple database lookup might have a 100ms timeout. A complex aggregation query might need 5 seconds. An async job submission might need only a 2-second timeout—you only care that the job was submitted, not that it completed.
Monitor timeouts in production. If you are frequently hitting timeouts, either the timeout is too short or the backend is overloaded.
Closed state: requests pass through normally. The breaker tracks failure rate. When failures exceed the threshold (say 50% over 10 seconds), the circuit opens.
Open state: requests immediately fail without hitting the backend. This protects the backend from being hammered by failing requests and protects your service from resource exhaustion. After a reset timeout (30 seconds is common), the circuit moves to half-open.
Half-open state: a limited number of requests pass through to test if the backend has recovered. Success closes the circuit. Failure reopens it. Half-open prevents thrashing and gives the backend time to fully recover before full traffic resumes.
Use a bulkhead when you need to contain resource consumption, not just stop calling a failing service. If one tenant's queries are consuming all database connections, you need bulkheads to partition connections per tenant—not circuit breakers.
Use a circuit breaker when a service is failing and you want to stop calling it. Use a bulkhead when you want to limit how much of a resource any single caller can consume.
The patterns are complementary. A bulkhead prevents one caller from exhausting your thread pool. A circuit breaker stops you from calling a service that is returning errors. Use both together.
A fallback is a predefined alternative response when a call fails: return cached data, return a default value, or call an alternative service. The fallback is specific to the failing operation.
Graceful degradation is a broader strategy: when full functionality is unavailable, the system continues operating in a reduced mode. This might involve multiple fallbacks working together. An e-commerce site might show cached product pages and disable reviews when the database is slow.
Fallbacks are implementation patterns; graceful degradation is a design philosophy.
Consider a service with multiple clients. Without bulkheads, one client's requests can exhaust the shared thread pool. Even with circuit breakers on each client, if the service is slow rather than failing, the breakers might not open while threads are consumed waiting.
Bulkheads partition the thread pool by client. Even if one client saturates their partition, other clients continue working. Circuit breakers detect when a backend is genuinely failing and stop calling it entirely. Together, they provide both resource isolation and failure detection.
Cache stampede happens when a popular cache entry expires. All concurrent requests miss the cache simultaneously and hit the backend. If the backend is slow, this can cause a thundering herd.
Prevention strategies: probabilistic early expiration—before the cache expires, some requests proactively refresh the cache in the background. This spreads the refresh load over time rather than having all requests wait for expiration.
Another strategy: lock the cache entry while refreshing. The first request acquires the lock and refreshes. Other requests wait or return stale data. Requires distributed locking (Redis) if across multiple servers.
Always use retry with circuit breaker, but configure them carefully. Retries handle transient failures—brief network hiccups that succeed on the second try. Circuit breakers handle persistent failures—services that are not recovering.
The danger is retry storms: many clients retry simultaneously after a transient failure, overwhelming the recovering service and causing failures again. Circuit breakers prevent this by stopping calls to persistently failing services. Configure retry limits low enough that they do not trigger the circuit breaker.
Use retries with exponential backoff and jitter for transient failures. Use circuit breakers to stop calling services that are genuinely down. Use both: retries handle transient issues, circuit breakers handle persistent ones.
Test in staging first: inject failures using chaos engineering tools. Kill services, add network latency, return 500s. Verify your timeouts trigger, circuit breakers open, retries back off, and fallbacks activate.
Test failure modes you cannot easily inject: slow backend (add artificial latency), partial degradation (backend returns partial data). Test at realistic load—patterns that work at low traffic might fail under pressure.
Game days: simulate failure scenarios in production during low-traffic windows. Chaos Monkey and similar tools inject real failures in production. Start with non-critical services and work toward critical ones as your confidence grows.
A bulkhead isolates thread pools so that failures in one partition do not drain threads from other partitions. If you have three downstream services and share one thread pool across all calls, one slow service can exhaust threads and block calls to the other two.
Sizing depends on your traffic patterns and service SLAs. A common approach: allocate threads based on the importance of the operation. Critical operations get dedicated thread pools with enough threads to handle expected load plus headroom. Non-critical operations share a smaller pool.
Monitor partition utilization. If one partition regularly hits 80%+ thread usage while others are at 20%, either rebalance or add more threads to that partition.
Closed state tracks failures. When failures exceed a threshold within a time window, the circuit opens. Threshold configuration matters: too sensitive and you get false positives (circuit opens on normal variance), too lenient and the circuit stays closed during real outages.
Time window is key. A 10-second window with 5-failure threshold means: if you get 5 failures in any 10-second period, the circuit opens. After the reset timeout (say 30 seconds), the circuit enters half-open.
Half-open allows a limited number of probe requests. If they succeed, the circuit closes. If they fail, the circuit opens again for another reset period. This prevents rapid cycling between open and closed states.
A hard timeout is absolute: the call fails after the specified duration regardless of what is happening. A soft timeout gives the operation a grace period to complete before declaring failure.
Soft timeouts are useful when you cannot interrupt a blocking call cleanly. You set a soft timeout slightly higher than your hard timeout, and when it triggers, you start graceful shutdown of the operation rather than killing it immediately.
In practice, most resilience implementations use hard timeouts because soft timeouts require the operation to support cancellation. If your downstream service supports cancellation tokens, soft timeouts can reduce resource waste during shutdown.
Chain fallbacks. Primary fails, use cached data. Cache miss, use default value. Default unavailable, return a graceful error with partial information. Each fallback should be independent and not depend on the same infrastructure.
Return partial data when possible. A recommendation engine might return popular items when the ML service is down. A product service might return basic product info from a static file when the database is slow.
Monitor fallback chains. If you are frequently hitting your third-level fallback, either the primary service is more fragile than you thought, or your capacity planning needs adjustment.
The thundering herd happens when a large number of requests all fail simultaneously and then retry simultaneously. The recovery of the downstream service causes a traffic spike that can knock it offline again.
Jitter spreads out retry timing so clients do not retry at the same moment. Bulkheads limit how many retries can be in flight at once. Circuit breakers stop retries entirely when the downstream is genuinely failing.
Probabilistic early expiration for caches prevents all requests from missing the cache simultaneously when a popular entry expires.
The safest answer: make the operation idempotent. Add a unique idempotency key to the request, store the key with the result, and return the cached result if you see the same key again.
If you cannot make the operation idempotent, track the state of in-flight requests. If a retry is about to happen, check if the original request is still processing. If it completed, do not retry. If it failed, retry with a new attempt.
For payment-style operations, consider an outbox pattern: write the intent to a durable log first, then process. On retry, check the log to see if the operation already succeeded.
Request coalescing prevents multiple simultaneous requests from all hitting a failing backend. When the circuit is open, instead of failing immediately, requests wait for a short window. During that window, if multiple requests come in for the same operation, only one actually calls the backend. The others wait for and receive the same result.
This is particularly useful during startup: when the circuit first closes, the backend might still be warming up. Coalescing prevents dozens of requests from hitting it simultaneously before it is ready.
Implement request coalescing with a lock or semaphore per operation key. Be careful with the window size: too long and you add unnecessary latency, too short and coalescing does not help.
A timeout is the maximum time you will wait for an operation. A deadline is the time by which the operation must complete for the result to be useful. If you start a request at time T with a timeout of 5 seconds, you will wait until T+5. If you start at T with a deadline of T+5, the operation must finish by T+5 or the result is worthless.
Deadlines propagate. If your request has a deadline of T+5, you might set a timeout of 4 seconds to give yourself 1 second to process the response. When calling a downstream service, you pass the remaining deadline so it can make smart decisions about whether to continue.
Use deadlines when the request is part of a larger workflow where the overall completion time matters. Use timeouts for individual operations where you just need to know if they succeeded.
Bulkheads and connection pools work together. A bulkhead limits how many threads can be waiting on a given downstream at once. A connection pool limits how many connections are available to that downstream.
If a bulkhead partition has 10 threads and the connection pool has 5 connections, at most 5 threads can have active connections while 5 are waiting. The bulkhead prevents thread exhaustion even when connections are exhausted.
Sizing both requires knowing your normal load and your failure behavior. During normal operation, connection pool utilization should be low. During failure, bulkheads prevent the failure from consuming all threads.
Per-pattern metrics: retry success rate (first attempt vs eventual success), timeout rate per downstream, circuit breaker state and transition frequency, bulkhead partition utilization, fallback activation rate and latency impact.
Cross-pattern interactions: how often do retries trigger circuit breakers? How often do timeouts lead to fallbacks? These interactions reveal whether your patterns are configured to work together or against each other.
Business impact: track error rates and latency percentiles for your APIs. If resilience patterns are working, error rates should be stable even when downstream services are failing.
Further Reading
- Rate Limiting - Controlling request volume
- Circuit Breaker Pattern - Preventing cascading failures
- Bulkhead Pattern - Resource isolation
Conclusion
- Retries handle transient failures; always use exponential backoff with jitter
- Timeouts prevent waiting forever; set based on P99 latency plus headroom
- Bulkheads partition resources to contain failure; monitor each partition
- Circuit breakers detect persistent failures and stop calling; always implement fallbacks
- Combine patterns thoughtfully; match complexity to criticality
Category
Related Posts
Bulkhead Pattern: Isolate Failures Before They Spread
The Bulkhead pattern prevents resource exhaustion by isolating workloads. Learn to implement bulkheads, partition resources, and use them with circuit breakers.
Circuit Breaker Pattern: Fail Fast, Recover Gracefully
The Circuit Breaker pattern prevents cascading failures in distributed systems. Learn states, failure thresholds, half-open recovery, and implementation.
Graceful Degradation: Systems That Bend Instead Break
Design systems that maintain core functionality when components fail through fallback strategies, degradation modes, and progressive service levels.