Cache Patterns: Stampede, Thundering Herd, Tiered Caching
Learn advanced cache patterns for production systems. Solve cache stampede, implement cache warming, and design tiered caching architectures.
Cache Patterns
The basics of caching are straightforward. Read from cache, fall back to database, populate cache. But production systems reveal problems that basic tutorials don’t cover.
Cache stampede. Thundering herd. Cold starts. Tiered latency. These aren’t edge cases at scale. They hit you regularly and they’ll ruin your day if you haven’t thought about them.
This covers the patterns that matter once you move past “add Redis and hope for the best.”
Cache Stampede
When a popular cache entry expires, every request that hits it tries to rebuild from the database simultaneously. The database wasn’t designed for that kind of concurrent load. It slows down, requests pile up, and now you’ve turned a caching problem into a cascading failure.
graph TD
A[Cache Miss] --> B[Request DB]
B --> C[100 more requests hit expired key]
C --> D[All 100 try DB simultaneously]
D --> E[DB overwhelmed]
E --> F[Requests timeout]
F --> G[Cache never repopulated]
This is the cache stampede problem. It has two main fixes.
Distributed Locking
Only one process rebuilds the cache; others wait or serve stale data.
import redis
import time
import json
def get_with_lock(key, fetch_func, lock_timeout=5, stale_timeout=30):
# Try to get from cache
value = redis.get(key)
if value:
return json.loads(value)
# Try to acquire lock
lock_key = f"lock:{key}"
lock = redis.lock(lock_key, timeout=lock_timeout)
if lock.acquire(blocking=True, blocking_timeout=1):
try:
# Double-check cache (another process might have populated)
value = redis.get(key)
if value:
return json.loads(value)
# Fetch from source
result = fetch_func(key)
redis.setex(key, 3600, json.dumps(result))
return result
finally:
lock.release()
else:
# Didn't get lock - wait briefly and retry cache
time.sleep(0.1)
value = redis.get(key)
if value:
return json.loads(value)
# Still nothing - either serve stale or error
return None
This prevents the stampede but adds latency for the first request. The others wait or get stale data.
Probabilistic Early Expiration
Refresh the cache before it expires, based on probability that decreases as TTL decreases.
import time
import random
import hashlib
def get_with_probabilistic_early_expiration(key, fetch_func, ttl=3600, beta=0.3):
cached = redis.get(key)
if cached:
return json.loads(cached)
# Get current TTL
current_ttl = redis.ttl(key)
if current_ttl <= 0:
current_ttl = ttl
# Calculate probability of early expiration
# Lower beta = less aggressive early refresh
remaining_ratio = current_ttl / ttl
probability = 1 - (remaining_ratio ** beta)
if random.random() < probability:
# Refresh in background (don't block)
def background_refresh():
try:
result = fetch_func(key)
redis.setex(key, ttl, json.dumps(result))
except Exception as e:
pass # Log but don't fail
# In production, submit to thread pool or queue
import threading
threading.Thread(target=background_refresh).start()
else:
# Normal cache miss handling
result = fetch_func(key)
redis.setex(key, ttl, json.dumps(result))
return result
return fetch_func(key) # Fallback
The math is: as TTL decreases, probability of refresh increases. Popular items get refreshed before they expire. The beta parameter controls aggressiveness.
Hysteresis
Keep serving stale data while refreshing in background, even after expiration.
def get_with_hysteresis(key, fetch_func, ttl=3600, stale_ttl=60):
value = redis.get(key)
if value:
return json.loads(value)
# Check for stale version
stale_value = redis.get(f"{key}:stale")
if stale_value:
# Return stale, trigger background refresh
def background_refresh():
try:
result = fetch_func(key)
redis.setex(key, ttl, json.dumps(result))
redis.delete(f"{key}:stale")
except Exception as e:
pass
import threading
threading.Thread(target=background_refresh).start()
return json.loads(stale_value)
# No stale version - must refresh synchronously
result = fetch_func(key)
redis.setex(key, ttl, json.dumps(result))
return result
This ensures users never hit a cache miss but requires careful handling of staleness.
Thundering Herd
Thundering herd is cache stampede at a larger scale, usually when an entire cache tier goes down or is restarted. Every request hits the database at once because nothing is cached.
graph TD
A[Cache goes down] --> B[All requests hit DB]
B --> C[DB overwhelmed]
C --> D[System crashes]
D --> E[Cache never recovers]
Prevention: Graceful Degradation
Design your cache to degrade gracefully under failure.
class CircuitBreakerCache:
def __init__(self, redis_client, db, failure_threshold=5, timeout=30):
self.redis = redis_client
self.db = db
self.failure_count = 0
self.last_failure_time = 0
self.timeout = timeout
self.failure_threshold = failure_threshold
self.state = "closed" # closed, open, half-open
def get(self, key):
# If circuit open, skip cache and go to DB directly
if self.state == "open":
if time.time() - self.last_failure_time > self.timeout:
self.state = "half-open"
else:
return self.db.query(key)
try:
value = self.redis.get(key)
if value:
return json.loads(value)
# Cache miss - get from DB
result = self.db.query(key)
self.redis.setex(key, 3600, json.dumps(result))
return result
except Exception as e:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "open"
# Fallback to DB directly
return self.db.query(key)
Prevention: Request Coalescing
Multiple requests for the same key get coalesced into a single database query.
import asyncio
from concurrent.futures import Future
class RequestCoalescer:
def __init__(self):
self.pending = {} # key -> Future
async def get(self, key, fetch_func):
# Check if there's already a pending request for this key
if key in self.pending:
# Wait for the existing request
return await self.pending[key]
# Create a future for this request
future = Future()
async def fetch_and_store():
try:
result = await fetch_func(key)
future.set_result(result)
except Exception as e:
future.set_exception(e)
finally:
del self.pending[key]
self.pending[key] = future
asyncio.create_task(fetch_and_store())
return await future
Cache Warming
On startup or after cache invalidation, the cache is empty. Users experience slow requests until popular items are repopulated. Cache warming pre-loads the cache before users need it.
Proactive Warming on Startup
def warm_cache_on_startup(redis_client, db, popular_keys):
"""
Load popular items into cache before users arrive.
Run this after cache restart or deployment.
"""
print(f"Warming cache with {len(popular_keys)} keys...")
for key in popular_keys:
try:
# Fetch from database
data = db.query(key)
# Store in cache with longer TTL
redis_client.setex(key, 86400, json.dumps(data)) # 24hr TTL
except Exception as e:
print(f"Failed to warm key {key}: {e}")
print("Cache warming complete")
# Common patterns for popular keys:
# - Top 1000 products by sales
# - Top 100 users by activity
# - Category listing pages
# - Feature flags and config
Background Warming
Warm cache entries in the background as they expire or as traffic patterns emerge.
import threading
import time
from collections import Counter
class BackgroundWarmingCache:
def __init__(self, redis_client, db, warm_threshold=100):
self.redis = redis_client
self.db = db
self.access_count = Counter()
self.warm_threshold = warm_threshold
self.running = True
# Start background thread
self.warmer_thread = threading.Thread(target=self._warm_loop)
self.warmer_thread.daemon = True
self.warmer_thread.start()
def track_access(self, key):
"""Track cache access for warming decisions"""
self.access_count[key] += 1
def get(self, key):
value = self.redis.get(key)
if value:
self.track_access(key)
return json.loads(value)
return None
def _warm_loop(self):
while self.running:
# Find hot keys that aren't in cache
for key, count in self.access_count.most_common(100):
if count >= self.warm_threshold and not self.redis.exists(key):
try:
data = self.db.query(key)
self.redis.setex(key, 86400, json.dumps(data))
self.access_count[key] = 0 # Reset after warming
except Exception:
pass
time.sleep(60) # Check every minute
Tiered Caching
Not all caches are equal. Using multiple cache tiers with different characteristics gives you the best of both worlds: fast access for hot data, larger capacity for warm data.
graph TD
A[Request] --> B[L1 Cache<br/>CPU Cache / In-Memory<br/>~MB, <1ms]
B -->|miss| C[L2 Cache<br/>Redis / Memcached<br/>~GB, <10ms]
C -->|miss| D[L3 Cache<br/>CDN / Edge<br/>~TB, <100ms]
D -->|miss| E[Database<br/>~TB, 10-100ms]
L1 + L2 Pattern
import redis
import threading
import time
class TieredCache:
def __init__(self, l1_size=1000, l2_client=None, ttl=3600):
# L1: local memory cache (thread-safe dict with LRU)
from collections import OrderedDict
class LRUCache:
def __init__(self, capacity):
self.capacity = capacity
self.cache = OrderedDict()
def get(self, key):
if key in self.cache:
self.cache.move_to_end(key)
return self.cache[key]
return None
def put(self, key, value):
if key in self.cache:
self.cache.move_to_end(key)
self.cache[key] = value
if len(self.cache) > self.capacity:
self.cache.popitem(last=False)
self.l1 = LRUCache(l1_size)
self.l2 = l2_client # Redis
self.ttl = ttl
def get(self, key):
# Check L1 first
value = self.l1.get(key)
if value:
return value
# Check L2
if self.l2:
value = self.l2.get(key)
if value:
# Promote to L1
self.l1.put(key, value)
return value
return None
def put(self, key, value):
# Write to both tiers
self.l1.put(key, value)
if self.l2:
self.l2.setex(key, self.ttl, value)
CDN + Origin Pattern
graph TD
A[User Request] --> B[CDN Edge<br/>~100ms TTL]
B -->|miss| C[CDN Origin<br/>~1hr TTL]
C -->|miss| D[Application<br/>~24hr TTL]
D -->|miss| E[Database]
For web content:
# CDN cache control headers
Cache-Control: public, max-age=100, s-maxage=100 # Edge cache 100s
Cache-Control: public, max-age=3600, s-maxage=3600 # Origin cache 1hr
# Private content (user-specific)
Cache-Control: private, max-age=3600 # Only browser caches this
Sharding Strategies
When a single cache instance isn’t enough, you need to distribute across multiple nodes.
Consistent Hashing
import hashlib
from bisect import bisect
class ConsistentHashRing:
def __init__(self, nodes, replicas=150):
self.replicas = replicas
self.ring = {}
self.sorted_keys = []
for node in nodes:
for i in range(replicas):
key = f"{node}:{i}"
hash_key = int(hashlib.md5(key.encode()).hexdigest(), 16)
self.ring[hash_key] = node
self.sorted_keys.append(hash_key)
self.sorted_keys.sort()
def get_node(self, key):
if not self.ring:
return None
hash_key = int(hashlib.md5(key.encode()).hexdigest(), 16)
idx = bisect(self.sorted_keys, hash_key)
if idx >= len(self.sorted_keys):
idx = 0
return self.ring[self.sorted_keys[idx]]
def add_node(self, node):
for i in range(self.replicas):
key = f"{node}:{i}"
hash_key = int(hashlib.md5(key.encode()).hexdigest(), 16)
self.ring[hash_key] = node
self.sorted_keys.append(hash_key)
self.sorted_keys.sort()
def remove_node(self, node):
for i in range(self.replicas):
key = f"{node}:{i}"
hash_key = int(hashlib.md5(key.encode()).hexdigest(), 16)
del self.ring[hash_key]
self.sorted_keys.remove(hash_key)
When to Use / When Not to Use
| Pattern | When to Use | When Not to Use |
|---|---|---|
| Distributed Locking | Preventing cache stampede on popular keys; ensuring single-flight rebuilds | High-frequency writes; when lock contention is a problem |
| Probabilistic Early Expiration | Popular items with predictable access; reducing miss latency | Unpredictable access patterns; items with short TTLs |
| Hysteresis (Stale-While-Refresh) | Latency-sensitive applications; users must never wait | When stale data is unacceptable; very short TTLs |
| Circuit Breaker | External cache service; preventing cascade failures | Local/in-process caches; when cache is always available |
| Request Coalescing | High concurrency on same keys; thundering herd prevention | Low traffic; unique keys per request |
| Cache Warming | After deployments; known hot dataset; reducing cold-start latency | Unknown access patterns; dynamic content |
| Tiered Caching (L1+L2) | Multiple application instances; sub-millisecond access needed for hot data | Single instance; uniform access patterns |
| CDN + Origin | Static content; global user base; reducing origin load | Dynamic, personalized content; real-time data |
Decision Guide
graph TD
A[Problem] --> B{Cache unavailable?}
B -->|Yes| C[Use Circuit Breaker]
B -->|No| D{Stampede on miss?}
D -->|Yes| E{Hot key access pattern?}
D -->|No| F{Tiered Latency Needed?}
E -->|Predictable| G[Probabilistic Early Expiration]
E -->|High Concurrency| H[Request Coalescing]
E -->|Low Latency Critical| I[Locking + Hysteresis]
F -->|Yes| J[L1 + L2 or CDN + Origin]
F -->|No| K[Standard Cache-Aside]
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Lock holder crashes | Lock never released; key permanently blocked | Use TTL on locks; implement lock timeout/auto-release |
| Hysteresis serves too-stale data | Users see outdated content beyond acceptable window | Set reasonable stale_ttl limit; monitor staleness metrics |
| Circuit breaker stuck open | Cache permanently bypassed; database always hit | Implement half-open state for testing; set maximum open time |
| Tiered cache inconsistency | L1 has stale data while L2 has updated | Use invalidation broadcast; prefer L2 as source of truth |
| Request coalescing memory leak | Pending futures accumulate if requests fail | Implement timeout on futures; cleanup on error |
| CDN serves stale after purge | Users see old content | Use versioned URLs; implement proper cache invalidation |
| Probabilistic early expiration overload | Too many background refreshes | Tune beta parameter; limit concurrent refreshes |
Observability Checklist
Metrics to Track
- Stampede Rate: Number of simultaneous cache misses for same key
- Lock Wait Time: How long requests wait for cache lock acquisition
- Circuit Breaker State: Time spent in open/closed/half-open states
- Stale Hit Ratio: Percentage of cache hits served while refreshing
- L1 vs L2 Hit Distribution: Split between local and distributed cache tiers
- Background Refresh Rate: How often early expiration triggers
# Stampede detection
def detect_stampede(redis_client, key, threshold=10):
"""
Detect if multiple requests trying to rebuild same key.
Uses INCR with short TTL as a lightweight counter.
"""
stampede_key = f"stampede:{key}"
count = redis_client.incr(stampede_key)
if count == 1:
redis_client.expire(stampede_key, 5) # 5 second window
if count > threshold:
logger.warning("cache_stampede_detected", key=key, concurrent_requests=count)
return True
return False
# Hysteresis monitoring
def monitor_staleness(redis_client, key):
stale_key = f"{key}:stale"
if redis_client.exists(stale_key):
ttl = redis_client.ttl(key)
logger.info("serving_stale_data", key=key, fresh_ttl=ttl)
Logs to Capture
logger = structlog.get_logger()
# Circuit breaker state changes
def log_circuit_breaker(previous_state, new_state, key=None):
logger.warning("circuit_breaker_state_change",
previous_state=previous_state,
new_state=new_state,
cache_key=key,
timestamp=time.time())
# Cache tier miss tracking
def log_tiered_miss(l1_hit=False, l2_hit=False, origin_required=True):
logger.info("cache_tier_miss",
l1_hit=l1_hit,
l2_hit=l2_hit,
origin_required=origin_required,
latency_tier="l1" if l1_hit else "l2" if l2_hit else "origin")
Alert Rules
- alert: HighLockContention
expr: rate(cache_lock_wait_seconds_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High lock contention on cache keys"
- alert: CircuitBreakerOpen
expr: cache_circuit_breaker_open_duration_seconds > 60
for: 1m
labels:
severity: critical
annotations:
summary: "Circuit breaker open for more than 60 seconds"
- alert: HighStaleHitRatio
expr: cache_stale_hits_total / cache_total_hits > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "More than 10% of cache hits are serving stale data"
Security Checklist
- Lock key injection - Validate key names in lock acquisition; prevent large/special key injection
- Rate limit lock attempts - Prevent denial of service via lock exhaustion
- Stale data exposure - Ensure stale cache doesn’t expose sensitive data between users
- Monitor circuit breaker state - An open circuit breaker may cause database overload
- Tiered cache data isolation - Ensure L1 (local) cache doesn’t leak data between users/tenants
- Background refresh access control - Background refresh threads should have same permission checks
- CDN edge security - Don’t cache sensitive content at CDN edge
# Secure lock implementation
class SecureLockingCache:
def __init__(self, redis_client, max_lock_timeout=5):
self.redis = redis_client
self.max_lock_timeout = max_lock_timeout
def acquire_lock(self, key):
# Validate key format - no special characters that could cause issues
if not self._is_valid_key(key):
raise ValueError(f"Invalid cache key: {key}")
lock_key = f"lock:{key}"
lock_timeout = min(self.max_lock_timeout, 5) # Cap at 5 seconds
# Use SET NX EX for atomic lock acquisition
acquired = self.redis.set(lock_key, "1", nx=True, ex=lock_timeout)
return acquired
def _is_valid_key(self, key):
# Basic validation - alphanumeric, colons, dashes, underscores
import re
return bool(re.match(r'^[\w:.-]+$', key))
Common Pitfalls / Anti-Patterns
1. Lock Without Timeout
Locks without TTL become permanent if the holder crashes.
# BAD: Lock without timeout
redis.set(f"lock:{key}", "1") # No TTL - forever if process dies
# GOOD: Lock with TTL
redis.set(f"lock:{key}", "1", nx=True, ex=5) # Auto-expires in 5 seconds
2. Not Handling Lock Acquisition Failure
Assuming you always get the lock leads to deadlocks.
# BAD: Assume lock always acquired
lock.acquire()
try:
result = fetch_and_cache()
finally:
lock.release()
# GOOD: Handle failure gracefully
if not lock.acquire(blocking=True, blocking_timeout=1):
# Wait and retry from cache, or serve stale
time.sleep(0.1)
return redis.get(key)
try:
result = fetch_and_cache()
finally:
lock.release()
3. Probabilistic Early Expiration Beta Too Aggressive
Beta controls refresh aggressiveness. Too high causes unnecessary refreshes.
# BAD: Beta too high (aggressive) - refreshes constantly
get_with_probabilistic_early_expiration(key, fetch_func, beta=1.0)
# Almost every request triggers refresh for low TTL items
# GOOD: Beta around 0.3 - measured refresh
get_with_probabilistic_early_expiration(key, fetch_func, beta=0.3)
# Only refreshes when TTL gets low enough
4. Hysteresis Without Staleness Limit
Serving stale forever defeats the purpose of cache invalidation.
# BAD: No limit on how long to serve stale
if redis.get(key):
return json.loads(redis.get(key)) # Always return whatever
# GOOD: Limit staleness window
def get_with_hysteresis(key, fetch_func, ttl=3600, max_stale_ttl=60):
value = redis.get(key)
if value:
return json.loads(value)
stale_value = redis.get(f"{key}:stale")
if stale_value:
if redis.ttl(f"{key}:stale") > -max_stale_ttl:
# Serving within acceptable stale window
trigger_background_refresh(key, fetch_func)
return json.loads(stale_value)
# No stale available - must refresh synchronously
result = fetch_func(key)
redis.setex(key, ttl, json.dumps(result))
return result
5. Tiered Cache Without Invalidation Strategy
Updates to L2 don’t propagate to L1, causing stale reads.
# BAD: Write to L2 but L1 still has old data
def update(key, value):
l2.setex(key, 3600, value)
# L1 still has old value - will be served until evicted
# GOOD: Delete from both tiers
def update(key, value):
l2.setex(key, 3600, value)
l1.delete(key) # Invalidate L1
Quick Recap
Key Bullets
- Stampede protection is essential for production caches with popular expiring keys
- Locking, hysteresis, and probabilistic early expiration all solve stampede differently
- Circuit breakers prevent cascade failures when cache is unavailable
- Tiered caching (L1+L2) gives sub-millisecond access for hottest data
- Request coalescing reduces thundering herd by collapsing concurrent requests
- Cache warming prevents cold-start latency after deployments or restarts
Copy/Paste Checklist
# Stampede protection - Locking pattern
lock = redis.lock(f"lock:{key}", timeout=5)
if lock.acquire(blocking=True, blocking_timeout=1):
try:
value = redis.get(key)
if not value:
value = fetch_from_db()
redis.setex(key, 3600, value)
finally:
lock.release()
# Hysteresis pattern
value = redis.get(key)
if not value:
stale = redis.get(f"{key}:stale")
if stale:
trigger_background_refresh(key)
return json.loads(stale)
value = fetch_from_db()
redis.setex(key, 3600, value)
# Circuit breaker
if circuit_breaker.state == "open":
return db.query(key) # Skip cache
try:
value = redis.get(key)
if value:
return json.loads(value)
except Exception as e:
circuit_breaker.record_failure()
return db.query(key)
# Tiered cache get
value = l1.get(key)
if value:
return value
value = l2.get(key)
if value:
l1.put(key, value) # Promote to L1
return value
value = db.query(key)
l2.setex(key, 3600, value)
l1.put(key, value)
return value
# Deployment checklist:
# [ ] Implement stampede protection before going to production
# [ ] Monitor lock contention and circuit breaker state
# [ ] Plan cache warming strategy for deployments
# [ ] Set appropriate TTLs for each cache tier
# [ ] Log tiered cache hit distribution (L1 vs L2)
# [ ] Have circuit breaker fallback to direct DB access
See Also
- Caching Strategies — The five strategies these patterns work with
- Cache Eviction Policies — How to decide what to evict
- CDN Deep Dive — Tiered caching at the edge
- Distributed Caching — Multi-node caching patterns
Conclusion
Cache stampede protection matters when you have popular expiring entries. Cache warming matters after deployments. Tiered caching matters when you’re big enough that one cache tier isn’t enough.
Start with the basics. Add stampede protection early because it bites you unexpectedly. The other patterns you add when you have the operational need.
Don’t over-engineer before you have the problem.
Category
Related Posts
Caching Strategies: Cache-Aside, Write-Through, and More
Master five caching strategies for production systems. Learn cache-aside vs write-through, avoid cache stampede, and scale with these patterns.
Cache Eviction Policies: LRU, LFU, FIFO, and More Explained
Learn LRU, LFU, FIFO, and TTL eviction policies. Understand trade-offs with real-world performance implications for caching.
CDN Deep Dive: Content Delivery Networks and Edge Computing
Understand CDNs from PoPs and anycast routing to cache headers and edge computing. Configure CloudFlare for performance.