Cache Patterns: Stampede, Thundering Herd, Tiered Caching

Learn advanced cache patterns for production systems. Solve cache stampede, implement cache warming, and design tiered caching architectures.

published: reading time: 16 min read

Cache Patterns

The basics of caching are straightforward. Read from cache, fall back to database, populate cache. But production systems reveal problems that basic tutorials don’t cover.

Cache stampede. Thundering herd. Cold starts. Tiered latency. These aren’t edge cases at scale. They hit you regularly and they’ll ruin your day if you haven’t thought about them.

This covers the patterns that matter once you move past “add Redis and hope for the best.”


Cache Stampede

When a popular cache entry expires, every request that hits it tries to rebuild from the database simultaneously. The database wasn’t designed for that kind of concurrent load. It slows down, requests pile up, and now you’ve turned a caching problem into a cascading failure.

graph TD
    A[Cache Miss] --> B[Request DB]
    B --> C[100 more requests hit expired key]
    C --> D[All 100 try DB simultaneously]
    D --> E[DB overwhelmed]
    E --> F[Requests timeout]
    F --> G[Cache never repopulated]

This is the cache stampede problem. It has two main fixes.

Distributed Locking

Only one process rebuilds the cache; others wait or serve stale data.

import redis
import time
import json

def get_with_lock(key, fetch_func, lock_timeout=5, stale_timeout=30):
    # Try to get from cache
    value = redis.get(key)
    if value:
        return json.loads(value)

    # Try to acquire lock
    lock_key = f"lock:{key}"
    lock = redis.lock(lock_key, timeout=lock_timeout)

    if lock.acquire(blocking=True, blocking_timeout=1):
        try:
            # Double-check cache (another process might have populated)
            value = redis.get(key)
            if value:
                return json.loads(value)

            # Fetch from source
            result = fetch_func(key)
            redis.setex(key, 3600, json.dumps(result))
            return result
        finally:
            lock.release()
    else:
        # Didn't get lock - wait briefly and retry cache
        time.sleep(0.1)
        value = redis.get(key)
        if value:
            return json.loads(value)

        # Still nothing - either serve stale or error
        return None

This prevents the stampede but adds latency for the first request. The others wait or get stale data.

Probabilistic Early Expiration

Refresh the cache before it expires, based on probability that decreases as TTL decreases.

import time
import random
import hashlib

def get_with_probabilistic_early_expiration(key, fetch_func, ttl=3600, beta=0.3):
    cached = redis.get(key)
    if cached:
        return json.loads(cached)

    # Get current TTL
    current_ttl = redis.ttl(key)
    if current_ttl <= 0:
        current_ttl = ttl

    # Calculate probability of early expiration
    # Lower beta = less aggressive early refresh
    remaining_ratio = current_ttl / ttl
    probability = 1 - (remaining_ratio ** beta)

    if random.random() < probability:
        # Refresh in background (don't block)
        def background_refresh():
            try:
                result = fetch_func(key)
                redis.setex(key, ttl, json.dumps(result))
            except Exception as e:
                pass  # Log but don't fail

        # In production, submit to thread pool or queue
        import threading
        threading.Thread(target=background_refresh).start()
    else:
        # Normal cache miss handling
        result = fetch_func(key)
        redis.setex(key, ttl, json.dumps(result))
        return result

    return fetch_func(key)  # Fallback

The math is: as TTL decreases, probability of refresh increases. Popular items get refreshed before they expire. The beta parameter controls aggressiveness.

Hysteresis

Keep serving stale data while refreshing in background, even after expiration.

def get_with_hysteresis(key, fetch_func, ttl=3600, stale_ttl=60):
    value = redis.get(key)
    if value:
        return json.loads(value)

    # Check for stale version
    stale_value = redis.get(f"{key}:stale")
    if stale_value:
        # Return stale, trigger background refresh
        def background_refresh():
            try:
                result = fetch_func(key)
                redis.setex(key, ttl, json.dumps(result))
                redis.delete(f"{key}:stale")
            except Exception as e:
                pass

        import threading
        threading.Thread(target=background_refresh).start()
        return json.loads(stale_value)

    # No stale version - must refresh synchronously
    result = fetch_func(key)
    redis.setex(key, ttl, json.dumps(result))
    return result

This ensures users never hit a cache miss but requires careful handling of staleness.


Thundering Herd

Thundering herd is cache stampede at a larger scale, usually when an entire cache tier goes down or is restarted. Every request hits the database at once because nothing is cached.

graph TD
    A[Cache goes down] --> B[All requests hit DB]
    B --> C[DB overwhelmed]
    C --> D[System crashes]
    D --> E[Cache never recovers]

Prevention: Graceful Degradation

Design your cache to degrade gracefully under failure.

class CircuitBreakerCache:
    def __init__(self, redis_client, db, failure_threshold=5, timeout=30):
        self.redis = redis_client
        self.db = db
        self.failure_count = 0
        self.last_failure_time = 0
        self.timeout = timeout
        self.failure_threshold = failure_threshold
        self.state = "closed"  # closed, open, half-open

    def get(self, key):
        # If circuit open, skip cache and go to DB directly
        if self.state == "open":
            if time.time() - self.last_failure_time > self.timeout:
                self.state = "half-open"
            else:
                return self.db.query(key)

        try:
            value = self.redis.get(key)
            if value:
                return json.loads(value)

            # Cache miss - get from DB
            result = self.db.query(key)
            self.redis.setex(key, 3600, json.dumps(result))
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()

            if self.failure_count >= self.failure_threshold:
                self.state = "open"

            # Fallback to DB directly
            return self.db.query(key)

Prevention: Request Coalescing

Multiple requests for the same key get coalesced into a single database query.

import asyncio
from concurrent.futures import Future

class RequestCoalescer:
    def __init__(self):
        self.pending = {}  # key -> Future

    async def get(self, key, fetch_func):
        # Check if there's already a pending request for this key
        if key in self.pending:
            # Wait for the existing request
            return await self.pending[key]

        # Create a future for this request
        future = Future()

        async def fetch_and_store():
            try:
                result = await fetch_func(key)
                future.set_result(result)
            except Exception as e:
                future.set_exception(e)
            finally:
                del self.pending[key]

        self.pending[key] = future
        asyncio.create_task(fetch_and_store())

        return await future

Cache Warming

On startup or after cache invalidation, the cache is empty. Users experience slow requests until popular items are repopulated. Cache warming pre-loads the cache before users need it.

Proactive Warming on Startup

def warm_cache_on_startup(redis_client, db, popular_keys):
    """
    Load popular items into cache before users arrive.
    Run this after cache restart or deployment.
    """
    print(f"Warming cache with {len(popular_keys)} keys...")

    for key in popular_keys:
        try:
            # Fetch from database
            data = db.query(key)

            # Store in cache with longer TTL
            redis_client.setex(key, 86400, json.dumps(data))  # 24hr TTL
        except Exception as e:
            print(f"Failed to warm key {key}: {e}")

    print("Cache warming complete")

# Common patterns for popular keys:
# - Top 1000 products by sales
# - Top 100 users by activity
# - Category listing pages
# - Feature flags and config

Background Warming

Warm cache entries in the background as they expire or as traffic patterns emerge.

import threading
import time
from collections import Counter

class BackgroundWarmingCache:
    def __init__(self, redis_client, db, warm_threshold=100):
        self.redis = redis_client
        self.db = db
        self.access_count = Counter()
        self.warm_threshold = warm_threshold
        self.running = True

        # Start background thread
        self.warmer_thread = threading.Thread(target=self._warm_loop)
        self.warmer_thread.daemon = True
        self.warmer_thread.start()

    def track_access(self, key):
        """Track cache access for warming decisions"""
        self.access_count[key] += 1

    def get(self, key):
        value = self.redis.get(key)
        if value:
            self.track_access(key)
            return json.loads(value)
        return None

    def _warm_loop(self):
        while self.running:
            # Find hot keys that aren't in cache
            for key, count in self.access_count.most_common(100):
                if count >= self.warm_threshold and not self.redis.exists(key):
                    try:
                        data = self.db.query(key)
                        self.redis.setex(key, 86400, json.dumps(data))
                        self.access_count[key] = 0  # Reset after warming
                    except Exception:
                        pass

            time.sleep(60)  # Check every minute

Tiered Caching

Not all caches are equal. Using multiple cache tiers with different characteristics gives you the best of both worlds: fast access for hot data, larger capacity for warm data.

graph TD
    A[Request] --> B[L1 Cache<br/>CPU Cache / In-Memory<br/>~MB, <1ms]
    B -->|miss| C[L2 Cache<br/>Redis / Memcached<br/>~GB, <10ms]
    C -->|miss| D[L3 Cache<br/>CDN / Edge<br/>~TB, <100ms]
    D -->|miss| E[Database<br/>~TB, 10-100ms]

L1 + L2 Pattern

import redis
import threading
import time

class TieredCache:
    def __init__(self, l1_size=1000, l2_client=None, ttl=3600):
        # L1: local memory cache (thread-safe dict with LRU)
        from collections import OrderedDict

        class LRUCache:
            def __init__(self, capacity):
                self.capacity = capacity
                self.cache = OrderedDict()

            def get(self, key):
                if key in self.cache:
                    self.cache.move_to_end(key)
                    return self.cache[key]
                return None

            def put(self, key, value):
                if key in self.cache:
                    self.cache.move_to_end(key)
                self.cache[key] = value
                if len(self.cache) > self.capacity:
                    self.cache.popitem(last=False)

        self.l1 = LRUCache(l1_size)
        self.l2 = l2_client  # Redis
        self.ttl = ttl

    def get(self, key):
        # Check L1 first
        value = self.l1.get(key)
        if value:
            return value

        # Check L2
        if self.l2:
            value = self.l2.get(key)
            if value:
                # Promote to L1
                self.l1.put(key, value)
                return value

        return None

    def put(self, key, value):
        # Write to both tiers
        self.l1.put(key, value)
        if self.l2:
            self.l2.setex(key, self.ttl, value)

CDN + Origin Pattern

graph TD
    A[User Request] --> B[CDN Edge<br/>~100ms TTL]
    B -->|miss| C[CDN Origin<br/>~1hr TTL]
    C -->|miss| D[Application<br/>~24hr TTL]
    D -->|miss| E[Database]

For web content:

# CDN cache control headers
Cache-Control: public, max-age=100, s-maxage=100  # Edge cache 100s
Cache-Control: public, max-age=3600, s-maxage=3600  # Origin cache 1hr

# Private content (user-specific)
Cache-Control: private, max-age=3600  # Only browser caches this

Sharding Strategies

When a single cache instance isn’t enough, you need to distribute across multiple nodes.

Consistent Hashing

import hashlib
from bisect import bisect

class ConsistentHashRing:
    def __init__(self, nodes, replicas=150):
        self.replicas = replicas
        self.ring = {}
        self.sorted_keys = []

        for node in nodes:
            for i in range(replicas):
                key = f"{node}:{i}"
                hash_key = int(hashlib.md5(key.encode()).hexdigest(), 16)
                self.ring[hash_key] = node
                self.sorted_keys.append(hash_key)

        self.sorted_keys.sort()

    def get_node(self, key):
        if not self.ring:
            return None

        hash_key = int(hashlib.md5(key.encode()).hexdigest(), 16)
        idx = bisect(self.sorted_keys, hash_key)
        if idx >= len(self.sorted_keys):
            idx = 0
        return self.ring[self.sorted_keys[idx]]

    def add_node(self, node):
        for i in range(self.replicas):
            key = f"{node}:{i}"
            hash_key = int(hashlib.md5(key.encode()).hexdigest(), 16)
            self.ring[hash_key] = node
            self.sorted_keys.append(hash_key)
        self.sorted_keys.sort()

    def remove_node(self, node):
        for i in range(self.replicas):
            key = f"{node}:{i}"
            hash_key = int(hashlib.md5(key.encode()).hexdigest(), 16)
            del self.ring[hash_key]
            self.sorted_keys.remove(hash_key)

When to Use / When Not to Use

PatternWhen to UseWhen Not to Use
Distributed LockingPreventing cache stampede on popular keys; ensuring single-flight rebuildsHigh-frequency writes; when lock contention is a problem
Probabilistic Early ExpirationPopular items with predictable access; reducing miss latencyUnpredictable access patterns; items with short TTLs
Hysteresis (Stale-While-Refresh)Latency-sensitive applications; users must never waitWhen stale data is unacceptable; very short TTLs
Circuit BreakerExternal cache service; preventing cascade failuresLocal/in-process caches; when cache is always available
Request CoalescingHigh concurrency on same keys; thundering herd preventionLow traffic; unique keys per request
Cache WarmingAfter deployments; known hot dataset; reducing cold-start latencyUnknown access patterns; dynamic content
Tiered Caching (L1+L2)Multiple application instances; sub-millisecond access needed for hot dataSingle instance; uniform access patterns
CDN + OriginStatic content; global user base; reducing origin loadDynamic, personalized content; real-time data

Decision Guide

graph TD
    A[Problem] --> B{Cache unavailable?}
    B -->|Yes| C[Use Circuit Breaker]
    B -->|No| D{Stampede on miss?}
    D -->|Yes| E{Hot key access pattern?}
    D -->|No| F{Tiered Latency Needed?}
    E -->|Predictable| G[Probabilistic Early Expiration]
    E -->|High Concurrency| H[Request Coalescing]
    E -->|Low Latency Critical| I[Locking + Hysteresis]
    F -->|Yes| J[L1 + L2 or CDN + Origin]
    F -->|No| K[Standard Cache-Aside]

Production Failure Scenarios

FailureImpactMitigation
Lock holder crashesLock never released; key permanently blockedUse TTL on locks; implement lock timeout/auto-release
Hysteresis serves too-stale dataUsers see outdated content beyond acceptable windowSet reasonable stale_ttl limit; monitor staleness metrics
Circuit breaker stuck openCache permanently bypassed; database always hitImplement half-open state for testing; set maximum open time
Tiered cache inconsistencyL1 has stale data while L2 has updatedUse invalidation broadcast; prefer L2 as source of truth
Request coalescing memory leakPending futures accumulate if requests failImplement timeout on futures; cleanup on error
CDN serves stale after purgeUsers see old contentUse versioned URLs; implement proper cache invalidation
Probabilistic early expiration overloadToo many background refreshesTune beta parameter; limit concurrent refreshes

Observability Checklist

Metrics to Track

  • Stampede Rate: Number of simultaneous cache misses for same key
  • Lock Wait Time: How long requests wait for cache lock acquisition
  • Circuit Breaker State: Time spent in open/closed/half-open states
  • Stale Hit Ratio: Percentage of cache hits served while refreshing
  • L1 vs L2 Hit Distribution: Split between local and distributed cache tiers
  • Background Refresh Rate: How often early expiration triggers
# Stampede detection
def detect_stampede(redis_client, key, threshold=10):
    """
    Detect if multiple requests trying to rebuild same key.
    Uses INCR with short TTL as a lightweight counter.
    """
    stampede_key = f"stampede:{key}"
    count = redis_client.incr(stampede_key)
    if count == 1:
        redis_client.expire(stampede_key, 5)  # 5 second window

    if count > threshold:
        logger.warning("cache_stampede_detected", key=key, concurrent_requests=count)
        return True
    return False

# Hysteresis monitoring
def monitor_staleness(redis_client, key):
    stale_key = f"{key}:stale"
    if redis_client.exists(stale_key):
        ttl = redis_client.ttl(key)
        logger.info("serving_stale_data", key=key, fresh_ttl=ttl)

Logs to Capture

logger = structlog.get_logger()

# Circuit breaker state changes
def log_circuit_breaker(previous_state, new_state, key=None):
    logger.warning("circuit_breaker_state_change",
        previous_state=previous_state,
        new_state=new_state,
        cache_key=key,
        timestamp=time.time())

# Cache tier miss tracking
def log_tiered_miss(l1_hit=False, l2_hit=False, origin_required=True):
    logger.info("cache_tier_miss",
        l1_hit=l1_hit,
        l2_hit=l2_hit,
        origin_required=origin_required,
        latency_tier="l1" if l1_hit else "l2" if l2_hit else "origin")

Alert Rules

- alert: HighLockContention
  expr: rate(cache_lock_wait_seconds_total[5m]) > 0.1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High lock contention on cache keys"

- alert: CircuitBreakerOpen
  expr: cache_circuit_breaker_open_duration_seconds > 60
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Circuit breaker open for more than 60 seconds"

- alert: HighStaleHitRatio
  expr: cache_stale_hits_total / cache_total_hits > 0.1
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "More than 10% of cache hits are serving stale data"

Security Checklist

  • Lock key injection - Validate key names in lock acquisition; prevent large/special key injection
  • Rate limit lock attempts - Prevent denial of service via lock exhaustion
  • Stale data exposure - Ensure stale cache doesn’t expose sensitive data between users
  • Monitor circuit breaker state - An open circuit breaker may cause database overload
  • Tiered cache data isolation - Ensure L1 (local) cache doesn’t leak data between users/tenants
  • Background refresh access control - Background refresh threads should have same permission checks
  • CDN edge security - Don’t cache sensitive content at CDN edge
# Secure lock implementation
class SecureLockingCache:
    def __init__(self, redis_client, max_lock_timeout=5):
        self.redis = redis_client
        self.max_lock_timeout = max_lock_timeout

    def acquire_lock(self, key):
        # Validate key format - no special characters that could cause issues
        if not self._is_valid_key(key):
            raise ValueError(f"Invalid cache key: {key}")

        lock_key = f"lock:{key}"
        lock_timeout = min(self.max_lock_timeout, 5)  # Cap at 5 seconds

        # Use SET NX EX for atomic lock acquisition
        acquired = self.redis.set(lock_key, "1", nx=True, ex=lock_timeout)
        return acquired

    def _is_valid_key(self, key):
        # Basic validation - alphanumeric, colons, dashes, underscores
        import re
        return bool(re.match(r'^[\w:.-]+$', key))

Common Pitfalls / Anti-Patterns

1. Lock Without Timeout

Locks without TTL become permanent if the holder crashes.

# BAD: Lock without timeout
redis.set(f"lock:{key}", "1")  # No TTL - forever if process dies

# GOOD: Lock with TTL
redis.set(f"lock:{key}", "1", nx=True, ex=5)  # Auto-expires in 5 seconds

2. Not Handling Lock Acquisition Failure

Assuming you always get the lock leads to deadlocks.

# BAD: Assume lock always acquired
lock.acquire()
try:
    result = fetch_and_cache()
finally:
    lock.release()

# GOOD: Handle failure gracefully
if not lock.acquire(blocking=True, blocking_timeout=1):
    # Wait and retry from cache, or serve stale
    time.sleep(0.1)
    return redis.get(key)

try:
    result = fetch_and_cache()
finally:
    lock.release()

3. Probabilistic Early Expiration Beta Too Aggressive

Beta controls refresh aggressiveness. Too high causes unnecessary refreshes.

# BAD: Beta too high (aggressive) - refreshes constantly
get_with_probabilistic_early_expiration(key, fetch_func, beta=1.0)
# Almost every request triggers refresh for low TTL items

# GOOD: Beta around 0.3 - measured refresh
get_with_probabilistic_early_expiration(key, fetch_func, beta=0.3)
# Only refreshes when TTL gets low enough

4. Hysteresis Without Staleness Limit

Serving stale forever defeats the purpose of cache invalidation.

# BAD: No limit on how long to serve stale
if redis.get(key):
    return json.loads(redis.get(key))  # Always return whatever

# GOOD: Limit staleness window
def get_with_hysteresis(key, fetch_func, ttl=3600, max_stale_ttl=60):
    value = redis.get(key)
    if value:
        return json.loads(value)

    stale_value = redis.get(f"{key}:stale")
    if stale_value:
        if redis.ttl(f"{key}:stale") > -max_stale_ttl:
            # Serving within acceptable stale window
            trigger_background_refresh(key, fetch_func)
            return json.loads(stale_value)

    # No stale available - must refresh synchronously
    result = fetch_func(key)
    redis.setex(key, ttl, json.dumps(result))
    return result

5. Tiered Cache Without Invalidation Strategy

Updates to L2 don’t propagate to L1, causing stale reads.

# BAD: Write to L2 but L1 still has old data
def update(key, value):
    l2.setex(key, 3600, value)
    # L1 still has old value - will be served until evicted

# GOOD: Delete from both tiers
def update(key, value):
    l2.setex(key, 3600, value)
    l1.delete(key)  # Invalidate L1

Quick Recap

Key Bullets

  • Stampede protection is essential for production caches with popular expiring keys
  • Locking, hysteresis, and probabilistic early expiration all solve stampede differently
  • Circuit breakers prevent cascade failures when cache is unavailable
  • Tiered caching (L1+L2) gives sub-millisecond access for hottest data
  • Request coalescing reduces thundering herd by collapsing concurrent requests
  • Cache warming prevents cold-start latency after deployments or restarts

Copy/Paste Checklist

# Stampede protection - Locking pattern
lock = redis.lock(f"lock:{key}", timeout=5)
if lock.acquire(blocking=True, blocking_timeout=1):
    try:
        value = redis.get(key)
        if not value:
            value = fetch_from_db()
            redis.setex(key, 3600, value)
    finally:
        lock.release()

# Hysteresis pattern
value = redis.get(key)
if not value:
    stale = redis.get(f"{key}:stale")
    if stale:
        trigger_background_refresh(key)
        return json.loads(stale)
    value = fetch_from_db()
    redis.setex(key, 3600, value)

# Circuit breaker
if circuit_breaker.state == "open":
    return db.query(key)  # Skip cache
try:
    value = redis.get(key)
    if value:
        return json.loads(value)
except Exception as e:
    circuit_breaker.record_failure()
    return db.query(key)

# Tiered cache get
value = l1.get(key)
if value:
    return value
value = l2.get(key)
if value:
    l1.put(key, value)  # Promote to L1
    return value
value = db.query(key)
l2.setex(key, 3600, value)
l1.put(key, value)
return value

# Deployment checklist:
# [ ] Implement stampede protection before going to production
# [ ] Monitor lock contention and circuit breaker state
# [ ] Plan cache warming strategy for deployments
# [ ] Set appropriate TTLs for each cache tier
# [ ] Log tiered cache hit distribution (L1 vs L2)
# [ ] Have circuit breaker fallback to direct DB access

See Also


Conclusion

Cache stampede protection matters when you have popular expiring entries. Cache warming matters after deployments. Tiered caching matters when you’re big enough that one cache tier isn’t enough.

Start with the basics. Add stampede protection early because it bites you unexpectedly. The other patterns you add when you have the operational need.

Don’t over-engineer before you have the problem.

Category

Related Posts

Caching Strategies: Cache-Aside, Write-Through, and More

Master five caching strategies for production systems. Learn cache-aside vs write-through, avoid cache stampede, and scale with these patterns.

#caching #system-design #performance

Cache Eviction Policies: LRU, LFU, FIFO, and More Explained

Learn LRU, LFU, FIFO, and TTL eviction policies. Understand trade-offs with real-world performance implications for caching.

#caching #algorithms #system-design

CDN Deep Dive: Content Delivery Networks and Edge Computing

Understand CDNs from PoPs and anycast routing to cache headers and edge computing. Configure CloudFlare for performance.

#cdn #networking #performance