Cache Stampede Prevention: Protecting Your Cache

Learn how single-flight, request coalescing, and probabilistic early expiration prevent cache stampedes that can overwhelm your database.

published: reading time: 7 min read

Cache Stampede Prevention: Protecting Your Cache from Thundering Herds

A cache stampede happens when a popular cache entry expires and dozens (or hundreds) of requests all miss simultaneously, each one hitting your database to rebuild the same value. Your cache was supposed to protect the database — but instead, the expiration creates a thundering herd that overwhelms whatever you were trying to shield.

This happens more often than you’d think. A single cache key for “top 100 products” or “user session data” can expire at a bad moment and send a flood of requests to your database. The fix isn’t just “use a longer TTL.” You need actual stampede prevention patterns.

This guide covers the main approaches: single-flight, request coalescing, probabilistic early expiration, and lock-based cache refresh.

Why Cache Stamps Happen

The classic scenario: you cache a result with a 60-second TTL. At second 60, 200 requests arrive within milliseconds of each other. Every single one checks the cache, finds nothing, and rushes to the database.

# This code stampedes
def get_product_catalog():
    cached = redis.get("product_catalog")
    if cached:
        return json.loads(cached)

    # 200 requests arrive here at the same time
    result = db.query("SELECT * FROM products")
    redis.setex("product_catalog", 60, json.dumps(result))
    return result

The cache didn’t help at all. Every request still hit the database.

Single-Flight Pattern

The single-flight pattern ensures only one request fetches the data while all others wait for that same result. Concurrent requests to the same key share a single in-flight request.

Instead of 200 database calls, you get 1. The other 199 requests wait on the same task.

import asyncio
import httpx
from collections import defaultdict
from datetime import datetime

class SingleFlight:
    def __init__(self):
        self.in_flight: dict[str, asyncio.Task] = {}
        self.lock = asyncio.Lock()

    async def get(self, key: str, fetch_fn) -> any:
        # Check if request is already in flight
        async with self.lock:
            if key in self.in_flight:
                # Another request is already fetching this key
                task = self.in_flight[key]
            else:
                # No in-flight request, create one
                task = asyncio.create_task(self._fetch(key, fetch_fn))
                self.in_flight[key] = task
                task.add_done_callback(
                    lambda _: self._cleanup(key)
                )

        return await task

    async def _fetch(self, key: str, fetch_fn) -> any:
        result = await fetch_fn()
        return result

    def _cleanup(self, key: str):
        self.in_flight.pop(key, None)

Go Single-Flight Implementation

Go’s golang.org/x/sync/singleflight is the canonical implementation:

var sf singleflight.Group

func getProductCatalog() ([]Product, error) {
    v, err, _ := sf.Do("product_catalog", func() (interface{}, error) {
        // Only one request reaches the database
        return fetchFromDB()
    })
    return v.([]Product), err
}

Request Coalescing

Request coalescing takes single-flight further by batching multiple requests that arrive within a window. Instead of one request per key, you wait a short time (e.g., 5ms) to collect all requests for the same key, then make a single database call.

The window_ms tuning matters. Too short and you don’t coalesce enough. Too long and latency suffers.

import asyncio
from collections import defaultdict
import time

class RequestCoalescer:
    def __init__(self, window_ms: float = 5.0):
        self.window_ms = window_ms
        self.pending: dict[str, asyncio.Queue] = {}
        self.lock = asyncio.Lock()

    async def get(self, key: str, fetch_fn) -> any:
        queue: asyncio.Queue = None

        async with self.lock:
            if key not in self.pending:
                queue = asyncio.Queue()
                self.pending[key] = queue
            else:
                queue = self.pending[key]

        # If we're the first request, wait for the window and then fetch
        if queue.empty():
            try:
                # Wait for the coalescing window
                await asyncio.sleep(self.window_ms / 1000)

                # Fetch once, then wake everyone
                result = await fetch_fn()

                # Unblock all waiting requests
                while not queue.empty():
                    queue.put_nowait(result)
            finally:
                async with self.lock:
                    self.pending.pop(key, None)
        else:
            result = await queue.get()

        return result

Probabilistic Early Expiration

Instead of waiting for the TTL to expire, you probabilistically refresh the cache before it actually expires. Keys that are frequently accessed get refreshed more often, spreading the load over time rather than having all requests hit at once.

The formula: refresh at ttl - (-ln(random()) * average_ttl * probability). For more on eviction policies that interact with this pattern, see cache eviction policies.

import random
import time

def should_refresh_early(ttl: int, access_frequency: float, random_factor: float = 0.3) -> bool:
    """
    Probabilistic early expiration.
    Higher access frequency = higher chance of early refresh.
    """
    probability = min(1.0, access_frequency * random_factor)
    return random.random() < probability

def get_with_probabilistic_refresh(key: str, fetch_fn, ttl: int = 60):
    value = redis.get(key)
    if value:
        access_count = redis.hincrby("access_counts", key, 1)
        access_frequency = access_count / ttl

        if should_refresh_early(ttl, access_frequency):
            # Asynchronously refresh in background
            asyncio.create_task(refresh_async(key, fetch_fn, ttl))

        return json.loads(value)

    # Cache miss - must wait for fetch
    result = fetch_fn()
    redis.setex(key, ttl, json.dumps(result))
    return result

This works well for read-heavy workloads where some keys are much more popular than others.

Lock-Based Cache Refresh

Instead of letting all requests miss when a key expires, you give one request the “lock” to rebuild the cache while others either return stale data or wait.

import redis
import json
import time
import threading

STALE_TTL = 30  # Keep serving stale data for 30 extra seconds

def get_with_lock_refresh(key: str, fetch_fn, ttl: int = 60):
    cached = redis.get(key)
    if cached:
        value, expires_at = json.loads(cached)
        is_stale = time.time() > expires_at

        if not is_stale:
            return value

        # Stale - try to acquire refresh lock
        lock_key = f"lock:{key}"
        lock_acquired = redis.set(lock_key, "1", nx=True, ex=10)

        if lock_acquired:
            # We got the lock - refresh in background
            def background_refresh():
                try:
                    result = fetch_fn()
                    new_expires = time.time() + ttl
                    redis.setex(key, ttl + STALE_TTL, json.dumps([result, new_expires]))
                finally:
                    redis.delete(lock_key)

            threading.Thread(target=background_refresh, daemon=True).start()

        # Return stale data while refresh happens in background
        return value

    # No cache at all - must fetch synchronously
    result = fetch_fn()
    expires_at = time.time() + ttl
    redis.setex(key, ttl + STALE_TTL, json.dumps([result, expires_at]))
    return result

This is essentially the stale-while-revalidate pattern that Cloudflare, Fastly, and CDNs use — serve stale content while refreshing in the background.

Choosing a Strategy

StrategyWhen it worksWhen it doesn’t
Single-flightLow to medium concurrency, simple casesHigh concurrency with many different keys
Request coalescingBursts of identical requestsVery low latency requirements
Probabilistic early expirationPopular keys with high varianceUniform access patterns
Lock-based refreshWilling to serve stale dataFresh data required always

Combining Approaches

Most production systems layer multiple strategies:

  1. Use single-flight at the application layer to deduplicate concurrent requests
  2. Add probabilistic early expiration on popular keys to spread load over time
  3. Keep a short stale TTL as a safety net for the long tail
async def get_cached_with_layers(key: str, fetch_fn, ttl: int = 60):
    # Layer 1: Single-flight deduplication
    result = await single_flight.get(key, lambda: refresh_with_layers(key, fetch_fn, ttl))
    return result

async def refresh_with_layers(key: str, fetch_fn, ttl: int):
    # Probabilistic early refresh
    if should_probabilistic_refresh(key, ttl):
        asyncio.create_task(background_refresh(key, fetch_fn, ttl))

    # Lock-based refresh if needed
    if is_stale(key):
        await acquire_refresh_lock(key, fetch_fn, ttl)

    return get_cached_or_fetch(key, fetch_fn)

Common Mistakes

Setting TTLs too uniformly: If every key expires at the same time (e.g., all set at application startup), you create synchronized stampedes. Add jitter:

import random
ttl_with_jitter = base_ttl + random.randint(0, int(base_ttl * 0.1))

Ignoring the lock timeout: If your lock acquisition doesn’t have a timeout, a crashed refresh process leaves the lock hanging forever. Always set EX on the lock key.

Serving stale data without bounds: The stale-while-revalidate pattern requires a maximum staleness threshold. Without one, you could serve data that’s hours old.

Monitoring for Stampedes

Watch these signals:

-- Redis: Detect stampede patterns
-- Many identical keys expiring at once
MONITOR | grep -c "EXPIRED"

-- Database: Detect thundering herds
SELECT COUNT(*), DATE_TRUNC('second', created_at) as t
FROM queries
WHERE created_at > NOW() - INTERVAL '10 seconds'
GROUP BY t
HAVING COUNT(*) > 100;

Set alerts on cache hit rate drops — a sudden drop often signals a stampede.

Next Steps

Cache stampede prevention is one piece of a larger caching strategy. For more on caching patterns, read about caching strategies and distributed caching.

For handling the database side of concurrent writes, the locking and concurrency guide covers pessimistic and optimistic approaches.

For understanding how cache stampedes relate to overall database scaling, it’s worth knowing how caches fit into the scale-out picture.

Category

Related Posts

Cache Eviction Policies: LRU, LFU, FIFO, and More Explained

Learn LRU, LFU, FIFO, and TTL eviction policies. Understand trade-offs with real-world performance implications for caching.

#caching #algorithms #system-design

Caching Strategies: Cache-Aside, Write-Through, and More

Master five caching strategies for production systems. Learn cache-aside vs write-through, avoid cache stampede, and scale with these patterns.

#caching #system-design #performance

Distributed Caching: Multi-Node Cache Clusters

Scale caching across multiple nodes. Learn about cache clusters, consistency models, session stores, and cache coherence patterns.

#distributed-systems #caching #scalability