Rate Limiting: Token Bucket, Sliding Window, and Distributed

Rate limiting protects APIs from abuse. Learn token bucket, sliding window, fixed window algorithms and distributed rate limiting at scale.

published: March 22, 2026 reading time: 12 min read

Rate Limiting: Token Bucket, Sliding Window, and Distributed Strategies

Every API has a capacity limit. Physical servers have finite CPU and memory. Databases can only handle so many connections. Rate limiting protects your services from being overwhelmed—whether by malicious attackers, runaway scripts, or too many legitimate users at once.

Without it, one misconfigured client can take down your entire API. A retry loop hitting a temporary backend hiccup creates a retry storm. Your service goes from degraded to dead.

Why Rate Limit?

Rate limiting serves several purposes. Protection from abuse is the obvious one. But there are subtler benefits.

Fairness. Without rate limits, heavy users crowd out light users. A single tenant uploading large files could saturate bandwidth for everyone else.

Cost control. Your infrastructure costs money. Rate limits ensure different customers pay for their usage levels.

Predictability. Knowing your API has hard limits lets you plan client-side behavior. Users can implement retry logic with appropriate backoff.

Protection from cascading failures. When upstream services are slow, clients retry aggressively. Rate limiting prevents retry storms from compounding the original problem.

Core Algorithms

Token Bucket

Token bucket is the most common rate limiting algorithm. Imagine a bucket that holds a fixed number of tokens. Tokens get added to the bucket at a constant rate. Each request consumes one token. If the bucket is empty, requests are rejected.

import time
import threading

class TokenBucket:
    def __init__(self, capacity: int, refill_rate: float):
        self.capacity = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.tokens = capacity
        self.last_refill = time.time()
        self.lock = threading.Lock()

    def allow_request(self) -> bool:
        with self.lock:
            self._refill()
            if self.tokens >= 1:
                self.tokens -= 1
                return True
            return False

    def _refill(self):
        now = time.time()
        elapsed = now - self.last_refill
        new_tokens = elapsed * self.refill_rate
        self.tokens = min(self.capacity, self.tokens + new_tokens)
        self.last_refill = now

The key properties: bursts allowed up to bucket capacity, long-term rate equals refill rate. A bucket with capacity 100 and refill rate of 10 per second lets you burst to 100 requests, then sustains 10 per second.

Users get burst capacity when they need it, then settle into the sustainable rate.

Sliding Window

Sliding window rate limiting addresses the boundary issues of fixed windows. Instead of resetting counters at window boundaries, sliding window tracks requests within a rolling time period.

import time
from collections import deque

class SlidingWindow:
    def __init__(self, max_requests: int, window_seconds: int):
        self.max_requests = max_requests
        self.window_seconds = window_seconds
        self.requests = deque()
        self.lock = threading.Lock()

    def allow_request(self) -> bool:
        with self.lock:
            now = time.time()
            cutoff = now - self.window_seconds

            # Remove expired entries
            while self.requests and self.requests[0] < cutoff:
                self.requests.popleft()

            if len(self.requests) < self.max_requests:
                self.requests.append(now)
                return True
            return False

The advantage over fixed window: no spike at window boundaries. With fixed windows, a client could make 100 requests at 11:59:59 and another 100 at 12:00:00, effectively doubling the rate. Sliding window prevents this.

The memory cost is proportional to the number of requests in the window. For high-volume APIs, this can add up.

Fixed Window

Fixed window is the simplest algorithm. Divide time into fixed intervals. Each interval has a request budget. Count resets at interval boundaries.

import time

class FixedWindow:
    def __init__(self, max_requests: int, window_seconds: int):
        self.max_requests = max_requests
        self.window_seconds = window_seconds
        self.window_start = time.time()
        self.count = 0
        self.lock = threading.Lock()

    def allow_request(self) -> bool:
        with self.lock:
            now = time.time()
            if now >= self.window_start + self.window_seconds:
                self.window_start = now
                self.count = 0

            if self.count < self.max_requests:
                self.count += 1
                return True
            return False

Fixed window is easy to implement and understand. The boundary spike problem is real, but for many applications it is acceptable. Memory usage stays constant.

Leaky Bucket

Leaky bucket works like a bucket with a hole in the bottom. Water (requests) enters at varying rates but leaks out at a constant rate. If the bucket overflows, requests are rejected.

import time
import threading

class LeakyBucket:
    def __init__(self, capacity: int, leak_rate: float):
        self.capacity = capacity
        self.leak_rate = leak_rate
        self.water = 0
        self.last_leak = time.time()
        self.lock = threading.Lock()

    def allow_request(self) -> bool:
        with self.lock:
            self._leak()
            if self.water < self.capacity:
                self.water += 1
                return True
            return False

    def _leak(self):
        now = time.time()
        elapsed = now - self.last_leak
        leaked = elapsed * self.leak_rate
        self.water = max(0, self.water - leaked)
        self.last_leak = now

Leaky bucket outputs at a steady rate, making it useful for rate-limiting requests to downstream services. If your API calls a third-party service with rate limits, leaky bucket smooths your outbound traffic.

Implementing Rate Limits in APIs

HTTP Headers

APIs communicate rate limit status through headers. There is no universal standard, but common conventions exist.

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1640995200
Retry-After: 3600

When a client exceeds the limit, return 429 Too Many Requests:

@app.route('/api/resource')
def get_resource():
    if not rate_limiter.allow_request():
        reset_time = rate_limiter.get_reset_time()
        return {
            'error': 'Rate limit exceeded',
            'retry_after': reset_time - time.time()
        }, 429
    return {'data': 'result'}

Clients should read these headers and back off accordingly. Implement exponential backoff with jitter to avoid thundering herd problems.

Distributed Rate Limiting

In a single-server environment, the algorithms above work fine. But modern applications run across multiple servers. A client could hit different servers with separate counters, bypassing the rate limit entirely.

Distributed rate limiting uses a shared store to coordinate limits across instances.

Redis-Based Token Bucket

-- Redis token bucket implementation
-- KEYS[1] = rate limit key
-- ARGV[1] = capacity
-- ARGV[2] = refill rate (tokens per second)
-- ARGV[3] = current timestamp
-- ARGV[4] = requested tokens (usually 1)

local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])

local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now

-- Refill tokens
local elapsed = now - last_refill
local new_tokens = elapsed * refill_rate
tokens = math.min(capacity, tokens + new_tokens)

local allowed = 0
if tokens >= requested then
    tokens = tokens - requested
    allowed = 1
end

redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 1)

return {allowed, tokens}

This Lua script executes atomically in Redis, preventing race conditions. The expiry ensures the key does not persist after the client stops making requests.

Redis-Based Sliding Window

Sliding window with Redis uses a sorted set. Each request gets a timestamp as the score and a unique member. To check the limit: remove entries older than the window, count remaining entries, add the new entry if within limit.

import redis
import time
import uuid

class RedisSlidingWindow:
    def __init__(self, max_requests: int, window_seconds: int):
        self.max_requests = max_requests
        self.window_seconds = window_seconds
        self.redis = redis.Redis()

    def allow_request(self, key: str) -> bool:
        now = time.time()
        window_start = now - self.window_seconds
        request_id = str(uuid.uuid4())

        pipe = self.redis.pipeline()
        # Remove old entries
        pipe.zremrangebyscore(key, 0, window_start)
        # Count current entries
        pipe.zcard(key)
        # Add new entry if allowed
        results = pipe.execute()

        current_count = results[1]
        if current_count < self.max_requests:
            self.redis.zadd(key, {request_id: now})
            self.redis.expire(key, self.window_seconds + 1)
            return True
        return False

Rate Limiting Strategies

Per-User vs Per-IP

Rate limits can apply to authenticated users or to IP addresses. Per-user limits are fairer to shared IPs (multiple users behind corporate NAT). Per-IP limits protect against attacks from single sources.

Use both. Per-IP limits catch unauthenticated abuse. Per-user limits ensure authenticated users share resources fairly.

Tiered Limits

Different subscription tiers get different rate limits. Free tier might get 100 requests per minute. Paid tier gets 1000. Enterprise gets unlimited.

def get_rate_limit(user: User) -> tuple[int, int]:
    tier = user.subscription_tier
    if tier == 'free':
        return (100, 60)  # 100 requests per 60 seconds
    elif tier == 'paid':
        return (1000, 60)
    elif tier == 'enterprise':
        return (10000, 60)

Global vs Endpoint-Specific

You might want different limits for different endpoints. Public data endpoints could have generous limits. Write operations should have stricter limits. Admin endpoints might be limited to internal networks only.

ENDPOINT_LIMITS = {
    '/api/public/search': (1000, 60),
    '/api/private/write': (100, 60),
    '/api/admin/': (10000, 60),
}

def check_rate_limit(endpoint: str, user: User) -> bool:
    limit = ENDPOINT_LIMITS.get(endpoint, (100, 60))
    return sliding_window.allow_request(f"user:{user.id}:{endpoint}", *limit)

Handling Rate Limit Violations

When a client exceeds the limit, return a proper response.

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 3600

{
    "error": "rate_limit_exceeded",
    "message": "API rate limit exceeded. Please retry after 3600 seconds.",
    "limit": 1000,
    "reset_at": "2026-03-22T13:00:00Z"
}

Log rate limit violations. A client repeatedly hitting limits might be malicious, or it might need a higher tier. Either way, you want to know.

Consider graduated responses. First violation gets a warning header. Repeated violations get actual rejections. This helps legitimate clients that are misconfigured.

Common Pitfalls

Not Being Atomic

Race conditions in distributed rate limiting cause limit bypass. Two requests arriving simultaneously on different servers both check the counter, both see room, both increment. Both get through when only one should.

Use atomic operations. Redis Lua scripts, conditional writes, or distributed locks. Pick one that matches your consistency requirements.

Ignoring Header Consistency

When using multiple rate-limiting instances, headers might show different values to the same client. Request hits Server A, sees 99 remaining. Request hits Server B, sees 100 remaining. This confuses clients tracking their usage.

Use a shared counter store and return values from it. Accept the latency cost.

Allowing Unrestricted Bursts

Token bucket capacity sets the maximum burst. Large bursts can overwhelm downstream services even if the long-term average is within limits.

Set burst limits based on what your infrastructure can actually handle, not on what seems generous.

Production Failure Scenarios

Failure	Impact	Mitigation
Rate limit counter overflow	Limits enforced incorrectly; bypassed or too restrictive	Use appropriate data types; set maximum values; monitor for unusual patterns
Redis unavailable for distributed limits	Rate limiting fails open or closed	Decide on fail-open vs fail-closed; implement local fallback with reduced limits
Clock skew in sliding window	Requests incorrectly counted or allowed	Use Redis server time; NTP sync all rate-limiting servers
Thundering herd on limit reset	All clients retry simultaneously at window boundary	Use jitter in retry logic; consider sliding window instead of fixed
Client bypass via IP spoofing	Rate limits circumvented by malicious actors	Combine per-IP with per-user limits; implement multiple enforcement layers

Observability Checklist

Metrics:
- Requests allowed vs rejected per endpoint
- Rate limit headroom (remaining vs total)
- Distributed counter sync latency
- 429 response rate by endpoint and client
- Token bucket utilization over time
Logs:
- Rate limit violations with client identifier
- Counter sync operations between nodes
- Algorithm selection per endpoint
- Burst detection (sudden spikes in requests)
Alerts:
- 429 response rate exceeds threshold (indicates abuse or misconfiguration)
- Rate limit headroom approaching zero
- Counter desynchronization between Redis instances
- Unusual spike in requests from single source

Security Checklist

Rate limiting enforced server-side (not just client-side)
Per-IP limits combined with per-user limits
Distributed rate limiting synchronized across all instances
Atomic operations prevent counter bypass
Rate limit headers do not expose internal implementation details
Graduated response (warn before blocking) for misconfigured clients
Anonymize PII in rate limiting logs

Common Anti-Patterns to Avoid

Rate Limiting Only at API Gateway

If your gateway is the only rate limiter, bypass it and you have no limits. Implement defense in depth.

Fixed Window for High-Value APIs

The boundary spike in fixed window allows double the rate momentarily. Use sliding window for critical endpoints.

Not Considering Retry-After

Clients need to know when to retry. Always include Retry-After header and respect it in client implementations.

Ignoring Burst Behavior

Setting bucket capacity to the same as refill rate eliminates all burst capability. Allow some burst headroom.

Quick Recap

Key Bullets:

Token bucket suits burst-friendly APIs; sliding window suits consistent-rate APIs
Distributed rate limiting requires atomic operations via Redis Lua or similar
Combine per-IP and per-user limits for defense in depth
Always include Retry-After header; clients should respect it
Monitor rate limit metrics to detect abuse and misconfiguration

Copy/Paste Checklist:

Rate Limiting Implementation:
[ ] Choose algorithm based on use case
[ ] Implement atomic distributed counter (Redis Lua)
[ ] Set burst capacity based on downstream tolerance
[ ] Implement per-IP and per-user limits
[ ] Add X-RateLimit-* headers to all responses
[ ] Include Retry-After on 429 responses
[ ] Implement graceful degradation (fail-open vs fail-closed)
[ ] Add jitter to client retry logic
[ ] Monitor violation rates per client
[ ] Test with concurrent requests from multiple nodes

When to Use Which Algorithm

Token bucket suits APIs where burst usage is normal. Users make many requests quickly while actively using a feature, then go idle.

Sliding window suits APIs needing smooth, predictable limiting. Better for services calling downstream APIs with strict rate limits.

Fixed window suits when you need simple, debuggable limits. The boundary spike might be acceptable for your use case.

Leaky bucket suits when you need to send requests at a steady rate to a downstream service.

For most web APIs, token bucket with Redis backend handles the job. It is well-understood, memory-efficient, and implements naturally as a distributed system.

For more on related topics, see API Gateway Patterns, Resilience Patterns, and Circuit Breaker Pattern.