Rate Limiting: Token Bucket, Sliding Window, and Distributed
Rate limiting protects APIs from abuse. Learn token bucket, sliding window, fixed window algorithms and distributed rate limiting at scale.
Rate Limiting: Token Bucket, Sliding Window, and Distributed Strategies
Every API has a capacity limit. Physical servers have finite CPU and memory. Databases can only handle so many connections. Rate limiting protects your services from being overwhelmed—whether by malicious attackers, runaway scripts, or too many legitimate users at once.
Without it, one misconfigured client can take down your entire API. A retry loop hitting a temporary backend hiccup creates a retry storm. Your service goes from degraded to dead.
Why Rate Limit?
Rate limiting serves several purposes. Protection from abuse is the obvious one. But there are subtler benefits.
Fairness. Without rate limits, heavy users crowd out light users. A single tenant uploading large files could saturate bandwidth for everyone else.
Cost control. Your infrastructure costs money. Rate limits ensure different customers pay for their usage levels.
Predictability. Knowing your API has hard limits lets you plan client-side behavior. Users can implement retry logic with appropriate backoff.
Protection from cascading failures. When upstream services are slow, clients retry aggressively. Rate limiting prevents retry storms from compounding the original problem.
Core Algorithms
Token Bucket
Token bucket is the most common rate limiting algorithm. Imagine a bucket that holds a fixed number of tokens. Tokens get added to the bucket at a constant rate. Each request consumes one token. If the bucket is empty, requests are rejected.
import time
import threading
class TokenBucket:
def __init__(self, capacity: int, refill_rate: float):
self.capacity = capacity
self.refill_rate = refill_rate # tokens per second
self.tokens = capacity
self.last_refill = time.time()
self.lock = threading.Lock()
def allow_request(self) -> bool:
with self.lock:
self._refill()
if self.tokens >= 1:
self.tokens -= 1
return True
return False
def _refill(self):
now = time.time()
elapsed = now - self.last_refill
new_tokens = elapsed * self.refill_rate
self.tokens = min(self.capacity, self.tokens + new_tokens)
self.last_refill = now
The key properties: bursts allowed up to bucket capacity, long-term rate equals refill rate. A bucket with capacity 100 and refill rate of 10 per second lets you burst to 100 requests, then sustains 10 per second.
Users get burst capacity when they need it, then settle into the sustainable rate.
Sliding Window
Sliding window rate limiting addresses the boundary issues of fixed windows. Instead of resetting counters at window boundaries, sliding window tracks requests within a rolling time period.
import time
from collections import deque
class SlidingWindow:
def __init__(self, max_requests: int, window_seconds: int):
self.max_requests = max_requests
self.window_seconds = window_seconds
self.requests = deque()
self.lock = threading.Lock()
def allow_request(self) -> bool:
with self.lock:
now = time.time()
cutoff = now - self.window_seconds
# Remove expired entries
while self.requests and self.requests[0] < cutoff:
self.requests.popleft()
if len(self.requests) < self.max_requests:
self.requests.append(now)
return True
return False
The advantage over fixed window: no spike at window boundaries. With fixed windows, a client could make 100 requests at 11:59:59 and another 100 at 12:00:00, effectively doubling the rate. Sliding window prevents this.
The memory cost is proportional to the number of requests in the window. For high-volume APIs, this can add up.
Fixed Window
Fixed window is the simplest algorithm. Divide time into fixed intervals. Each interval has a request budget. Count resets at interval boundaries.
import time
class FixedWindow:
def __init__(self, max_requests: int, window_seconds: int):
self.max_requests = max_requests
self.window_seconds = window_seconds
self.window_start = time.time()
self.count = 0
self.lock = threading.Lock()
def allow_request(self) -> bool:
with self.lock:
now = time.time()
if now >= self.window_start + self.window_seconds:
self.window_start = now
self.count = 0
if self.count < self.max_requests:
self.count += 1
return True
return False
Fixed window is easy to implement and understand. The boundary spike problem is real, but for many applications it is acceptable. Memory usage stays constant.
Leaky Bucket
Leaky bucket works like a bucket with a hole in the bottom. Water (requests) enters at varying rates but leaks out at a constant rate. If the bucket overflows, requests are rejected.
import time
import threading
class LeakyBucket:
def __init__(self, capacity: int, leak_rate: float):
self.capacity = capacity
self.leak_rate = leak_rate
self.water = 0
self.last_leak = time.time()
self.lock = threading.Lock()
def allow_request(self) -> bool:
with self.lock:
self._leak()
if self.water < self.capacity:
self.water += 1
return True
return False
def _leak(self):
now = time.time()
elapsed = now - self.last_leak
leaked = elapsed * self.leak_rate
self.water = max(0, self.water - leaked)
self.last_leak = now
Leaky bucket outputs at a steady rate, making it useful for rate-limiting requests to downstream services. If your API calls a third-party service with rate limits, leaky bucket smooths your outbound traffic.
Implementing Rate Limits in APIs
HTTP Headers
APIs communicate rate limit status through headers. There is no universal standard, but common conventions exist.
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1640995200
Retry-After: 3600
When a client exceeds the limit, return 429 Too Many Requests:
@app.route('/api/resource')
def get_resource():
if not rate_limiter.allow_request():
reset_time = rate_limiter.get_reset_time()
return {
'error': 'Rate limit exceeded',
'retry_after': reset_time - time.time()
}, 429
return {'data': 'result'}
Clients should read these headers and back off accordingly. Implement exponential backoff with jitter to avoid thundering herd problems.
Distributed Rate Limiting
In a single-server environment, the algorithms above work fine. But modern applications run across multiple servers. A client could hit different servers with separate counters, bypassing the rate limit entirely.
Distributed rate limiting uses a shared store to coordinate limits across instances.
Redis-Based Token Bucket
-- Redis token bucket implementation
-- KEYS[1] = rate limit key
-- ARGV[1] = capacity
-- ARGV[2] = refill rate (tokens per second)
-- ARGV[3] = current timestamp
-- ARGV[4] = requested tokens (usually 1)
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now
-- Refill tokens
local elapsed = now - last_refill
local new_tokens = elapsed * refill_rate
tokens = math.min(capacity, tokens + new_tokens)
local allowed = 0
if tokens >= requested then
tokens = tokens - requested
allowed = 1
end
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 1)
return {allowed, tokens}
This Lua script executes atomically in Redis, preventing race conditions. The expiry ensures the key does not persist after the client stops making requests.
Redis-Based Sliding Window
Sliding window with Redis uses a sorted set. Each request gets a timestamp as the score and a unique member. To check the limit: remove entries older than the window, count remaining entries, add the new entry if within limit.
import redis
import time
import uuid
class RedisSlidingWindow:
def __init__(self, max_requests: int, window_seconds: int):
self.max_requests = max_requests
self.window_seconds = window_seconds
self.redis = redis.Redis()
def allow_request(self, key: str) -> bool:
now = time.time()
window_start = now - self.window_seconds
request_id = str(uuid.uuid4())
pipe = self.redis.pipeline()
# Remove old entries
pipe.zremrangebyscore(key, 0, window_start)
# Count current entries
pipe.zcard(key)
# Add new entry if allowed
results = pipe.execute()
current_count = results[1]
if current_count < self.max_requests:
self.redis.zadd(key, {request_id: now})
self.redis.expire(key, self.window_seconds + 1)
return True
return False
Rate Limiting Strategies
Per-User vs Per-IP
Rate limits can apply to authenticated users or to IP addresses. Per-user limits are fairer to shared IPs (multiple users behind corporate NAT). Per-IP limits protect against attacks from single sources.
Use both. Per-IP limits catch unauthenticated abuse. Per-user limits ensure authenticated users share resources fairly.
Tiered Limits
Different subscription tiers get different rate limits. Free tier might get 100 requests per minute. Paid tier gets 1000. Enterprise gets unlimited.
def get_rate_limit(user: User) -> tuple[int, int]:
tier = user.subscription_tier
if tier == 'free':
return (100, 60) # 100 requests per 60 seconds
elif tier == 'paid':
return (1000, 60)
elif tier == 'enterprise':
return (10000, 60)
Global vs Endpoint-Specific
You might want different limits for different endpoints. Public data endpoints could have generous limits. Write operations should have stricter limits. Admin endpoints might be limited to internal networks only.
ENDPOINT_LIMITS = {
'/api/public/search': (1000, 60),
'/api/private/write': (100, 60),
'/api/admin/': (10000, 60),
}
def check_rate_limit(endpoint: str, user: User) -> bool:
limit = ENDPOINT_LIMITS.get(endpoint, (100, 60))
return sliding_window.allow_request(f"user:{user.id}:{endpoint}", *limit)
Handling Rate Limit Violations
When a client exceeds the limit, return a proper response.
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 3600
{
"error": "rate_limit_exceeded",
"message": "API rate limit exceeded. Please retry after 3600 seconds.",
"limit": 1000,
"reset_at": "2026-03-22T13:00:00Z"
}
Log rate limit violations. A client repeatedly hitting limits might be malicious, or it might need a higher tier. Either way, you want to know.
Consider graduated responses. First violation gets a warning header. Repeated violations get actual rejections. This helps legitimate clients that are misconfigured.
Common Pitfalls
Not Being Atomic
Race conditions in distributed rate limiting cause limit bypass. Two requests arriving simultaneously on different servers both check the counter, both see room, both increment. Both get through when only one should.
Use atomic operations. Redis Lua scripts, conditional writes, or distributed locks. Pick one that matches your consistency requirements.
Ignoring Header Consistency
When using multiple rate-limiting instances, headers might show different values to the same client. Request hits Server A, sees 99 remaining. Request hits Server B, sees 100 remaining. This confuses clients tracking their usage.
Use a shared counter store and return values from it. Accept the latency cost.
Allowing Unrestricted Bursts
Token bucket capacity sets the maximum burst. Large bursts can overwhelm downstream services even if the long-term average is within limits.
Set burst limits based on what your infrastructure can actually handle, not on what seems generous.
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Rate limit counter overflow | Limits enforced incorrectly; bypassed or too restrictive | Use appropriate data types; set maximum values; monitor for unusual patterns |
| Redis unavailable for distributed limits | Rate limiting fails open or closed | Decide on fail-open vs fail-closed; implement local fallback with reduced limits |
| Clock skew in sliding window | Requests incorrectly counted or allowed | Use Redis server time; NTP sync all rate-limiting servers |
| Thundering herd on limit reset | All clients retry simultaneously at window boundary | Use jitter in retry logic; consider sliding window instead of fixed |
| Client bypass via IP spoofing | Rate limits circumvented by malicious actors | Combine per-IP with per-user limits; implement multiple enforcement layers |
Observability Checklist
-
Metrics:
- Requests allowed vs rejected per endpoint
- Rate limit headroom (remaining vs total)
- Distributed counter sync latency
- 429 response rate by endpoint and client
- Token bucket utilization over time
-
Logs:
- Rate limit violations with client identifier
- Counter sync operations between nodes
- Algorithm selection per endpoint
- Burst detection (sudden spikes in requests)
-
Alerts:
- 429 response rate exceeds threshold (indicates abuse or misconfiguration)
- Rate limit headroom approaching zero
- Counter desynchronization between Redis instances
- Unusual spike in requests from single source
Security Checklist
- Rate limiting enforced server-side (not just client-side)
- Per-IP limits combined with per-user limits
- Distributed rate limiting synchronized across all instances
- Atomic operations prevent counter bypass
- Rate limit headers do not expose internal implementation details
- Graduated response (warn before blocking) for misconfigured clients
- Anonymize PII in rate limiting logs
Common Anti-Patterns to Avoid
Rate Limiting Only at API Gateway
If your gateway is the only rate limiter, bypass it and you have no limits. Implement defense in depth.
Fixed Window for High-Value APIs
The boundary spike in fixed window allows double the rate momentarily. Use sliding window for critical endpoints.
Not Considering Retry-After
Clients need to know when to retry. Always include Retry-After header and respect it in client implementations.
Ignoring Burst Behavior
Setting bucket capacity to the same as refill rate eliminates all burst capability. Allow some burst headroom.
Quick Recap
Key Bullets:
- Token bucket suits burst-friendly APIs; sliding window suits consistent-rate APIs
- Distributed rate limiting requires atomic operations via Redis Lua or similar
- Combine per-IP and per-user limits for defense in depth
- Always include Retry-After header; clients should respect it
- Monitor rate limit metrics to detect abuse and misconfiguration
Copy/Paste Checklist:
Rate Limiting Implementation:
[ ] Choose algorithm based on use case
[ ] Implement atomic distributed counter (Redis Lua)
[ ] Set burst capacity based on downstream tolerance
[ ] Implement per-IP and per-user limits
[ ] Add X-RateLimit-* headers to all responses
[ ] Include Retry-After on 429 responses
[ ] Implement graceful degradation (fail-open vs fail-closed)
[ ] Add jitter to client retry logic
[ ] Monitor violation rates per client
[ ] Test with concurrent requests from multiple nodes
When to Use Which Algorithm
Token bucket suits APIs where burst usage is normal. Users make many requests quickly while actively using a feature, then go idle.
Sliding window suits APIs needing smooth, predictable limiting. Better for services calling downstream APIs with strict rate limits.
Fixed window suits when you need simple, debuggable limits. The boundary spike might be acceptable for your use case.
Leaky bucket suits when you need to send requests at a steady rate to a downstream service.
For most web APIs, token bucket with Redis backend handles the job. It is well-understood, memory-efficient, and implements naturally as a distributed system.
For more on related topics, see API Gateway Patterns, Resilience Patterns, and Circuit Breaker Pattern.
Category
Related Posts
API Versioning Strategies: Managing API Evolution Without Breaking Clients
Learn different API versioning strategies including URL versioning, header versioning, and query parameters. Understand backward compatibility and deprecation practices.
GraphQL vs REST: Choosing the Right API Paradigm
Compare GraphQL and REST APIs, understand when to use each approach, schema design, queries, mutations, and trade-offs between the two paradigms.
RESTful API Design: Best Practices for Building Web APIs
Learn REST principles, resource naming, HTTP methods, status codes, and best practices. Design clean, maintainable, and scalable RESTful APIs.