Key-Value Stores: Redis and DynamoDB Patterns

Learn Redis and DynamoDB key-value patterns for caching, sessions, leaderboards, TTL eviction policies, and storage tradeoffs.

published: March 26, 2026 reading time: 37 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

Key-value stores trade query flexibility for raw speed, with Redis providing sub-millisecond in-memory operations and DynamoDB offering fully managed, persistent, distributed storage with automatic sharding. Redis data structures beyond strings—lists, sets, sorted sets, and hashes—enable use cases like leaderboards, queues, and atomic counters that would require complex SQL operations. Key design in both systems must match actual access patterns since arbitrary field queries are impossible without redesigning the key structure. TTLs prevent key accumulation in Redis while eviction policies like allkeys-lru manage memory pressure, and distributed locks require token-based release patterns to avoid accidental release by expired locks.

Key-Value Stores: Redis and DynamoDB Patterns

The key-value store is the simplest kind of database. You have a key, you get a value. No queries, no joins, no complex schema. This simplicity is the feature.

Redis and DynamoDB represent two ends of the key-value spectrum. Redis is an in-memory store with optional persistence. DynamoDB is a fully managed, persistent, distributed key-value store with configurable consistency. Understanding both helps you choose the right tool.

The Key-Value Model

At its core, the model is:

# Basic operations
store.set(key, value)
value = store.get(key)
store.delete(key)
exists = store.exists(key)

No WHERE clauses. No aggregations. You know exactly where your data is.

This simplicity enables two things: extremely fast operations and horizontal scalability. When every lookup goes directly to a specific location, there is no query planner overhead, no index traversal, no join computation.

flowchart LR
    Client["Client"]
    Redis["Redis Cluster<br/>(3 Nodes)"]
    A["Node A<br/>hash_tag: {user}"]
    B["Node B<br/>hash_tag: {order}"]
    C["Node C<br/>hash_tag: {product}"]
    Client --> Redis
    Redis --> A & B & C

Redis Cluster shards data by hash slot (16384 slots total). The hash tag in curly braces determines which slot a key maps to — {user}:123 and {user}:456 go to the same node. DynamoDB works similarly under the hood: partition keys determine which storage node holds your data.

Redis: In-Memory Speed

Redis stores data primarily in memory, with optional durability to disk. This makes it fast for read-heavy workloads.

Basic Operations

import redis

r = redis.Redis(host='localhost', port=6379, db=0)

# String operations
r.set('user:123:session', json.dumps(session_data))
r.set('rate:limit:192.168.1.1', 100, ex=60)  # With 60-second TTL
value = r.get('user:123:session')

# Multiple operations (pipelines reduce round trips)
pipe = r.pipeline()
pipe.hset('user:123', mapping={'name': 'Alice', 'email': 'alice@example.com'})
pipe.expire('user:123', 3600)
pipe.execute()

# Atomic counters
r.incr('api:request:count')
r.incrby('user:123:balance', 100)

The EX parameter sets expiration in seconds, while PX uses milliseconds. EX covers most use cases — seconds are usually precise enough. Use PX when you need sub-second expiry, like short-lived rate limit windows or temporary fraud-detection holds.

Pipeline batching groups multiple commands into a single round trip. Without it, each command waits for a response before the next one is sent, so latency is dominated by network round trips. Batching 10 commands cuts latency by roughly 10x. The tradeoff is that either all commands in the batch succeed or you handle partial failures yourself.

Error handling matters in production. Redis throws ResponseError for type mismatches (calling INCR on a string value), ConnectionError when the server is unreachable, and TimeoutError when commands hang. Use retry loops with exponential backoff for ConnectionError only — do not retry on type errors, which indicate a bug in your code. Set socket_connect_timeout and socket_timeout so a saturated Redis does not block your application indefinitely.

Data Structures Beyond Strings

Redis is actually a data structure server. Values can be more than strings.

# Lists (queues, activity feeds)
r.lpush('queue:jobs', 'job1', 'job2', 'job3')
next_job = r.rpop('queue:jobs')

# Sets (unique items, tags)
r.sadd('user:123:likes', 'item1', 'item2', 'item3')
is_member = r.sismember('user:123:likes', 'item1')
all_items = r.smembers('user:123:likes')

# Sorted sets (leaderboards, priorities)
r.zadd('leaderboard', {'alice': 100, 'bob': 200, 'charlie': 150})
top_players = r.zrevrange('leaderboard', 0, 9, withscores=True)
rank = r.zrank('leaderboard', 'alice')

# Hashes (objects)
r.hset('product:SKU-001', mapping={
    'name': 'Gaming Laptop',
    'price': 1299.99,
    'stock': 50
})
product = r.hgetall('product:SKU-001')

Each data structure maps to specific access patterns. Lists work as FIFO queues with LPUSH and RPOP, or as activity feeds when you LRANGE to paginate. Sets handle unique collections — SISMEMBER is O(1) and fast enough for presence checks or tag deduplication. Sorted sets add a score dimension, making rank queries like ZREVRANK and ZRANGE O(log N) regardless of set size.

Hashes sit between strings and structured tables. Use them when an object has multiple fields you update independently (HINCRBY on a view count without touching the name field), or when you need field-level TTL via separate expiry keys. Memory overhead per field is higher than a serialized JSON string, so for objects you always read wholesale, a JSON string in a regular key is more efficient.

TTL and Expiration

Time-to-live (TTL) is what separates a cache from a permanent store. When you attach a TTL to a key, Redis marks it for automatic deletion at expiry. No background job, no manual cleanup. This comes up constantly in production: sessions expire so users get logged out automatically, rate limit counters reset on their own schedule, and cache entries vanish before they turn into stale data that causes hard-to-debug inconsistencies.

TTL is a first-class Redis feature. You set it at write time using EX for seconds or PX for milliseconds. Redis handles the rest. Keys without a TTL live forever, which is fine for persistent data. For anything meant to be temporary, it becomes a memory leak.

# Session with 30-minute expiry
r.setex('session:abc123', 1800, json.dumps(session_data))

# Rate limiting: allow 100 requests per minute
def rate_limit(identifier, limit=100, window=60):
    key = f'ratelimit:{identifier}'
    current = r.get(key)

    if current and int(current) >= limit:
        return False  # Rate limit exceeded

    pipe = r.pipeline()
    pipe.incr(key)
    pipe.expire(key, window)
    pipe.execute()
    return True

# Distributed locks
def acquire_lock(lock_name, timeout=10):
    lock_key = f'lock:{lock_name}'
    acquired = r.set(lock_key, '1', nx=True, ex=timeout)
    return acquired

def release_lock(lock_name):
    lock_key = f'lock:{lock_name}'
    r.delete(lock_key)

DynamoDB: Managed Persistent Storage

DynamoDB is a fully managed NoSQL database by AWS. It offers persistent storage with automatic sharding, eventual or strong consistency options, and pay-per-request pricing.

Table Structure

DynamoDB has a simple primary key: either a partition key alone, or a partition key plus sort key.

import boto3

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('UserSessions')

# Put and get items
table.put_item(Item={
    'userId': 'user-123',
    'sessionId': 'session-abc',
    'data': {'preferences': {'theme': 'dark'}},
    'expiresAt': int(time.time()) + 3600
})

response = table.get_item(
    Key={
        'userId': 'user-123',
        'sessionId': 'session-abc'
    }
)

The partition key determines which storage node holds your data. DynamoDB hashes the partition key value and routes items accordingly — you do not control the routing, but you control the key values. A table with userId as the partition key groups all a single user’s items together on the same node. A table with just productId as the partition key stores each product independently.

The sort key is optional and enables range queries within a partition. If your partition key is userId and your sort key is timestamp, you can Query all items for a user within a time range efficiently. Without a sort key, GetItem is the only operation — you need the exact partition and sort key values. This shapes how you model data: if you frequently need to look up orders by date for a user, timestamp belongs in the sort key, not as a regular attribute.

Item size limits matter at design time. DynamoDB items have a 400 KB ceiling, and attribute names count toward this. Long attribute names are a hidden memory cost. Use short keys: uid instead of userIdentifier, sid instead of sessionId. DynamoDB charges for provisioned capacity, not storage volume directly, but hot partitions from uneven key distribution will throttle you before you hit the size limit.

Access Patterns Drive Key Design

In DynamoDB, you design keys based on how you query. Unlike relational databases where you can freely query any column, DynamoDB queries are limited to key attributes.

# Single table design: multiple entity types in one table
# Key: PK (partition key), SK (sort key)

# User entity
{'PK': 'USER#alice', 'SK': 'PROFILE', 'name': 'Alice', 'email': 'alice@example.com'}

# Orders for user
{'PK': 'USER#alice', 'SK': 'ORDER#2024-01-01', 'total': 129.99, 'items': [...]}
{'PK': 'USER#alice', 'SK': 'ORDER#2024-01-15', 'total': 49.99, 'items': [...]}

# Products
{'PK': 'PRODUCT#LAPTOP-001', 'SK': 'METADATA', 'name': 'Gaming Laptop', 'price': 1299.99}

# Query all orders for a user
response = table.query(
    KeyConditionExpression=Key('PK').eq('USER#alice') & Key('SK').begins_with('ORDER#')
)

# Query single user profile
response = table.get_item(
    Key={'PK': 'USER#alice', 'SK': 'PROFILE'}
)

Single-table design trades query flexibility for operational simplicity. One table means one place to manage provisioned capacity, one CloudWatch namespace to monitor, and fewer API calls to juggle. The catch: access patterns must be known upfront. You cannot add a new index after the fact without a migration. If your query needs change, you re-write the data with new key patterns and backfill.

The begins_with operator on sort keys enables prefix-based grouping. ORDER#2024-01 catches all January 2024 orders. ORDER#2024-01-15 targets a specific day. This pattern replaces date-range queries you would write as WHERE created_at BETWEEN x AND y in SQL. The limitation is lexical sorting — dates must be zero-padded (2024-01-01, not 2024-1-1) so prefix matching works correctly across month boundaries.

Avoid overloading a single table. Packing 15 entity types into one table creates GSI sprawl and access pattern collisions. Three to five entity types that genuinely share query patterns is the practical limit. If you find yourself creating GSIs just to separate concerns, you have outgrown single-table design.

Global Secondary Indexes

When you need alternative access patterns, use GSIs.

# Create table with GSI for email lookup
table = dynamodb.create_table(
    TableName='Users',
    KeySchema=[
        {'AttributeName': 'userId', 'KeyType': 'HASH'}
    ],
    AttributeDefinitions=[
        {'AttributeName': 'userId', 'AttributeType': 'S'},
        {'AttributeName': 'email', 'AttributeType': 'S'}
    ],
    GlobalSecondaryIndexes=[
        {
            'IndexName': 'EmailIndex',
            'KeySchema': [{'AttributeName': 'email', 'KeyType': 'HASH'}],
            'Projection': {'ProjectionType': 'ALL'}
        }
    ],
    BillingMode='PAY_PER_REQUEST'
)

# Later, query by email
response = table.query(
    IndexName='EmailIndex',
    KeyConditionExpression=Key('email').eq('alice@example.com')
)

GSIs are eventually consistent by default — writes propagate to the index in the background, typically under a second but potentially longer under heavy write load. If your application needs to read its own writes immediately, query the base table directly or use strong consistency on the table. GSIs cannot use strong consistency, which is a fundamental limitation of the GSI architecture.

Each GSI consumes provisioned read and write capacity from the table. A GSI on email with ProjectionType: ALL copies every attribute from matching items. If your table has 50 attributes and you only need 3 in the index, project only those 3 to reduce write amplification. ProjectionType: KEYS_ONLY projects only the key attributes and is the most efficient option when you need to fetch the full item by key after the index lookup.

You can create up to 20 GSIs per table. This sounds generous until you need alternative access patterns for 15 entity types. Local Secondary Indexes (LSIs) share the same partition key as the base table but offer alternative sort keys — useful if you need the same partition key but different range queries. LSIs are limited to 5 per table and must be created at table creation time, unlike GSIs which you can add after the fact.

Use Cases: Where Key-Value Stores Excel

Caching

The classic use case. Store frequently accessed data in Redis to reduce database load.

def get_user_profile(user_id):
    cache_key = f'user:profile:{user_id}'

    # Try cache first
    cached = redis.get(cache_key)
    if cached:
        return json.loads(cached)

    # Cache miss - fetch from database
    profile = database.fetch_user(user_id)

    # Store in cache for next time
    redis.setex(cache_key, 300, json.dumps(profile))  # 5 minute TTL

    return profile

Cache invalidation is the hard part. The pattern above is cache-aside: read from cache, populate on miss. The TTL of 300 seconds means stale data lives for up to 5 minutes. For data that changes frequently (prices, inventory counts), a shorter TTL or an active invalidation strategy is needed. On-demand invalidation via DEL on write is cleaner for data that must not go stale even briefly, but it adds a cache-miss round trip on the next read.

Cache hit ratio drives the value of Redis as a cache. Track keyspace_hits / (keyspace_hits + keyspace_misses) — below 80% and Redis is consuming memory without reducing database load proportionally. A cache that hits 30% of requests still saves 30% of database queries, but the operational cost of Redis may not justify it over a simpler in-memory cache or just more database capacity.

Cache warming (pre-loading frequently accessed data on startup) avoids cold-start latency spikes. The tradeoff is pre-loading every possible key is impossible and wastes memory. Load the top N by access frequency, identified from your database query logs or Redis MONITOR output during peak traffic. Monitor miss rate after deployment — a spike means your warming set was too small.

Session Storage

Sessions are naturally key-value: session ID maps to session data.

# Web session
session_id = cookies.get('session_id')
if not session_id:
    session_id = generate_session_id()
    redis.setex(f'session:{session_id}', 86400, json.dumps({}))

session_data = json.loads(redis.get(f'session:{session_id}'))
session_data['page_views'] = session_data.get('page_views', 0) + 1
redis.setex(f'session:{session_id}', 86400, json.dumps(session_data))

Session expiry is the primary mechanism for automatic logout. The 86400-second TTL above represents 24 hours — reasonable for a general web session. Shorter TTLs suit sensitive applications (banking sessions often expire after 15 minutes of inactivity). The tradeoff is user experience: sessions that expire too aggressively frustrate users who step away briefly.

Session data stored as JSON in Redis is deserialized on every read. For high-traffic sessions with small data (page view counts, flash messages), this overhead is negligible. For sessions with large data objects, consider storing only what you need in Redis and keeping the rest in your primary database. A session that grows to hundreds of kilobytes in Redis creates network transfer overhead on every request.

Multi-tab sessions require careful design. If a user opens multiple tabs, each tab reads and writes the same session key. Concurrent writes from multiple tabs can race — Tab A reads session, Tab B reads session, Tab A writes, Tab B writes, Tab A’s data is lost. Use WATCH/MULTI/EXEC for optimistic locking, or store per-tab data under separate keys and aggregate on read.

Leaderboards

Sorted sets make ranked lists straightforward.

def update_leaderboard(player_id, score):
    redis.zadd('leaderboard', {player_id: score})

def get_top_players(n=10):
    return redis.zrevrange('leaderboard', 0, n-1, withscores=True)

def get_player_rank(player_id):
    rank = redis.zrank('leaderboard', player_id)
    return rank + 1 if rank is not None else None

Redis sorted sets store scores as doubles. When multiple players share the same score, ZRANK returns the lower rank — players who submitted their score first rank higher. This matters in competitive games where ties need deterministic ordering. If consistent tiebreaking matters, embed a timestamp as a fractional offset: score - (timestamp /1e9) gives older scores priority at the cost of precision.

Leaderboard size grows unbounded with ZADD. In a game with millions of players, storing every player in one sorted set becomes memory-intensive. Partition the leaderboard by region, season, or skill tier. Query only the relevant partition. Archiving old seasons to a separate key or DynamoDB table keeps the active leaderboard lean.

Real-time leaderboards require ZADD on every score change. If score updates are frequent (multiple per second per player), the Redis single-threaded model can bottleneck under extreme write load. Pipelining score updates from your game server reduces round trips, but if write throughput exceeds what a single Redis instance can handle, Redis Cluster shards by key — different leaderboard keys go to different nodes automatically.

Rate Limiting

Atomic operations enable distributed rate limiting.

def is_allowed(client_id, limit=100, window=60):
    key = f'ratelimit:{client_id}'

    # Lua script for atomic check-and-increment
    script = """
    local current = redis.call('GET', KEYS[1])
    if current and tonumber(current) >= tonumber(ARGV[1]) then
        return 0
    end
    current = redis.call('INCR', KEYS[1])
    if tonumber(current) == 1 then
        redis.call('EXPIRE', KEYS[1], ARGV[2])
    end
    return 1
    """

    result = redis.eval(script, 1, key, limit, window)
    return bool(result)

The Lua script above is atomic — Redis executes it as a single operation without interleaving other commands. This prevents the race condition where two requests both check the counter at zero and both pass. Without atomicity, a distributed rate limiter has a window where concurrent requests can both pass even when the limit is exceeded.

Fixed window rate limiting (resetting the counter every minute) has a burst problem: a user can hit the limit at 0:59 and again at 1:01, getting double the allowed requests across the window boundary. Sliding window rate limiting uses sorted sets with timestamps as scores, removing expired entries before counting. This is more memory-intensive but eliminates the boundary burst.

Token bucket is an alternative that allows bursts while enforcing average rate. The algorithm: store a last-request timestamp and available tokens. On each request, add tokens based on elapsed time (up to max capacity), then check if tokens are available. Redis alone can implement this with a single key, but the Lua script grows more complex. For most APIs, the fixed window with a generous limit is sufficient.

Distributed Locks

Redis implements coordination primitives.

import uuid
import time

class RedisLock:
    def __init__(self, redis_client, lock_name, timeout=10):
        self.redis = redis_client
        self.lock_key = f'lock:{lock_name}'
        self.timeout = timeout
        self.token = str(uuid.uuid4())

    def acquire(self, blocking=True, blocking_timeout=10):
        start = time.time()
        while True:
            if self.redis.set(self.lock_key, self.token, nx=True, ex=self.timeout):
                return True
            if not blocking:
                return False
            if time.time() - start >= blocking_timeout:
                return False
            time.sleep(0.01)

    def release(self):
        # Only release if we own the lock
        script = """
        if redis.call('GET', KEYS[1]) == ARGV[1] then
            return redis.call('DEL', KEYS[1])
        else
            return 0
        end
        """
        self.redis.eval(script, 1, self.lock_key, self.token)

The token-based release pattern prevents a common bug: releasing a lock that expired and was re-acquired by another process. Using DEL directly compares poorly — if your work takes longer than expected and the lock expires mid-execution, a subsequent client acquires the lock, and your DEL removes their lock. The Lua script in release() checks ownership before deleting, so only the original holder can release.

Lock timeout must exceed the maximum duration of your work. If a job takes 30 seconds but the lock timeout is 10 seconds, the lock expires while the job is still running. The next client acquires the lock and both are now doing the work. Set timeout conservatively: max_expected_duration * 2 is a reasonable starting point. If jobs have widely varying duration, consider a watchdog pattern that extends the lock TTL while work is actively in progress.

Redlock is a distributed locking algorithm that uses multiple Redis instances to tolerate single-node failures. The algorithm: acquire the lock on a majority of N independent Redis instances. Requires N odd (3, 5). It guards against a single Redis node failing, but adds latency (round trips to multiple nodes) and operational complexity. For most applications, a single Redis lock with a reasonable timeout is sufficient — if Redis is down, your application has bigger problems than lock correctness.

Eviction Policies

Redis offers multiple eviction policies when memory limits are reached.

# maxmemory set to 2gb
# maxmemory-policy defined how eviction works

# noeviction: reject writes, reads allowed (default)
# allkeys-lru: evict least recently used keys across all keys
# allkeys-random: evict random keys
# volatile-lru: evict LRU keys only in keys with TTL
# volatile-random: evict random keys with TTL
# volatile-ttl: evict keys with shortest TTL
# allkeys-lfu: evict least frequently used keys
# volatile-lfu: evict LFU keys with TTL

For caching, allkeys-lru is typically the best choice. For session storage with TTL, volatile-lru ensures only expiring keys get evicted.

In-Memory vs Persistent: Tradeoffs

Characteristic	Redis (In-Memory)	DynamoDB (Persistent)
Latency	Sub-millisecond	Single-digit milliseconds
Durability	Optional (RDB/AOF)	Always durable
Scalability	Vertical plus read replicas	Automatic sharding
Cost	Pay for memory	Pay for throughput
Complexity	Single instance relatively simple	Requires understanding capacity
Data size	Limited by memory	Virtually unlimited

Redis with persistence is not equivalent to DynamoDB. Redis persistence (RDB snapshots, AOF logs) guards against crashes but not against disk failures. DynamoDB’s architecture distributes data across multiple AZs by default.

For true durability in Redis, you need replication to follower instances that can take over if the primary fails.

Common Production Failures

Redis OOM kills crashing the database: Redis runs out of memory and the OS kernel kills it via OOM killer. Your application starts returning errors and recovery takes time to reload data. Set maxmemory and maxmemory-policy, configure vm.overcommit_memory=1 at the OS level, and have a reload plan from your primary database ready.

Keys without TTL filling up Redis: Someone writes SET mykey value instead of SET mykey value EX 3600 in the rate limiter. The key never expires. Redis fills up, allkeys-lru starts evicting keys you actually need. Default to TTL on every key—a missing expiration is a bug, not a feature.

DynamoDB hot partition throttling: Your session table uses userId as the partition key. Turns out user-0000 is the test account that every load test hits. That partition maxes out at 300 RCU while the rest of the table sits idle. Spread write volume across partition keys, and implement exponential backoff with jitter when you hit ProvisionedThroughputExceededException.

AOF always mode killing Redis write throughput: You enable AOF with appendfsync always for “maximum durability”. Under write load, Redis blocks on every sync to disk and throughput collapses. Switch to appendfsync everysec or appendfsync no, and match your disk I/O to the write rate if you use everysec.

Distributed lock released by wrong owner: The lock code calls DEL on expiry. But the lock expired while your work was still running, another client grabbed it, and now two processes hold the same lock. The token-based release (compare-and-delete via Lua script) shown in the code example prevents this—only the lock owner can release it.

DynamoDB single-table design gone wrong: You pack 15 entity types into one table to avoid “wasting” tables. The GSI situation becomes a mess and access patterns start colliding. Single-table design shines with 3-5 entity types that share access patterns. Beyond that, the operational complexity wins.

Capacity Estimation: Memory-per-Key and DynamoDB Partition Math

Redis memory is the primary constraint. Each key-value pair has overhead on top of the value size. A string key with a 100-byte value uses roughly 100 bytes plus 40-60 bytes of key metadata (keyspace overhead, pointer to value, expiry metadata if set). A hash with 10 fields uses more — each field name is stored alongside the value.

Rough formula for Redis memory: total_keys * (avg_key_size + avg_value_size + 40 bytes_overhead). For a cache with 1 million keys averaging 500 bytes each: roughly 540 MB plus Redis’s own internal fragmentation. Set maxmemory with headroom — at 80% used, Redis starts evicting and you lose predictability.

For TTLs, the expiration queue has a cost: Redis scans the keyspace periodically to expire keys. With millions of keys with TTLs, ACTIVE_EXPIRE_CYCLE_SLOW runs every 100ms. This is generally fast but if you set TTLs on millions of keys at once (a batch import, for example), Redis can briefly block. The latency-sensitive work — expiring the keys you care about — happens incrementally.

DynamoDB partition sizing: DynamoDB creates partitions based on partition key values and throughput. Each partition supports up to 1,000 RCU, 1,000 WCU, and 10 GB of data. For a table with 50 GB and 5,000 RCU, you need at minimum 5 partitions (10 GB each), and the RCU requirement means you actually need 5 partitions just for throughput even if the data were smaller.

If you exceed 10 GB per partition, DynamoDB splits automatically. If you exceed 1,000 WCU on a single partition before splitting, you get throttled. DynamoDB adaptive capacity can temporarily burst above 1,000 WCU per partition, but sustained traffic above that requires either better key distribution or requesting a provisioning increase from AWS support.

Trade-off Analysis: Redis vs DynamoDB vs Memcached

Use Case	Redis	DynamoDB	Memcached
Sub-millisecond caching	Native, in-memory	Single-digit ms, NVMe-backed	Native, in-memory
Distributed locks	Native (SET NX EX)	Conditional writes	Not natively supported
Session storage	Fast, ephemeral	Durable, managed	Fast, ephemeral
Write-heavy workloads	Good with persistence	Excellent auto-scaling	Good, but no persistence
Managed service	Self-hosted or Redis Enterprise	Fully managed by AWS	Self-hosted or ElastiCache
Data size limits	Bounded by memory	400KB per item, PB total	1MB per item
Replication	Master-replica, Cluster	Multi-AZ by default	Replica support via consistent hashing

Observability Hooks: Redis INFO and DynamoDB CloudWatch

For Redis, INFO is the primary observability interface. The output has sections: Server (Redis build info, version, uptime), Clients (connected clients, blocked clients), Memory (used_memory, used_memory_peak, mem_fragmentation_ratio), Persistence (RDB/AOF status), Stats (total commands processed, keyspace hits/misses), Replication (role, master link status), CPU, Latency stats.

The metrics to watch in production: mem_fragmentation_ratio above 1.5 means Redis is using 50% more memory than it needs — restart Redis or adjust activedefrag settings. keyspace_hits / (keyspace_hits + keyspace_misses) is your cache hit ratio. Below 80% and your cache is not earning its memory. instantaneous_ops_per_sec shows current throughput. connected_clients approaching maxclients (default 10,000) is a connection exhaustion warning.

For DynamoDB, CloudWatch metrics matter: ConsumedReadCapacityUnits and ConsumedWriteCapacityUnits against ProvisionedThroughput. If you consistently consume above 80% of provisioned, DynamoDB autoscales or you need to increase. ThrottledRequests is the key alarm — non-zero throttling means your application is failing requests. UserErrors with ProvisionedThroughputExceededException means your keys are unevenly distributed. ReplicationLatency matters for global tables — cross-region replication lag should stay under 1 second.

Real-World Case Study: Twitter’s Redis Timeline Storage

Twitter’s early architecture used Redis to store user timelines — the list of tweets a user sees when they open their home timeline. The problem was timeline size: celebrities with millions of followers had timelines that could not be computed on read (pulling from all followed users’ tweets in real time was too slow).

Their solution was write-time fanout: when a user posts a tweet, Twitter pushes that tweet into the timelines of all their followers via a Redis pipeline. Reading a timeline became a simple Redis range read — fast, predictable. The tradeoff was write amplification: one tweet from a celebrity required writes to millions of Redis instances.

The operational consequence: the Redis cluster storing timelines was among the largest in production. Memory pressure was constant. When a Redis node failed, millions of users noticed missing timeline entries until repair ran. Twitter’s fix was a hybrid model — push timelines for active users, pull from graph for inactive ones. The lesson: Redis makes reads cheap at the cost of write amplification and memory pressure. Design for the read-to-write ratio of your actual workload.

Interview Questions

1. Your Redis instance is using 8 GB of memory out of 10 GB max. You restart the process and memory drops to 5 GB. What happened?

Redis does not immediately return freed memory to the OS — used_memory reflects the allocator's view, not the OS's. After a restart, the allocator gets a fresh heap and fragmentation memory is recovered. The 3 GB drop was likely memory fragmentation (the allocator holding freed chunks that have not been coalesced) and dirty pages waiting to write to AOF or RDB snapshots. Check mem_fragmentation_ratio — if it was above 1.5, fragmentation was eating your memory. The fix is either restarting Redis periodically or enabling activedefrag (Redis 4.0+).

2. You are using DynamoDB with a composite primary key (userId, timestamp). Your application is reading a user's data range within a time window but latency has spiked. What do you check?

First check CloudWatch ThrottledRequests and ConsumedReadCapacityUnits. If the table is in provisioned mode and the partition key userId has hot spots (one user receiving more reads than others), that partition maxes out at 1,000 RCU and throttles. Check PartitionKey distribution in CloudWatch — if one key is consuming disproportionate capacity, switch to a sparse attribute as the partition key or add a random suffix to userId to spread the load. If the table is on-demand, check whether DynamoDB adaptive capacity is catching up — it splits hot partitions but takes a few minutes under sustained load.

3. When would you choose Redis over DynamoDB for storing session data?

Redis when session data is ephemeral enough that losing it is acceptable (users can re-authenticate) and when sub-millisecond latency matters. Redis sessions work well in single-region deployments where recovery from a Redis failure is fast (users re-login quickly). DynamoDB when sessions must survive regional failures (cross-region replication with global tables), when you need auditability (DynamoDB Streams captures session changes), or when your session data has complex querying needs (you need to query sessions by arbitrary attributes). For most web applications, the deciding factor is whether a session lookup failure should require re-authentication or not.

4. How does the Redis eviction policy `volatile-lru` differ from `allkeys-lru`, and when would you choose each?

volatile-lru evicts the least recently used keys only among keys that have a TTL set. allkeys-lru evicts the LRU key across all keys in the database, regardless of whether they have a TTL. Choose volatile-lru when you have a mix of permanent and temporary keys — you want Redis to prioritize evicting keys that are going to expire anyway. Choose allkeys-lru when all your data is effectively a cache and you want Redis to manage the entire working set uniformly. The classic use case for volatile-lru is session storage where some sessions have TTLs and you want to protect permanent data from eviction.

5. Explain the difference between Redis replication and Redis Cluster. When would you use each?

Redis replication (master-replica) provides a read replica for horizontal read scaling and automatic failover if the primary fails. It is a single-primary topology — all writes go to the master, replicas asynchronously copy data. Redis Cluster shards the keyspace across multiple primaries using hash slots (16384 slots distributed across nodes). Each primary handles a subset of the keyspace, providing both write scaling and fault tolerance. Use replication when you need read scaling with a single write source and your data fits on one node. Use Cluster when you need to store more data than a single Redis instance can hold or when you need to scale writes horizontally.

6. Your Redis latency spikes to 100ms occasionally. What diagnostic steps do you take?

Check redis-cli INFO latency for the latency history. Check redis-cli CONFIG GET maxmemory and evicted_keys — if keys are being evicted under memory pressure, that causes latency spikes as Redis blocks for eviction. Check mem_fragmentation_ratio — if it is above 1.5, the allocator is fragmented and causing latency. Check instantaneous_ops_per_sec — if it drops to zero periodically, Redis is blocking on something. Check slowlog (redis-cli SLOWLOG GET 10) for commands exceeding your slow-query threshold. Look for BGREWRITEAOF or BGSAVE running — these fork and can cause CoW latency spikes if memory is high. Also check OS level: vmstat 1 for context switches, iostat for disk I/O saturation.

7. How do you implement a distributed rate limiter with Redis that handles millions of requests per second across multiple application instances?

The key is atomic operations and sliding window design. Use a Lua script for atomic check-and-increment so you do not have race conditions. The script checks the current count, rejects if over limit, increments if under limit, and sets expiry on the key. For a sliding window rather than fixed window, use sorted sets with timestamps as scores and remove expired entries before counting. The key design distributes load across Redis slots — using a key per user (rate:user123) means different users hit different slots, which Redis Cluster handles automatically. Use local in-memory counting as a first pass (e.g., a token bucket) to reduce Redis round trips for the common case of being well under the limit.

8. Compare Redis pub/sub to Kafka for event streaming. When would you choose each?

Redis pub/sub is fire-and-forget — messages are not persisted, subscribers must be connected at publish time, and there is no replay capability. Redis Streams (a data structure in Redis) adds persistence and consumer groups, making it closer to Kafka. Kafka has guaranteed delivery, persistent storage with configurable retention, replay from any offset, partitioning, and replication. Redis is simpler operationally and works well when your event volume fits in memory, you do not need replay, and your consumers are always online. Kafka is for high-volume, durable, multi-consumer scenarios where you need to reprocess events or have consumers that come and go. For most real-time analytics use cases, Redis Streams is sufficient. For event sourcing with audit trails and complex consumption patterns, Kafka wins.

9. DynamoDB has a 400KB item size limit. How do you design around this for large objects like user profiles with unbounded activity history?

Store large data in S3 and keep the S3 pointer in DynamoDB. This is the standard pattern — DynamoDB for the metadata and index, S3 for the payload. For activity history specifically, store the last N activities inline (recent and most valuable) and archive older activities to S3 or a separate DynamoDB table with a composite key (userId + timestamp). Use a separate activity archive table with time-based bucketing on the sort key so you can query ranges efficiently. The tradeoff: you add a network hop to S3 for the full profile, but you gain predictable DynamoDB item sizes and avoid the 400KB limit entirely.

10. What is the difference between Redis Strings and Redis Hashes for storing objects? When would you use each?

Strings serialize an entire object as one value (JSON, protobuf, etc.). Hashes store field-value pairs at the top level, with each field stored separately. Use Strings when you always read the entire object together, when the object size is small (under a few KB), and when you benefit from pipelining multiple objects (MGET). Use Hashes when you frequently need to access individual fields of an object (HGET instead of GET + deserialize + re-serialize), when you want to use field-level expiry (HDEL with expiration on individual fields via separate keys), and when your objects have many fields but your queries only need a few. Hashes have more memory overhead per field (field names are stored per field), so for small simple objects, Strings are more memory-efficient.

11. How does the Redis AOF (Append Only File) persistence work, and what are the implications for durability vs performance?

AOF logs every write operation to a file. On restart, Redis replays the log to restore state. The appendfsync setting controls when the OS syncs to disk: always syncs on every write (safest, slowest — can be 10x slower), everysec syncs once per second (good durability, minimal performance impact), no relies on the OS to flush when it wants (fastest, risk of losing up to 1 second of writes). For true durability without the always performance hit, use everysec plus a replica that has fsync enabled, so if the primary fails you have at least one replica with the latest data.

12. Your DynamoDB GSI is returning stale data even after writes complete. What could be causing this?

GSI updates are eventually consistent by default — writes go to the main table, and GSI updates propagate in the background. The propagation typically takes under a second but can take longer under high write load or if the GSI partition is hot. If you need to read your own writes immediately, use strong consistency on the table reads or query the table directly rather than the GSI. Also check whether you are using ReturnValues in your writes — DynamoDB returns the old item state on some conditions, which can confuse applications expecting the new state.

13. How do you handle Redis connection exhaustion in a high-concurrency application?

Redis has a default maxclients of 10,000. If you run out, new connections are rejected. First, use connection pooling — do not create a new connection per request; maintain a pool of long-lived connections and multiplex requests on them. Second, use pipelining to batch multiple commands into fewer round trips. Third, check your timeout setting — if connections are left open but not used, they still count toward the limit. Fourth, use INFO clients to see connected clients and which are blocked. Fifth, if you genuinely need more than 10,000 connections, use Redis Cluster to spread load across multiple nodes, each with its own connection pool.

14. Explain the tradeoffs between Redis Sorted Sets and Range Queries in DynamoDB for leaderboard implementations.

Redis Sorted Sets maintain a score-sorted order with O(log N) insert and O(log N) range query — you can get top 10 with ZREVRANGE in log time. DynamoDB has no native sorted range queries — to get a top-N across the entire dataset you would need a GSI on a score attribute, but it would still require a scan or a filter, not a true range query. For leaderboards with frequent updates and frequent top-N queries, Redis Sorted Sets are far superior. For static leaderboards or queries that can be bounded (top N within a category), DynamoDB with a GSI works. The DynamoDB advantage is durability and managed scaling; the Redis advantage is performance and native rank operations.

15. What is Redis pipelining, why does it dramatically improve throughput, and what are the caveats?

Redis pipelining sends multiple commands to Redis in a single network round trip by batching them together. Without pipelining, each command waits for a response before the next is sent — latency is dominated by network round trips. With pipelining, you send N commands and wait for N responses at once, reducing round trips from N to 1. The caveat: all commands in a pipeline must be independent — you cannot use the result of command 1 as input to command 2 within the same pipeline without custom logic. Additionally, very large pipelines (thousands of commands) consume memory on both client and server, so size your pipelines appropriately. For most production workloads, pipelines of 100-500 commands strike a good balance.

16. How would you design a Redis-based cache warming strategy for a new application deployment?

Cache warming is the practice of populating the cache before users hit it, avoiding cold-start cache misses. Strategy 1: batch load on startup — run a background job that reads the top-N most frequently accessed items from your primary database and writes them to Redis with appropriate TTLs. Strategy 2: lazy loading with fallback — when a cache miss occurs, populate the cache before returning the data so subsequent requests are cached. Strategy 3: hybrid — warm the hot dataset on startup (top 10,000 items), let lazy loading handle the long tail. Monitor keyspace_hits and keyspace_misses after deployment — if miss rate is high after warming, adjust the warming scope. Set appropriate TTLs during warming so cache does not grow unbounded.

17. DynamoDB On-Demand vs Provisioned capacity: when would you choose each, and how do you handle cost optimization?

DynamoDB On-Demand charges per read/write operation with no capacity planning — it scales automatically but has higher per-operation cost at high throughput. Provisioned capacity charges for reserved RCU/WCU with lower per-unit cost but requires upfront capacity planning. Choose On-Demand for unpredictable, bursty workloads; for new applications where you do not know the traffic pattern yet; for development and testing. Choose Provisioned for predictable, sustained high-throughput workloads where you can optimize cost per operation. Cost optimization: monitor ConsumedCapacityUnits and aim for 70-80% utilization of provisioned capacity. Use auto-scaling to handle traffic growth without overprovisioning. For known traffic patterns (nightly batch jobs), provision for peak and scale down with auto-scaling rather than paying On-Demand rates for predictable workloads.

18. What is the difference between Redis BITCOUNT and HyperLogLog for cardinality estimation?

BITCOUNT counts bits set to 1 in a string — you can use it to track approximate counts by setting bits for each user ID (hash to a bit position), but memory is proportional to the key space you want to cover and it has accuracy limitations for small cardinalities. HyperLogLog is a probabilistic data structure purpose-built for cardinality estimation — it uses 12KB of memory regardless of cardinality and provides approximately 1% error rate. Use BITCOUNT when you need exact counts or when the cardinality is small enough that memory is not a concern. Use HyperLogLog when you need to estimate cardinality of large sets (millions to billions of unique values) where storing exact counts is prohibitive. HyperLogLog is the right choice for "how many unique visitors in the last 24 hours" style metrics.

19. How do you handle cross-region Redis replication for disaster recovery?

For cross-region replication, the standard approach is Redis Active-Replication (also called Redis Geo-Replication in managed offerings). The primary region receives writes, and replica regions asynchronously replicate. The key limitation: cross-region latency means replication lag — writes in the primary may not be immediately visible in the secondary region. For true disaster recovery, you need to understand your RPO (Recovery Point Objective) — how much data can you afford to lose? If it is zero, you need synchronous replication which has high latency. If it is seconds to minutes, async cross-region replication works. Test failover explicitly — when the primary region fails, how do you promote the secondary? DNS updates take time. Your application must handle the lag gracefully during the switchover.

20. Your Redis cluster is experiencing split-brain after a network partition. How do you diagnose and resolve it?

Split-brain occurs when the cluster fragments into two or more sub-clusters that cannot communicate, and each believes it is the primary. First, check redis-cli CLUSTER NODES on each node to see the cluster state and which nodes think they are the primary. If nodes in both partitions are accepting writes, you have divergent data. In Redis Cluster, only the primary in the largest reachable partition continues accepting writes — nodes in smaller partitions become replicas of the larger partition when connectivity restores. The recovery process: restore network connectivity, let the cluster auto-heal via gossip protocol (Redis Cluster handles this automatically), then run redis-cli CLUSTER FAILOVER on replicas to ensure they point to the correct primary. Check redis-cli CLUSTER INFO for cluster state flags. For data divergence after healing, use redis-cli sync to force a full resync from the primary.

Conclusion

Key-value stores trade query flexibility for raw speed and simplicity. Redis works well as a cache, session store, and real-time data structure server where sub-millisecond latency matters. DynamoDB works well as a managed, scalable, durable store for application data with predictable access patterns.

The key insight is that key-value stores require knowing your access patterns upfront. You cannot query on arbitrary fields. Design your keys around how you actually read data, not around the structure of the data itself.

Key-Value Stores: Redis and DynamoDB Patterns

The Key-Value Model

Redis: In-Memory Speed

Basic Operations

Data Structures Beyond Strings

TTL and Expiration

DynamoDB: Managed Persistent Storage

Table Structure

Access Patterns Drive Key Design

Global Secondary Indexes

Use Cases: Where Key-Value Stores Excel

Caching

Session Storage

Leaderboards

Rate Limiting

Distributed Locks

Eviction Policies

In-Memory vs Persistent: Tradeoffs

Common Production Failures

Capacity Estimation: Memory-per-Key and DynamoDB Partition Math

Trade-off Analysis: Redis vs DynamoDB vs Memcached

Observability Hooks: Redis INFO and DynamoDB CloudWatch

Real-World Case Study: Twitter’s Redis Timeline Storage

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

Redis vs Memcached: Choosing an In-Memory Data Store

Column-Family Databases: Cassandra and HBase Architecture

Document Databases: MongoDB and CouchDB Data Modeling