Key-Value Stores: Redis and DynamoDB Patterns

Learn Redis and DynamoDB key-value patterns for caching, sessions, leaderboards, TTL eviction policies, and storage tradeoffs.

published: reading time: 27 min read author: GeekWorkBench

Key-Value Stores: Redis and DynamoDB Patterns

The key-value store is the simplest kind of database. You have a key, you get a value. No queries, no joins, no complex schema. This simplicity is the feature.

Redis and DynamoDB represent two ends of the key-value spectrum. Redis is an in-memory store with optional persistence. DynamoDB is a fully managed, persistent, distributed key-value store with configurable consistency. Understanding both helps you choose the right tool.


The Key-Value Model

At its core, the model is:

# Basic operations
store.set(key, value)
value = store.get(key)
store.delete(key)
exists = store.exists(key)

No WHERE clauses. No aggregations. You know exactly where your data is.

This simplicity enables two things: extremely fast operations and horizontal scalability. When every lookup goes directly to a specific location, there is no query planner overhead, no index traversal, no join computation.

flowchart LR
    Client["Client"]
    Redis["Redis Cluster<br/>(3 Nodes)"]
    A["Node A<br/>hash_tag: {user}"]
    B["Node B<br/>hash_tag: {order}"]
    C["Node C<br/>hash_tag: {product}"]
    Client --> Redis
    Redis --> A & B & C

Redis Cluster shards data by hash slot (16384 slots total). The hash tag in curly braces determines which slot a key maps to — {user}:123 and {user}:456 go to the same node. DynamoDB works similarly under the hood: partition keys determine which storage node holds your data.


Redis: In-Memory Speed

Redis stores data primarily in memory, with optional durability to disk. This makes it fast for read-heavy workloads.

Basic Operations

import redis

r = redis.Redis(host='localhost', port=6379, db=0)

# String operations
r.set('user:123:session', json.dumps(session_data))
r.set('rate:limit:192.168.1.1', 100, ex=60)  # With 60-second TTL
value = r.get('user:123:session')

# Multiple operations (pipelines reduce round trips)
pipe = r.pipeline()
pipe.hset('user:123', mapping={'name': 'Alice', 'email': 'alice@example.com'})
pipe.expire('user:123', 3600)
pipe.execute()

# Atomic counters
r.incr('api:request:count')
r.incrby('user:123:balance', 100)

Data Structures Beyond Strings

Redis is actually a data structure server. Values can be more than strings.

# Lists (queues, activity feeds)
r.lpush('queue:jobs', 'job1', 'job2', 'job3')
next_job = r.rpop('queue:jobs')

# Sets (unique items, tags)
r.sadd('user:123:likes', 'item1', 'item2', 'item3')
is_member = r.sismember('user:123:likes', 'item1')
all_items = r.smembers('user:123:likes')

# Sorted sets (leaderboards, priorities)
r.zadd('leaderboard', {'alice': 100, 'bob': 200, 'charlie': 150})
top_players = r.zrevrange('leaderboard', 0, 9, withscores=True)
rank = r.zrank('leaderboard', 'alice')

# Hashes (objects)
r.hset('product:SKU-001', mapping={
    'name': 'Gaming Laptop',
    'price': 1299.99,
    'stock': 50
})
product = r.hgetall('product:SKU-001')

TTL and Expiration

Redis handles time-limited data well.

# Session with 30-minute expiry
r.setex('session:abc123', 1800, json.dumps(session_data))

# Rate limiting: allow 100 requests per minute
def rate_limit(identifier, limit=100, window=60):
    key = f'ratelimit:{identifier}'
    current = r.get(key)

    if current and int(current) >= limit:
        return False  # Rate limit exceeded

    pipe = r.pipeline()
    pipe.incr(key)
    pipe.expire(key, window)
    pipe.execute()
    return True

# Distributed locks
def acquire_lock(lock_name, timeout=10):
    lock_key = f'lock:{lock_name}'
    acquired = r.set(lock_key, '1', nx=True, ex=timeout)
    return acquired

def release_lock(lock_name):
    lock_key = f'lock:{lock_name}'
    r.delete(lock_key)

DynamoDB: Managed Persistent Storage

DynamoDB is a fully managed NoSQL database by AWS. It offers persistent storage with automatic sharding, eventual or strong consistency options, and pay-per-request pricing.

Table Structure

DynamoDB has a simple primary key: either a partition key alone, or a partition key plus sort key.

import boto3

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('UserSessions')

# Put and get items
table.put_item(Item={
    'userId': 'user-123',
    'sessionId': 'session-abc',
    'data': {'preferences': {'theme': 'dark'}},
    'expiresAt': int(time.time()) + 3600
})

response = table.get_item(
    Key={
        'userId': 'user-123',
        'sessionId': 'session-abc'
    }
)

Access Patterns Drive Key Design

In DynamoDB, you design keys based on how you query. Unlike relational databases where you can freely query any column, DynamoDB queries are limited to key attributes.

# Single table design: multiple entity types in one table
# Key: PK (partition key), SK (sort key)

# User entity
{'PK': 'USER#alice', 'SK': 'PROFILE', 'name': 'Alice', 'email': 'alice@example.com'}

# Orders for user
{'PK': 'USER#alice', 'SK': 'ORDER#2024-01-01', 'total': 129.99, 'items': [...]}
{'PK': 'USER#alice', 'SK': 'ORDER#2024-01-15', 'total': 49.99, 'items': [...]}

# Products
{'PK': 'PRODUCT#LAPTOP-001', 'SK': 'METADATA', 'name': 'Gaming Laptop', 'price': 1299.99}

# Query all orders for a user
response = table.query(
    KeyConditionExpression=Key('PK').eq('USER#alice') & Key('SK').begins_with('ORDER#')
)

# Query single user profile
response = table.get_item(
    Key={'PK': 'USER#alice', 'SK': 'PROFILE'}
)

Global Secondary Indexes

When you need alternative access patterns, use GSIs.

# Create table with GSI for email lookup
table = dynamodb.create_table(
    TableName='Users',
    KeySchema=[
        {'AttributeName': 'userId', 'KeyType': 'HASH'}
    ],
    AttributeDefinitions=[
        {'AttributeName': 'userId', 'AttributeType': 'S'},
        {'AttributeName': 'email', 'AttributeType': 'S'}
    ],
    GlobalSecondaryIndexes=[
        {
            'IndexName': 'EmailIndex',
            'KeySchema': [{'AttributeName': 'email', 'KeyType': 'HASH'}],
            'Projection': {'ProjectionType': 'ALL'}
        }
    ],
    BillingMode='PAY_PER_REQUEST'
)

# Later, query by email
response = table.query(
    IndexName='EmailIndex',
    KeyConditionExpression=Key('email').eq('alice@example.com')
)

Use Cases: Where Key-Value Stores Excel

Caching

The classic use case. Store frequently accessed data in Redis to reduce database load.

def get_user_profile(user_id):
    cache_key = f'user:profile:{user_id}'

    # Try cache first
    cached = redis.get(cache_key)
    if cached:
        return json.loads(cached)

    # Cache miss - fetch from database
    profile = database.fetch_user(user_id)

    # Store in cache for next time
    redis.setex(cache_key, 300, json.dumps(profile))  # 5 minute TTL

    return profile

Session Storage

Sessions are naturally key-value: session ID maps to session data.

# Web session
session_id = cookies.get('session_id')
if not session_id:
    session_id = generate_session_id()
    redis.setex(f'session:{session_id}', 86400, json.dumps({}))

session_data = json.loads(redis.get(f'session:{session_id}'))
session_data['page_views'] = session_data.get('page_views', 0) + 1
redis.setex(f'session:{session_id}', 86400, json.dumps(session_data))

Leaderboards

Sorted sets make ranked lists straightforward.

def update_leaderboard(player_id, score):
    redis.zadd('leaderboard', {player_id: score})

def get_top_players(n=10):
    return redis.zrevrange('leaderboard', 0, n-1, withscores=True)

def get_player_rank(player_id):
    rank = redis.zrank('leaderboard', player_id)
    return rank + 1 if rank is not None else None

Rate Limiting

Atomic operations enable distributed rate limiting.

def is_allowed(client_id, limit=100, window=60):
    key = f'ratelimit:{client_id}'

    # Lua script for atomic check-and-increment
    script = """
    local current = redis.call('GET', KEYS[1])
    if current and tonumber(current) >= tonumber(ARGV[1]) then
        return 0
    end
    current = redis.call('INCR', KEYS[1])
    if tonumber(current) == 1 then
        redis.call('EXPIRE', KEYS[1], ARGV[2])
    end
    return 1
    """

    result = redis.eval(script, 1, key, limit, window)
    return bool(result)

Distributed Locks

Redis implements coordination primitives.

import uuid
import time

class RedisLock:
    def __init__(self, redis_client, lock_name, timeout=10):
        self.redis = redis_client
        self.lock_key = f'lock:{lock_name}'
        self.timeout = timeout
        self.token = str(uuid.uuid4())

    def acquire(self, blocking=True, blocking_timeout=10):
        start = time.time()
        while True:
            if self.redis.set(self.lock_key, self.token, nx=True, ex=self.timeout):
                return True
            if not blocking:
                return False
            if time.time() - start >= blocking_timeout:
                return False
            time.sleep(0.01)

    def release(self):
        # Only release if we own the lock
        script = """
        if redis.call('GET', KEYS[1]) == ARGV[1] then
            return redis.call('DEL', KEYS[1])
        else
            return 0
        end
        """
        self.redis.eval(script, 1, self.lock_key, self.token)

Eviction Policies

Redis offers multiple eviction policies when memory limits are reached.

# maxmemory set to 2gb
# maxmemory-policy defined how eviction works

# noeviction: reject writes, reads allowed (default)
# allkeys-lru: evict least recently used keys across all keys
# allkeys-random: evict random keys
# volatile-lru: evict LRU keys only in keys with TTL
# volatile-random: evict random keys with TTL
# volatile-ttl: evict keys with shortest TTL
# allkeys-lfu: evict least frequently used keys
# volatile-lfu: evict LFU keys with TTL

For caching, allkeys-lru is typically the best choice. For session storage with TTL, volatile-lru ensures only expiring keys get evicted.


In-Memory vs Persistent: Tradeoffs

CharacteristicRedis (In-Memory)DynamoDB (Persistent)
LatencySub-millisecondSingle-digit milliseconds
DurabilityOptional (RDB/AOF)Always durable
ScalabilityVertical plus read replicasAutomatic sharding
CostPay for memoryPay for throughput
ComplexitySingle instance relatively simpleRequires understanding capacity
Data sizeLimited by memoryVirtually unlimited

Redis with persistence is not equivalent to DynamoDB. Redis persistence (RDB snapshots, AOF logs) guards against crashes but not against disk failures. DynamoDB’s architecture distributes data across multiple AZs by default.

For true durability in Redis, you need replication to follower instances that can take over if the primary fails.


Common Production Failures

Redis OOM kills crashing the database: Redis runs out of memory and the OS kernel kills it via OOM killer. Your application starts returning errors and recovery takes time to reload data. Set maxmemory and maxmemory-policy, configure vm.overcommit_memory=1 at the OS level, and have a reload plan from your primary database ready.

Keys without TTL filling up Redis: Someone writes SET mykey value instead of SET mykey value EX 3600 in the rate limiter. The key never expires. Redis fills up, allkeys-lru starts evicting keys you actually need. Default to TTL on every key—a missing expiration is a bug, not a feature.

DynamoDB hot partition throttling: Your session table uses userId as the partition key. Turns out user-0000 is the test account that every load test hits. That partition maxes out at 300 RCU while the rest of the table sits idle. Spread write volume across partition keys, and implement exponential backoff with jitter when you hit ProvisionedThroughputExceededException.

AOF always mode killing Redis write throughput: You enable AOF with appendfsync always for “maximum durability”. Under write load, Redis blocks on every sync to disk and throughput collapses. Switch to appendfsync everysec or appendfsync no, and match your disk I/O to the write rate if you use everysec.

Distributed lock released by wrong owner: The lock code calls DEL on expiry. But the lock expired while your work was still running, another client grabbed it, and now two processes hold the same lock. The token-based release (compare-and-delete via Lua script) shown in the code example prevents this—only the lock owner can release it.

DynamoDB single-table design gone wrong: You pack 15 entity types into one table to avoid “wasting” tables. The GSI situation becomes a mess and access patterns start colliding. Single-table design shines with 3-5 entity types that share access patterns. Beyond that, the operational complexity wins.


Capacity Estimation: Memory-per-Key and DynamoDB Partition Math

Redis memory is the primary constraint. Each key-value pair has overhead on top of the value size. A string key with a 100-byte value uses roughly 100 bytes plus 40-60 bytes of key metadata (keyspace overhead, pointer to value, expiry metadata if set). A hash with 10 fields uses more — each field name is stored alongside the value.

Rough formula for Redis memory: total_keys * (avg_key_size + avg_value_size + 40 bytes_overhead). For a cache with 1 million keys averaging 500 bytes each: roughly 540 MB plus Redis’s own internal fragmentation. Set maxmemory with headroom — at 80% used, Redis starts evicting and you lose predictability.

For TTLs, the expiration queue has a cost: Redis scans the keyspace periodically to expire keys. With millions of keys with TTLs, ACTIVE_EXPIRE_CYCLE_SLOW runs every 100ms. This is generally fast but if you set TTLs on millions of keys at once (a batch import, for example), Redis can briefly block. The latency-sensitive work — expiring the keys you care about — happens incrementally.

DynamoDB partition sizing: DynamoDB creates partitions based on partition key values and throughput. Each partition supports up to 1,000 RCU, 1,000 WCU, and 10 GB of data. For a table with 50 GB and 5,000 RCU, you need at minimum 5 partitions (10 GB each), and the RCU requirement means you actually need 5 partitions just for throughput even if the data were smaller.

If you exceed 10 GB per partition, DynamoDB splits automatically. If you exceed 1,000 WCU on a single partition before splitting, you get throttled. DynamoDB adaptive capacity can temporarily burst above 1,000 WCU per partition, but sustained traffic above that requires either better key distribution or requesting a provisioning increase from AWS support.


Trade-off Analysis: Redis vs DynamoDB vs Memcached

Use CaseRedisDynamoDBMemcached
Sub-millisecond cachingNative, in-memorySingle-digit ms, NVMe-backedNative, in-memory
Distributed locksNative (SET NX EX)Conditional writesNot natively supported
Session storageFast, ephemeralDurable, managedFast, ephemeral
Write-heavy workloadsGood with persistenceExcellent auto-scalingGood, but no persistence
Managed serviceSelf-hosted or Redis EnterpriseFully managed by AWSSelf-hosted or ElastiCache
Data size limitsBounded by memory400KB per item, PB total1MB per item
ReplicationMaster-replica, ClusterMulti-AZ by defaultReplica support via consistent hashing

Observability Hooks: Redis INFO and DynamoDB CloudWatch

For Redis, INFO is the primary observability interface. The output has sections: Server (Redis build info, version, uptime), Clients (connected clients, blocked clients), Memory (used_memory, used_memory_peak, mem_fragmentation_ratio), Persistence (RDB/AOF status), Stats (total commands processed, keyspace hits/misses), Replication (role, master link status), CPU, Latency stats.

The metrics to watch in production: mem_fragmentation_ratio above 1.5 means Redis is using 50% more memory than it needs — restart Redis or adjust activedefrag settings. keyspace_hits / (keyspace_hits + keyspace_misses) is your cache hit ratio. Below 80% and your cache is not earning its memory. instantaneous_ops_per_sec shows current throughput. connected_clients approaching maxclients (default 10,000) is a connection exhaustion warning.

For DynamoDB, CloudWatch metrics matter: ConsumedReadCapacityUnits and ConsumedWriteCapacityUnits against ProvisionedThroughput. If you consistently consume above 80% of provisioned, DynamoDB autoscales or you need to increase. ThrottledRequests is the key alarm — non-zero throttling means your application is failing requests. UserErrors with ProvisionedThroughputExceededException means your keys are unevenly distributed. ReplicationLatency matters for global tables — cross-region replication lag should stay under 1 second.

Real-World Case Study: Twitter’s Redis Timeline Storage

Twitter’s early architecture used Redis to store user timelines — the list of tweets a user sees when they open their home timeline. The problem was timeline size: celebrities with millions of followers had timelines that could not be computed on read (pulling from all followed users’ tweets in real time was too slow).

Their solution was write-time fanout: when a user posts a tweet, Twitter pushes that tweet into the timelines of all their followers via a Redis pipeline. Reading a timeline became a simple Redis range read — fast, predictable. The tradeoff was write amplification: one tweet from a celebrity required writes to millions of Redis instances.

The operational consequence: the Redis cluster storing timelines was among the largest in production. Memory pressure was constant. When a Redis node failed, millions of users noticed missing timeline entries until repair ran. Twitter’s fix was a hybrid model — push timelines for active users, pull from graph for inactive ones. The lesson: Redis makes reads cheap at the cost of write amplification and memory pressure. Design for the read-to-write ratio of your actual workload.

Interview Questions

1. Your Redis instance is using 8 GB of memory out of 10 GB max. You restart the process and memory drops to 5 GB. What happened?

Redis does not immediately return freed memory to the OS — used_memory reflects the allocator's view, not the OS's. After a restart, the allocator gets a fresh heap and fragmentation memory is recovered. The 3 GB drop was likely memory fragmentation (the allocator holding freed chunks that have not been coalesced) and dirty pages waiting to write to AOF or RDB snapshots. Check mem_fragmentation_ratio — if it was above 1.5, fragmentation was eating your memory. The fix is either restarting Redis periodically or enabling activedefrag (Redis 4.0+).

2. You are using DynamoDB with a composite primary key (userId, timestamp). Your application is reading a user's data range within a time window but latency has spiked. What do you check?

First check CloudWatch ThrottledRequests and ConsumedReadCapacityUnits. If the table is in provisioned mode and the partition key userId has hot spots (one user receiving more reads than others), that partition maxes out at 1,000 RCU and throttles. Check PartitionKey distribution in CloudWatch — if one key is consuming disproportionate capacity, switch to a sparse attribute as the partition key or add a random suffix to userId to spread the load. If the table is on-demand, check whether DynamoDB adaptive capacity is catching up — it splits hot partitions but takes a few minutes under sustained load.

3. When would you choose Redis over DynamoDB for storing session data?

Redis when session data is ephemeral enough that losing it is acceptable (users can re-authenticate) and when sub-millisecond latency matters. Redis sessions work well in single-region deployments where recovery from a Redis failure is fast (users re-login quickly). DynamoDB when sessions must survive regional failures (cross-region replication with global tables), when you need auditability (DynamoDB Streams captures session changes), or when your session data has complex querying needs (you need to query sessions by arbitrary attributes). For most web applications, the deciding factor is whether a session lookup failure should require re-authentication or not.

4. How does the Redis eviction policy `volatile-lru` differ from `allkeys-lru`, and when would you choose each?

volatile-lru evicts the least recently used keys only among keys that have a TTL set. allkeys-lru evicts the LRU key across all keys in the database, regardless of whether they have a TTL. Choose volatile-lru when you have a mix of permanent and temporary keys — you want Redis to prioritize evicting keys that are going to expire anyway. Choose allkeys-lru when all your data is effectively a cache and you want Redis to manage the entire working set uniformly. The classic use case for volatile-lru is session storage where some sessions have TTLs and you want to protect permanent data from eviction.

5. Explain the difference between Redis replication and Redis Cluster. When would you use each?

Redis replication (master-replica) provides a read replica for horizontal read scaling and automatic failover if the primary fails. It is a single-primary topology — all writes go to the master, replicas asynchronously copy data. Redis Cluster shards the keyspace across multiple primaries using hash slots (16384 slots distributed across nodes). Each primary handles a subset of the keyspace, providing both write scaling and fault tolerance. Use replication when you need read scaling with a single write source and your data fits on one node. Use Cluster when you need to store more data than a single Redis instance can hold or when you need to scale writes horizontally.

6. Your Redis latency spikes to 100ms occasionally. What diagnostic steps do you take?

Check redis-cli INFO latency for the latency history. Check redis-cli CONFIG GET maxmemory and evicted_keys — if keys are being evicted under memory pressure, that causes latency spikes as Redis blocks for eviction. Check mem_fragmentation_ratio — if it is above 1.5, the allocator is fragmented and causing latency. Check instantaneous_ops_per_sec — if it drops to zero periodically, Redis is blocking on something. Check slowlog (redis-cli SLOWLOG GET 10) for commands exceeding your slow-query threshold. Look for BGREWRITEAOF or BGSAVE running — these fork and can cause CoW latency spikes if memory is high. Also check OS level: vmstat 1 for context switches, iostat for disk I/O saturation.

7. How do you implement a distributed rate limiter with Redis that handles millions of requests per second across multiple application instances?

The key is atomic operations and sliding window design. Use a Lua script for atomic check-and-increment so you do not have race conditions. The script checks the current count, rejects if over limit, increments if under limit, and sets expiry on the key. For a sliding window rather than fixed window, use sorted sets with timestamps as scores and remove expired entries before counting. The key design distributes load across Redis slots — using a key per user (rate:user123) means different users hit different slots, which Redis Cluster handles automatically. Use local in-memory counting as a first pass (e.g., a token bucket) to reduce Redis round trips for the common case of being well under the limit.

8. Compare Redis pub/sub to Kafka for event streaming. When would you choose each?

Redis pub/sub is fire-and-forget — messages are not persisted, subscribers must be connected at publish time, and there is no replay capability. Redis Streams (a data structure in Redis) adds persistence and consumer groups, making it closer to Kafka. Kafka has guaranteed delivery, persistent storage with configurable retention, replay from any offset, partitioning, and replication. Redis is simpler operationally and works well when your event volume fits in memory, you do not need replay, and your consumers are always online. Kafka is for high-volume, durable, multi-consumer scenarios where you need to reprocess events or have consumers that come and go. For most real-time analytics use cases, Redis Streams is sufficient. For event sourcing with audit trails and complex consumption patterns, Kafka wins.

9. DynamoDB has a 400KB item size limit. How do you design around this for large objects like user profiles with unbounded activity history?

Store large data in S3 and keep the S3 pointer in DynamoDB. This is the standard pattern — DynamoDB for the metadata and index, S3 for the payload. For activity history specifically, store the last N activities inline (recent and most valuable) and archive older activities to S3 or a separate DynamoDB table with a composite key (userId + timestamp). Use a separate activity archive table with time-based bucketing on the sort key so you can query ranges efficiently. The tradeoff: you add a network hop to S3 for the full profile, but you gain predictable DynamoDB item sizes and avoid the 400KB limit entirely.

10. What is the difference between Redis Strings and Redis Hashes for storing objects? When would you use each?

Strings serialize an entire object as one value (JSON, protobuf, etc.). Hashes store field-value pairs at the top level, with each field stored separately. Use Strings when you always read the entire object together, when the object size is small (under a few KB), and when you benefit from pipelining multiple objects (MGET). Use Hashes when you frequently need to access individual fields of an object (HGET instead of GET + deserialize + re-serialize), when you want to use field-level expiry (HDEL with expiration on individual fields via separate keys), and when your objects have many fields but your queries only need a few. Hashes have more memory overhead per field (field names are stored per field), so for small simple objects, Strings are more memory-efficient.

11. How does the Redis AOF (Append Only File) persistence work, and what are the implications for durability vs performance?

AOF logs every write operation to a file. On restart, Redis replays the log to restore state. The appendfsync setting controls when the OS syncs to disk: always syncs on every write (safest, slowest — can be 10x slower), everysec syncs once per second (good durability, minimal performance impact), no relies on the OS to flush when it wants (fastest, risk of losing up to 1 second of writes). For true durability without the always performance hit, use everysec plus a replica that has fsync enabled, so if the primary fails you have at least one replica with the latest data.

12. Your DynamoDB GSI is returning stale data even after writes complete. What could be causing this?

GSI updates are eventually consistent by default — writes go to the main table, and GSI updates propagate in the background. The propagation typically takes under a second but can take longer under high write load or if the GSI partition is hot. If you need to read your own writes immediately, use strong consistency on the table reads or query the table directly rather than the GSI. Also check whether you are using ReturnValues in your writes — DynamoDB returns the old item state on some conditions, which can confuse applications expecting the new state.

13. How do you handle Redis connection exhaustion in a high-concurrency application?

Redis has a default maxclients of 10,000. If you run out, new connections are rejected. First, use connection pooling — do not create a new connection per request; maintain a pool of long-lived connections and multiplex requests on them. Second, use pipelining to batch multiple commands into fewer round trips. Third, check your timeout setting — if connections are left open but not used, they still count toward the limit. Fourth, use INFO clients to see connected clients and which are blocked. Fifth, if you genuinely need more than 10,000 connections, use Redis Cluster to spread load across multiple nodes, each with its own connection pool.

14. Explain the tradeoffs between Redis Sorted Sets and Range Queries in DynamoDB for leaderboard implementations.

Redis Sorted Sets maintain a score-sorted order with O(log N) insert and O(log N) range query — you can get top 10 with ZREVRANGE in log time. DynamoDB has no native sorted range queries — to get a top-N across the entire dataset you would need a GSI on a score attribute, but it would still require a scan or a filter, not a true range query. For leaderboards with frequent updates and frequent top-N queries, Redis Sorted Sets are far superior. For static leaderboards or queries that can be bounded (top N within a category), DynamoDB with a GSI works. The DynamoDB advantage is durability and managed scaling; the Redis advantage is performance and native rank operations.

15. What is Redis pipelining, why does it dramatically improve throughput, and what are the caveats?

Redis pipelining sends multiple commands to Redis in a single network round trip by batching them together. Without pipelining, each command waits for a response before the next is sent — latency is dominated by network round trips. With pipelining, you send N commands and wait for N responses at once, reducing round trips from N to 1. The caveat: all commands in a pipeline must be independent — you cannot use the result of command 1 as input to command 2 within the same pipeline without custom logic. Additionally, very large pipelines (thousands of commands) consume memory on both client and server, so size your pipelines appropriately. For most production workloads, pipelines of 100-500 commands strike a good balance.

16. How would you design a Redis-based cache warming strategy for a new application deployment?

Cache warming is the practice of populating the cache before users hit it, avoiding cold-start cache misses. Strategy 1: batch load on startup — run a background job that reads the top-N most frequently accessed items from your primary database and writes them to Redis with appropriate TTLs. Strategy 2: lazy loading with fallback — when a cache miss occurs, populate the cache before returning the data so subsequent requests are cached. Strategy 3: hybrid — warm the hot dataset on startup (top 10,000 items), let lazy loading handle the long tail. Monitor keyspace_hits and keyspace_misses after deployment — if miss rate is high after warming, adjust the warming scope. Set appropriate TTLs during warming so cache does not grow unbounded.

17. DynamoDB On-Demand vs Provisioned capacity: when would you choose each, and how do you handle cost optimization?

DynamoDB On-Demand charges per read/write operation with no capacity planning — it scales automatically but has higher per-operation cost at high throughput. Provisioned capacity charges for reserved RCU/WCU with lower per-unit cost but requires upfront capacity planning. Choose On-Demand for unpredictable, bursty workloads; for new applications where you do not know the traffic pattern yet; for development and testing. Choose Provisioned for predictable, sustained high-throughput workloads where you can optimize cost per operation. Cost optimization: monitor ConsumedCapacityUnits and aim for 70-80% utilization of provisioned capacity. Use auto-scaling to handle traffic growth without overprovisioning. For known traffic patterns (nightly batch jobs), provision for peak and scale down with auto-scaling rather than paying On-Demand rates for predictable workloads.

18. What is the difference between Redis BITCOUNT and HyperLogLog for cardinality estimation?

BITCOUNT counts bits set to 1 in a string — you can use it to track approximate counts by setting bits for each user ID (hash to a bit position), but memory is proportional to the key space you want to cover and it has accuracy limitations for small cardinalities. HyperLogLog is a probabilistic data structure purpose-built for cardinality estimation — it uses 12KB of memory regardless of cardinality and provides approximately 1% error rate. Use BITCOUNT when you need exact counts or when the cardinality is small enough that memory is not a concern. Use HyperLogLog when you need to estimate cardinality of large sets (millions to billions of unique values) where storing exact counts is prohibitive. HyperLogLog is the right choice for "how many unique visitors in the last 24 hours" style metrics.

19. How do you handle cross-region Redis replication for disaster recovery?

For cross-region replication, the standard approach is Redis Active-Replication (also called Redis Geo-Replication in managed offerings). The primary region receives writes, and replica regions asynchronously replicate. The key limitation: cross-region latency means replication lag — writes in the primary may not be immediately visible in the secondary region. For true disaster recovery, you need to understand your RPO (Recovery Point Objective) — how much data can you afford to lose? If it is zero, you need synchronous replication which has high latency. If it is seconds to minutes, async cross-region replication works. Test failover explicitly — when the primary region fails, how do you promote the secondary? DNS updates take time. Your application must handle the lag gracefully during the switchover.

20. Your Redis cluster is experiencing split-brain after a network partition. How do you diagnose and resolve it?

Split-brain occurs when the cluster fragments into two or more sub-clusters that cannot communicate, and each believes it is the primary. First, check redis-cli CLUSTER NODES on each node to see the cluster state and which nodes think they are the primary. If nodes in both partitions are accepting writes, you have divergent data. In Redis Cluster, only the primary in the largest reachable partition continues accepting writes — nodes in smaller partitions become replicas of the larger partition when connectivity restores. The recovery process: restore network connectivity, let the cluster auto-heal via gossip protocol (Redis Cluster handles this automatically), then run redis-cli CLUSTER FAILOVER on replicas to ensure they point to the correct primary. Check redis-cli CLUSTER INFO for cluster state flags. For data divergence after healing, use redis-cli sync to force a full resync from the primary.

Further Reading

For more on NoSQL databases, see the NoSQL overview. To learn about caching strategies, see Caching Strategies. For comparison with Redis and Memcached, see Redis vs Memcached.

Conclusion

Key-value stores trade query flexibility for raw speed and simplicity. Redis works well as a cache, session store, and real-time data structure server where sub-millisecond latency matters. DynamoDB works well as a managed, scalable, durable store for application data with predictable access patterns.

The key insight is that key-value stores require knowing your access patterns upfront. You cannot query on arbitrary fields. Design your keys around how you actually read data, not around the structure of the data itself.

Category

Related Posts

Redis vs Memcached: Choosing an In-Memory Data Store

A comprehensive comparison of Redis and Memcached — data structures, persistence, clustering, Lua scripting, pub/sub, and guidance on when to choose each.

#system-design #databases #caching

Column-Family Databases: Cassandra and HBase Architecture

Cassandra and HBase data storage explained. Learn partition key design, column families, time-series modeling, and consistency tradeoffs.

#database #nosql #column-family

Document Databases: MongoDB and CouchDB Data Modeling

Learn MongoDB and CouchDB data modeling, embedding vs referencing, schema validation, and when document stores fit better than relational databases.

#database #nosql #document-database