Database Scaling: Vertical, Horizontal, and Read Replicas

Learn strategies for scaling databases beyond a single instance: vertical scaling, read replicas, write scaling, and when to choose each approach.

published: March 22, 2026 reading time: 44 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

Database scaling follows a progression: vertical (bigger server) first for simplicity, then read replicas for read-heavy workloads, then caching for hot queries, then sharding when writes become the bottleneck. Each step adds operational complexity, so move only when measurements show the current approach is exhausted. Connection pooling prevents connection exhaustion, caching reduces database load but introduces consistency trade-offs, and sharding becomes necessary only when data no longer fits on one server or writes exceed single-server capacity. The real lessons come from production failures: cache stampedes, shard key hotspots, and connection pool misconfiguration.

Introduction

Vertical scaling means bigger servers. More CPU cores, more memory, faster disks. This is the simplest approach. You take your existing database and move it to a more powerful machine.

The advantage is simplicity. No application changes required. Your single-server transactions stay ACID-compliant. Your queries work exactly as before.

The disadvantage is limits. Server sizes have caps. A database that needs 64 cores is not going to fit on a server with 32. And costs scale faster than capacity. A server with twice the resources rarely costs twice as much.

Vertical scaling works best early in a system’s life. When traffic is modest and growth is predictable, throwing bigger iron at the problem is often the cheapest solution in engineering time.

Scaling Formulas & Metrics

Capacity Planning

Capacity planning means tracking resource usage over time and figuring out when you will run out of headroom. The four metrics I monitor most closely: CPU utilization, memory usage, disk I/O, and active connection counts. These show up in standard tools like pg_stat_activity, iostat, and vmstat, and they cover the bottlenecks that hit most databases first.

Set thresholds based on your actual SLOs, not generic rules. A database at 70% CPU might be fine if your p99 latency SLO is 200ms and queries are still fast. The same database at 50% CPU could be struggling if one slow query is eating a disproportionate share of resources. The point of baseline metrics is to understand current load. The point of SLOs is to know when that load becomes a problem. Generic guidelines like “alert at 70% CPU” work as a starting point, but they are not substitutes for knowing your own workload.

Growth projection is where most capacity planning falls apart. Use your actual historical data. If your database CPU has grown 10% month-over-month for three consecutive months, that trend is a better predictor than any rule of thumb. Seasonal patterns matter too — an e-commerce site running at 40% capacity in February might hit 90% by November. Planned events like product launches or marketing campaigns can accelerate growth beyond what history suggests, so add a buffer when you know a surge is coming.

The limiting resource is the metric that will hit its ceiling first given your growth rate. This is the one you solve first. Disk I/O bottlenecks are nastier than memory bottlenecks because fixing them often means changing queries or schemas, not just tweaking a config value. If your database is IO-bound on cloud storage, switching to local SSDs or bumping up provisioned IOPS might help. If it is memory-bound, you might need more RAM or a smaller shared_buffers setting to give the OS more cache. The bottleneck determines the fix, not the other way around.

Once you have a growth rate and a limiting resource, you can plug those into the scaling formulas in the next section to get a rough timeline. Capacity planning is not a one-time exercise. Rerun it quarterly or whenever your growth rate shifts.

Vertical Scaling Formulas

Estimate when you will need to scale vertically based on current resource utilization and growth rate:

import math

def estimate_vertical_scale_time(
    current_cpu_percent: float,
    current_memory_percent: float,
    monthly_growth_rate: float,  # e.g., 0.15 for 15% monthly growth
    days_until_limit: int = 90
) -> dict:
    """
    Estimate when vertical scaling will be needed.
    Returns the limiting resource and estimated days until scale required.
    """
    # Days until CPU limit at current growth
    cpu_days = math.log(days_until_limit / 30) / math.log(1 + monthly_growth_rate) * 30

    # Days until memory limit (typically slower growth than CPU)
    memory_days = math.log(days_until_limit / 30) / math.log(1 + monthly_growth_rate * 0.7) * 30

    limiting_resource = 'CPU' if cpu_days < memory_days else 'Memory'
    days_until_scale = min(cpu_days, memory_days)

    return {
        'limiting_resource': limiting_resource,
        'estimated_days_until_scale': int(days_until_scale),
        'monthly_growth_assumption': f"{monthly_growth_rate * 100:.0f}%"
    }

# Example: 60% CPU, 70% memory, 20% monthly growth
result = estimate_vertical_scale_time(60, 70, 0.20)
# Returns: limiting_resource='CPU', estimated_days_until_scale=~73 days

Connection pool sizing formula:

def recommended_pool_size(
    max_connections_per_query: int,
    expected_concurrent_requests: int,
    database_cpu_count: int
) -> int:
    """
    Calculate recommended connection pool size.
    Rule of thumb: (core_count * 2) + effective_spindle_count
    For SSDs: effective_spindle_count ~= 30
    """
    # General formula from PostgreSQL documentation
    baseline = database_cpu_count * 2

    # Add for SSD storage (rough approximation of IO capacity)
    effective_spindle_count = 30  # for SSD-backed storage

    recommended = baseline + effective_spindle_count

    # Cap at reasonable fraction of max_connections
    max_pool_size = max_connections_per_query * expected_concurrent_requests * 0.5

    return min(recommended, max_pool_size)

# Example: 8-core DB, 100 concurrent requests, each query uses 1 connection
pool_size = recommended_pool_size(1, 100, 8)  # Returns ~46

Read Replica Lag Estimation

Replication lag depends on write volume, network bandwidth, and replica hardware:

def estimated_replication_lag_seconds(
    write_throughput_mbps: float,
    network_latency_ms: float,
    replica_apply_latency_ms: float,
    wal_record_size_kb: float = 4.0
) -> float:
    """
    Estimate typical replication lag in seconds.
    """
    # Time to transfer WAL records for each write batch
    transfer_time = (wal_record_size_kb / write_throughput_mbps) * 1000  # ms

    # Round trip + apply time
    total_latency = network_latency_ms + transfer_time + replica_apply_latency_ms

    return total_latency / 1000  # convert to seconds

# Example: 10 MB/s write throughput, 1ms LAN latency, 2ms apply
lag = estimated_replication_lag_seconds(10, 1.0, 2.0)
# Returns: ~0.005 seconds (5ms) for small records

Sharding Capacity Formula

def estimate_sharding_requirements(
    total_data_gb: int,
    write_tps: int,
    read_tps: int,
    target_shard_data_gb: int = 100  # GB per shard target
) -> dict:
    """
    Estimate number of shards needed and their characteristics.
    """
    # Data-based shard count
    data_shards = math.ceil(total_data_gb / target_shard_data_gb)

    # Write scaling: each shard can handle ~5000 TPS for simple writes
    writes_per_shard = write_tps / data_shards
    write_shards = math.ceil(write_tps / 5000) if writes_per_shard > 5000 else data_shards

    # Read scaling: each shard can handle ~10000 TPS with caching
    reads_per_shard = read_tps / data_shards
    read_shards = math.ceil(read_tps / 10000) if reads_per_shard > 10000 else data_shards

    total_shards = max(data_shards, write_shards, read_shards)

    return {
        'estimated_shards': total_shards,
        'data_gb_per_shard': total_data_gb / total_shards,
        'writes_per_shard_per_second': write_tps / total_shards,
        'reads_per_shard_per_second': read_tps / total_shards,
        'recommendation': 'shard' if total_shards > 1 else 'single_server'
    }

# Example: 500 GB data, 20000 write TPS, 100000 read TPS
result = estimate_sharding_requirements(500, 20000, 100000)
# Returns: ~5 shards, 100GB/shard, 4000 write TPS/shard, 20000 read TPS/shard

Cost & Decision Framework

Cost Comparison: Vertical vs Horizontal Scaling

Dimension	Vertical Scaling	Horizontal Scaling (Replicas + Sharding)
Hardware cost	High (large server prices non-linearly)	Lower (commodity servers scale linearly)
Licensing cost	Often per-socket or per-core	Per-server (can use more smaller licenses)
Operational complexity	Low (single server)	High (more servers, replication, failover)
Scaling elasticity	Discrete jumps (next server size)	Smooth (add servers incrementally)
Maximum scale	Limited by largest available server	Nearly unlimited (sharding)
Failure domain	Single point of failure	Can survive single node failures
Read throughput	Limited to one server’s capacity	Multiplied by replica count
Write throughput	Limited to one server’s capacity	Requires sharding (complex)
Monthly cost trajectory	Step function (big jumps)	Linear growth with usage

Break-even comparison:

Scenario	Vertical (r6i.4xlarge)	Horizontal (3x r6i.xlarge)
Specifications	16 vCPU, 128 GB RAM	3 × (4 vCPU, 32 GB RAM)
Monthly AWS cost (us-east-1)	~$1,200	~$1,350
Write throughput	15,000 TPS	~20,000 TPS (with sharding)
Read throughput (cached)	50,000 QPS	150,000 QPS
Failure handling	Manual failover	Automatic with read replicas
Best for	Write-heavy, simple ops	Read-heavy, high availability

Scaling Timeline: When to Choose What

Use this as a rough guide for when to consider each scaling approach based on your workload:

Scale Stage	Users / QPS	Primary Bottleneck	Recommended Approach
Startup	< 1K users, < 100 QPS	Simple queries	Single server, basic vertical
Growing	1K-10K users, 100-1K QPS	Connection limits, some slow queries	Vertical + read replicas
Scaling	10K-100K users, 1K-10K QPS	Read latency, replication lag	Add caching, more replicas
Large	100K-1M users, 10K-50K QPS	Write throughput, data size	Sharding, connection pooling
Enterprise	> 1M users, > 50K QPS	Multi-region, compliance	Distributed database, read-heavy CQRS

Decision triggers to watch:

def should_scale_vertically(
    cpu_percent_avg: float,
    memory_percent_avg: float,
    connections_used_percent: float
) -> dict:
    """Return scaling recommendations based on resource utilization."""
    triggers = []

    if cpu_percent_avg > 70:
        triggers.append("CPU > 70%: consider vertical scale or read replicas")
    if memory_percent_avg > 80:
        triggers.append("Memory > 80%: vertical scale or caching")
    if connections_used_percent > 70:
        triggers.append("Connections > 70%: add connection pooling or replicas")

    return {
        'should_scale': len(triggers) > 0,
        'triggers': triggers,
        'recommendation': 'vertical' if cpu_percent_avg > 70 else 'replicas'
    }

def should_add_caching(
    cache_hit_rate: float,
    repeated_query_percent: float,
    db_cpu_percent: float
) -> dict:
    """Determine if caching would help."""
    if repeated_query_percent > 30 and cache_hit_rate < 0.80:
        return {
            'recommend_caching': True,
            'reason': f"{repeated_query_percent:.0f}% repeated queries, "
                      f"only {cache_hit_rate:.0%} cache hit rate"
        }
    return {'recommend_caching': False}

def should_shard(
    data_size_gb: int,
    write_tps: int,
    single_server_max_tps: int,
    single_server_max_gb: int
) -> dict:
    """Determine if sharding is needed."""
    reasons = []
    if data_size_gb > single_server_max_gb * 0.7:
        reasons.append(f"Data ({data_size_gb} GB) approaching "
                       f"single server limit ({single_server_max_gb} GB)")
    if write_tps > single_server_max_tps * 0.7:
        reasons.append(f"Write TPS ({write_tps}) approaching "
                       f"single server limit ({single_server_max_tps})")

    return {
        'should_shard': len(reasons) > 0,
        'reasons': reasons
    }

Start with the simplest approach (vertical). Only add complexity when measurements show it is needed.

Consistency Considerations

Scaling introduces consistency trade-offs. Read replicas are typically asynchronously replicated. Writes to primary take time to propagate to replicas.

# This might fail if replica lags behind primary
user = replica_db.query("SELECT * FROM users WHERE id = ?", user_id)
if user['email'] != new_email:
    # User might see old email briefly
    primary_db.execute("UPDATE users SET email = ?", new_email)

Read-after-write consistency is not guaranteed with replicas. A user might write data and immediately read it from a replica that has not yet received the write. Solutions include:

Read from primary after writes
Synchronous replication (higher latency, lower availability)
Session-based routing (read from primary for a window after writes)

For most applications, eventual consistency is acceptable. Users rarely notice milliseconds of staleness. Choose stronger guarantees only when your requirements demand them.

Query Performance Optimization

Query Optimization Deep Dive

Query optimization is the highest-leverage activity in database scaling — a single missing index can tank performance by 100x without touching infrastructure. The key is a systematic process: identify the worst queries, understand why they are slow, fix the root cause, and verify the improvement. This section walks through each step in that order.

Step 1: Capture the right queries. Enable slow query logging in PostgreSQL with log_min_duration_statement = 1000 (logs queries over 1 second) or use the pg_stat_statements extension to track cumulative resource usage across all queries. In MySQL, performance_schema and the sys.metrics schema give you similar visibility. The goal is to sort queries by total time — queries that run frequently and take 50ms each consume more resources than a single 5-second query that runs once a day. Focus on the top 10 queries by total time first.

Step 2: Classify the pattern. Most slow queries fall into a handful of patterns with specific fixes:

Pattern	How to spot it	Fix
Missing index	Seq Scan in EXPLAIN on large table	Add index on WHERE/JOIN column
N+1 query	One query per loop iteration in app	Batch fetch or JOIN
Inefficient join	Cross product in EXPLAIN	Add filter conditions, rewrite JOIN order
Unnecessary sort	Sort on unindexed columns	Add index to eliminate sort, or remove ORDER BY
Full table scan	Large rows estimate vs small actual	Partition the table, or add targeted index
Stale statistics	Rows estimate >> actual rows	Run ANALYZE/VACUUM

Step 3: Benchmark before touching anything. Record current p50, p95, and p99 query latency, plus the query plan from EXPLAIN ANALYZE. Apply one change at a time — if you add an index, change the query rewrite, and increase shared_buffers simultaneously, you will not know which change helped. After each change, re-run the benchmark and compare. A query that drops from 500ms to 5ms after adding a covering index is a win you can quantify and defend to stakeholders.

Step 4: Validate with the production workload. Test queries against real data volume. A query that performs well on 10,000 rows may be fine on 100,000 rows only if an index makes the difference. EXPLAIN ANALYZE shows actual rows versus estimated rows — a large gap means the query planner is making poor decisions because statistics are stale. After adding an index, check that the query plan uses it (Index Scan, not Seq Scan) and that actual rows match estimates.

The most common mistake is adding indexes without verifying they are used. Run EXPLAIN ANALYZE on the same query after adding the index and confirm the plan changed. An index that exists but is not used consumes storage and slows writes for no benefit.

Reading EXPLAIN Plans

Every major database provides execution plan visualization:

-- PostgreSQL
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT u.name, COUNT(o.id) as order_count
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2024-01-01'
GROUP BY u.id;

-- Key things to look for:
-- - Seq Scan: full table scan (often indicates missing index)
-- - Hash Join vs Nested Loop: Hash is better for large sets
-- - Rows estimate vs actual rows: large mismatch means stale statistics)

For MySQL:

EXPLAIN FORMAT=JSON
SELECT u.name, COUNT(o.id) as order_count
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2024-01-01'
GROUP BY u.id;

Critical patterns to identify in EXPLAIN output:

Pattern	Indicates	Action
Seq Scan on large table	Missing index on WHERE/JOIN column	Add appropriate index
Hash Join with large rows	May exceed memory for hash table	Increase `work_mem` or use partial indexes
Nested Loop with large inner	O(n²) behavior	Add index, rewrite to use Hash Join
Rows estimate ≫ actual	Stale statistics	Run `ANALYZE` or `VACUUM ANALYZE`
Index Only Scan	Optimal — data from index alone	Ensure covering indexes for frequent queries

Real-world Failure Scenarios

Understanding how scaling decisions fail in production helps you avoid the same mistakes. These scenarios are drawn from documented incidents at high-traffic services.

Scenario 1: The Cache Cascade

A popular e-commerce site deployed a new recommendation engine that cleared cache entries on each page view. Within minutes, every request hit the database directly. Database connections spiked to 100% of max capacity. The application began timing out, which caused retries from clients, compounding the problem. Within 10 minutes, the site was unavailable.

Root cause: Cache invalidation on read operations. Every page view invalidated the cache entry it had just read, creating a circular pattern where the cache was never populated.

Resolution: Changed cache invalidation strategy from immediate invalidation to TTL-based expiration. Added circuit breakers to prevent cascading failures when cache operations fail.

Scenario 2: The Shard Imbalance

A gaming company sharded their player database by player ID (modulo of shard count). Players who joined early accumulated more data (achievements, inventory, social graphs) than new players. Early shard nodes ran out of storage while newer shards had 80% free space. Re-sharding required 72 hours of downtime and manual data migration.

Root cause: Shard key selection did not account for data growth patterns. Player age (join time) correlated with data volume, not just access frequency.

Resolution: Migrated to consistent hashing with virtual nodes. Data is now rebalanced automatically as shards are added or removed. Implemented monitoring for shard storage skew.

Scenario 3: The Read Replica Storm

A reporting dashboard polled a read replica every second. During business hours, 500 concurrent users generated 500 queries per second against a replica that could handle 1,000 TPS. At peak, replication lag spiked to 30+ seconds. The dashboard showed stale data, and users refreshed repeatedly, creating a feedback loop of increasing load.

Root cause: No query rate limiting or caching at the application layer. Identical queries were executed every second regardless of data freshness requirements.

Resolution: Added query result caching with 10-second TTL. Implemented query coalescing (multiple requests share one database query). Set up separate analytics database for reporting workloads.

Scenario 4: The Connection Pool Exhaustion

An API service scaled horizontally to 20 instances, each maintaining its own connection pool to the database. With 50 connections per pool, the total was 1,000 connections. The database was configured with max_connections=800. Under load, connection requests queued, latency spiked, and the service became unresponsive.

Root cause: Connection pools were not coordinated across instances. Each instance sized its pool independently without accounting for total database connection capacity.

Resolution: Centralized connection pool management with PgBouncer. Application instances connect through the pooler, which maintains a shared pool sized for total database capacity. Implemented connection health checks and automatic recovery.

Scenario 5: The Synchronous Replication Deadlock

A financial service configured synchronous replication for audit compliance. During a network partition between primary and replica, writes blocked indefinitely waiting for replica confirmation. The application hung, pending transactions did not complete, and the queue of uncommitted transactions grew to consume all available disk space.

Root cause: Synchronous replication with no timeout or fallback. The system had no graceful degradation when the replica was unavailable.

Resolution: Configured synchronous_commit with timeout and automatic fallback to asynchronous. Implemented partition detection and automatic promotion of asynchronous replication when synchronous target is unreachable.

Trade-off Analysis

Database scaling involves fundamental trade-offs. Understanding these helps you make decisions that match your workload requirements.

Consistency vs. Availability

Approach	Consistency	Availability	Use Case
Synchronous replication	Strong	Lower (replica required)	Financial transactions, inventory
Asynchronous replication	Eventual	High	Read-heavy apps, social feeds
Multi-leader	Eventual	Highest	Geographic distribution
Leaderless (quorum)	Configurable	Very high	Distributed systems, Cassandra

Recommendation: Start with asynchronous replication. Add synchronous replication only when consistency requirements demand it, and implement proper timeouts and fallback behavior.

Write Scaling vs. Complexity

Approach	Write Scale	Operational Complexity	Data Integrity
Single primary	Limited	Low	Strong
Multi-leader	Medium	High	Conflict-dependent
Sharding	High	Very high	Application-managed

Recommendation: Exhaust vertical scaling and read replicas before sharding. Sharding complexity is often 10x higher than single-server operation.

Index Overhead vs. Query Performance

Index Strategy	Write Overhead	Read Performance	Storage
No indexes	Lowest	Full table scans	Lowest
Minimal (primary lookups)	Low	Good for point queries	Low
Covering indexes	Medium	Excellent for specific queries	Medium
Many indexes	High	Fast for varied queries	High

Recommendation: Add indexes based on query patterns measured in production. Remove unused indexes. Balance read performance gains against write overhead.

Connection Pool Sizing

Pool Size	Throughput	Latency	Risk
Too small	Low	High (queuing)	Underutilization
Optimal	High	Low	Balanced
Too large	Medium	High (contention)	Memory pressure

Recommendation: Start with the formula (core_count * 2) + effective_spindle_count. Monitor connection wait times and adjust based on observed behavior.

Common Pitfalls / Anti-Patterns

Scaling too late: Monitoring gaps cause sudden performance crises. Monitor proactively and scale before hitting limits.
Premature sharding: Sharding introduces significant complexity. Vertical scaling plus caching handles most workloads longer than expected.
Ignoring read-after-write consistency: With replicas, reads might return stale data immediately after writes. Implement appropriate consistency levels.
Cache invalidation complexity: Caching stale data is worse than no caching. Ensure invalidation is correct before caching frequently-changing data.
Connection pool misconfiguration: Too few connections starve the application. Too many exhaust database resources. Size appropriately.
Not testing failover: If failover is broken, you only find out during an outage. Test regularly in staging.
Sharding without clear access patterns: Without understanding how data is accessed, shard keys will be wrong. Design access patterns first.
Caching without measuring: Cache adds complexity. Measure cache hit rates and only cache when it provides meaningful benefit.

Real-World Case Studies

Case Study 1: Stack Overflow — Redis Caching at Scale

Stack Overflow handles millions of requests per day with a primarily read-heavy workload. Their caching strategy: hot question pages are cached in Redis with a 1-minute TTL. On cache miss, the query hits PostgreSQL with aggressive index-only scans. This pattern handles 10,000+ requests per second with sub-10ms p95 latency.

The architecture behind this is straightforward: a read-through cache layer sits in front of PostgreSQL. When a question page is requested, Stack Overflow checks Redis first. If the entry exists and has not expired, the cached HTML is returned directly — no database query at all. On a miss, the application queries PostgreSQL using a covering index that satisfies the entire query from the index alone, then serializes the result and stores it in Redis with a 60-second TTL before returning it to the client.

The 1-minute TTL is a deliberate trade-off. A longer TTL means fewer cache misses and lower database load, but users see stale content for longer after a question is edited or a new answer is posted. Stack Overflow accepts the write visibility delay because their traffic pattern is heavily skewed toward reads — the vast majority of page views benefit from the cache hit rate, and the occasional stale read is a minor UX issue compared to the performance gain.

What makes this case study instructive is the index strategy underneath the cache. Stack Overflow’s database schema uses carefully crafted covering indexes so that even cache misses are fast. The cache does not hide inefficient queries — it amplifies efficient ones. If you are considering adding Redis in front of a database that is already doing full table scans, the cache will reduce load but not latency in a meaningful way. The covering index on frequently accessed data is what enables sub-10ms p95 latency even on cache misses.

Case Study 2: Booking.com — Database Sharding for Hotel Search

Booking.com shards hotel search by geographic region (hotel location). Each shard contains hotels in a specific region, allowing fast searches within a region. Cross-region searches scatter across multiple shards and aggregate results. This architecture supports billions of hotel room queries per day.

Geographic sharding works well for hotel search because user queries naturally filter by location. A traveler looking for hotels in Barcelona does not need to see results from Tokyo. By co-locating all Barcelona hotels on a single shard, queries that include a location filter route directly to one shard and execute with single-shard latency. The database does not need to scan across regions it will never use.

Cross-region searches are the hard part. When a user searches “all hotels in Europe with a rating above 4”, the query must fan out to every European shard, collect results from each, merge them by the application’s sorting logic, and return a paginated response. Booking.com handles this by running the scatter-gather query asynchronously — the application sends requests to all relevant shards in parallel, waits for the slowest response, then ranks and truncates the merged result set. This approach adds latency on cross-region queries but keeps single-region queries fast.

The operational challenge that often gets overlooked is shard rebalancing. Hotels do not distribute evenly across regions — a city like Paris has far more inventory than a rural French town. Booking.com monitors shard size and query latency per shard to detect imbalance. When a shard grows too large or receives disproportionate query load, data is migrated to a new shard and the routing table is updated. This is done without taking the service offline by adding the new shard, migrating data in the background, and cutting over routing once the new shard is caught up.

The lesson here is that shard key selection determines your query patterns. Geographic sharding is a natural fit for travel and location-based services, but it breaks down if your queries frequently span regions or if geographic distribution of data is heavily skewed. Before committing to a shard key, map your actual query patterns in production — not the queries you expect to run, but the ones your users actually run.

Case Study 3: GitHub — MySQL Sharding by Repository ID

GitHub shards repository data by repository ID using a modified consistent hashing scheme. Each shard handles approximately10,000 repositories. Shard routing is done at the application layer via a lookup service. This allows GitHub to scale to millions of repositories without hitting single-node limits.

The interesting part is how GitHub handles the shard lookup. Instead of modulo-based sharding (which hard-codes shard count and forces data migration when you add shards), GitHub uses a lookup service that maps repository IDs to shard locations. When you add a new shard, you update the lookup mapping — existing data stays in place, and new repositories land on the new shard. The application queries this service to find which shard holds a given repository, then routes accordingly.

The routing layer is just a key-value store mapping repository ID prefixes to shard identifiers. A repository like “github/linguist” might have an ID like “linux:github/linguist” that maps to shard7. The application caches these mappings locally and refreshes them periodically or on a cache miss. Most queries need zero network hops to find the target shard — the lookup is local and fast.

Repository ID works well as a shard key because writes distribute naturally across the ID space, and all data for a single repository stays together on one shard. That keeps joins within a shard for most operations. Cross-repository operations like searching across multiple repositories need scatter-gather queries, which GitHub handles asynchronously in a separate analytics pipeline.

The trade-off for this flexibility is a small amount of routing overhead. You can add shards without migrating existing data, and shard count becomes a capacity knob you turn independently of data distribution.

Interview Questions

1. Your database CPU is at 90% but memory utilization is only 40%. Adding more RAM does not help. What is the bottleneck and what do you do?

The bottleneck is CPU, not memory. More RAM would help if the bottleneck was cache misses — if you were hitting disk because data did not fit in memory, adding RAM would reduce disk I/O and CPU time spent waiting. But with CPU at 90% and memory at 40%, your working set fits in memory and the CPU is doing compute work. Options: upgrade CPU (vertical scaling), add read replicas to distribute read queries across multiple servers, optimize slow queries to reduce CPU per transaction, or implement caching for hot queries to reduce database load.

2. Your application is read-heavy (90% reads, 10% writes). You have a single database server. What is the first scaling step?

The first step is adding read replicas. With 90% reads, distributing reads across replicas gives you nearly 10x read capacity without modifying application logic. You add one or more read replicas, configure the application to route writes to the primary and reads to replicas, and monitor replication lag to ensure reads are not too stale. Vertical scaling is also an option, but for a 90/10 read-heavy workload, replicas give more capacity per dollar than upgrading to a larger server.

3. You are at 80% CPU on your primary database and adding replicas does not help because writes are the bottleneck. What do you do?

If writes are the bottleneck and replicas do not help (because replicas still funnel writes through the primary), you need to reduce write volume or distribute writes. Options: optimize write-heavy queries — maybe an UPDATE runs more often than necessary or an index is causing excessive write overhead; implement caching for data that does not need to be written to the database immediately; shard the database to distribute writes across multiple primaries; or move to a write-optimized database for specific write-heavy tables while keeping the rest on the primary.

4. At what point do you decide to shard versus continue vertical scaling?

The decision should be data-driven, not time-based. Shard when you have exhausted vertical scaling (cannot get a bigger server or the cost exceeds what sharding would cost), when you have measured that writes — not reads — are the bottleneck (replicas do not help writes), and when your data size exceeds what a single server can reasonably store. If you have a 2TB database and your largest available server has 4TB of storage, you have headroom but not for growth. If that server is already at 80% CPU and 90% storage, you need to shard or upgrade. The key metric is whether you can achieve your SLA for write latency and availability with your current architecture under projected load.

5. Explain the difference between ACID and BASE consistency models. When would you choose each?

ACID (Atomicity, Consistency, Isolation, Durability) provides strong consistency guarantees with transaction isolation. It ensures that writes are immediately visible to all subsequent reads and that transactions are all-or-nothing. BASE (Basically Available, Soft State, Eventually Consistent) sacrifices strong consistency for availability and partition tolerance. ACID is appropriate for financial transactions, inventory systems, or any system where reading stale data causes real harm. BASE is appropriate for social feeds, activity logs, user sessions, or read-heavy workloads where milliseconds of staleness is acceptable. The CAP theorem shows you cannot have all three (Consistency, Availability, Partition Tolerance) simultaneously — BASE prioritizes availability while ACID prioritizes consistency.

6. What is replication lag and how do you monitor and mitigate it?

Replication lag is the time delay between a write being committed on the primary and that write being applied on a replica. It occurs because most replication is asynchronous — the primary does not wait for replicas to confirm writes before returning success to the client. Monitoring: PostgreSQL's `pg_stat_replication` shows `sent_lsn`, `write_lsn`, `flush_lsn`, and `replay_lag`; MySQL's `SHOW REPLICA STATUS` shows `Seconds_Behind_Master`. Mitigation strategies: use synchronous replication (higher latency but zero lag), route critical reads to the primary after writes (read-after-write consistency), session-based routing (sticky reads to primary for a window after writes), or accept eventual consistency for non-critical reads. High replication lag causes stale reads where users see outdated data after updates.

7. How do you choose a shard key? What makes a good vs bad shard key?

A good shard key distributes writes evenly across shards (avoids hotspots), is frequently used in query filters (enables efficient routing), and has high cardinality (many distinct values). User ID or customer ID often works well because writes naturally distribute by user and queries often filter by user. A bad shard key causes hotspots: timestamp-based keys concentrate writes on the latest shard; geographic regions may cause uneven distribution if some regions have more activity. Avoid keys with low cardinality (e.g., boolean status field creates only two shards). Cross-shard queries (joining data across shards) are expensive — design keys to co-locate frequently joined data on the same shard. Test key distribution under realistic write patterns before production deployment.

8. What is cache stampede and how do you prevent it?

Cache stampede (thundering herd) occurs when a popular cache entry expires and multiple concurrent requests all miss the cache, overwhelming the database with identical queries. Prevention strategies: probabilistic early expiration (randomize TTL so entries don't all expire at once), cache locking (only one request fetches from DB, others wait for the cached result), jittered TTLs (add random delay to TTL to spread out expiration), proactive warming (refresh popular entries before they expire), and background refresh (use a background job to refresh cache before expiration). The goal is to ensure only one request hits the database when a cache miss occurs.

9. What are the trade-offs between query optimization versus scaling infrastructure?

Query optimization (adding indexes, rewriting queries, denormalizing) is often 10-100x cheaper than scaling infrastructure. A missing index can cause a full table scan that takes seconds; adding an index makes the same query milliseconds. Before scaling, exhaust optimization opportunities: add covering indexes for frequent queries, remove unused indexes to reduce write overhead, batch operations to reduce round trips, use parameterized queries for plan caching, and analyze EXPLAIN plans to find bottlenecks. Scaling (vertical or horizontal) should come when optimization is insufficient — not as a substitute for it. Premature scaling adds cost and complexity; premature optimization on the wrong queries wastes effort. Measure first to identify actual bottlenecks.

10. How does connection pooling work and why is it important for database scaling?

Connection pooling maintains a cache of database connections that can be reused across requests. Without pooling, each request opens a new connection (expensive handshake, authentication, session setup) and closes it afterward. With pooling, requests borrow an existing connection from the pool, use it, and return it — avoiding connection overhead. Pooling is important because: database connections are finite resources (each consumes memory, file descriptors, and server-side processes); establishing connections is slow (TCP handshake + TLS + authentication); and connection overhead limits throughput. Tools: PgBouncer for PostgreSQL, ProxySQL for MySQL, built-in pools in ORMs. Pool sizing: too small starves requests, too large wastes resources. Rule of thumb: (core_count * 2) + effective_spindle_count for PostgreSQL, adjusted based on observed wait times.

11. Compare single-leader, multi-leader, and leaderless replication topologies. When would you choose each?

Single-leader (one primary receives writes, multiple replicas) is simple, well-understood, and provides strong write ordering — but write capacity is limited to the primary. Multi-leader (multiple primaries accept writes, each replicates to others) scales writes and provides geographic distribution but requires conflict resolution for concurrent writes and is operationally complex. Leaderless (writes go to any replica, quorum-based reads) eliminates single point of failure and scales write capacity but requires complex conflict resolution and quorum tuning. Use single-leader for most workloads — the simplicity outweighs limitations. Multi-leader for geographic distribution with write-heavy workloads. Leaderless for maximum availability and fault tolerance (Cassandra, DynamoDB). Chain replication (write flows through a chain) provides simple ordering but latency increases with chain length.

12. How do B-tree and hash indexes differ, and when would you use each?

B-tree indexes maintain sorted data and support both equality and range queries — they are the default for most databases because they handle `WHERE column > value` and `BETWEEN` efficiently. Hash indexes support only equality lookups (`WHERE column = value`) but are faster for simple key lookups. Use B-tree for: columns used in range conditions, ORDER BY, GROUP BY, and any query that filters by a prefix of the index. Use hash for: lookup tables, session storage, and columns with high cardinality where only equality checks occur. Hash indexes cannot handle range scans — if you ever need `created_at > '2024-01-01'`, a hash index will not help. Index maintenance overhead matters: every write to an indexed column updates the index, so too many indexes slow writes. Balance read performance needs against write overhead.

13. What are covering indexes and partial indexes? When would you use each?

A covering index (index-only scan) includes all columns needed by a query in the index itself, so the database never accesses the table — only the index. Use when you frequently query specific columns together and want to avoid table lookups entirely. A partial index indexes only rows matching a condition — for example, only pending orders — making the index smaller, faster to maintain, and used only for relevant queries. Use partial indexes when: a subset of rows is queried frequently (like active users vs archived), the filtered column has high cardinality but queries target a specific value, or you want to reduce index size for write-heavy tables. Both reduce I/O and improve performance, but covering indexes serve queries that need specific columns while partial indexes serve queries that filter on a specific condition.

14. Explain how PgBouncer and ProxySQL work for connection pooling. What are the trade-offs?

PgBouncer (PostgreSQL) and ProxySQL (MySQL) sit between the application and database, maintaining a pool of database connections that are reused across requests. Without pooling, each request opens a new connection (TCP handshake, authentication, session setup), which is slow and resource-intensive. With pooling, requests borrow an existing connection, execute their query, and return the connection — eliminating connection overhead. Pool modes: session mode (connection returned after transaction), transaction mode (connection returned after each transaction — most common), and statement mode (connection returned after each statement). Trade-offs: connection pooling adds latency for connection reuse logic but reduces database resource usage; transaction mode pooling can break session-specific state (like `SET SESSION` variables); too-small pools starve requests, too-large waste resources; pools do not help if the bottleneck is query performance rather than connection overhead.

15. What is read-after-write consistency and how do you achieve it with read replicas?

Read-after-write consistency means that after a user writes data, subsequent reads from the same user return the new data — not stale data from a replica that has not yet received the write. With asynchronous replication, writes propagate to replicas with a delay, so a user might write data and immediately read it from a replica that has not yet applied the write, seeing old data. Achieving read-after-write consistency: read from primary after writes (guaranteed but defeats replica benefits); synchronous replication (replicas confirm writes before client returns — higher latency, lower availability); session-based routing (sticky reads to primary for a window after writes, e.g., 30 seconds); or use causal consistency protocols that track causal dependencies. Most applications use session-based routing because it balances consistency with performance — critical reads go to primary for a short window after writes, other reads use replicas normally.

16. How do you prevent cache stampede (thundering herd) when a popular cache entry expires?

Cache stampede occurs when a popular entry expires and multiple concurrent requests all miss the cache, hitting the database simultaneously with identical queries. Prevention strategies: probabilistic early expiration (refresh cache before expiration with some probability, so entries do not all expire simultaneously); cache locking (only one request fetches from the database while others wait for the cached result — prevents duplicate DB queries); jittered TTLs (add random delay to TTL values so expiration times are spread out); proactive warming (background job refreshes popular entries before they expire); and background refresh (dedicated process updates cache independently of user requests). The goal is ensuring only one request hits the database when a cache miss occurs. Cache stampede is especially problematic for expensive queries on hot data — the thundering herd can overwhelm the database at the worst time (when the cache becomes empty).

17. What are the trade-offs between sharding at the application level versus using database-native sharding?

Application-level sharding (custom shard router in code) gives you full control over shard key selection, routing logic, and cross-shard operations — you can implement any strategy that fits your access patterns. However, you must manage shard topology, handle resharding (when you add shards), and the application becomes coupled to the sharding scheme (tests and migrations are harder). Database-native sharding (MongoDB sharding, CockroachDB, Vitess for MySQL) automates shard management, rebalancing, and failover — the database handles distribution transparently. The trade-off: native sharding is easier to operate but less flexible; application-level is more flexible but adds operational complexity. Consider native sharding when your database supports it well and your access patterns fit its model. Consider application-level when you have unusual requirements (specific shard key distribution, complex cross-shard transactions) that native sharding cannot handle.

18. How does WAL (Write-Ahead Log) streaming replication work in PostgreSQL?

PostgreSQL uses physical streaming replication. When the primary receives a write, it records the change in the WAL (Write-Ahead Log) before applying it to the data files — this ensures durability (writes survive crashes). The primary streams WAL records to replicas via the streaming replication protocol, and replicas continuously apply them to stay in sync. Replicas are called hot standbys and can serve reads while they apply WAL. The replication slot mechanism ensures replicas retain WAL they have not yet applied, preventing the primary from removing WAL that replicas still need. Monitoring: `pg_stat_replication` shows state, sent LSN, write LSN, flush LSN, and replay lag. By default, replication is asynchronous (writes commit before replicas confirm) — synchronous replication with `synchronous_commit = on` ensures writes are confirmed by replicas before returning to the client, at the cost of higher latency. Logical replication (table-level) is separate and used for migrations and selective replication.

19. Describe the trade-offs in choosing composite index column order. How do you decide?

Column order in a composite index determines which query patterns can use the index. An index on (A, B, C) can serve queries that filter on A, on A and B, or on A, B, and C — but cannot efficiently serve queries that filter only on B or C (without A). Rule: put equality conditions first, range conditions last. If you frequently query `WHERE region = 'US' AND date > '2024-01-01'`, index on (region, date) works. If you frequently query `WHERE date > '2024-01-01'` alone, the region-first index cannot use the date range efficiently. Also consider: cardinality (high-cardinality columns first filters more rows), query frequency (put columns used in more frequent queries first), and write overhead (columns updated frequently add index maintenance cost). Test with EXPLAIN ANALYZE — if a query shows Seq Scan instead of Index Scan, the index order may not match query patterns. Avoid indexing low-cardinality columns first (like boolean status) because it creates large intermediate result sets.

20. When selecting between PostgreSQL, MySQL, and MongoDB for a new project, what factors would guide the decision?

Choose PostgreSQL when: you need ACID compliance (financial, inventory, any system where consistency matters), you have complex queries (joins, subqueries, window functions), you need strong JSON support (JSONB with GIN indexes), or you need advanced features (CTEs,full-text search,Geospatial with PostGIS). Choose MySQL when: you need high write throughput (MySQL's InnoDB handles write concurrency well), you prioritize read performance with replication (MySQL's replication is mature and fast), or you are building a LAMP/LEMP stack where MySQL is well-integrated. Choose MongoDB when: your data is document-structured (JSON-like, nested, variable schema), you need horizontal scaling out of the box (MongoDB sharding is mature), or your application benefits from flexible schema (rapid iteration, polymorphic data). Key factors: consistency requirements (PostgreSQL > MySQL > MongoDB for strong guarantees), scaling needs (MongoDB for automatic sharding, PostgreSQL with extensions for application-level sharding), query complexity (relational joins favor PostgreSQL/MySQL), and team expertise (familiarity matters for operational reliability).

Conclusion

Key Bullets:

Scale vertically first, then add read replicas, then cache, then shard (in that order)
Read replicas solve read scale; sharding solves write scale
Caching provides the best ROI for reducing database load when access patterns allow
Monitor CPU, memory, connection counts, and replication lag
Test failover procedures before you need them in production
Choose scaling approach based on actual bottleneck (measure first)

Copy/Paste Checklist:

# Read replica routing with connection pooling
from sqlalchemy import create_engine
from sqlalchemy.pool import QueuePool

# Primary for writes
primary_engine = create_engine(
    "postgresql://user:***@primary:5432/mydb",
    pool_size=5, max_overflow=10
)

# Replica for reads
replica_engine = create_engine(
    "postgresql://user:***@replica:5432/mydb",
    pool_size=20, max_overflow=20
)

# Use primary for writes, replica for reads
def write_operation(sql, params):
    with primary_engine.connect() as conn:
        return conn.execute(sql, params)

def read_operation(sql, params):
    with replica_engine.connect() as conn:
        return conn.execute(sql, params)

Observability Checklist

Metrics to Monitor:

Query latency by type (read, write, analytical)
Database CPU, memory, disk I/O utilization
Connection pool utilization and wait times
Replication lag (seconds behind primary)
Cache hit ratio and memory utilization
Shard distribution and hotspot detection
Throughput (queries per second, transactions per second)
Queue depths and backpressure indicators

Logs to Capture:

Slow query logs (queries exceeding thresholds)
Connection events (opens, closes, failures)
Replication status changes and errors
Failover events and decision rationale
Cache evictions and expiration patterns
Application errors related to database access

Alerts to Set:

CPU or memory utilization > 80%
Replication lag > 30 seconds
Connection pool > 80% utilized
Cache hit ratio < threshold (e.g., < 80%)
Query latency p99 exceeding SLA
Shard skew exceeding threshold (e.g., > 2x average)
Disk usage > 85%

# Database scaling monitoring example
def check_db_health(db_config):
    metrics = {
        'cpu_percent': get_db_cpu(),
        'memory_percent': get_db_memory(),
        'connections': get_connection_count(),
        'replication_lag': get_replication_lag(),
        'cache_hit_ratio': get_cache_hit_ratio()
    }

    alerts = []
    if metrics['cpu_percent'] > 80:
        alerts.append(f"CPU critical: {metrics['cpu_percent']}%")
    if metrics['replication_lag'] > 30:
        alerts.append(f"Replication lag: {metrics['replication_lag']}s")

    return metrics, alerts