Database Scaling: Vertical, Horizontal, and Read Replicas

Learn strategies for scaling databases beyond a single instance: vertical scaling, read replicas, write scaling, and when to choose each approach.

published: reading time: 21 min read

Database Scaling: Vertical, Horizontal, Read Replicas, and Write Scaling

Every database hits limits eventually. A single server has finite CPU, memory, and disk speed. At some traffic level, you need to scale. The question is how.

I have seen teams jump to complex sharding when vertical scaling would have bought them years. I have also seen teams keep vertically scaling past the point where horizontal approaches would have been cheaper. The right answer depends on your workload.

Vertical Scaling First

Vertical scaling means bigger servers. More CPU cores, more memory, faster disks. This is the simplest approach. You take your existing database and move it to a more powerful machine.

The advantage is simplicity. No application changes required. Your single-server transactions stay ACID-compliant. Your queries work exactly as before.

The disadvantage is limits. Server sizes have caps. A database that needs 64 cores is not going to fit on a server with 32. And costs scale faster than capacity. A server with twice the resources rarely costs twice as much.

Vertical scaling works best early in a system’s life. When traffic is modest and growth is predictable, throwing bigger iron at the problem is often the cheapest solution in engineering time.

Read Replicas

Most application workloads are read-heavy. Eighty percent reads, twenty percent writes is common. Read replicas let you scale read capacity by copying data to additional servers.

graph TD
    App[Application] --> LB[Load Balancer]
    LB --> Primary[(Primary DB)]
    LB --> Replica1[(Read Replica 1)]
    LB --> Replica2[(Read Replica 2)]
    Primary -->|Replication| Replica1
    Primary -->|Replication| Replica2

The primary handles writes. Replicas handle reads. Replication is typically asynchronous, meaning reads might be slightly stale. For most applications, milliseconds of staleness is acceptable.

# Route reads to replicas, writes to primary
def get_user(user_id):
    # Read from replica
    return replica_db.query("SELECT * FROM users WHERE id = ?", user_id)

def update_user(user_id, data):
    # Write to primary
    primary_db.execute("UPDATE users SET ... WHERE id = ?", user_id)
    # Invalidate or wait for replication

PostgreSQL, MySQL, and MongoDB all support read replicas. The implementation details differ but the pattern is the same. You gain read scale at the cost of replication lag.

Write Scaling Challenges

Read replicas solve read scale. Write scaling is harder. You cannot simply add more servers to handle writes because writes must be coordinated. Every write needs to be visible to all subsequent reads.

graph TD
    App[Application] --> Primary[(Primary DB)]
    App2[Application 2] --> Primary
    App3[Application 3] --> Primary
    Primary --> Replica1[(Read Replica)]
    Primary --> Replica2[(Read Replica)]

The primary is the bottleneck for writes. Vertical scaling helps up to a point, but eventually you need to distribute writes somehow.

Write scaling approaches include:

  1. Partitioning writes by functionality (different tables on different servers)
  2. Sharding across multiple primary servers
  3. Using systems designed for write distribution

Most applications hit write limits later than they expect. Vertical scaling plus read replicas handles surprisingly large workloads.

Vendor-Specific Replication Differences

Each database handles replication differently. Understanding these differences helps you choose and operate your system correctly.

PostgreSQL Replication

PostgreSQL uses physical streaming replication. The primary sends WAL (Write-Ahead Log) records to replicas, which apply them continuously.

-- Create a physical replication slot
SELECT * FROM pg_create_physical_replication_slot('replica1_slot');

-- Check replication status
SELECT client_addr, state, sent_lsn, write_lsn
FROM pg_stat_replication;

Key characteristics:

  • Streaming replication is asynchronous by default
  • Synchronous replication available via synchronous_commit = on
  • Supports read replicas with hot_standby
  • Logical replication enables table-level replication (for migrations, not scaling)

Connection pooling: PgBouncer is the standard. Use transaction-mode pooling for most applications:

; pgbouncer.ini
[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb

[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 20

MySQL Replication

MySQL uses binary log (binlog) replication. The primary writes transactions to the binlog, replicas request and apply them.

-- Check replication status
SHOW REPLICA STATUS\G

-- Configure binlog retention
SET GLOBAL binlog_expire_logs_seconds = 604800;

Key characteristics:

  • Default is asynchronous (semi-synchronous available)
  • GTID (Global Transaction ID) makes replication tracking reliable
  • log_slave_updates controls whether replicas write to their own binlog
  • Row-based replication is more consistent but produces larger binlogs

Connection pooling: ProxySQL is the standard for MySQL:

-- Add a backend pool in ProxySQL
INSERT INTO mysql_servers (hostgroup_id, hostname, port)
VALUES (10, 'replica1-host', 3306);

LOAD MYSQL SERVERS TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;

MongoDB Replication

MongoDB uses a replica set with oplog (operations log). The primary applies operations to the oplog, secondaries tail the oplog and apply operations.

// Check replica set status
rs.status();

// Configure read preference
db.getMongo().setReadPref("secondaryPreferred");

// Add a new member
rs.add("mongodb2.example.com:27017");

Key characteristics:

  • Oplog is a capped collection that stores all write operations
  • Default read preference is primary (strong consistency)
  • Secondary reads are eventually consistent
  • Change Streams provide real-time change notifications

Sharding architecture:

// Add a shard to the cluster
sh.addShard("shard1-replica-set/mongodb1:27017");

// Enable balancing
sh.enableBalancing("myapp.users");

Database Sharding

Sharding distributes data across multiple servers. Each shard contains a subset of the data. Writes route to the appropriate shard based on a shard key.

graph LR
    App[Application] --> Router[(Shard Router)]
    Router -->|user:1000-1999| Shard1[(Shard 1)]
    Router -->|user:2000-2999| Shard2[(Shard 2)]
    Router -->|user:3000-3999| Shard3[(Shard 3)]

Choosing the shard key is critical. A bad key leads to uneven distribution. User ID often works well since writes are distributed by user.

Sharding is complex. Cross-shard queries are difficult. Joins across shards require querying multiple shards and combining results. Your application logic must understand the sharding scheme.

def get_user_shard(user_id):
    shard_id = int(user_id) % NUM_SHARDS
    return f"shard_{shard_id}"

def get_user(user_id):
    shard = get_user_shard(user_id)
    return shards[shard].query("SELECT * FROM users WHERE id = ?", user_id)

Most databases have some sharding support. MongoDB shards automatically. PostgreSQL requires application-level sharding or extensions. CockroachDB distributes writes automatically but with different trade-offs.

Caching as a Buffer

Before scaling the database, consider caching. Memcached and Redis sit in front of databases and serve repeated queries from memory.

def get_user(user_id):
    # Try cache first
    cached = redis.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)

    # Cache miss, query database
    user = db.query("SELECT * FROM users WHERE id = ?", user_id)

    # Store in cache for next time
    redis.setex(f"user:{user_id}", 300, json.dumps(user))

    return user

Cache hit rates of 90% or higher are common. Caching dramatically reduces database load. Add caching before adding complexity.

The challenge is cache invalidation. When data changes, you must update or delete the cached version. TTLs provide automatic expiration but introduce staleness.

When to Choose Each Approach

Start with vertical scaling. It is the simplest and handles most workloads longer than you expect. Monitor your database. When CPU or memory approach limits, evaluate your options.

Add read replicas when read latency increases or CPU on the primary maxes out from read load. Reads scale horizontally easily. Route reads to replicas, writes to primary.

Consider caching when you have repeated queries for the same data. Hot data caches well. Immutable or rarely-changed data caches best.

Shard when you hit write limits. Sharding is complex. Only introduce that complexity when vertical scaling and replication cannot help.

Capacity Planning

Vertical Scaling Formulas

Estimate when you will need to scale vertically based on current resource utilization and growth rate:

import math

def estimate_vertical_scale_time(
    current_cpu_percent: float,
    current_memory_percent: float,
    monthly_growth_rate: float,  # e.g., 0.15 for 15% monthly growth
    days_until_limit: int = 90
) -> dict:
    """
    Estimate when vertical scaling will be needed.
    Returns the limiting resource and estimated days until scale required.
    """
    # Days until CPU limit at current growth
    cpu_days = math.log(days_until_limit / 30) / math.log(1 + monthly_growth_rate) * 30

    # Days until memory limit (typically slower growth than CPU)
    memory_days = math.log(days_until_limit / 30) / math.log(1 + monthly_growth_rate * 0.7) * 30

    limiting_resource = 'CPU' if cpu_days < memory_days else 'Memory'
    days_until_scale = min(cpu_days, memory_days)

    return {
        'limiting_resource': limiting_resource,
        'estimated_days_until_scale': int(days_until_scale),
        'monthly_growth_assumption': f"{monthly_growth_rate * 100:.0f}%"
    }

# Example: 60% CPU, 70% memory, 20% monthly growth
result = estimate_vertical_scale_time(60, 70, 0.20)
# Returns: limiting_resource='CPU', estimated_days_until_scale=~73 days

Connection pool sizing formula:

def recommended_pool_size(
    max_connections_per_query: int,
    expected_concurrent_requests: int,
    database_cpu_count: int
) -> int:
    """
    Calculate recommended connection pool size.
    Rule of thumb: (core_count * 2) + effective_spindle_count
    For SSDs: effective_spindle_count ~= 30
    """
    # General formula from PostgreSQL documentation
    baseline = database_cpu_count * 2

    # Add for SSD storage (rough approximation of IO capacity)
    effective_spindle_count = 30  # for SSD-backed storage

    recommended = baseline + effective_spindle_count

    # Cap at reasonable fraction of max_connections
    max_pool_size = max_connections_per_query * expected_concurrent_requests * 0.5

    return min(recommended, max_pool_size)

# Example: 8-core DB, 100 concurrent requests, each query uses 1 connection
pool_size = recommended_pool_size(1, 100, 8)  # Returns ~46

Read Replica Lag Estimation

Replication lag depends on write volume, network bandwidth, and replica hardware:

def estimated_replication_lag_seconds(
    write_throughput_mbps: float,
    network_latency_ms: float,
    replica_apply_latency_ms: float,
    wal_record_size_kb: float = 4.0
) -> float:
    """
    Estimate typical replication lag in seconds.
    """
    # Time to transfer WAL records for each write batch
    transfer_time = (wal_record_size_kb / write_throughput_mbps) * 1000  # ms

    # Round trip + apply time
    total_latency = network_latency_ms + transfer_time + replica_apply_latency_ms

    return total_latency / 1000  # convert to seconds

# Example: 10 MB/s write throughput, 1ms LAN latency, 2ms apply
lag = estimated_replication_lag_seconds(10, 1.0, 2.0)
# Returns: ~0.005 seconds (5ms) for small records

Sharding Capacity Formula

def estimate_sharding_requirements(
    total_data_gb: int,
    write_tps: int,
    read_tps: int,
    target_shard_data_gb: int = 100  # GB per shard target
) -> dict:
    """
    Estimate number of shards needed and their characteristics.
    """
    # Data-based shard count
    data_shards = math.ceil(total_data_gb / target_shard_data_gb)

    # Write scaling: each shard can handle ~5000 TPS for simple writes
    writes_per_shard = write_tps / data_shards
    write_shards = math.ceil(write_tps / 5000) if writes_per_shard > 5000 else data_shards

    # Read scaling: each shard can handle ~10000 TPS with caching
    reads_per_shard = read_tps / data_shards
    read_shards = math.ceil(read_tps / 10000) if reads_per_shard > 10000 else data_shards

    total_shards = max(data_shards, write_shards, read_shards)

    return {
        'estimated_shards': total_shards,
        'data_gb_per_shard': total_data_gb / total_shards,
        'writes_per_shard_per_second': write_tps / total_shards,
        'reads_per_shard_per_second': read_tps / total_shards,
        'recommendation': 'shard' if total_shards > 1 else 'single_server'
    }

# Example: 500 GB data, 20000 write TPS, 100000 read TPS
result = estimate_sharding_requirements(500, 20000, 100000)
# Returns: ~5 shards, 100GB/shard, 4000 write TPS/shard, 20000 read TPS/shard

Cost Comparison: Vertical vs Horizontal Scaling

DimensionVertical ScalingHorizontal Scaling (Replicas + Sharding)
Hardware costHigh (large server prices non-linearly)Lower (commodity servers scale linearly)
Licensing costOften per-socket or per-corePer-server (can use more smaller licenses)
Operational complexityLow (single server)High (more servers, replication, failover)
Scaling elasticityDiscrete jumps (next server size)Smooth (add servers incrementally)
Maximum scaleLimited by largest available serverNearly unlimited (sharding)
Failure domainSingle point of failureCan survive single node failures
Read throughputLimited to one server’s capacityMultiplied by replica count
Write throughputLimited to one server’s capacityRequires sharding (complex)
Monthly cost trajectoryStep function (big jumps)Linear growth with usage

Break-even comparison:

ScenarioVertical (r6i.4xlarge)Horizontal (3x r6i.xlarge)
Specifications16 vCPU, 128 GB RAM3 × (4 vCPU, 32 GB RAM)
Monthly AWS cost (us-east-1)~$1,200~$1,350
Write throughput15,000 TPS~20,000 TPS (with sharding)
Read throughput (cached)50,000 QPS150,000 QPS
Failure handlingManual failoverAutomatic with read replicas
Best forWrite-heavy, simple opsRead-heavy, high availability

Scaling Timeline: When to Choose What

Use this as a rough guide for when to consider each scaling approach based on your workload:

Scale StageUsers / QPSPrimary BottleneckRecommended Approach
Startup< 1K users, < 100 QPSSimple queriesSingle server, basic vertical
Growing1K-10K users, 100-1K QPSConnection limits, some slow queriesVertical + read replicas
Scaling10K-100K users, 1K-10K QPSRead latency, replication lagAdd caching, more replicas
Large100K-1M users, 10K-50K QPSWrite throughput, data sizeSharding, connection pooling
Enterprise> 1M users, > 50K QPSMulti-region, complianceDistributed database, read-heavy CQRS

Decision triggers to watch:

def should_scale_vertically(
    cpu_percent_avg: float,
    memory_percent_avg: float,
    connections_used_percent: float
) -> dict:
    """Return scaling recommendations based on resource utilization."""
    triggers = []

    if cpu_percent_avg > 70:
        triggers.append("CPU > 70%: consider vertical scale or read replicas")
    if memory_percent_avg > 80:
        triggers.append("Memory > 80%: vertical scale or caching")
    if connections_used_percent > 70:
        triggers.append("Connections > 70%: add connection pooling or replicas")

    return {
        'should_scale': len(triggers) > 0,
        'triggers': triggers,
        'recommendation': 'vertical' if cpu_percent_avg > 70 else 'replicas'
    }

def should_add_caching(
    cache_hit_rate: float,
    repeated_query_percent: float,
    db_cpu_percent: float
) -> dict:
    """Determine if caching would help."""
    if repeated_query_percent > 30 and cache_hit_rate < 0.80:
        return {
            'recommend_caching': True,
            'reason': f"{repeated_query_percent:.0f}% repeated queries, "
                      f"only {cache_hit_rate:.0%} cache hit rate"
        }
    return {'recommend_caching': False}

def should_shard(
    data_size_gb: int,
    write_tps: int,
    single_server_max_tps: int,
    single_server_max_gb: int
) -> dict:
    """Determine if sharding is needed."""
    reasons = []
    if data_size_gb > single_server_max_gb * 0.7:
        reasons.append(f"Data ({data_size_gb} GB) approaching "
                       f"single server limit ({single_server_max_gb} GB)")
    if write_tps > single_server_max_tps * 0.7:
        reasons.append(f"Write TPS ({write_tps}) approaching "
                       f"single server limit ({single_server_max_tps})")

    return {
        'should_shard': len(reasons) > 0,
        'reasons': reasons
    }

Start with the simplest approach (vertical). Only add complexity when measurements show it is needed.

Consistency Considerations

Scaling introduces consistency trade-offs. Read replicas are typically asynchronously replicated. Writes to primary take time to propagate to replicas.

# This might fail if replica lags behind primary
user = replica_db.query("SELECT * FROM users WHERE id = ?", user_id)
if user['email'] != new_email:
    # User might see old email briefly
    primary_db.execute("UPDATE users SET email = ?", new_email)

Read-after-write consistency is not guaranteed with replicas. A user might write data and immediately read it from a replica that has not yet received the write. Solutions include:

  1. Read from primary after writes
  2. Synchronous replication (higher latency, lower availability)
  3. Session-based routing (read from primary for a window after writes)

For most applications, eventual consistency is acceptable. Users rarely notice milliseconds of staleness. Choose stronger guarantees only when your requirements demand them.

When to Use and When Not to Use Each Scaling Approach

Vertical Scaling:

  • Use when: Database CPU or memory approaches limits; you want simplicity
  • Do not use when: You have hit maximum server specifications; cost of large servers exceeds distributed alternatives

Read Replicas:

  • Use when: Read latency is increasing; primary CPU maxes out from read load; you need geographic distribution
  • Do not use when: You need strong read-after-write consistency; writes are the bottleneck

Caching:

  • Use when: You have repeated queries for the same data; hot data patterns exist; you want to reduce database load
  • Do not use when: Data changes frequently and cache invalidation is complex; additional infrastructure is not justified

Sharding:

  • Use when: You have exhausted vertical scaling and replicas; data size exceeds single server; write throughput exceeds single server capability
  • Do not use when: You can solve the problem with vertical scaling or caching; your team lacks operational expertise

Production Failure Scenarios

FailureImpactMitigation
Vertical scaling hitting hardware limitsDatabase stops scaling up, performance degradesPlan capacity ahead, migrate to distributed architecture before crisis
Read replica lagStale reads, user sees outdated dataMonitor lag, route critical reads to primary, use synchronous replication
Cache stampedeDatabase overwhelmed when cache expiresImplement cache locking, use jittered TTLs, warm cache proactively
Shard hotspotSome shards overloaded while others idleChoose better shard key, implement virtual nodes, rebalance
Connection pool exhaustionNew queries fail, application errorsSize connection pools appropriately, implement backpressure, use pooling middleware
Cross-shard query performanceSlow aggregations, timeoutsDenormalize for common queries, use scatter-gather with timeouts, pre-aggregate
Cache failureDatabase hit by all requestsImplement cache redundancy, graceful degradation, circuit breakers
Failover not workingExtended downtime during primary failureTest failover regularly, automate with health checks, have manual runbook

Observability Checklist

Metrics to Monitor:

  • Query latency by type (read, write, analytical)
  • Database CPU, memory, disk I/O utilization
  • Connection pool utilization and wait times
  • Replication lag (seconds behind primary)
  • Cache hit ratio and memory utilization
  • Shard distribution and hotspot detection
  • Throughput (queries per second, transactions per second)
  • Queue depths and backpressure indicators

Logs to Capture:

  • Slow query logs (queries exceeding thresholds)
  • Connection events (opens, closes, failures)
  • Replication status changes and errors
  • Failover events and decision rationale
  • Cache evictions and expiration patterns
  • Application errors related to database access

Alerts to Set:

  • CPU or memory utilization > 80%
  • Replication lag > 30 seconds
  • Connection pool > 80% utilized
  • Cache hit ratio < threshold (e.g., < 80%)
  • Query latency p99 exceeding SLA
  • Shard skew exceeding threshold (e.g., > 2x average)
  • Disk usage > 85%
# Database scaling monitoring example
def check_db_health(db_config):
    metrics = {
        'cpu_percent': get_db_cpu(),
        'memory_percent': get_db_memory(),
        'connections': get_connection_count(),
        'replication_lag': get_replication_lag(),
        'cache_hit_ratio': get_cache_hit_ratio()
    }

    alerts = []
    if metrics['cpu_percent'] > 80:
        alerts.append(f"CPU critical: {metrics['cpu_percent']}%")
    if metrics['replication_lag'] > 30:
        alerts.append(f"Replication lag: {metrics['replication_lag']}s")

    return metrics, alerts

Security Checklist

  • Use strong authentication for all database connections
  • Implement TLS encryption for all database connections
  • Apply principle of least privilege for application database users
  • Rotate credentials regularly and use secret management
  • Audit database access logs for security review
  • Use network segmentation (databases not directly internet-accessible)
  • Encrypt data at rest (filesystem or application-level encryption)
  • Implement query whitelisting or restrictions for application users
  • Monitor for unauthorized access attempts
  • Test database access controls with penetration testing
  • Secure backup data with encryption
  • Implement read-only users separate from read-write users

Common Pitfalls and Anti-Patterns

  1. Scaling too late: Monitoring gaps cause sudden performance crises. Monitor proactively and scale before hitting limits.

  2. Premature sharding: Sharding introduces significant complexity. Vertical scaling plus caching handles most workloads longer than expected.

  3. Ignoring read-after-write consistency: With replicas, reads might return stale data immediately after writes. Implement appropriate consistency levels.

  4. Cache invalidation complexity: Caching stale data is worse than no caching. Ensure invalidation is correct before caching frequently-changing data.

  5. Connection pool misconfiguration: Too few connections starve the application. Too many exhaust database resources. Size appropriately.

  6. Not testing failover: If failover is broken, you only find out during an outage. Test regularly in staging.

  7. Sharding without clear access patterns: Without understanding how data is accessed, shard keys will be wrong. Design access patterns first.

  8. Caching without measuring: Cache adds complexity. Measure cache hit rates and only cache when it provides meaningful benefit.

Quick Recap

Key Bullets:

  • Scale vertically first, then add read replicas, then cache, then shard (in that order)
  • Read replicas solve read scale; sharding solves write scale
  • Caching provides the best ROI for reducing database load when access patterns allow
  • Monitor CPU, memory, connection counts, and replication lag
  • Test failover procedures before you need them in production
  • Choose scaling approach based on actual bottleneck (measure first)

Copy/Paste Checklist:

# Read replica routing with connection pooling
from sqlalchemy import create_engine
from sqlalchemy.pool import QueuePool

# Primary for writes
primary_engine = create_engine(
    "postgresql://user:pass@primary:5432/mydb",
    pool_size=5, max_overflow=10
)

# Replica for reads
replica_engine = create_engine(
    "postgresql://user:pass@replica:5432/mydb",
    pool_size=20, max_overflow=20
)

# Use primary for writes, replica for reads
def write_operation(sql, params):
    with primary_engine.connect() as conn:
        return conn.execute(sql, params)

def read_operation(sql, params):
    with replica_engine.connect() as conn:
        return conn.execute(sql, params)

Real-World Case Study: Stack Overflow’s Scaling Journey

Stack Overflow’s scaling story is a textbook example of vertical-first scaling done right. In the early years, they ran on a single server with SQL Server. Their database fit comfortably on one server with 32GB of RAM. As traffic grew, they upgraded — from 32GB to 64GB, then to 96GB, then to multiple servers with different roles.

What made Stack Overflow’s approach work was their philosophy of not over-engineering. They did not shard until they genuinely needed to. When their database CPU hit sustained 80%+ utilization, they upgraded to a larger server. When disk I/O became the bottleneck, they moved from spinning disks to SSDs. Each upgrade bought them another 12-18 months of runway.

Their read replicas came when they needed to distribute read traffic, not before. Stack Overflow’s traffic is extremely read-heavy — roughly 95% reads, 5% writes. Adding read replicas let them distribute the read load across multiple servers while keeping writes on a single powerful primary.

The lesson: Stack Overflow did not implement caching, replicas, or sharding until measurements showed they needed it. By then, they understood their actual bottleneck and chose the right tool. Many teams preemptively implement complex scaling architectures based on hypothetical future load. Stack Overflow scaled by responding to real measurements.

Stack Overflow eventually moved to a distributed architecture with specialized servers for different workloads — but that came after years of vertical scaling bought them time to understand their traffic patterns.

Interview Questions

Q: Your database CPU is at 90% but memory utilization is only 40%. Adding more RAM does not help. What is the bottleneck and what do you do?

The bottleneck is CPU, not memory. More RAM would help if the bottleneck was cache misses — if you were hitting disk because data did not fit in memory, adding RAM would reduce disk I/O and CPU time spent waiting. But with CPU at 90% and memory at 40%, your working set fits in memory and the CPU is doing compute work. Options: upgrade CPU (vertical scaling), add read replicas to distribute read queries across multiple servers, optimize slow queries to reduce CPU per transaction, or implement caching for hot queries to reduce database load.

Q: Your application is read-heavy (90% reads, 10% writes). You have a single database server. What is the first scaling step?

The first step is adding read replicas. With 90% reads, distributing reads across replicas gives you nearly 10x read capacity without modifying application logic. You add one or more read replicas, configure the application to route writes to the primary and reads to replicas, and monitor replication lag to ensure reads are not too stale. Vertical scaling is also an option, but for a 90/10 read-heavy workload, replicas give more capacity per dollar than upgrading to a larger server.

Q: You are at 80% CPU on your primary database and adding replicas does not help because writes are the bottleneck. What do you do?

If writes are the bottleneck and replicas do not help (because replicas still funnel writes through the primary), you need to reduce write volume or distribute writes. Options: optimize write-heavy queries — maybe an UPDATE runs more often than necessary or an index is causing excessive write overhead; implement caching for data that does not need to be written to the database immediately; shard the database to distribute writes across multiple primaries; or move to a write-optimized database for specific write-heavy tables while keeping the rest on the primary.

Q: At what point do you decide to shard versus continue vertical scaling?

The decision should be data-driven, not time-based. Shard when you have exhausted vertical scaling (cannot get a bigger server or the cost exceeds what sharding would cost), when you have measured that writes — not reads — are the bottleneck (replicas do not help writes), and when your data size exceeds what a single server can reasonably store. If you have a 2TB database and your largest available server has 4TB of storage, you have headroom but not for growth. If that server is already at 80% CPU and 90% storage, you need to shard or upgrade. The key metric is whether you can achieve your SLA for write latency and availability with your current architecture under projected load.

Conclusion

Database scaling follows a progression. Vertical scaling handles early growth. Read replicas scale read-heavy workloads. Caching reduces database load for hot data. Sharding distributes writes across servers.

Most applications never need sharding. Vertical scaling plus replicas handles impressive workloads. Only introduce sharding complexity when simpler solutions are exhausted.

Monitor your database. Understand where bottlenecks develop. Scale the dimension that constrains you. Both premature optimization and premature scaling waste engineering resources.

For related reading, see Relational Databases to understand database fundamentals, and Horizontal Sharding for deep-dive on sharding strategies.

Category

Related Posts

Database Capacity Planning: A Practical Guide

Plan for growth before you hit walls. This guide covers growth forecasting, compute and storage sizing, IOPS requirements, and cloud vs on-prem decisions.

#database #capacity-planning #infrastructure

Connection Pooling: HikariCP, pgBouncer, and ProxySQL

Learn connection pool sizing, HikariCP, pgBouncer, and ProxySQL, timeout settings, idle management, and when pooling helps or hurts performance.

#database #connection-pooling #performance

Read/Write Splitting

Master-slave replication for read scaling. Routing strategies, consistency lag considerations, and when this pattern helps vs hurts your architecture.

#database #scaling #replication