Database Scaling: Vertical, Horizontal, and Read Replicas
Learn strategies for scaling databases beyond a single instance: vertical scaling, read replicas, write scaling, and when to choose each approach.
Database Scaling: Vertical, Horizontal, Read Replicas, and Write Scaling
Every database hits limits eventually. A single server has finite CPU, memory, and disk speed. At some traffic level, you need to scale. The question is how.
I have seen teams jump to complex sharding when vertical scaling would have bought them years. I have also seen teams keep vertically scaling past the point where horizontal approaches would have been cheaper. The right answer depends on your workload.
Vertical Scaling First
Vertical scaling means bigger servers. More CPU cores, more memory, faster disks. This is the simplest approach. You take your existing database and move it to a more powerful machine.
The advantage is simplicity. No application changes required. Your single-server transactions stay ACID-compliant. Your queries work exactly as before.
The disadvantage is limits. Server sizes have caps. A database that needs 64 cores is not going to fit on a server with 32. And costs scale faster than capacity. A server with twice the resources rarely costs twice as much.
Vertical scaling works best early in a system’s life. When traffic is modest and growth is predictable, throwing bigger iron at the problem is often the cheapest solution in engineering time.
Read Replicas
Most application workloads are read-heavy. Eighty percent reads, twenty percent writes is common. Read replicas let you scale read capacity by copying data to additional servers.
graph TD
App[Application] --> LB[Load Balancer]
LB --> Primary[(Primary DB)]
LB --> Replica1[(Read Replica 1)]
LB --> Replica2[(Read Replica 2)]
Primary -->|Replication| Replica1
Primary -->|Replication| Replica2
The primary handles writes. Replicas handle reads. Replication is typically asynchronous, meaning reads might be slightly stale. For most applications, milliseconds of staleness is acceptable.
# Route reads to replicas, writes to primary
def get_user(user_id):
# Read from replica
return replica_db.query("SELECT * FROM users WHERE id = ?", user_id)
def update_user(user_id, data):
# Write to primary
primary_db.execute("UPDATE users SET ... WHERE id = ?", user_id)
# Invalidate or wait for replication
PostgreSQL, MySQL, and MongoDB all support read replicas. The implementation details differ but the pattern is the same. You gain read scale at the cost of replication lag.
Write Scaling Challenges
Read replicas solve read scale. Write scaling is harder. You cannot simply add more servers to handle writes because writes must be coordinated. Every write needs to be visible to all subsequent reads.
graph TD
App[Application] --> Primary[(Primary DB)]
App2[Application 2] --> Primary
App3[Application 3] --> Primary
Primary --> Replica1[(Read Replica)]
Primary --> Replica2[(Read Replica)]
The primary is the bottleneck for writes. Vertical scaling helps up to a point, but eventually you need to distribute writes somehow.
Write scaling approaches include:
- Partitioning writes by functionality (different tables on different servers)
- Sharding across multiple primary servers
- Using systems designed for write distribution
Most applications hit write limits later than they expect. Vertical scaling plus read replicas handles surprisingly large workloads.
Vendor-Specific Replication Differences
Each database handles replication differently. Understanding these differences helps you choose and operate your system correctly.
PostgreSQL Replication
PostgreSQL uses physical streaming replication. The primary sends WAL (Write-Ahead Log) records to replicas, which apply them continuously.
-- Create a physical replication slot
SELECT * FROM pg_create_physical_replication_slot('replica1_slot');
-- Check replication status
SELECT client_addr, state, sent_lsn, write_lsn
FROM pg_stat_replication;
Key characteristics:
- Streaming replication is asynchronous by default
- Synchronous replication available via
synchronous_commit = on - Supports read replicas with
hot_standby - Logical replication enables table-level replication (for migrations, not scaling)
Connection pooling: PgBouncer is the standard. Use transaction-mode pooling for most applications:
; pgbouncer.ini
[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 20
MySQL Replication
MySQL uses binary log (binlog) replication. The primary writes transactions to the binlog, replicas request and apply them.
-- Check replication status
SHOW REPLICA STATUS\G
-- Configure binlog retention
SET GLOBAL binlog_expire_logs_seconds = 604800;
Key characteristics:
- Default is asynchronous (semi-synchronous available)
- GTID (Global Transaction ID) makes replication tracking reliable
log_slave_updatescontrols whether replicas write to their own binlog- Row-based replication is more consistent but produces larger binlogs
Connection pooling: ProxySQL is the standard for MySQL:
-- Add a backend pool in ProxySQL
INSERT INTO mysql_servers (hostgroup_id, hostname, port)
VALUES (10, 'replica1-host', 3306);
LOAD MYSQL SERVERS TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;
MongoDB Replication
MongoDB uses a replica set with oplog (operations log). The primary applies operations to the oplog, secondaries tail the oplog and apply operations.
// Check replica set status
rs.status();
// Configure read preference
db.getMongo().setReadPref("secondaryPreferred");
// Add a new member
rs.add("mongodb2.example.com:27017");
Key characteristics:
- Oplog is a capped collection that stores all write operations
- Default read preference is primary (strong consistency)
- Secondary reads are eventually consistent
- Change Streams provide real-time change notifications
Sharding architecture:
// Add a shard to the cluster
sh.addShard("shard1-replica-set/mongodb1:27017");
// Enable balancing
sh.enableBalancing("myapp.users");
Database Sharding
Sharding distributes data across multiple servers. Each shard contains a subset of the data. Writes route to the appropriate shard based on a shard key.
graph LR
App[Application] --> Router[(Shard Router)]
Router -->|user:1000-1999| Shard1[(Shard 1)]
Router -->|user:2000-2999| Shard2[(Shard 2)]
Router -->|user:3000-3999| Shard3[(Shard 3)]
Choosing the shard key is critical. A bad key leads to uneven distribution. User ID often works well since writes are distributed by user.
Sharding is complex. Cross-shard queries are difficult. Joins across shards require querying multiple shards and combining results. Your application logic must understand the sharding scheme.
def get_user_shard(user_id):
shard_id = int(user_id) % NUM_SHARDS
return f"shard_{shard_id}"
def get_user(user_id):
shard = get_user_shard(user_id)
return shards[shard].query("SELECT * FROM users WHERE id = ?", user_id)
Most databases have some sharding support. MongoDB shards automatically. PostgreSQL requires application-level sharding or extensions. CockroachDB distributes writes automatically but with different trade-offs.
Caching as a Buffer
Before scaling the database, consider caching. Memcached and Redis sit in front of databases and serve repeated queries from memory.
def get_user(user_id):
# Try cache first
cached = redis.get(f"user:{user_id}")
if cached:
return json.loads(cached)
# Cache miss, query database
user = db.query("SELECT * FROM users WHERE id = ?", user_id)
# Store in cache for next time
redis.setex(f"user:{user_id}", 300, json.dumps(user))
return user
Cache hit rates of 90% or higher are common. Caching dramatically reduces database load. Add caching before adding complexity.
The challenge is cache invalidation. When data changes, you must update or delete the cached version. TTLs provide automatic expiration but introduce staleness.
When to Choose Each Approach
Start with vertical scaling. It is the simplest and handles most workloads longer than you expect. Monitor your database. When CPU or memory approach limits, evaluate your options.
Add read replicas when read latency increases or CPU on the primary maxes out from read load. Reads scale horizontally easily. Route reads to replicas, writes to primary.
Consider caching when you have repeated queries for the same data. Hot data caches well. Immutable or rarely-changed data caches best.
Shard when you hit write limits. Sharding is complex. Only introduce that complexity when vertical scaling and replication cannot help.
Capacity Planning
Vertical Scaling Formulas
Estimate when you will need to scale vertically based on current resource utilization and growth rate:
import math
def estimate_vertical_scale_time(
current_cpu_percent: float,
current_memory_percent: float,
monthly_growth_rate: float, # e.g., 0.15 for 15% monthly growth
days_until_limit: int = 90
) -> dict:
"""
Estimate when vertical scaling will be needed.
Returns the limiting resource and estimated days until scale required.
"""
# Days until CPU limit at current growth
cpu_days = math.log(days_until_limit / 30) / math.log(1 + monthly_growth_rate) * 30
# Days until memory limit (typically slower growth than CPU)
memory_days = math.log(days_until_limit / 30) / math.log(1 + monthly_growth_rate * 0.7) * 30
limiting_resource = 'CPU' if cpu_days < memory_days else 'Memory'
days_until_scale = min(cpu_days, memory_days)
return {
'limiting_resource': limiting_resource,
'estimated_days_until_scale': int(days_until_scale),
'monthly_growth_assumption': f"{monthly_growth_rate * 100:.0f}%"
}
# Example: 60% CPU, 70% memory, 20% monthly growth
result = estimate_vertical_scale_time(60, 70, 0.20)
# Returns: limiting_resource='CPU', estimated_days_until_scale=~73 days
Connection pool sizing formula:
def recommended_pool_size(
max_connections_per_query: int,
expected_concurrent_requests: int,
database_cpu_count: int
) -> int:
"""
Calculate recommended connection pool size.
Rule of thumb: (core_count * 2) + effective_spindle_count
For SSDs: effective_spindle_count ~= 30
"""
# General formula from PostgreSQL documentation
baseline = database_cpu_count * 2
# Add for SSD storage (rough approximation of IO capacity)
effective_spindle_count = 30 # for SSD-backed storage
recommended = baseline + effective_spindle_count
# Cap at reasonable fraction of max_connections
max_pool_size = max_connections_per_query * expected_concurrent_requests * 0.5
return min(recommended, max_pool_size)
# Example: 8-core DB, 100 concurrent requests, each query uses 1 connection
pool_size = recommended_pool_size(1, 100, 8) # Returns ~46
Read Replica Lag Estimation
Replication lag depends on write volume, network bandwidth, and replica hardware:
def estimated_replication_lag_seconds(
write_throughput_mbps: float,
network_latency_ms: float,
replica_apply_latency_ms: float,
wal_record_size_kb: float = 4.0
) -> float:
"""
Estimate typical replication lag in seconds.
"""
# Time to transfer WAL records for each write batch
transfer_time = (wal_record_size_kb / write_throughput_mbps) * 1000 # ms
# Round trip + apply time
total_latency = network_latency_ms + transfer_time + replica_apply_latency_ms
return total_latency / 1000 # convert to seconds
# Example: 10 MB/s write throughput, 1ms LAN latency, 2ms apply
lag = estimated_replication_lag_seconds(10, 1.0, 2.0)
# Returns: ~0.005 seconds (5ms) for small records
Sharding Capacity Formula
def estimate_sharding_requirements(
total_data_gb: int,
write_tps: int,
read_tps: int,
target_shard_data_gb: int = 100 # GB per shard target
) -> dict:
"""
Estimate number of shards needed and their characteristics.
"""
# Data-based shard count
data_shards = math.ceil(total_data_gb / target_shard_data_gb)
# Write scaling: each shard can handle ~5000 TPS for simple writes
writes_per_shard = write_tps / data_shards
write_shards = math.ceil(write_tps / 5000) if writes_per_shard > 5000 else data_shards
# Read scaling: each shard can handle ~10000 TPS with caching
reads_per_shard = read_tps / data_shards
read_shards = math.ceil(read_tps / 10000) if reads_per_shard > 10000 else data_shards
total_shards = max(data_shards, write_shards, read_shards)
return {
'estimated_shards': total_shards,
'data_gb_per_shard': total_data_gb / total_shards,
'writes_per_shard_per_second': write_tps / total_shards,
'reads_per_shard_per_second': read_tps / total_shards,
'recommendation': 'shard' if total_shards > 1 else 'single_server'
}
# Example: 500 GB data, 20000 write TPS, 100000 read TPS
result = estimate_sharding_requirements(500, 20000, 100000)
# Returns: ~5 shards, 100GB/shard, 4000 write TPS/shard, 20000 read TPS/shard
Cost Comparison: Vertical vs Horizontal Scaling
| Dimension | Vertical Scaling | Horizontal Scaling (Replicas + Sharding) |
|---|---|---|
| Hardware cost | High (large server prices non-linearly) | Lower (commodity servers scale linearly) |
| Licensing cost | Often per-socket or per-core | Per-server (can use more smaller licenses) |
| Operational complexity | Low (single server) | High (more servers, replication, failover) |
| Scaling elasticity | Discrete jumps (next server size) | Smooth (add servers incrementally) |
| Maximum scale | Limited by largest available server | Nearly unlimited (sharding) |
| Failure domain | Single point of failure | Can survive single node failures |
| Read throughput | Limited to one server’s capacity | Multiplied by replica count |
| Write throughput | Limited to one server’s capacity | Requires sharding (complex) |
| Monthly cost trajectory | Step function (big jumps) | Linear growth with usage |
Break-even comparison:
| Scenario | Vertical (r6i.4xlarge) | Horizontal (3x r6i.xlarge) |
|---|---|---|
| Specifications | 16 vCPU, 128 GB RAM | 3 × (4 vCPU, 32 GB RAM) |
| Monthly AWS cost (us-east-1) | ~$1,200 | ~$1,350 |
| Write throughput | 15,000 TPS | ~20,000 TPS (with sharding) |
| Read throughput (cached) | 50,000 QPS | 150,000 QPS |
| Failure handling | Manual failover | Automatic with read replicas |
| Best for | Write-heavy, simple ops | Read-heavy, high availability |
Scaling Timeline: When to Choose What
Use this as a rough guide for when to consider each scaling approach based on your workload:
| Scale Stage | Users / QPS | Primary Bottleneck | Recommended Approach |
|---|---|---|---|
| Startup | < 1K users, < 100 QPS | Simple queries | Single server, basic vertical |
| Growing | 1K-10K users, 100-1K QPS | Connection limits, some slow queries | Vertical + read replicas |
| Scaling | 10K-100K users, 1K-10K QPS | Read latency, replication lag | Add caching, more replicas |
| Large | 100K-1M users, 10K-50K QPS | Write throughput, data size | Sharding, connection pooling |
| Enterprise | > 1M users, > 50K QPS | Multi-region, compliance | Distributed database, read-heavy CQRS |
Decision triggers to watch:
def should_scale_vertically(
cpu_percent_avg: float,
memory_percent_avg: float,
connections_used_percent: float
) -> dict:
"""Return scaling recommendations based on resource utilization."""
triggers = []
if cpu_percent_avg > 70:
triggers.append("CPU > 70%: consider vertical scale or read replicas")
if memory_percent_avg > 80:
triggers.append("Memory > 80%: vertical scale or caching")
if connections_used_percent > 70:
triggers.append("Connections > 70%: add connection pooling or replicas")
return {
'should_scale': len(triggers) > 0,
'triggers': triggers,
'recommendation': 'vertical' if cpu_percent_avg > 70 else 'replicas'
}
def should_add_caching(
cache_hit_rate: float,
repeated_query_percent: float,
db_cpu_percent: float
) -> dict:
"""Determine if caching would help."""
if repeated_query_percent > 30 and cache_hit_rate < 0.80:
return {
'recommend_caching': True,
'reason': f"{repeated_query_percent:.0f}% repeated queries, "
f"only {cache_hit_rate:.0%} cache hit rate"
}
return {'recommend_caching': False}
def should_shard(
data_size_gb: int,
write_tps: int,
single_server_max_tps: int,
single_server_max_gb: int
) -> dict:
"""Determine if sharding is needed."""
reasons = []
if data_size_gb > single_server_max_gb * 0.7:
reasons.append(f"Data ({data_size_gb} GB) approaching "
f"single server limit ({single_server_max_gb} GB)")
if write_tps > single_server_max_tps * 0.7:
reasons.append(f"Write TPS ({write_tps}) approaching "
f"single server limit ({single_server_max_tps})")
return {
'should_shard': len(reasons) > 0,
'reasons': reasons
}
Start with the simplest approach (vertical). Only add complexity when measurements show it is needed.
Consistency Considerations
Scaling introduces consistency trade-offs. Read replicas are typically asynchronously replicated. Writes to primary take time to propagate to replicas.
# This might fail if replica lags behind primary
user = replica_db.query("SELECT * FROM users WHERE id = ?", user_id)
if user['email'] != new_email:
# User might see old email briefly
primary_db.execute("UPDATE users SET email = ?", new_email)
Read-after-write consistency is not guaranteed with replicas. A user might write data and immediately read it from a replica that has not yet received the write. Solutions include:
- Read from primary after writes
- Synchronous replication (higher latency, lower availability)
- Session-based routing (read from primary for a window after writes)
For most applications, eventual consistency is acceptable. Users rarely notice milliseconds of staleness. Choose stronger guarantees only when your requirements demand them.
When to Use and When Not to Use Each Scaling Approach
Vertical Scaling:
- Use when: Database CPU or memory approaches limits; you want simplicity
- Do not use when: You have hit maximum server specifications; cost of large servers exceeds distributed alternatives
Read Replicas:
- Use when: Read latency is increasing; primary CPU maxes out from read load; you need geographic distribution
- Do not use when: You need strong read-after-write consistency; writes are the bottleneck
Caching:
- Use when: You have repeated queries for the same data; hot data patterns exist; you want to reduce database load
- Do not use when: Data changes frequently and cache invalidation is complex; additional infrastructure is not justified
Sharding:
- Use when: You have exhausted vertical scaling and replicas; data size exceeds single server; write throughput exceeds single server capability
- Do not use when: You can solve the problem with vertical scaling or caching; your team lacks operational expertise
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Vertical scaling hitting hardware limits | Database stops scaling up, performance degrades | Plan capacity ahead, migrate to distributed architecture before crisis |
| Read replica lag | Stale reads, user sees outdated data | Monitor lag, route critical reads to primary, use synchronous replication |
| Cache stampede | Database overwhelmed when cache expires | Implement cache locking, use jittered TTLs, warm cache proactively |
| Shard hotspot | Some shards overloaded while others idle | Choose better shard key, implement virtual nodes, rebalance |
| Connection pool exhaustion | New queries fail, application errors | Size connection pools appropriately, implement backpressure, use pooling middleware |
| Cross-shard query performance | Slow aggregations, timeouts | Denormalize for common queries, use scatter-gather with timeouts, pre-aggregate |
| Cache failure | Database hit by all requests | Implement cache redundancy, graceful degradation, circuit breakers |
| Failover not working | Extended downtime during primary failure | Test failover regularly, automate with health checks, have manual runbook |
Observability Checklist
Metrics to Monitor:
- Query latency by type (read, write, analytical)
- Database CPU, memory, disk I/O utilization
- Connection pool utilization and wait times
- Replication lag (seconds behind primary)
- Cache hit ratio and memory utilization
- Shard distribution and hotspot detection
- Throughput (queries per second, transactions per second)
- Queue depths and backpressure indicators
Logs to Capture:
- Slow query logs (queries exceeding thresholds)
- Connection events (opens, closes, failures)
- Replication status changes and errors
- Failover events and decision rationale
- Cache evictions and expiration patterns
- Application errors related to database access
Alerts to Set:
- CPU or memory utilization > 80%
- Replication lag > 30 seconds
- Connection pool > 80% utilized
- Cache hit ratio < threshold (e.g., < 80%)
- Query latency p99 exceeding SLA
- Shard skew exceeding threshold (e.g., > 2x average)
- Disk usage > 85%
# Database scaling monitoring example
def check_db_health(db_config):
metrics = {
'cpu_percent': get_db_cpu(),
'memory_percent': get_db_memory(),
'connections': get_connection_count(),
'replication_lag': get_replication_lag(),
'cache_hit_ratio': get_cache_hit_ratio()
}
alerts = []
if metrics['cpu_percent'] > 80:
alerts.append(f"CPU critical: {metrics['cpu_percent']}%")
if metrics['replication_lag'] > 30:
alerts.append(f"Replication lag: {metrics['replication_lag']}s")
return metrics, alerts
Security Checklist
- Use strong authentication for all database connections
- Implement TLS encryption for all database connections
- Apply principle of least privilege for application database users
- Rotate credentials regularly and use secret management
- Audit database access logs for security review
- Use network segmentation (databases not directly internet-accessible)
- Encrypt data at rest (filesystem or application-level encryption)
- Implement query whitelisting or restrictions for application users
- Monitor for unauthorized access attempts
- Test database access controls with penetration testing
- Secure backup data with encryption
- Implement read-only users separate from read-write users
Common Pitfalls and Anti-Patterns
-
Scaling too late: Monitoring gaps cause sudden performance crises. Monitor proactively and scale before hitting limits.
-
Premature sharding: Sharding introduces significant complexity. Vertical scaling plus caching handles most workloads longer than expected.
-
Ignoring read-after-write consistency: With replicas, reads might return stale data immediately after writes. Implement appropriate consistency levels.
-
Cache invalidation complexity: Caching stale data is worse than no caching. Ensure invalidation is correct before caching frequently-changing data.
-
Connection pool misconfiguration: Too few connections starve the application. Too many exhaust database resources. Size appropriately.
-
Not testing failover: If failover is broken, you only find out during an outage. Test regularly in staging.
-
Sharding without clear access patterns: Without understanding how data is accessed, shard keys will be wrong. Design access patterns first.
-
Caching without measuring: Cache adds complexity. Measure cache hit rates and only cache when it provides meaningful benefit.
Quick Recap
Key Bullets:
- Scale vertically first, then add read replicas, then cache, then shard (in that order)
- Read replicas solve read scale; sharding solves write scale
- Caching provides the best ROI for reducing database load when access patterns allow
- Monitor CPU, memory, connection counts, and replication lag
- Test failover procedures before you need them in production
- Choose scaling approach based on actual bottleneck (measure first)
Copy/Paste Checklist:
# Read replica routing with connection pooling
from sqlalchemy import create_engine
from sqlalchemy.pool import QueuePool
# Primary for writes
primary_engine = create_engine(
"postgresql://user:pass@primary:5432/mydb",
pool_size=5, max_overflow=10
)
# Replica for reads
replica_engine = create_engine(
"postgresql://user:pass@replica:5432/mydb",
pool_size=20, max_overflow=20
)
# Use primary for writes, replica for reads
def write_operation(sql, params):
with primary_engine.connect() as conn:
return conn.execute(sql, params)
def read_operation(sql, params):
with replica_engine.connect() as conn:
return conn.execute(sql, params)
Real-World Case Study: Stack Overflow’s Scaling Journey
Stack Overflow’s scaling story is a textbook example of vertical-first scaling done right. In the early years, they ran on a single server with SQL Server. Their database fit comfortably on one server with 32GB of RAM. As traffic grew, they upgraded — from 32GB to 64GB, then to 96GB, then to multiple servers with different roles.
What made Stack Overflow’s approach work was their philosophy of not over-engineering. They did not shard until they genuinely needed to. When their database CPU hit sustained 80%+ utilization, they upgraded to a larger server. When disk I/O became the bottleneck, they moved from spinning disks to SSDs. Each upgrade bought them another 12-18 months of runway.
Their read replicas came when they needed to distribute read traffic, not before. Stack Overflow’s traffic is extremely read-heavy — roughly 95% reads, 5% writes. Adding read replicas let them distribute the read load across multiple servers while keeping writes on a single powerful primary.
The lesson: Stack Overflow did not implement caching, replicas, or sharding until measurements showed they needed it. By then, they understood their actual bottleneck and chose the right tool. Many teams preemptively implement complex scaling architectures based on hypothetical future load. Stack Overflow scaled by responding to real measurements.
Stack Overflow eventually moved to a distributed architecture with specialized servers for different workloads — but that came after years of vertical scaling bought them time to understand their traffic patterns.
Interview Questions
Q: Your database CPU is at 90% but memory utilization is only 40%. Adding more RAM does not help. What is the bottleneck and what do you do?
The bottleneck is CPU, not memory. More RAM would help if the bottleneck was cache misses — if you were hitting disk because data did not fit in memory, adding RAM would reduce disk I/O and CPU time spent waiting. But with CPU at 90% and memory at 40%, your working set fits in memory and the CPU is doing compute work. Options: upgrade CPU (vertical scaling), add read replicas to distribute read queries across multiple servers, optimize slow queries to reduce CPU per transaction, or implement caching for hot queries to reduce database load.
Q: Your application is read-heavy (90% reads, 10% writes). You have a single database server. What is the first scaling step?
The first step is adding read replicas. With 90% reads, distributing reads across replicas gives you nearly 10x read capacity without modifying application logic. You add one or more read replicas, configure the application to route writes to the primary and reads to replicas, and monitor replication lag to ensure reads are not too stale. Vertical scaling is also an option, but for a 90/10 read-heavy workload, replicas give more capacity per dollar than upgrading to a larger server.
Q: You are at 80% CPU on your primary database and adding replicas does not help because writes are the bottleneck. What do you do?
If writes are the bottleneck and replicas do not help (because replicas still funnel writes through the primary), you need to reduce write volume or distribute writes. Options: optimize write-heavy queries — maybe an UPDATE runs more often than necessary or an index is causing excessive write overhead; implement caching for data that does not need to be written to the database immediately; shard the database to distribute writes across multiple primaries; or move to a write-optimized database for specific write-heavy tables while keeping the rest on the primary.
Q: At what point do you decide to shard versus continue vertical scaling?
The decision should be data-driven, not time-based. Shard when you have exhausted vertical scaling (cannot get a bigger server or the cost exceeds what sharding would cost), when you have measured that writes — not reads — are the bottleneck (replicas do not help writes), and when your data size exceeds what a single server can reasonably store. If you have a 2TB database and your largest available server has 4TB of storage, you have headroom but not for growth. If that server is already at 80% CPU and 90% storage, you need to shard or upgrade. The key metric is whether you can achieve your SLA for write latency and availability with your current architecture under projected load.
Conclusion
Database scaling follows a progression. Vertical scaling handles early growth. Read replicas scale read-heavy workloads. Caching reduces database load for hot data. Sharding distributes writes across servers.
Most applications never need sharding. Vertical scaling plus replicas handles impressive workloads. Only introduce sharding complexity when simpler solutions are exhausted.
Monitor your database. Understand where bottlenecks develop. Scale the dimension that constrains you. Both premature optimization and premature scaling waste engineering resources.
For related reading, see Relational Databases to understand database fundamentals, and Horizontal Sharding for deep-dive on sharding strategies.
Category
Related Posts
Database Capacity Planning: A Practical Guide
Plan for growth before you hit walls. This guide covers growth forecasting, compute and storage sizing, IOPS requirements, and cloud vs on-prem decisions.
Connection Pooling: HikariCP, pgBouncer, and ProxySQL
Learn connection pool sizing, HikariCP, pgBouncer, and ProxySQL, timeout settings, idle management, and when pooling helps or hurts performance.
Read/Write Splitting
Master-slave replication for read scaling. Routing strategies, consistency lag considerations, and when this pattern helps vs hurts your architecture.