Database Capacity Planning: A Practical Guide
Plan for growth before you hit walls. This guide covers growth forecasting, compute and storage sizing, IOPS requirements, and cloud vs on-prem decisions.
Database Capacity Planning: A Practical Guide
Capacity planning is the unsexy work that prevents the dramatic failures. Nobody writes blog posts about databases that had enough capacity. They write posts about databases that ran out of disk at 3am, or whose CPU maxed out during a product launch, or that fell over because connection pools were exhausted.
This guide covers how to size database infrastructure for current needs and future growth, with practical formulas and monitoring approaches.
Growth Forecasting
Before you can plan capacity, you need to understand growth. How is your data growing? How are your users growing? What about query volume?
Data Growth Modeling
Start with your current data volume. How large is your database today?
SELECT
pg_database.datname,
pg_size_pretty(pg_database_size(pg_database.datname)) AS size
FROM pg_database
WHERE datname = current_database();
For per-table sizes:
SELECT
relname,
pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
pg_size_pretty(pg_relation_size(relid)) AS table_size
FROM pg_stat_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 20;
Project growth by analyzing historical trends. If you have doubled in size over the past year, plan for that trajectory—but also consider whether that growth rate is likely to continue, accelerate, or slow.
Workload Characterization
Data volume alone does not tell the whole story. A database with 100GB might be perfectly fine with modest resources, or it might be hopelessly underpowered depending on query patterns. Characterize your workload:
- Read-heavy vs write-heavy: Read-heavy workloads benefit from more memory (cache). Write-heavy workloads need I/O bandwidth.
- Query complexity: Simple point queries need different resources than complex analytical queries with large joins.
- Connection patterns: Many short connections vs fewer long-running queries.
Sizing Compute (CPU)
CPU sizing depends on query complexity and concurrency requirements. A database running simple CRUD operations needs less CPU than one running complex analytical queries.
For CPU, focus on single-threaded performance. PostgreSQL cannot parallelize a single query across multiple cores for most operations (though parallel query exists, it has limits). A database with 8 fast cores is often better than one with 16 slow cores.
Rule of thumb for general workloads: start with 4-8 cores for a production database and scale up based on monitoring. If you consistently see CPU above 70-80%, that is a signal to upgrade.
# Check CPU usage
top -bn1 | grep "Cpu(s)"
In cloud environments, remember that CPU credits apply. Burstable instances work for development but can throttle during sustained load.
Sizing Memory
Memory is where PostgreSQL caches data and does its work. More memory means fewer disk I/O operations, which are orders of magnitude slower than memory access.
The key metric is effective cache size:
SHOW effective_cache_size;
This setting tells PostgreSQL how much memory is available for caching. A reasonable starting point is 75% of your total RAM. If you have 32GB RAM, set effective_cache_size to 24GB.
For dedicated database servers, allocate roughly:
- 25% for operating system and other processes
- 25% for PostgreSQL shared buffers (or less—modern guidance suggests 25% is often too high)
- 50%+ for operating system cache (used for PostgreSQL data files)
PostgreSQL’s shared_buffers default of 128MB is far too small for production. For a server with 32GB RAM, shared_buffers of 8-16GB is reasonable.
-- Check current shared buffers
SHOW shared_buffers;
-- Check memory settings
SHOW max_connections;
SHOW work_mem;
SHOW maintenance_work_mem;
work_mem is per-operation (sort, hash) and maintenance_work_mem is for maintenance operations like VACUUM and CREATE INDEX. Both should be sized appropriately for your workload.
Sizing Storage
Storage sizing requires estimating current data size, growth rate, and headroom.
Current and Projected Size
Calculate current table and index sizes (see queries above). Add 30-50% for headroom and growth between storage expansions.
Storage Type Considerations
Different storage types suit different workloads:
- SSD (NVMe): Best for high I/O mixed read/write workloads. More expensive but faster.
- HDD: Cheaper per GB, suitable for large analytical databases with sequential I/O.
- Network storage: Common in cloud environments. Latency varies—understand your cloud provider’s performance characteristics.
For most OLTP workloads, SSD or NVMe storage is the better choice. The performance difference is significant.
IOPS Requirements
IOPS (Input/Output Operations Per Second) measures storage throughput. Database workloads are I/O intensive, especially with poor cache hit rates.
Calculate required IOPS by understanding your workload:
-- Average query disk usage
SELECT
sum(total_exec_time) AS total_time,
sum(total_exec_time) / sum(calls) AS avg_time,
sum(calls) AS total_calls
FROM pg_stat_statements;
A rough formula: if your database is doing 10,000 IOPS and each I/O takes 0.5ms, you are generating 5 seconds of I/O per second—completely reasonable for SSDs, impossible for slow HDDs.
Cloud providers specify IOPS for their storage volumes. Make sure your volume type supports your workload’s I/O requirements. PostgreSQL’s bgwriter stats give you insight into write I/O patterns:
SELECT
buffers_alloc,
buffersBackend,
buffersBackend_fsync,
buffers_clean,
maxwritten_clean
FROM pg_stat_bgwriter;
Connection Pool Sizing
PostgreSQL connections have overhead. Each connection uses memory (roughly 5-10MB per connection in work_mem) and competes for CPU. A database with 1000 idle connections is less efficient than one with 100 busy connections.
Direct Connection Limits
max_connections sets the absolute maximum:
SHOW max_connections;
Typical production values range from 100 to 1000 depending on workload. More is not always better—each idle connection consumes resources.
Connection Poolers
For most production workloads, use a connection pooler like PgBouncer or Pgpool-II:
; PgBouncer configuration
[databases]
mydb = host=localhost dbname=mydb
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 100
PgBouncer pools connections at the transaction level. This allows many application connections to share few database connections. The tradeoff is that you cannot use session-level features (like prepared statements in some configurations, or SET statements that persist across transactions).
Pool Size Calculation
A reasonable starting formula for default_pool_size:
pool_size = (core_count * 2) + effective_io_concurrency
For a 4-core database with SSD storage:
pool_size = (4 * 2) + 4 = 12
But this varies by workload. OLTP workloads with short transactions can often use larger pools. Analytical workloads with long transactions need smaller pools.
Monitor connection utilization:
-- Current connection counts
SELECT
state,
COUNT(*)
FROM pg_stat_activity
GROUP BY state;
Cloud vs On-Premises Tradeoffs
The cloud versus on-premises decision affects capacity planning significantly.
Cloud Advantages
- Elasticity: Scale up quickly when needed. Scale down to save money during low-traffic periods.
- Variety: Choose instance types optimized for compute, memory, or I/O.
- Managed services: RDS, Cloud SQL, and similar services handle some operational burden.
Cloud Disadvantages
- Cost at scale: At large scale, cloud database costs become significant. A $50,000/month database in the cloud might cost $10,000/month on-premises.
- Less control: You cannot tune kernel parameters or optimize storage as freely.
- IOPS costs: Cloud storage IOPS often costs extra. High-IOPS workloads can become expensive.
Hybrid Approaches
Many organizations use a hybrid approach: core production data on-premises or in a private cloud, with burst capacity or read replicas in public cloud for traffic spikes.
Monitoring Trends to Predict Capacity Needs
Capacity planning is not a one-time exercise. You need ongoing monitoring to predict when you will need to scale.
Key Metrics to Track
- Disk utilization: If consistently above 70-80%, plan storage expansion.
- CPU utilization: Sustained above 70% indicates underpowered CPU.
- Memory pressure: Check for swap usage, which indicates insufficient RAM.
- Connection counts: Watch for approaching
max_connections. - Query latency trends: Slow queries often precede capacity crises.
-- Check disk space
SELECT
df.mountpoint,
df.device,
pg_size_pretty(df.blocks * 8192) AS total,
pg_size_pretty((df.blocks - df.free) * 8192) AS used,
pg_size_pretty(df.free * 8192) AS available
FROM pg_catalog.pg_stat_device_stats df
ORDER BY df.mountpoint;
Capacity Triggers
Establish thresholds that trigger review:
- Disk usage above 70%: Investigate growth causes, plan storage expansion
- CPU above 80% sustained: Plan CPU upgrade or read scaling
- Memory pressure indicators: Plan RAM upgrade or query optimization
- Connection counts above 70% of max: Review connection pooling or increase max_connections
Practical Capacity Planning Process
flowchart TD
Start["Start: New workload or growth trigger"] --> Measure["1. Measure current:<br/>Disk, CPU, Memory,<br/>Connections, IOPS"]
Measure --> Characterize["2. Characterize workload:<br/>Read/write ratio?<br/>Query complexity?<br/>Peak patterns?"]
Characterize --> Project["3. Project growth:<br/>3, 6, 12 month<br/>trajectory"]
Project --> Size["4. Size resources:<br/>Compute, Storage,<br/>Memory, IOPS"]
Size --> Buffer["5. Add 30-50% buffer:<br/>Growth headroom +<br/>Traffic spikes"]
Buffer --> Monitor["6. Monitor trends<br/>Quarterly review:<br/>Adjust projections"]
Monitor -->|<--|Adjust| Project
A practical approach to capacity planning:
- Baseline: Measure current usage—disk, CPU, memory, connections, I/O.
- Characterize: Understand your workload—read/write ratio, query types, peak patterns.
- Project: Calculate growth rate. Where will you be in 3, 6, 12 months?
- Buffer: Add 30-50% headroom for unexpected growth or traffic spikes.
- Monitor: Track utilization trends. Update projections quarterly.
- Automate: Set up alerts for thresholds. Automate responses where possible.
Production Failure Scenarios
| Failure | Cause | Mitigation |
|---|---|---|
| Disk exhaustion mid-operation | Underestimated data growth, no monitoring of disk trends | Track disk utilization weekly, trigger capacity review at 60% usage |
| CPU saturation during peak | undersized compute for query complexity + concurrency | Profile query CPU usage, add read replicas or scale compute before peaks |
| Memory OOM crashes | shared_buffers + work_mem undersized for workload | Size effective_cache_size at 75% of RAM, monitor query-level memory usage |
| Connection pool exhaustion | max_connections too low for concurrent request volume | Deploy PgBouncer, right-size pool based on core_count * 2 + effective_io_concurrency |
| IOPS throttling | Cloud volume IOPS limits hit by mixed read/write load | Choose appropriate volume types, monitor IOPS metrics, distribute across volumes |
Quick Reference: Sizing Formulas
CPU cores: Start with 4-8 for general workloads. Scale up when CPU consistently exceeds 70-80%.
Memory allocation for a dedicated PostgreSQL server: reserve 25% of RAM for the operating system, set shared_buffers to 25% (8-16GB on a 32GB server), and effective_cache_size to 75% of RAM. For work_mem, start with 4-10MB per concurrent sort or hash operation. Set maintenance_work_mem to 512MB-2GB and scale up before large index builds.
Storage IOPS: if the database does 10,000 IOPS and each I/O takes 0.5ms, you are at 50% utilization of a fast SSD — know your storage latency to judge headroom.
Connection pool size: (core_count * 2) + effective_io_concurrency. On a 4-core server with NVMe SSD: pool_size = (4 * 2) + 4 = 12.
Storage headroom: add 30-50% above projected need to account for bloat, temporary files, and growth between provisioning cycles.
Real-World Case Study: GitHub’s Capacity Planning Failures
GitHub has published extensively about their PostgreSQL scaling journey, including several capacity planning failures that cost them significant engineering time and caused service disruptions.
Failure 1: The storage exhaustion cascade
In 2016, GitHub’s PostgreSQL database cluster approached disk capacity on their main database servers. The root cause was a combination of factors: binary logging enabled for replication consumed more disk than anticipated, an unintended backup retention policy accumulated old binary logs, and the growth rate had been underestimated by 3x due to an untracked increase in user activity. When the primary disk reached 97% capacity, PostgreSQL began experiencing writeslowdowns. Their alerting was set at 90% capacity—giving only 7% headroom at a growth rate that consumed that headroom in hours. The incident resulted in 18 hours of degraded service, including periods where the website was read-only.
What they changed: GitHub implemented a capacity planning process that tracks disk usage growth week-over-week, with alerts set at 70% and 85% thresholds to provide actionable response time. They separated WAL volume onto a dedicated volume to isolate backup I/O from production I/O. They also built an internal tool called “Bloat” that tracks waste (bloat, unused indexes, abandoned tables) as a percentage of total storage.
Failure 2: Connection pool exhaustion at 10,000 repositories
As GitHub scaled from thousands to millions of repositories, their connection pooling configuration (pgBouncer) became a bottleneck. They had sized their pool for peak connection rates at small scale, but as the number of repositories grew, the number of simultaneous connections from various internal services grew super-linearly. At 10,000 repositories, their connection pool exhausted its maximum and began refusing connections.
What they changed: GitHub moved to a connection-per-worker model where each foreground service maintains its own connection, and background jobs use a separate pool with lower limits. They also implemented pg_stat_activity monitoring to track connection states and identify connection leaks early.
The lesson: Capacity planning failures compound. A wrong growth estimate doesn’t just mean provisioning more hardware—it cascades into alerting gaps, monitoring blind spots, and configuration assumptions that worked at small scale but fail at production scale. GitHub’s public post-incident reports are a valuable reference for teams building their own capacity planning processes.
Interview Questions
Q1: Your PostgreSQL database is consuming 400GB of a 500GB disk. Growth rate is 10GB/day. A new customer onboarding next quarter will add 50% more data volume. What’s your capacity plan?
A: The critical path is time-to-capacity: 400GB used, 100GB free, 10GB/day growth means 10 days before disk is full. This is already critical. Immediate action: provision additional storage within 48 hours. For the new customer, calculate: 50% more data volume means 200GB additional at current growth rate, plus the 10GB/day ongoing growth. That’s 300GB+ needed within 30-60 days. My plan: Week 1: provision storage to get to 200GB headroom (100GB immediate + 100GB for 10 days growth). Week 2: implement or verify monitoring on table-level storage consumption to identify bloat or unnecessary data. Month 1: provision the larger storage tier accounting for new customer load (add ~300GB). Month 2-3: evaluate read replicas or archiving strategies for older data to reduce primary disk load.
Q2: How do you calculate the right number of connections for your connection pool?
A: The standard formula is: pool_size = (core_count * 2) + effective_spindle_count. On a 4-core server with an NVMe SSD: pool_size = (4 × 2) + 4 = 12. This accounts for CPU cores (parallelism) and I/O concurrency (spindle count). For PostgreSQL specifically, the key constraint is max_connections (global limit) and per-query parallelism settings. Set max_connections high enough to handle your pool size plus superuser_reserved_connections (typically 3). Monitor actual connection usage with pg_stat_activity — if you consistently see connections idle in transaction for long periods, your pool is too large or your application is holding connections improperly. For applications with high connection churn (serverless, microservices), use connection poolers like PgBouncer or PgCat to multiplex hundreds of application connections over a smaller number of actual PostgreSQL connections.
Q3: What’s the difference between IOPS and throughput, and how do you size storage for a write-heavy database?
A: IOPS (Input/Output Operations Per Second) measures the number of discrete I/O operations per second—critical for random read/write patterns like database page accesses. Throughput (MB/s) measures data transfer rate—critical for sequential operations like full table scans or bulk loads. A typical 7200 RPM spinning disk delivers ~100 IOPS and ~100 MB/s throughput. An NVMe SSD delivers ~500K IOPS and ~3 GB/s throughput. For a write-heavy PostgreSQL database: calculate your write IOPS requirement by understanding your wal_file_size generation rate (each 16MB WAL segment = 1 IOPS at minimum) and your write-ahead log write pattern. A database generating 1GB of WAL per minute at 16MB per WAL file = 64 WAL writes per second. NVMe is strongly recommended for any PostgreSQL write workload above 1,000 IOPS.
Q4: You need to forecast capacity for a database that will grow from 1TB to 10TB over 2 years. How do you approach this?
A: Three-step approach: First, establish the growth curve. Pull 90 days of historical data from pg_stat_database (tup_returned, tup_fetched, tup_inserted, tup_updated, tup_deleted) and pg_stat_bgwriter to understand write volume and query patterns. Calculate growth rate per week. Second, model the growth curve: linear growth (e.g., 1TB to 5TB then to 10TB = ~5% per week), exponential growth (doubling periods), or step-function growth (acquisition-driven). For 1TB to 10TB in 24 months, average growth is ~375GB/month or ~12.5GB/day. Third, plan provisioning triggers: at 70% of current storage tier (e.g., provision new tier at 700GB for 1TB volume), provision the next tier 60 days before you hit 90% of current tier. Budget for 3x the storage you think you need in year 2 because growth often accelerates.
Conclusion
Capacity planning is continuous, not a one-time event. The goal is not to predict perfectly—it is to have enough visibility into trends that you are not surprised.
Build monitoring into your routine. Know your growth rate. Understand your workload characteristics. Make capacity decisions based on data, not guesswork.
Good capacity planning happens before anyone notices a problem. You want boring infrastructure—the kind that simply works, reliably, as load grows.
Related Posts
- Database Scaling Strategies - Horizontal and vertical scaling approaches
- Database Monitoring - Key metrics to track
- Cost Optimization - Optimizing database costs
Category
Related Posts
Database Backup Strategies: Full, Incremental, and PITR
Learn database backup strategies: full, incremental, and differential backups. Point-in-time recovery, WAL archiving, and RTO/RPO planning.
Connection Pooling: HikariCP, pgBouncer, and ProxySQL
Learn connection pool sizing, HikariCP, pgBouncer, and ProxySQL, timeout settings, idle management, and when pooling helps or hurts performance.
Failover Automation
Automatic failover patterns. Health checks, failure detection, split-brain prevention, and DNS TTL management during database failover.