Relational Databases: ACID Transactions and Data Foundation
Learn how relational databases work, from ACID properties and transactions to joins, indexing, and normalization in PostgreSQL and MySQL.
Relational Databases: ACID Transactions and the Foundation of Data
Introduction
A relational database stores data in tables with rows and columns. Each row represents a record, and each column represents an attribute. Tables connect through keys, which establish relationships between records.
The most important relationship type is the foreign key. A foreign key in one table points to the primary key of another table. This creates links. You do not store the user’s name in every order record. Instead, you store a user_id that points to the users table.
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email VARCHAR(255) NOT NULL UNIQUE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
user_id INTEGER REFERENCES users(id),
total DECIMAL(10, 2) NOT NULL,
status VARCHAR(50) DEFAULT 'pending',
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
This structure is what gives relational databases their power. You can ask complex questions by combining tables, filtering on multiple conditions, and aggregating results. The query optimizer figures out the most efficient way to answer your question given the available indexes.
Core Concepts
Relational databases provide guarantees around how transactions work. These guarantees have a memorable acronym: ACID.
Atomicity
Atomicity means a transaction happens completely or not at all. If you transfer money from account A to account B, either both the debit and credit happen, or neither does. You never end up with money deducted from A but not credited to B.
Database systems implement this through transaction logs. The database writes what it plans to do before doing it. If a crash happens mid-transaction, the log allows the system to either complete the operation or roll it back.
Consistency
Consistency ensures that every transaction moves the database from one valid state to another. Any constraints you define, like foreign keys or unique indexes, must hold after every transaction.
If you have a constraint that order.user_id must reference a valid users.id, the database rejects any transaction that tries to create an order with an invalid user_id. Consistency is your safety net for data integrity.
Isolation
Isolation determines how concurrent transactions interact. Multiple people querying the database at the same time should not see each other’s half-finished work.
If you transfer money from A to B while someone else reads your balance, isolation determines whether they see the old balance, the new balance, or something in between. Higher isolation levels prevent anomalies but reduce concurrency.
Durability
Durability means once a transaction commits, it survives system crashes. If you receive confirmation that a transaction succeeded, the data persists even if the database server loses power immediately after.
Most databases achieve this by writing to disk before reporting success. The write-ahead log technique ensures the database can replay incomplete transactions after a crash.
Transactions in Practice
Transactions group multiple operations into a single unit. PostgreSQL and MySQL both use the same basic syntax:
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;
If the second UPDATE fails, the first one rolls back. Your balance stays at its original value.
The default isolation level in most databases is READ COMMITTED, which means you only see data that has been committed by other transactions. You can set stricter isolation:
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SERIALIZABLE prevents most concurrency anomalies but can cause more lock contention and rolled-back transactions under heavy load.
Understanding Joins
Joins combine data from multiple tables. Understanding how they work is fundamental to writing efficient queries.
Inner Join
An inner join returns only rows that have matches in both tables. If you join orders to users, you only see orders that have a valid user_id.
SELECT users.email, orders.total, orders.created_at
FROM users
INNER JOIN orders ON users.id = orders.user_id
WHERE orders.status = 'completed';
Left Join
A left join returns all rows from the left table, even if they have no matching rows in the right table. The right side columns show NULL for non-matching rows.
SELECT users.email, orders.total
FROM users
LEFT JOIN orders ON users.id = orders.user_id;
This query shows every user, even those who have never placed an order. Their order columns will be NULL.
Right Join and Full Outer Join
Right joins work the opposite way, returning all rows from the right table. Full outer joins return all rows from both tables, filling in NULLs where there is no match. These are less commonly used but important to understand.
Join Performance
Joins can get expensive. A naive join algorithm scans both tables and compares every row. With proper indexing and query planning, the database can do much better.
The query optimizer decides how to execute your join based on table statistics, available indexes, and estimated row counts. Sometimes the optimizer makes poor choices, which is when you need to understand EXPLAIN output.
EXPLAIN SELECT users.email, orders.total
FROM users
INNER JOIN orders ON users.id = orders.user_id;
Query Optimization Guide
The EXPLAIN family of commands is the main diagnostic tool for slow queries. Here is how to use it effectively.
-- Basic plan
EXPLAIN SELECT * FROM orders WHERE user_id = 42;
-- With actual runtime stats (PostgreSQL)
EXPLAIN (ANALYZE, BUFFERS, TIMING) SELECT * FROM orders WHERE user_id = 42;
-- Costs and actual rows
EXPLAIN (ANALYZE, COSTS, VERBOSE) SELECT * FROM orders WHERE user_id = 42;
Common plan nodes to recognize:
| Node Type | What It Means |
|---|---|
| Seq Scan | Full table scan — usually a red flag on large tables |
| Index Scan | Index used to find rows |
| Index Only Scan | Data comes entirely from index, no heap fetch |
| Nested Loop | Joins by probing inner relation per outer row |
| Hash Join | Builds a hash table for joining large sets |
| Merge Join | Sorts both inputs then merges — efficient on pre-sorted data |
| Bitmap Heap Scan | Retrieves rows via bitmap from index |
Diagnosing slow queries:
-
Seq Scan on a large table returning few rows — You need an index. A query returning 100 rows from a 10M-row table doing a Seq Scan is a clear signal.
-
Estimated rows far from actual rows — Statistics are stale. Run
ANALYZE table_name;to update them. -
Hash Join spilling to disk — The query is running out of
work_mem. Either increase it or restructure the query. -
Filter applied after join vs during join — For INNER JOINs, predicates in the ON clause filter earlier. Move filters to the right place.
-- Tables with more seq scans than index scans (missing indexes)
SELECT schemaname, tablename, seq_scan, idx_scan
FROM pg_stat_user_tables
WHERE seq_scan > idx_scan * 10
ORDER BY seq_scan DESC;
-- Tables with high dead tuple percentage
SELECT relname, n_dead_tup, n_live_tup,
round(n_dead_tup::numeric / (n_dead_tup + n_live_tup + 1) * 100, 2) AS dead_pct
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 10;
-- Top queries by total time
SELECT query, calls, total_time, mean_time, rows
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 20;
Index patterns worth knowing:
-- Partial index for a common filter pattern
CREATE INDEX idx_orders_pending ON orders(created_at)
WHERE status = 'pending';
-- Expression index when you filter on a function
CREATE INDEX idx_users_email_lower ON users(lower(email));
-- Include columns to enable index-only scans (PostgreSQL 11+)
CREATE INDEX idx_orders_user_include ON orders(user_id) INCLUDE (total, created_at);
-- This covers: SELECT total, created_at FROM orders WHERE user_id = ?
Query rewriting patterns:
-- Subquery in IN clause — rewrite as a JOIN
SELECT * FROM orders WHERE id IN (SELECT order_id FROM items WHERE product_id = 123);
-- Becomes:
SELECT o.* FROM orders o INNER JOIN items i ON o.id = i.order_id WHERE i.product_id = 123;
-- Multiple queries fetching one row each (N+1 pattern) — batch them
SELECT * FROM orders WHERE user_id = 1;
SELECT * FROM orders WHERE user_id = 2;
-- Becomes:
SELECT * FROM orders WHERE user_id IN (1, 2) ORDER BY user_id;
-- Aggregation by category — let PostgreSQL do it in one scan
SELECT status, count(*) FROM orders GROUP BY status;
Indexing for Performance
Indexes speed up data retrieval. Without indexes, the database must scan every row to find what you are looking for. With the right index, it jumps directly to the relevant pages.
How Indexes Work
Most database indexes use a structure called B-tree, which keeps data sorted and allows efficient range queries. When you create an index on a column, the database maintains a separate sorted structure pointing to the actual data rows.
CREATE INDEX idx_orders_user_id ON orders(user_id);
CREATE INDEX idx_orders_status ON orders(status);
CREATE INDEX idx_orders_created ON orders(created_at);
These indexes let the database quickly find orders by user_id, filter by status, or range-query by date.
When Indexes Hurt
Indexes are not free. Every INSERT, UPDATE, or DELETE must update all relevant indexes. A table with many indexes writes slower than one with few indexes.
For columns with low cardinality (few unique values), indexes provide minimal benefit. An index on a boolean column rarely helps since half the rows match any query.
Composite Indexes
A composite index covers multiple columns. The order matters. A composite index on (user_id, created_at) helps queries that filter by user_id alone, or user_id plus created_at. It does not help queries that only filter by created_at.
CREATE INDEX idx_orders_user_created ON orders(user_id, created_at);
Normalization and Its Trade-offs
Database normalization structures tables to reduce redundancy. Normal forms are stages of normalization, from 1NF (first normal form) to 3NF (third normal form) and beyond.
First Normal Form
1NF requires atomic values and no repeating groups. Each cell contains a single value, not a list or array. Each row is unique.
Second Normal Form
2NF removes partial dependencies. Non-key columns must depend on the entire primary key, not just part of it. This matters for composite keys.
Third Normal Form
3NF removes transitive dependencies. Non-key columns should not depend on other non-key columns. If column A determines column B, and B determines column C, you have a transitive dependency.
-- Not normalized: total depends on quantity and price
CREATE TABLE order_items_bad (
order_id INTEGER,
product_id INTEGER,
quantity INTEGER,
price DECIMAL(10, 2),
total DECIMAL(10, 2) -- This depends on quantity * price
);
-- Normalized: total is calculated, not stored
CREATE TABLE order_items (
order_id INTEGER,
product_id INTEGER,
quantity INTEGER,
price DECIMAL(10, 2)
);
Normalization reduces data duplication. Denormalization (deliberately introducing redundancy) can improve read performance. Both have their place.
PostgreSQL vs MySQL
Both PostgreSQL and MySQL are excellent relational databases. They share most SQL syntax but differ in implementation details.
PostgreSQL offers more advanced features: better support for JSON operations within a relational context, full-text search, window functions, and more sophisticated indexing options. It follows SQL standards more strictly.
MySQL has historically been faster for simple read-heavy workloads. Its default storage engine, InnoDB, provides ACID transactions and row-level locking. MySQL syntax sometimes diverges from SQL standards.
Both handle replication, backups, and high availability. PostgreSQL has better support for complex data types and operations. MySQL has a larger ecosystem of managed services and tooling.
For most applications, the difference matters less than you might think. Pick one and learn it well. Switching between them is rarely necessary and comes with significant migration costs.
Sharding and Replication Deep Dive
Horizontal scaling for relational databases takes two forms: sharding and replication. They solve different problems.
Read Replicas and Replication Topologies
Replication streams data from a primary to one or more replicas. Reads go to replicas, reducing load on the primary. Writes always go to the primary, which pushes them to replicas after the fact.
-- PostgreSQL streaming replication setup (primary)
ALTER DATABASE mydb SET wal_level = replica;
ALTER SYSTEM SET max_wal_senders = 10;
ALTER SYSTEM SET max_replication_slots = 10;
-- Create replication slot on primary
SELECT pg_create_physical_replication_slot('replica1_slot');
-- On replica, create recovery.conf (or modern standby.signal)
-- primary_conninfo = 'host=primary_ip port=5432 application_name=replica1'
-- slot_name = 'replica1_slot'
-- recovery_target_timeline = 'latest'
Replication topologies:
| Topology | Pros | Cons | Best For |
|---|---|---|---|
| Single primary + async replicas | Simple, read scaling | Replica lag | Read-heavy apps, analytics |
| Synchronous replication | Strong consistency, automatic failover | Extra latency per write | Financial transactions |
| Multi-primary | No single write bottleneck | Conflict resolution complexity | Geo-distributed writes |
| Cascading replication | Takes load off primary for many replicas | Lag compounds downstream | Large replica counts |
Sharding Patterns
Sharding partitions data across multiple databases. Each shard holds a subset, selected by a shard key.
-- Application-level sharding by user_id hash
-- Shard key = hash(user_id) % num_shards
CREATE TABLE orders (
id SERIAL,
user_id INTEGER,
shard_key INTEGER GENERATED ALWAYS AS (user_id % 4) STORED,
total DECIMAL(10, 2),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Shard 0, 1, 2, 3 are separate physical databases
-- Query router directs traffic based on shard_key
Sharding strategies:
| Strategy | How It Works | Challenges |
|---|---|---|
| Hash-based | shard = hash(key) % num_shards | Resharding means remapping everything |
| Range-based | shard = key / range_size | Sequential keys concentrate on one shard |
| Directory-based | Lookup table maps keys to shards | Lookup table becomes a bottleneck |
| Geo-based | Region determines shard | Uneven distribution by geography |
When to shard:
- Data no longer fits on one server
- Write throughput outpaces a single primary
- Compliance demands specific geographic data residency
When NOT to shard:
- A single server handles your data fine
- You are optimizing prematurely
- Better indexing or caching would solve the problem first
Distributed Consistency Trade-offs
Relational databases trade horizontal write scalability for strong consistency. NoSQL systems make the opposite trade. Most applications never hit the ceiling on a well-tuned relational setup — but engineers often jump to NoSQL before exhausting what relational databases can do.
Topic-Specific Deep Dives
ACID vs BASE Trade-offs
ACID and BASE are two different philosophies on how to handle consistency in distributed systems. ACID is what relational databases do. BASE showed up with NoSQL.
The ACID Model
ACID gives you four guarantees:
- Atomicity: Transaction completes entirely or not at all
- Consistency: All constraints hold after every transaction
- Isolation: Concurrent transactions stay out of each other’s way
- Durability: Committed data survives crashes
These guarantees make relational databases predictable. You know exactly what state your data is in. That predictability has a cost: keeping all nodes in sync requires coordination, and coordination takes time.
The BASE Model
BASE makes different trade-offs:
- Basically Available: The system stays up, even if some nodes are stale
- Soft state: Data might be inconsistent between nodes at any moment
- Eventually consistent: Given enough time, all nodes will agree
graph LR
A[Write to Node A] --> B{Quorum?<br/>W + R > N}
B -->|Yes| C[Strong Consistency]
B -->|No| D[Eventual Consistency]
D --> E[Read from Node B<br/>Possible stale data]
C --> F[All nodes converge]
ACID vs BASE Comparison
| Factor | ACID (Relational) | BASE (NoSQL) |
|---|---|---|
| Consistency | Strong | Eventual |
| Transactions | Full ACID | Limited or none |
| Latency | Higher | Lower |
| Availability | Lower (CP systems) | Higher (AP systems) |
| Write scaling | Hard | Natural |
| Data integrity | Enforced by database | Application logic |
| Examples | PostgreSQL, MySQL | Cassandra, DynamoDB |
When to Pick Each
Go with ACID/relational when data integrity matters more than anything else — financial records, inventory, medical data. When you need complex transactions across multiple tables, or when your queries involve joins that would be a nightmare in application code.
Go with BASE/NoSQL when write throughput is the hard problem and you can tolerate reads that might be slightly behind. Document-oriented or key-value workloads that do not map naturally to tables fit here. So do use cases that need automatic sharding across geographic regions.
Hybrid Systems
A few modern databases try to split the difference:
- Google Spanner: TrueTime hardware (specialized GPS + atomic clocks) enables strongly consistent distributed transactions with geographic replication. The hardware is the key.
- CockroachDB: Distributed SQL without special hardware — uses Hybrid Logical Clocks instead. Operationally simpler than Spanner, slightly more latency on distributed transactions.
- YugabyteDB: PostgreSQL-compatible, uses Raft consensus for distributed consistency.
- Aurora: MySQL/PostgreSQL-compatible with a distributed storage layer that replicates across availability zones. Not truly distributed writes — still a single primary — but the storage layer handles replication and durability.
These systems give you relational semantics with distributed scale. The cost is operational complexity — running a distributed SQL cluster is harder than running a single PostgreSQL instance.
Trade-off Analysis
Use this table to guide architectural decisions based on workload characteristics:
| Factor | Scenario | PostgreSQL | MySQL (InnoDB) | Trade-off Consideration |
|---|---|---|---|---|
| High write throughput | MVCC reduces writer contention | Row-level locking, faster single-thread | PostgreSQL better for mixed workloads | PostgreSQL wins for write-heavy concurrent workloads |
| Complex analytical queries | Window functions, CTEs, excellent optimizer | Limited window function support | PostgreSQL wins for analytics | PostgreSQL superior for complex analytical queries |
| Simple key-value lookups | B-tree indexed, reliable | Very fast for primary key lookups | MySQL slightly faster for point queries | MySQL slightly faster; PostgreSQL for concurrent multi-key access |
| Full-text search | Native tsvector/tsquery, built-in | MySQL Full-Text indexes available | PostgreSQL superior out of the box | PostgreSQL for production FTS; MySQL acceptable for simple cases |
| JSON data handling | JSONB with GIN indexes, full JSON support | JSON functions limited | PostgreSQL for JSON-centric workloads | PostgreSQL for JSON-heavy workloads; MySQL for basic JSON only |
| Horizontal sharding | Citus extension, or manual sharding | Vitess, manual sharding | PostgreSQL has better sharding support | PostgreSQL via Citus preferred; MySQL requires Vitess complexity |
| Managed cloud offerings | Aurora, Cloud SQL, RDS, self-managed | Aurora, Cloud SQL, RDS, self-managed | Both well-supported in cloud | Both equivalent in managed cloud; PostgreSQL has richer tooling |
| Replication | Streaming, logical decoding, extensive | Binlog-based, semi-sync available | PostgreSQL more flexible replication | PostgreSQL for flexible replication; MySQL for simplicity |
| Locking granularity | Row-level, predicate locks | Row-level, gap locks in RR mode | PostgreSQL prevents phantoms at lower cost | PostgreSQL for better concurrent write performance |
When to Use / When Not to Use
When to Use Relational Databases:
- Your data has clear entity relationships that map naturally to tables
- You need ACID transactions for financial data, inventory, or orders
- Your queries involve complex joins across multiple tables
- You need powerful ad-hoc querying with SQL for analytics
- Your workload is balanced between reads and writes
- You need mature tooling, backups, and operational procedures
When Not to Use Relational Databases:
- Your data structure varies dramatically between records (document storage fits better)
- You need to store and query massive semi-structured data like JSON logs
- Your primary access pattern is single-key lookups at extreme scale
- Your writes massively outnumber reads and you need horizontal write scaling
- You are building a graph-based application with complex relationship traversals
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Primary disk failure | Database unavailable, potential data loss | Regular backups, write-ahead logs, replication to replicas |
| Lock contention from long transactions | Queries timeout, application hangs | Keep transactions short, use appropriate isolation levels |
| Index corruption | Queries return incorrect results | Use REINDEX periodically, enable checksums, verify with pg_checksums |
| Connection pool exhaustion | New queries fail, “too many connections” | Right-size max_connections, implement connection pooling (PgBouncer) |
Accidental DROP TABLE or DELETE | Data loss | Restrict permissions, use DROP TABLE ... IF EXISTS, point-in-time recovery |
| Replication lag spikes | Read replicas serve stale data | Monitor lag, route writes to primary after updates, use synchronous replication |
| Autovacuum not keeping up | Table bloat, degraded performance | Tune autovacuum workers, manually VACUUM when needed, monitor dead tuples |
| Query planner chooses wrong index | Slow queries, high CPU | Analyze with EXPLAIN, update statistics, use index hints if needed |
Common Pitfalls / Anti-Patterns
-
Missing indexes on foreign keys: Always index foreign key columns. Without indexes, parent-child joins cause full table scans.
-
N+1 query problems: Fetching related data in a loop instead of using joins. A single query returning 100 users with their orders is better than 101 queries.
-
Over-indexing: Creating indexes “just in case” hurts write performance. Every index slows down INSERT/UPDATE/DELETE.
-
Using triggers for business logic: Triggers hide side effects and make debugging harder. Handle logic in application code.
-
Ignoring NULL semantics: NULL is not “empty string” or “zero”. Queries behave differently with NULLs, and mixed NULL handling causes subtle bugs.
-
Storing serialized objects or arrays in text columns: While convenient, this bypasses indexing and query capabilities. Use proper data types or a document store.
-
Ignoring connection pooling: Each connection consumes memory. Use PgBouncer or similar to pool connections and avoid connection exhaustion.
-
Not using prepared statements: Parsing overhead adds up. Use prepared statements for repeated queries.
-
Mismatched types in joins: Joining
VARCHARtoINTEGERcauses implicit casts and prevents index usage. Always match types.
Quick Recap
Key Takeaways
- Relational databases provide ACID guarantees that keep your data consistent and durable
- Indexes are essential for read performance but slow down writes
- Normalization reduces redundancy; denormalization improves read performance
- Transactions group operations into atomic units that commit or rollback together
- Choose PostgreSQL for advanced features and standards compliance, MySQL for simple read-heavy workloads
- Monitor query latency, connection counts, and replication lag in production
Copy/Paste Checklist:
-- Essential performance monitoring queries
SELECT * FROM pg_stat_activity WHERE state != 'idle';
SELECT * FROM pg_stat_replication;
SELECT relname, n_dead_tup, last_vacuum FROM pg_stat_user_tables;
EXPLAIN (ANALYZE, BUFFERS) your_query_here;
-- Essential security settings
ALTER DATABASE mydb SET session_preload_libraries = 'pg_stat_statements';
ALTER ROLE readonly NOLOGIN;
GRANT CONNECT ON DATABASE mydb TO app_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO readonly;
-- Essential backup verification
SELECT pg_start_backup('baseline');
SELECT pg_stop_backup();
Observability Checklist
Metrics to Monitor:
- Query latency (p50, p95, p99)
- Active connections and connection saturation
- Transaction throughput (commits/rollbacks per second)
- Lock wait time and lock contention rates
- Cache hit ratio (shared_buffers)
- Disk I/O utilization and queue depth
- Replication lag in seconds
- Autovacuum activity and table bloat percentage
- Index usage and missing indexes
Logs to Capture:
- Slow query log (queries exceeding threshold, e.g., 100ms)
- Error logs with full context including query parameters
- Connection logs for authentication failures
- Lock deadlocks and lock timeout events
- Checkpoint and bgwriter activity
- Autovacuum execution details
Alerts to Set:
- Connection count > 80% of
max_connections - Replication lag > 30 seconds
- Disk usage > 85% on data volume
- Long-running queries (> 60 seconds)
- Cache hit ratio < 90%
- Lock wait time spike
- Autovacuum failing repeatedly
- Replication slot age exceeding retention
-- Quick visibility queries
SELECT * FROM pg_stat_activity WHERE state != 'idle';
SELECT * FROM pg_stat_replication;
SELECT relname, n_dead_tup, n_live_tup, last_autovacuum FROM pg_stat_user_tables;
SELECT * FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;
Security Checklist
- Use strong authentication (scram-sha-256 or certificate-based, never md5)
- Implement least-privilege access (users get only needed permissions)
- Encrypt connections with SSL/TLS (set
ssl = onand enforce in pg_hba.conf) - Encrypt data at rest (use filesystem encryption or database-level encryption)
- Enable and review audit logging for sensitive operations
- Use
ALTER DATABASE SET session_preload_libraries = 'pg_stat_statements'for query monitoring without overexposure - Restrict
pg_hba.confto specific IP ranges, deny public access - Regularly apply security patches and update minor versions
- Use network segmentation (database not directly internet-accessible)
- Implement row-level security for multi-tenant scenarios
- Backup encryption keys separately from encrypted backups
- Test restore procedures in isolation from production
Real-World Case Studies
Netflix: From Oracle to PostgreSQL at Massive Scale
Netflix ran on Oracle for years — billions of rows, thousands of tables, serious hardware. In 2012 they migrated to PostgreSQL running on AWS RDS. The migration took two years of preparation, and the main motivation was cost: Oracle licensing was pricing them out at their scale.
The migration was incremental. They did not do a big-bang switch. They ran dual-write for months — writing to both Oracle and PostgreSQL simultaneously while comparing results. They built custom validation tools to catch any divergence. When they cut over, the decision was purely operational: flip a routing switch and watch the metrics.
The lesson: database migrations at scale are infrastructure problems more than database problems. The schema transfer was straightforward. The hard part was the two years of tooling, validation, and operational runbook development that made the switch boring.
Amazon: Why They Left Oracle for PostgreSQL and MySQL
Amazon’s consumer-facing databases migrated away from Oracle around 2013-2017. The stated reasons were licensing cost and control. At Amazon’s scale, Oracle’s per-core licensing model was a serious line item.
Amazon’s internal Postgres-compatible database, Aurora, launched in 2014. The architecture is not a vanilla Postgres — it separates the storage layer and replicates across three availability zones. This gives MySQL/PostgreSQL compatibility with durability characteristics no traditional RDS could match.
The takeaway: the database you pick shapes your operational constraints for years. Amazon built their own storage engine because off-the-shelf could not give them the durability they needed at their write volume. Most companies will not hit that ceiling, but it is worth knowing where the ceiling is.
Interview Questions
MySQL InnoDB uses clustered indexes where the primary key is the physical row storage order. Secondary indexes store primary key values, not row pointers. SQL Server follows the same pattern. PostgreSQL has no concept of clustered indexes — tables are heaps and indexes point to heap locations (block number and offset). PostgreSQL does offer CLUSTER to reorder a table physically, but subsequent inserts are not maintained in order. This means PostgreSQL's index-only scans depend on visibility map freshness, while MySQL's covering indexes are more predictable. For read-heavy workloads on MySQL, choose your primary key carefully since it determines physical storage order.
First, run EXPLAIN (ANALYZE, BUFFERS) on the query and compare the plan to a known-good plan — the issue is often a stale statistics change causing the planner to pick a sequential scan instead of an index scan. Check whether the table has recently grown past an autovacuum threshold (high dead tuple count causes the planner to misestimate). Check whether an index was dropped or disabled. Verify that the connection is not hitting a different replica with stale statistics. Check for lock contention — if the query is waiting on a lock, execution time includes wait time, not just execution time.
Atomicity means a transaction commits entirely or rolls back entirely — no partial state. Consistency means the database moves from one valid state to another — all constraints and invariants hold. Isolation means concurrent transactions do not interfere with each other — the details of which isolation level determines exactly how much interference is visible. Durability means once a transaction commits, it survives crashes. Distributed databases most often relax isolation. Serializability is expensive — it requires coordination across nodes. Many distributed databases default to eventually consistent or read-committed isolation to avoid the latency penalty of synchronous coordination. Durability is also frequently traded against latency in write paths (write-ahead logs vs synchronous fsync).
A B-tree (balanced tree) keeps data sorted in pages, typically 8KB-16KB each. Root and intermediate pages contain pointers to child pages; leaf pages contain the actual index entries sorted by key. This means lookups, range scans, and sequential access are all efficient — O(log n) for point queries, O(log n + k) for range queries where k is results. B-trees stay balanced automatically through page splits on insert. They minimize disk I/O because tree depth is shallow (3-4 levels for millions of rows) and pages are large. Hash indexes are faster for equality on exact matches but cannot handle range queries, which are common in relational workloads, making B-trees the default.
A race condition occurs when two transactions operate on the same data simultaneously and the outcome depends on execution order. READ UNCOMMITTED allows dirty reads — seeing uncommitted changes. READ COMMITTED (PostgreSQL/MySQL default) prevents dirty reads but allows non-repeatable reads (re-reading the same row yields different values). REPEATABLE READ (MySQL default) prevents dirty and non-repeatable reads but allows phantom reads (new rows matching a range query). SERIALIZABLE prevents all anomalies by executing transactions as if sequentially, typically via two-phase locking or snapshot isolation. PostgreSQL additionally implements SELECT FOR UPDATE and FOR SHARE to explicitly lock rows, giving developers control at the cost of potential deadlocks.
The N+1 problem occurs when fetching a list of records (1 query) then making N additional queries — one per record — to load related data. Example: fetching 100 users (1 query) then loading each user's orders (100 queries). You identify it via query logging: a single page load generates hundreds of similar queries, or pg_stat_statements shows many calls with the same pattern. Fix approaches: use JOINs to fetch related data in one query, use batch loading (IN clause with limited batches), use an ORM's eager loading features (N+1 prevention), or use a DataLoader pattern to batch and cache individual lookups. The right fix depends on whether you need all related data or can paginate it.
DELETE is a DML operation that removes rows one by one, triggers DELETE triggers, respects transaction semantics (can be rolled back), and logging depends on the recovery model — it is logged in full recovery mode. TRUNCATE is a DDL operation that deallocates data pages, making it much faster for large tables; it resets identity columns, does not fire triggers, and is not fully logged (it logs page deallocations, not rows). DROP removes the entire table structure including indexes and constraints — it is a DDL operation, cannot be rolled back in most databases, and removes the table entirely. For large tables, TRUNCATE TABLE (table_name) is orders of magnitude faster than DELETE FROM (table_name). DROP is irreversible without backups.
A many-to-many relationship requires a join table with foreign keys to both tables. For example, students and courses: a student can take many courses and a course has many students. The join table (enrollments) has (student_id, course_id) as a composite primary key. At scale, add a surrogate key for the join table if you need to reference specific enrollments (for grades, dates, etc.). Always index both foreign key columns — a composite index on (student_id, course_id) handles queries for all courses a student takes and all students in a course. If queries are predominantly one-directional, consider separate indexes. For high-write scenarios, the join table becomes a hotspot; partitioning by one of the foreign keys can help distribute load.
A regular view is a saved query — it runs the underlying query each time you SELECT from it. A materialized view pre-computes and stores the result as a physical table, refreshing on a schedule or on-demand. Use materialized views when you have expensive aggregations or joins computed frequently (daily sales summaries, reporting dashboards) and near-real-time accuracy is not required. PostgreSQL supports materialized views with REFRESH MATERIALIZED VIEW. MySQL does not have native materialized view support — you implement the pattern manually with summary tables and scheduled updates. Trade-offs: materialized views improve read performance but add storage overhead, require refresh strategy, and may serve stale data.
Pessimistic locking assumes conflicts will happen and prevents them by acquiring locks before a transaction: SELECT FOR UPDATE in PostgreSQL or SELECT ... LOCK IN SHARE MODE in MySQL. The lock is held until the transaction commits. This prevents conflicts but can cause lock contention and deadlocks under high concurrency. Optimistic locking assumes conflicts are rare: you read a row with a version number or timestamp, and on update you check that the version has not changed (WHERE id = ? AND version = ?). If it changed, you retry or abort. This allows higher concurrency but means you must handle retry logic in application code. Use pessimistic locking when conflicts are frequent or expensive to resolve (financial transactions, inventory decrements). Use optimistic locking when conflicts are rare and concurrency is important (user profile updates, comment likes).
A stored procedure is executable SQL logic stored in the database itself. Advantages: execution plan caching and reuse, reduced network round-trips for multi-statement operations, encapsulation of business logic that must apply regardless of which application accesses the database. Disadvantages: vendor lock-in (procedures are database-specific), harder to version control and test (requires database deployment), limited debugging tools, and they shift business logic away from application code where modern testing and CI/CD practices are well-established. For most applications, keep business logic in the application layer and use stored procedures sparingly — primarily for performance-critical operations where network latency is the bottleneck, or for operations that must be atomic regardless of which client executes them.
Normalization eliminates redundant data by splitting tables and establishing relationships. In an unnormalized order entry, customer name appears on every order row; in 3NF, customer name exists only in the customers table. This eliminates update anomalies — changing a customer's name requires updating one row, not many. 1NF enforces atomicity (no repeating groups), 2NF removes partial dependencies on composite keys, 3NF removes transitive dependencies. Beyond 3NF, BC NF and 4NF handle edge cases. The practical limit is usually 3NF — higher forms matter mainly for specialized domains. Highly normalized schemas can require many joins for common queries, which hurts read performance. Denormalization (deliberately reintroducing controlled redundancy) is a deliberate trade-off for read-heavy workloads, not a failure of design.
In synchronous replication, the primary waits for at least one replica to confirm the write before reporting success to the client. If the replica does not confirm within a timeout, the primary fails the transaction. This guarantees zero data loss if the primary fails — the replica has all committed data. However, write latency increases by at least the network round-trip to the replica, which matters at scale. Asynchronous replication propagates writes to replicas after the primary confirms success to the client. Writes are faster but there is a window of data loss if the primary fails — typically measured as replication lag in seconds. Semi-synchronous replication (MySQL) waits for at least one replica to receive the write but not necessarily apply it. PostgreSQL's synchronous replication requires confirmation before commit; streaming replication can be configured for various guarantee levels.
A deadlock occurs when two or more transactions each hold locks the other needs and neither can proceed. PostgreSQL detects deadlocks via a wait-for graph — when a transaction detects it is waiting on a lock held by another transaction that is itself waiting, it checks for a cycle. If found, it aborts one transaction (the victim) with error 40P01. Prevention strategies: acquire locks in a consistent order across all transactions (if all code acquires table locks in the same order, deadlocks are impossible), keep transactions short (less lock hold time means less window for cycles), use lock timeout to bound wait time instead of waiting indefinitely, avoid user interaction inside transactions (batch operations instead).
Both enforce uniqueness, but primary key adds NOT NULL and is the clustered index in MySQL (InnoDB). PostgreSQL does not cluster by default — the primary key is just a unique constraint with a NOT NULL added. In MySQL, the primary key determines physical row order, so choose it carefully for write patterns. In PostgreSQL, you can have one PK that is NOT NULL and another unique index that allows NULLs (for composite keys where one column might be absent). Foreign keys can reference either. The distinction matters for replication: MySQL's statement-based replication replicates primary key usage for binlog formatting.
Big table migrations require a multi-phase approach. Option 1 (online migration): add new columns as nullable, backfill in batches using chunked UPDATE with sleep between batches to reduce load, add triggers to keep old and new columns in sync during transition, backfill complete, add NOT NULL constraints, drop old columns. Option 2 (shadow table): create new table with target schema, backfill from old table in batches, symlink cutover (rename old to old_backup, rename new to active in single transaction), verify, drop old_backup. The key constraint: you must be able to roll back at every step. For very large tables, tools like `pg_repack` or `ALTER TABLE ... SET TABLESPACE` help move physical storage without table rewrite locks.
Table bloat is dead tuple accumulation — rows that were updated or deleted but not reclaimed. Causes: high UPDATE/DELETE rate without sufficient autovacuum, long-running transactions holding snapshots (preventing vacuum from seeing dead tuples), abrupt connection drops without proper cleanup, aggressive indexing on frequently updated columns. Detection: `pg_stat_user_tables.n_dead_tup` growing, `pgstattuple` extension for precise bloat estimation, comparing `pg_class.relpages` (physical pages) against expected from `pg_class.reltuples` (row count). Mitigation: tune autovacuum to run more aggressively on high-write tables, manually VACUUM when needed, increase `autovacuum_vacuum_scale_factor` or lower threshold, consider `VACUUM FULL` (requires exclusive lock) or `pg_repack` for online cleanup. Prevention: right-size `work_mem` to reduce sorts spilling to disk, avoid long transactions.
A regular index scan fetches each matching row individually from the heap — one disk I/O per row. A bitmap index scan first collects all matching index entries into a bitmap (in memory), then sorts the bitmap by heap page location, then fetches each page once with all rows from that page. Bitmap scans win when many rows match and are scattered across many pages — fetching each row individually would cause random I/O. PostgreSQL chooses bitmap scans when the index selectivity is moderate (not too selective, not too unselective). For highly selective queries returning few rows, regular index scan is faster because you stop after finding them. You can force bitmap scan with `enable_bitmapscan = off` to test whether the planner is making the right choice.
Advisory locks are application-defined locks identified by bigint keys. They are not tied to data rows — you explicitly acquire and release them. Functions: `pg_advisory_lock(key)`, `pg_advisory_xact_lock(key)` (auto-release at transaction end), `pg_try_advisory_lock(key)` (non-blocking). Use cases: implement application-level resource serialization without adding database rows (e.g., ensure only one job of a certain type runs at a time), coordinate between multiple application instances without database rows as coordination points, implement distributed counters or rate limiting. Row locks block on the locked row until the transaction completes; advisory locks give you coarser-grained control. The tradeoff: row locks are automatically released on commit/rollback; advisory locks require explicit release or transaction scope.
Natural keys come from your domain data (e.g., ISBN for books, country code for countries). Surrogate keys are system-generated (auto-increment integers, UUIDs). Surrogate keys advantages: never change, no possibility of conflicts, compact storage (4 bytes for bigint vs variable for strings), no business logic leakage into foreign keys. Natural key advantages: self-documenting (a join on country_code is readable, a join on country_id is not), enables referential integrity checks at application level, sometimes eliminates the need for a join. For most OLTP workloads, surrogate keys are preferred — they are immutable and the natural key can live as a unique constraint instead of the PK. For reference tables with stable, meaningful codes (ISO country codes, currency codes), natural keys make sense. UUIDs as primary keys have write amplification issues in B-tree indexes (random insertion patterns cause page splits and index bloat) — use `gen_random_uuid()` only when truly needed, or consider `uuid_generate_v4()` with `pgcrypto` extension for sequential UUID generation.
Further Reading
- PostgreSQL Documentation: The Query Planner — Deep dive into how PostgreSQL chooses execution plans
- MySQL 8.0 Reference Manual: Optimization — MySQL-specific optimization techniques
- “Designing Data-Intensive Applications” by Martin Kleppmann — Essential reading on distributed systems and data storage trade-offs
- PostgreSQL wiki: EXPLAIN plans — Community guide to reading EXPLAIN output
- CiteScale Paper on ACID vs BASE — Why distributed systems compromise consistency
- AWS Aurora Architecture — How Aurora implements distributed relational storage
Database Engine Internals Deep Dive
Understanding the internals helps you reason about performance, concurrency, and failure modes.
MVCC and Read Consistency
PostgreSQL uses Multi-Version Concurrency Control (MVCC) to allow readers and writers to not block each other. When you UPDATE a row, PostgreSQL creates a new version while keeping the old version until no transaction needs it. Each transaction sees a snapshot based on its start time — this is why READ COMMITTED works as it does.
-- See the current transaction snapshot
SELECT txid_current_snapshot();
-- Check which rows are visible to current transaction
SELECT * FROM my_table WHERE ctid = '(0,1)';
-- Monitor tuple visibility and age
SELECT relname, n_live_tup, n_dead_tup, last_vacuum, last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;
The visibility map tracks which pages contain only dead tuples. When the visibility map shows a page is all-dead, future index-only scans can skip heap fetches. This is why autovacuum freshness matters for read performance.
WAL and Crash Recovery
The Write-Ahead Log (WAL) is the foundation of PostgreSQL durability. Every change is written to the WAL before being applied to data pages. On crash, PostgreSQL replays the WAL to reach a consistent state.
-- Check WAL activity and size
SELECT pg_current_wal_lsn(), pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0');
-- Force a checkpoint (admin task, use carefully)
SELECT pg_checkpoint_sync();
-- Check wal usage per database
SELECT datname, total_wal_bytes FROM pg_stat_database;
wal_level = replica enables logical decoding for change data capture and replication. For production, logical level provides enough information for point-in-time recovery plus replication.
Autovacuum Tuning
Autovacuum prevents table bloat by reclaiming dead tuples. Default settings work for moderate workloads but need tuning for high-write tables:
-- Table-specific autovacuum tuning for high-write tables
ALTER TABLE orders SET (
autovacuum_vacuum_scale_factor = 0.01, -- Trigger at 1% dead tuples vs default 20%
autovacuum_analyze_scale_factor = 0.005, -- Analyze at 0.5% changes vs default 10%
autovacuum_vacuum_cost_delay = 2 -- Lower cost delay for faster cleanup
);
-- Monitor autovacuum activity
SELECT relname, last_autovacuum, autovacuum_count,
COALESCE(ROUND(n_dead_tup::numeric / (n_live_tup + n_dead_tup + 1) * 100, 2), 0) AS dead_pct
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC;
-- Manual vacuum for emergency cleanup
VACUUM verbose orders;
Connection Pooling with PgBouncer
PgBouncer reduces connection overhead by pooling connections to PostgreSQL. It sits between your application and the database:
; pgbouncer.ini
[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 20
min_pool_size = 5
reserve_pool_size = 5
reserve_pool_timeout = 3
With pool_mode = transaction, PgBouncer disconnects the backend connection when the client transaction ends. Your application never holds a backend connection between transactions, allowing you to support many more concurrent application connections than PostgreSQL can handle.
Conclusion
Relational databases remain relevant because they solve real problems. ACID transactions give you confidence your data stays consistent. Joins let you ask complex questions across related data. Indexes make queries fast. Normalization keeps your data from becoming a mess.
PostgreSQL and MySQL both handle these concerns well. The features overlap significantly. My preference leans toward PostgreSQL for its feature set and standards compliance, but MySQL is equally capable for most workloads.
For related reading, see my posts on NoSQL Databases to understand when alternative data models make sense, and Database Scaling to learn how to handle growth beyond a single database instance.
Category
Related Posts
Connection Pooling: HikariCP, pgBouncer, and ProxySQL
Learn connection pool sizing, HikariCP, pgBouncer, and ProxySQL, timeout settings, idle management, and when pooling helps or hurts performance.
Table Partitioning: Range, Hash, and List Strategies
Learn how table partitioning improves query performance and maintenance, including range partitioning, hash partitioning, and partition pruning.
dbt: The SQL-First Transformation Tool for Data Teams
Discover how dbt brings software engineering practices—version control, testing, documentation—to SQL-based data transformations.