Relational Databases: ACID Transactions and Data Foundation

Learn how relational databases work, from ACID properties and transactions to joins, indexing, and normalization in PostgreSQL and MySQL.

published: reading time: 16 min read

Relational Databases: ACID Transactions and the Foundation of Data

Relational databases have been the backbone of data storage for decades. Despite newer alternatives grabbing headlines, PostgreSQL and MySQL still power most production applications. Understanding how they work matters whether you are designing a new system or debugging an existing one.

I have seen plenty of engineers reach for NoSQL without really understanding what they give up. Relational databases offer something increasingly rare: predictable consistency, powerful querying, and battle-tested reliability. The trade-off is scaling horizontally, which matters a lot less than people think until you hit serious traffic numbers.

What Makes a Database Relational

A relational database stores data in tables with rows and columns. Each row represents a record, and each column represents an attribute. Tables connect through keys, which establish relationships between records.

The most important relationship type is the foreign key. A foreign key in one table points to the primary key of another table. This creates links. You do not store the user’s name in every order record. Instead, you store a user_id that points to the users table.

CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    email VARCHAR(255) NOT NULL UNIQUE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    user_id INTEGER REFERENCES users(id),
    total DECIMAL(10, 2) NOT NULL,
    status VARCHAR(50) DEFAULT 'pending',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

This structure is what gives relational databases their power. You can ask complex questions by combining tables, filtering on multiple conditions, and aggregating results. The query optimizer figures out the most efficient way to answer your question given the available indexes.

ACID Properties

Relational databases provide guarantees around how transactions work. These guarantees have a memorable acronym: ACID.

Atomicity

Atomicity means a transaction happens completely or not at all. If you transfer money from account A to account B, either both the debit and credit happen, or neither does. You never end up with money deducted from A but not credited to B.

Database systems implement this through transaction logs. The database writes what it plans to do before doing it. If a crash happens mid-transaction, the log allows the system to either complete the operation or roll it back.

Consistency

Consistency ensures that every transaction moves the database from one valid state to another. Any constraints you define, like foreign keys or unique indexes, must hold after every transaction.

If you have a constraint that order.user_id must reference a valid users.id, the database rejects any transaction that tries to create an order with an invalid user_id. Consistency is your safety net for data integrity.

Isolation

Isolation determines how concurrent transactions interact. Multiple people querying the database at the same time should not see each other’s half-finished work.

If you transfer money from A to B while someone else reads your balance, isolation determines whether they see the old balance, the new balance, or something in between. Higher isolation levels prevent anomalies but reduce concurrency.

Durability

Durability means once a transaction commits, it survives system crashes. If you receive confirmation that a transaction succeeded, the data persists even if the database server loses power immediately after.

Most databases achieve this by writing to disk before reporting success. The write-ahead log technique ensures the database can replay incomplete transactions after a crash.

Transactions in Practice

Transactions group multiple operations into a single unit. PostgreSQL and MySQL both use the same basic syntax:

BEGIN;

UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;

COMMIT;

If the second UPDATE fails, the first one rolls back. Your balance stays at its original value.

The default isolation level in most databases is READ COMMITTED, which means you only see data that has been committed by other transactions. You can set stricter isolation:

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;

SERIALIZABLE prevents most concurrency anomalies but can cause more lock contention and rolled-back transactions under heavy load.

Understanding Joins

Joins combine data from multiple tables. Understanding how they work is fundamental to writing efficient queries.

Inner Join

An inner join returns only rows that have matches in both tables. If you join orders to users, you only see orders that have a valid user_id.

SELECT users.email, orders.total, orders.created_at
FROM users
INNER JOIN orders ON users.id = orders.user_id
WHERE orders.status = 'completed';

Left Join

A left join returns all rows from the left table, even if they have no matching rows in the right table. The right side columns show NULL for non-matching rows.

SELECT users.email, orders.total
FROM users
LEFT JOIN orders ON users.id = orders.user_id;

This query shows every user, even those who have never placed an order. Their order columns will be NULL.

Right Join and Full Outer Join

Right joins work the opposite way, returning all rows from the right table. Full outer joins return all rows from both tables, filling in NULLs where there is no match. These are less commonly used but important to understand.

Join Performance

Joins can get expensive. A naive join algorithm scans both tables and compares every row. With proper indexing and query planning, the database can do much better.

The query optimizer decides how to execute your join based on table statistics, available indexes, and estimated row counts. Sometimes the optimizer makes poor choices, which is when you need to understand EXPLAIN output.

EXPLAIN SELECT users.email, orders.total
FROM users
INNER JOIN orders ON users.id = orders.user_id;

Indexing for Performance

Indexes speed up data retrieval. Without indexes, the database must scan every row to find what you are looking for. With the right index, it jumps directly to the relevant pages.

How Indexes Work

Most database indexes use a structure called B-tree, which keeps data sorted and allows efficient range queries. When you create an index on a column, the database maintains a separate sorted structure pointing to the actual data rows.

CREATE INDEX idx_orders_user_id ON orders(user_id);
CREATE INDEX idx_orders_status ON orders(status);
CREATE INDEX idx_orders_created ON orders(created_at);

These indexes let the database quickly find orders by user_id, filter by status, or range-query by date.

When Indexes Hurt

Indexes are not free. Every INSERT, UPDATE, or DELETE must update all relevant indexes. A table with many indexes writes slower than one with few indexes.

For columns with low cardinality (few unique values), indexes provide minimal benefit. An index on a boolean column rarely helps since half the rows match any query.

Composite Indexes

A composite index covers multiple columns. The order matters. A composite index on (user_id, created_at) helps queries that filter by user_id alone, or user_id plus created_at. It does not help queries that only filter by created_at.

CREATE INDEX idx_orders_user_created ON orders(user_id, created_at);

Normalization and Its Trade-offs

Database normalization structures tables to reduce redundancy. Normal forms are stages of normalization, from 1NF (first normal form) to 3NF (third normal form) and beyond.

First Normal Form

1NF requires atomic values and no repeating groups. Each cell contains a single value, not a list or array. Each row is unique.

Second Normal Form

2NF removes partial dependencies. Non-key columns must depend on the entire primary key, not just part of it. This matters for composite keys.

Third Normal Form

3NF removes transitive dependencies. Non-key columns should not depend on other non-key columns. If column A determines column B, and B determines column C, you have a transitive dependency.

-- Not normalized: total depends on quantity and price
CREATE TABLE order_items_bad (
    order_id INTEGER,
    product_id INTEGER,
    quantity INTEGER,
    price DECIMAL(10, 2),
    total DECIMAL(10, 2)  -- This depends on quantity * price
);

-- Normalized: total is calculated, not stored
CREATE TABLE order_items (
    order_id INTEGER,
    product_id INTEGER,
    quantity INTEGER,
    price DECIMAL(10, 2)
);

Normalization reduces data duplication. Denormalization (deliberately introducing redundancy) can improve read performance. Both have their place.

PostgreSQL vs MySQL

Both PostgreSQL and MySQL are excellent relational databases. They share most SQL syntax but differ in implementation details.

PostgreSQL offers more advanced features: better support for JSON operations within a relational context, full-text search, window functions, and more sophisticated indexing options. It follows SQL standards more strictly.

MySQL has historically been faster for simple read-heavy workloads. Its default storage engine, InnoDB, provides ACID transactions and row-level locking. MySQL syntax sometimes diverges from SQL standards.

Both handle replication, backups, and high availability. PostgreSQL has better support for complex data types and operations. MySQL has a larger ecosystem of managed services and tooling.

For most applications, the difference matters less than you might think. Pick one and learn it well. Switching between them is rarely necessary and comes with significant migration costs.

When to Use and When Not to Use Relational Databases

When to Use Relational Databases:

  • Your data has clear entity relationships that map naturally to tables
  • You need ACID transactions for financial data, inventory, or orders
  • Your queries involve complex joins across multiple tables
  • You need powerful ad-hoc querying with SQL for analytics
  • Your workload is balanced between reads and writes
  • You need mature tooling, backups, and operational procedures

When Not to Use Relational Databases:

  • Your data structure varies dramatically between records (document storage fits better)
  • You need to store and query massive semi-structured data like JSON logs
  • Your primary access pattern is single-key lookups at extreme scale
  • Your writes massively outnumber reads and you need horizontal write scaling
  • You are building a graph-based application with complex relationship traversals

Production Failure Scenarios

FailureImpactMitigation
Primary disk failureDatabase unavailable, potential data lossRegular backups, write-ahead logs, replication to replicas
Lock contention from long transactionsQueries timeout, application hangsKeep transactions short, use appropriate isolation levels
Index corruptionQueries return incorrect resultsUse REINDEX periodically, enable checksums, verify with pg_checksums
Connection pool exhaustionNew queries fail, “too many connections”Right-size max_connections, implement connection pooling (PgBouncer)
Accidental DROP TABLE or DELETEData lossRestrict permissions, use DROP TABLE ... IF EXISTS, point-in-time recovery
Replication lag spikesRead replicas serve stale dataMonitor lag, route writes to primary after updates, use synchronous replication
Autovacuum not keeping upTable bloat, degraded performanceTune autovacuum workers, manually VACUUM when needed, monitor dead tuples
Query planner chooses wrong indexSlow queries, high CPUAnalyze with EXPLAIN, update statistics, use index hints if needed

Observability Checklist

Metrics to Monitor:

  • Query latency (p50, p95, p99)
  • Active connections and connection saturation
  • Transaction throughput (commits/rollbacks per second)
  • Lock wait time and lock contention rates
  • Cache hit ratio (shared_buffers)
  • Disk I/O utilization and queue depth
  • Replication lag in seconds
  • Autovacuum activity and table bloat percentage
  • Index usage and missing indexes

Logs to Capture:

  • Slow query log (queries exceeding threshold, e.g., 100ms)
  • Error logs with full context including query parameters
  • Connection logs for authentication failures
  • Lock deadlocks and lock timeout events
  • Checkpoint and bgwriter activity
  • Autovacuum execution details

Alerts to Set:

  • Connection count > 80% of max_connections
  • Replication lag > 30 seconds
  • Disk usage > 85% on data volume
  • Long-running queries (> 60 seconds)
  • Cache hit ratio < 90%
  • Lock wait time spike
  • Autovacuum failing repeatedly
  • Replication slot age exceeding retention
-- Quick visibility queries
SELECT * FROM pg_stat_activity WHERE state != 'idle';
SELECT * FROM pg_stat_replication;
SELECT relname, n_dead_tup, n_live_tup, last_autovacuum FROM pg_stat_user_tables;
SELECT * FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;

Security Checklist

  • Use strong authentication (scram-sha-256 or certificate-based, never md5)
  • Implement least-privilege access (users get only needed permissions)
  • Encrypt connections with SSL/TLS (set ssl = on and enforce in pg_hba.conf)
  • Encrypt data at rest (use filesystem encryption or database-level encryption)
  • Enable and review audit logging for sensitive operations
  • Use ALTER DATABASE SET session_preload_libraries = 'pg_stat_statements' for query monitoring without overexposure
  • Restrict pg_hba.conf to specific IP ranges, deny public access
  • Regularly apply security patches and update minor versions
  • Use network segmentation (database not directly internet-accessible)
  • Implement row-level security for multi-tenant scenarios
  • Backup encryption keys separately from encrypted backups
  • Test restore procedures in isolation from production

Common Pitfalls and Anti-Patterns

  1. Missing indexes on foreign keys: Always index foreign key columns. Without indexes, parent-child joins cause full table scans.

  2. N+1 query problems: Fetching related data in a loop instead of using joins. A single query returning 100 users with their orders is better than 101 queries.

  3. Over-indexing: Creating indexes “just in case” hurts write performance. Every index slows down INSERT/UPDATE/DELETE.

  4. Using triggers for business logic: Triggers hide side effects and make debugging harder. Handle logic in application code.

  5. Ignoring NULL semantics: NULL is not “empty string” or “zero”. Queries behave differently with NULLs, and mixed NULL handling causes subtle bugs.

  6. Storing serialized objects or arrays in text columns: While convenient, this bypasses indexing and query capabilities. Use proper data types or a document store.

  7. Ignoring connection pooling: Each connection consumes memory. Use PgBouncer or similar to pool connections and avoid connection exhaustion.

  8. Not using prepared statements: Parsing overhead adds up. Use prepared statements for repeated queries.

  9. Mismatched types in joins: Joining VARCHAR to INTEGER causes implicit casts and prevents index usage. Always match types.

Real-World Case Studies

Netflix: From Oracle to PostgreSQL at Massive Scale

Netflix ran on Oracle for years — billions of rows, thousands of tables, serious hardware. In 2012 they migrated to PostgreSQL running on AWS RDS. The migration took two years of preparation, and the main motivation was cost: Oracle licensing was pricing them out at their scale.

The migration was incremental. They did not do a big-bang switch. They ran dual-write for months — writing to both Oracle and PostgreSQL simultaneously while comparing results. They built custom validation tools to catch any divergence. When they cut over, the decision was purely operational: flip a routing switch and watch the metrics.

The lesson: database migrations at scale are infrastructure problems more than database problems. The schema transfer was straightforward. The hard part was the two years of tooling, validation, and operational runbook development that made the switch boring.

Amazon: Why They Left Oracle for PostgreSQL and MySQL

Amazon’s consumer-facing databases migrated away from Oracle around 2013-2017. The stated reasons were licensing cost and control. At Amazon’s scale, Oracle’s per-core licensing model was a serious line item.

Amazon’s internal Postgres-compatible database, Aurora, launched in 2014. The architecture is not a vanilla Postgres — it separates the storage layer and replicates across three availability zones. This gives MySQL/PostgreSQL compatibility with durability characteristics no traditional RDS could match.

The takeaway: the database you pick shapes your operational constraints for years. Amazon built their own storage engine because off-the-shelf could not give them the durability they needed at their write volume. Most companies will not hit that ceiling, but it is worth knowing where the ceiling is.

Interview Questions

Q: What is the difference between a clustered and a non-clustered index?

A: In PostgreSQL, there is no clustered index concept like in SQL Server or MySQL InnoDB. Instead, PostgreSQL stores tables as heaps and indexes reference heap locations via block number and offset. A clustered index in SQL Server physically reorders the data on disk to match the index key order. In PostgreSQL, use CLUSTER to reorder a table physically once, but subsequent inserts are not maintained in order. This is one reason why choosing the right fill factor and autovacuum tuning matters more in PostgreSQL than in databases with true clustered indexes.

Q: A query that worked yesterday is suddenly slow. List the steps you would take to diagnose it.

A: First, run EXPLAIN (ANALYZE, BUFFERS) on the query and compare the plan to a known-good plan — the issue is often a stale statistics change causing the planner to pick a sequential scan instead of an index scan. Check whether the table has recently grown past an autovacuum threshold (high dead tuple count causes the planner to misestimate). Check whether an index was dropped or disabled. Verify that the connection is not hitting a different replica with stale statistics. Check for lock contention — if the query is waiting on a lock, execution time includes wait time, not just execution time.

Q: Explain ACID properties and which one is most often relaxed in distributed databases.

A: Atomicity means a transaction commits entirely or rolls back entirely — no partial state. Consistency means the database moves from one valid state to another — all constraints and invariants hold. Isolation means concurrent transactions do not interfere with each other — the details of which isolation level determines exactly how much interference is visible. Durability means once a transaction commits, it survives crashes.

Distributed databases most often relax isolation. Serializability is expensive — it requires coordination across nodes. Many distributed databases default to eventually consistent or read-committed isolation to avoid the latency penalty of synchronous coordination. Durability is also frequently traded against latency in write paths (write-ahead logs vs synchronous fsync).

Quick Recap

Key Bullets:

  • Relational databases provide ACID guarantees that keep your data consistent and durable
  • Indexes are essential for read performance but slow down writes
  • Normalization reduces redundancy; denormalization improves read performance
  • Transactions group operations into atomic units that commit or rollback together
  • Choose PostgreSQL for advanced features and standards compliance, MySQL for simple read-heavy workloads
  • Monitor query latency, connection counts, and replication lag in production

Copy/Paste Checklist:

-- Essential performance monitoring queries
SELECT * FROM pg_stat_activity WHERE state != 'idle';
SELECT * FROM pg_stat_replication;
SELECT relname, n_dead_tup, last_vacuum FROM pg_stat_user_tables;
EXPLAIN (ANALYZE, BUFFERS) your_query_here;

-- Essential security settings
ALTER DATABASE mydb SET session_preload_libraries = 'pg_stat_statements';
ALTER ROLE readonly NOLOGIN;
GRANT CONNECT ON DATABASE mydb TO app_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO readonly;

-- Essential backup verification
SELECT pg_start_backup('baseline');
SELECT pg_stop_backup();

Conclusion

Relational databases remain relevant because they solve real problems. ACID transactions give you confidence your data stays consistent. Joins let you ask complex questions across related data. Indexes make queries fast. Normalization keeps your data from becoming a mess.

PostgreSQL and MySQL both handle these concerns well. The features overlap significantly. My preference leans toward PostgreSQL for its feature set and standards compliance, but MySQL is equally capable for most workloads.

For related reading, see my posts on NoSQL Databases to understand when alternative data models make sense, and Database Scaling to learn how to handle growth beyond a single database instance.

Category

Related Posts

Connection Pooling: HikariCP, pgBouncer, and ProxySQL

Learn connection pool sizing, HikariCP, pgBouncer, and ProxySQL, timeout settings, idle management, and when pooling helps or hurts performance.

#database #connection-pooling #performance

Table Partitioning: Range, Hash, and List Strategies

Learn how table partitioning improves query performance and maintenance, including range partitioning, hash partitioning, and partition pruning.

#databases #partitioning #performance

dbt: The SQL-First Transformation Tool for Data Teams

Discover how dbt brings software engineering practices—version control, testing, documentation—to SQL-based data transformations.

#data-engineering #dbt #sql