Primary and Foreign Keys: A Practical Guide

Learn the difference between natural and surrogate keys, how foreign keys enforce referential integrity, cascade rules, and best practices for constraint naming.

published: reading time: 13 min read

Primary and Foreign Keys: A Practical Guide

Primary and foreign keys are how relational databases establish relationships and enforce data integrity. They shape query performance and schema maintainability. Getting them right from the start avoids pain later. This covers primary keys, foreign keys, the choices around natural versus surrogate keys, and the cascade rules that govern delete and update behavior.

Primary Keys

A primary key uniquely identifies each row in a table. No two rows can have the same key value, and the key cannot be NULL. Every table should have one.

Natural Keys

A natural key is a value that exists in the real world and uniquely identifies the entity. An email address might serve as a natural key for a users table. A Social Security Number for a taxpayers table. The value has meaning outside the database.

Natural keys make sense when the real-world identifier is stable and guaranteed unique. Email addresses change when people switch jobs. SSNs have edge cases (recent immigrants, elderly who never filed). In practice, natural keys are often problematic because real-world identifiers change more than you expect.

Surrogate Keys

A surrogate key has no meaning outside the database. It exists solely to identify rows. An auto-incrementing integer or a UUID serves as a surrogate key. The database generates it, and nothing in your application cares what the value is, as long as it is unique.

Surrogate keys are the default choice for most developers. They do not change. They have no business logic attached. You can add them to any table without coordinating with external systems.

CREATE TABLE users (
    user_id INT PRIMARY KEY AUTO_INCREMENT,
    email VARCHAR(255) NOT NULL UNIQUE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Composite Keys

A composite key uses multiple columns to uniquely identify a row. The combination of columns is unique, but individual columns may repeat.

CREATE TABLE course_enrollments (
    student_id INT,
    course_id INT,
    enrollment_date DATE,
    PRIMARY KEY (student_id, course_id)
);

Composite keys work for junction tables in many-to-many relationships. The primary key above combines student and course, ensuring a student cannot enroll in the same course twice.

Composite keys have tradeoffs. They make foreign key references verbose. A reference to course_enrollments requires both columns. This complexity adds up in large schemas.

Cascade Delete Flow

flowchart TD
    subgraph customers["customers table"]
    C1["customer_id = 1"]
    end
    subgraph orders["orders table"]
    O1["customer_id = 1\norder_id = 100"]
    O2["customer_id = 1\norder_id = 101"]
    end
    subgraph order_items["order_items table"]
    OI1["order_id = 100\nproduct_id = 5"]
    OI2["order_id = 101\nproduct_id = 7"]
    end
    C1 -->|ON DELETE CASCADE| O1
    C1 -->|ON DELETE CASCADE| O2
    O1 -->|ON DELETE CASCADE| OI1
    O2 -->|ON DELETE CASCADE| OI2

With CASCADE, deleting a customer automatically deletes their orders and order items. Without it, the deletion fails or leaves orphaned rows depending on the constraint action.

Foreign Keys

A foreign key establishes a link between two tables. The foreign key in one table references the primary key of another. This relationship enforces referential integrity: you cannot insert a value into the foreign key column that does not exist in the referenced table.

CREATE TABLE orders (
    order_id INT PRIMARY KEY AUTO_INCREMENT,
    customer_id INT NOT NULL,
    order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);

The foreign key constraint prevents orphaned orders. If you try to insert an order with a customer_id that does not exist in the customers table, the database rejects it.

Foreign keys have a cost. Every insert and update must check the referenced table. For high-throughput write paths, this overhead accumulates. Some architectures defer integrity checks to the application layer. This trades safety for speed, and it works when your application enforces the rules correctly.

Referential Integrity and Cascade Rules

What happens when you delete a customer who has orders? The database must decide. Foreign key constraints include cascade rules that define the behavior.

ON DELETE

  • CASCADE: Delete the referencing rows when the referenced row is deleted. Delete a customer and all their orders disappear.

  • SET NULL: Set the foreign key to NULL when the referenced row is deleted. Delete a customer and their orders remain, but customer_id becomes NULL.

  • RESTRICT: Reject the deletion if referencing rows exist. You cannot delete a customer who has orders.

  • NO ACTION: Similar to RESTRICT, but the check happens at the end of the statement. Some databases treat these differently.

CREATE TABLE orders (
    order_id INT PRIMARY KEY AUTO_INCREMENT,
    customer_id INT,
    FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
        ON DELETE CASCADE
);

ON UPDATE

UPDATE behaves the same way. CASCADE propagates primary key changes to foreign keys. SET NULL clears them. RESTRICT and NO ACTION block the update.

In practice, primary keys rarely change once assigned, especially surrogate keys. ON UPDATE CASCADE matters more for natural keys where the business identifier might legitimately change.

Constraint Naming

Give your constraints explicit names rather than letting the database generate cryptic identifiers.

-- Bad: constraint name is database-generated like "fk_orders_customer_id"
FOREIGN KEY (customer_id) REFERENCES customers(customer_id)

-- Good: explicit name communicates intent
FOREIGN KEY (fk_orders_customer_id) REFERENCES customers(customer_id)

Explicit names make debugging easier. When a constraint violation occurs, the error message includes your name, not some auto-generated hash. This matters in production where you are parsing logs, not clicking through a GUI.

Naming conventions vary. Pick one and apply it consistently. A common pattern: fk_<table>_<referenced_table> for foreign keys, pk_<table> for primary keys, uq_<table>_<column> for unique constraints.

Indexes and Keys

Primary keys create clustered indexes by default in most databases. The clustered index determines the physical order of data pages. Foreign keys do not automatically create indexes, but you almost always want one.

CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    customer_id INT,
    FOREIGN KEY (fk_orders_customer_id) REFERENCES customers(customer_id)
);

-- You typically want this index for query performance
CREATE INDEX idx_orders_customer_id ON orders(customer_id);

Without an index on the foreign key column, queries that JOIN on customer_id must scan the orders table. For small tables this does not matter. For large tables with frequent joins, the index is essential.

Capacity Estimation: Foreign Key Index Storage

Foreign keys need indexes, and those indexes have storage and write costs. Here is how to size them.

Storage for a FK index is straightforward. A B-tree index on an INTEGER column consumes roughly 2–4 bytes per entry for the index itself, plus overhead for the index structure. For an orders table with 100 million rows and a customer_id FK, the index on customer_id is roughly 200–400 MB. A UUID FK (16 bytes) doubles that to 400–800 MB.

Write overhead is the more important factor. Every INSERT to the child table must also update the FK index on that column. Every UPDATE to the FK column updates the index. Every DELETE from the parent table that cascades must update the child’s FK index. If you have high-frequency writes and cascade deletes on large tables, the FK index write amplification adds up.

Estimate FK write cost: if you insert 10,000 orders per second and each has a customer_id FK, you are doing 10,000 index inserts per second to the FK index. On a busy write path, that index might become a bottleneck. Profile the actual write throughput before assuming it is free.

Rule of thumb: index FK columns when you routinely join on them in queries. Skip the index when the relationship is almost never queried and you only use cascade deletes for cleanup.

Key Decisions at a Glance: Trade-offs

DecisionWhen to chooseTrade-offs
Surrogate key (auto-increment INT, UUID)Most cases, especially when key has no business meaningNo natural connection to real-world data
Natural key (SSN, email, SKU)When the identifier already exists and is stableChanges in the real world force schema changes
Composite keyJunction tables, multi-column identifiersVerbose foreign key references, complex joins
Single-column surrogate + natural unique indexWhen both matterAdds a column you have to index anyway
ON DELETE CASCADEHierarchical data with clear ownershipDangerous on large tables — cascades are hard to stop
ON DELETE SET NULLOptional relationshipsRequires the FK column to be nullable
ON DELETE RESTRICTWhen child must outlive parentPrevents legitimate deletions
Defer FK enforcement to app layerUltra high-throughput write pathsYou own the integrity — mistakes are permanent

Surrogate keys win in most scenarios. The minor storage cost of an extra column is almost always worth the simplicity.

Common Production Failures

Cascade delete on large tables: Running DELETE FROM customers WHERE customer_id = 1 with CASCADE on a customer who has 50,000 orders locks every related row. On a busy database this blocks writes to the orders table for seconds or minutes. Test CASCADE behavior with realistic data volumes. For large datasets, batch deletes or soft deletes are safer.

Missing indexes on foreign keys: Every foreign key column that does not have an index causes full table scans during joins. With millions of rows, a missing FK index turns a simple lookup into a full scan on every query. Add the index.

Deferring integrity to the application: When teams disable foreign key constraints for performance and handle integrity in application code, one bug or oversight corrupts data permanently. The application layer is the wrong place for referential integrity — it is not transactional and it is not enforced for direct SQL access. If constraints are disabled for a bulk load, re-enable them immediately after and validate.

Surrogate key changes: When a surrogate key is used as a natural identifier (like an order number derived from the auto-increment ID), and that ID changes, every foreign key reference breaks. Once an ID leaves the system — in a URL, an email, a printed invoice — it cannot change. Use immutable identifiers for external-facing references.

Real-World Case Study: GitHub and Rails Primary Key Migration

GitHub ran MySQL with INT SIGNED as primary keys for years. When they hit 2^31 - 1 on their user ID sequence, they had to migrate every table to BIGINT. This is not a theoretical problem — it bit them in 2012 and forced a massive cross-schema migration.

Their lesson: surrogate key type selection has a long tail. They chose INT early because it seemed reasonable at the time. By the time the problem surfaced, they had millions of rows and hundreds of foreign key relationships. The migration took months of planning and had to be executed in ways that avoided locking production tables.

The equivalent problem with UUIDs is storage and index bloat. UUIDs are 16 bytes versus 4 for an INT. On a table with 500 million rows, switching from INT to UUID surrogate keys adds roughly 6 GB of storage per index. For composite keys involving multiple UUIDs, the overhead multiplies. GitHub’s subsequent move to BIGINT was less painful than a full UUID migration would have been.

The broader lesson: pick your key type for where you expect the data to be in five years, not where it is today. And understand what your ORM defaults to — Rails historically used INT for id columns, which worked until it did not.

Interview Questions

Q: What is the difference between a primary key and a unique index?

A primary key is a constraint that enforces uniqueness and cannot be NULL. A primary key also creates a clustered index by default in most databases (except PostgreSQL where it creates a B-tree index but does not cluster). A unique index enforces uniqueness but allows NULLs (multiple NULLs depending on the database). You can have many unique indexes on a table but only one primary key. Use a primary key when you need a true identifier for the row. Use a unique index for columns that should be unique but are not the row identifier.

Q: You need to delete a customer who has 50,000 orders. The business cannot tolerate downtime. Walk through your approach.

Never run a single DELETE with CASCADE on a large parent-child relationship in production. The database will lock every related row. Instead, set the FK on the orders table to SET NULL or SET DEFAULT for the customer_id, then delete the customer. This avoids locking the orders rows during deletion. Alternatively, archive orders in batches — update customer_id to a placeholder (like -1 for “deleted”) in batches of 1000, sleep briefly between batches to let writes through, then delete the customer. This keeps the table responsive throughout.

Q: When would you choose a natural key over a surrogate key?

Natural keys make sense when the business already has a stable identifier that lives outside the database. An ISBN for books is a decent natural key — it exists before the book enters your system and it uniquely identifies the book. A SKU that is printed on physical products and used across multiple systems is another good candidate. The risk is that natural keys can change. If the ISBN system changes, your key changes. If a customer changes their email address (the natural key), every foreign key reference breaks. Natural keys work best when the identifier is truly immutable and genuinely unique.

Q: A developer disabled foreign key constraints to improve bulk load performance. What are the risks and how would you mitigate them?

The risk is data corruption. With FK constraints disabled, you can insert orders with customer_id values that do not exist, and the database will happily accept them. Those orphaned rows silently accumulate. The mitigation is to disable constraints only for the bulk load window, then re-enable them immediately after, running a validation query to confirm no orphaned rows exist before re-enabling. Never ship production code with constraints disabled as a “performance optimization.”

When Keys Matter Most

Keys matter most in write-heavy workloads. Every insert, update, and delete must maintain key constraints. For read-heavy analytical workloads, keys add overhead without proportionate benefit if you never enforce relationships at the database level.

OLTP systems need strong key enforcement. Analytical systems and data warehouses often skip foreign keys entirely, loading data that is already cleaned and validated. Know your workload and choose accordingly.

For more on how tables connect, see the guide on joins and relationships. To understand how keys fit into broader schema design, explore schema design and relational databases.

Category

Related Posts

Understanding SQL JOINs and Database Relationships

Master SQL JOINs with this practical guide covering INNER, LEFT, RIGHT, FULL OUTER, and CROSS joins. Learn how relationship types between tables shape your queries.

#database #joins #sql

Database Normalization: From 1NF to BCNF

Learn database normalization from 1NF through BCNF. Understand how normalization eliminates redundancy, prevents update anomalies, and when denormalization makes sense for performance.

#database #normalization #relational-databases

Schema Design: Building the Foundation of Your Database

Learn how to design effective database schemas with proper data types, constraints, and relationships that scale with your application.

#database #schema-design #relational-databases