Primary and Foreign Keys: A Practical Guide

Learn the difference between natural and surrogate keys, how foreign keys enforce referential integrity, cascade rules, and best practices for constraint naming.

published: reading time: 25 min read author: GeekWorkBench

Primary and Foreign Keys: A Practical Guide

Primary and foreign keys are how relational databases establish relationships and enforce data integrity. They shape query performance and schema maintainability. Getting them right from the start avoids pain later. This covers primary keys, foreign keys, the choices around natural versus surrogate keys, and the cascade rules that govern delete and update behavior.

Primary Keys

A primary key uniquely identifies each row in a table. No two rows can have the same key value, and the key cannot be NULL. Every table should have one.

Natural Keys

A natural key is a value that exists in the real world and uniquely identifies the entity. An email address might serve as a natural key for a users table. A Social Security Number for a taxpayers table. The value has meaning outside the database.

Natural keys make sense when the real-world identifier is stable and guaranteed unique. Email addresses change when people switch jobs. SSNs have edge cases (recent immigrants, elderly who never filed). In practice, natural keys are often problematic because real-world identifiers change more than you expect.

Surrogate Keys

A surrogate key has no meaning outside the database. It exists solely to identify rows. An auto-incrementing integer or a UUID serves as a surrogate key. The database generates it, and nothing in your application cares what the value is, as long as it is unique.

Surrogate keys are the default choice for most developers. They do not change. They have no business logic attached. You can add them to any table without coordinating with external systems.

CREATE TABLE users (
    user_id INT PRIMARY KEY AUTO_INCREMENT,
    email VARCHAR(255) NOT NULL UNIQUE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Composite Keys

A composite key uses multiple columns to uniquely identify a row. The combination of columns is unique, but individual columns may repeat.

CREATE TABLE course_enrollments (
    student_id INT,
    course_id INT,
    enrollment_date DATE,
    PRIMARY KEY (student_id, course_id)
);

Composite keys work for junction tables in many-to-many relationships. The primary key above combines student and course, ensuring a student cannot enroll in the same course twice.

Composite keys have tradeoffs. They make foreign key references verbose. A reference to course_enrollments requires both columns. This complexity adds up in large schemas.

Cascade Delete Flow

flowchart TD
    subgraph customers["customers table"]
    C1["customer_id = 1"]
    end
    subgraph orders["orders table"]
    O1["customer_id = 1\norder_id = 100"]
    O2["customer_id = 1\norder_id = 101"]
    end
    subgraph order_items["order_items table"]
    OI1["order_id = 100\nproduct_id = 5"]
    OI2["order_id = 101\nproduct_id = 7"]
    end
    C1 -->|ON DELETE CASCADE| O1
    C1 -->|ON DELETE CASCADE| O2
    O1 -->|ON DELETE CASCADE| OI1
    O2 -->|ON DELETE CASCADE| OI2

With CASCADE, deleting a customer automatically deletes their orders and order items. Without it, the deletion fails or leaves orphaned rows depending on the constraint action.

Foreign Keys

A foreign key establishes a link between two tables. The foreign key in one table references the primary key of another. This relationship enforces referential integrity: you cannot insert a value into the foreign key column that does not exist in the referenced table.

CREATE TABLE orders (
    order_id INT PRIMARY KEY AUTO_INCREMENT,
    customer_id INT NOT NULL,
    order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);

The foreign key constraint prevents orphaned orders. If you try to insert an order with a customer_id that does not exist in the customers table, the database rejects it.

Foreign keys have a cost. Every insert and update must check the referenced table. For high-throughput write paths, this overhead accumulates. Some architectures defer integrity checks to the application layer. This trades safety for speed, and it works when your application enforces the rules correctly.

Referential Integrity and Cascade Rules

What happens when you delete a customer who has orders? The database must decide. Foreign key constraints include cascade rules that define the behavior.

ON DELETE

  • CASCADE: Delete the referencing rows when the referenced row is deleted. Delete a customer and all their orders disappear.

  • SET NULL: Set the foreign key to NULL when the referenced row is deleted. Delete a customer and their orders remain, but customer_id becomes NULL.

  • RESTRICT: Reject the deletion if referencing rows exist. You cannot delete a customer who has orders.

  • NO ACTION: Similar to RESTRICT, but the check happens at the end of the statement. Some databases treat these differently.

CREATE TABLE orders (
    order_id INT PRIMARY KEY AUTO_INCREMENT,
    customer_id INT,
    FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
        ON DELETE CASCADE
);

ON UPDATE

UPDATE behaves the same way. CASCADE propagates primary key changes to foreign keys. SET NULL clears them. RESTRICT and NO ACTION block the update.

In practice, primary keys rarely change once assigned, especially surrogate keys. ON UPDATE CASCADE matters more for natural keys where the business identifier might legitimately change.

Constraint Naming

Give your constraints explicit names rather than letting the database generate cryptic identifiers.

-- Bad: constraint name is database-generated like "fk_orders_customer_id"
FOREIGN KEY (customer_id) REFERENCES customers(customer_id)

-- Good: explicit name communicates intent
FOREIGN KEY (fk_orders_customer_id) REFERENCES customers(customer_id)

Explicit names make debugging easier. When a constraint violation occurs, the error message includes your name, not some auto-generated hash. This matters in production where you are parsing logs, not clicking through a GUI.

Naming conventions vary. Pick one and apply it consistently. A common pattern: fk_<table>_<referenced_table> for foreign keys, pk_<table> for primary keys, uq_<table>_<column> for unique constraints.

Indexes and Keys

Primary keys create clustered indexes by default in most databases. The clustered index determines the physical order of data pages. Foreign keys do not automatically create indexes, but you almost always want one.

CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    customer_id INT,
    FOREIGN KEY (fk_orders_customer_id) REFERENCES customers(customer_id)
);

-- You typically want this index for query performance
CREATE INDEX idx_orders_customer_id ON orders(customer_id);

Without an index on the foreign key column, queries that JOIN on customer_id must scan the orders table. For small tables this does not matter. For large tables with frequent joins, the index is essential.

Capacity Estimation: Foreign Key Index Storage

Foreign keys need indexes, and those indexes have storage and write costs. Here is how to size them.

Storage for a FK index is straightforward. A B-tree index on an INTEGER column consumes roughly 2–4 bytes per entry for the index itself, plus overhead for the index structure. For an orders table with 100 million rows and a customer_id FK, the index on customer_id is roughly 200–400 MB. A UUID FK (16 bytes) doubles that to 400–800 MB.

Write overhead is the more important factor. Every INSERT to the child table must also update the FK index on that column. Every UPDATE to the FK column updates the index. Every DELETE from the parent table that cascades must update the child’s FK index. If you have high-frequency writes and cascade deletes on large tables, the FK index write amplification adds up.

Estimate FK write cost: if you insert 10,000 orders per second and each has a customer_id FK, you are doing 10,000 index inserts per second to the FK index. On a busy write path, that index might become a bottleneck. Profile the actual write throughput before assuming it is free.

Rule of thumb: index FK columns when you routinely join on them in queries. Skip the index when the relationship is almost never queried and you only use cascade deletes for cleanup.

Key Decisions at a Glance: Trade-offs

DecisionWhen to chooseTrade-offs
Surrogate key (auto-increment INT, UUID)Most cases, especially when key has no business meaningNo natural connection to real-world data
Natural key (SSN, email, SKU)When the identifier already exists and is stableChanges in the real world force schema changes
Composite keyJunction tables, multi-column identifiersVerbose foreign key references, complex joins
Single-column surrogate + natural unique indexWhen both matterAdds a column you have to index anyway
ON DELETE CASCADEHierarchical data with clear ownershipDangerous on large tables — cascades are hard to stop
ON DELETE SET NULLOptional relationshipsRequires the FK column to be nullable
ON DELETE RESTRICTWhen child must outlive parentPrevents legitimate deletions
Defer FK enforcement to app layerUltra high-throughput write pathsYou own the integrity — mistakes are permanent

Surrogate keys win in most scenarios. The minor storage cost of an extra column is almost always worth the simplicity.

Common Production Failures

Cascade delete on large tables: Running DELETE FROM customers WHERE customer_id = 1 with CASCADE on a customer who has 50,000 orders locks every related row. On a busy database this blocks writes to the orders table for seconds or minutes. Test CASCADE behavior with realistic data volumes. For large datasets, batch deletes or soft deletes are safer.

Missing indexes on foreign keys: Every foreign key column that does not have an index causes full table scans during joins. With millions of rows, a missing FK index turns a simple lookup into a full scan on every query. Add the index.

Deferring integrity to the application: When teams disable foreign key constraints for performance and handle integrity in application code, one bug or oversight corrupts data permanently. The application layer is the wrong place for referential integrity — it is not transactional and it is not enforced for direct SQL access. If constraints are disabled for a bulk load, re-enable them immediately after and validate.

Surrogate key changes: When a surrogate key is used as a natural identifier (like an order number derived from the auto-increment ID), and that ID changes, every foreign key reference breaks. Once an ID leaves the system — in a URL, an email, a printed invoice — it cannot change. Use immutable identifiers for external-facing references.

Real-World Case Study: GitHub and Rails Primary Key Migration

GitHub ran MySQL with INT SIGNED as primary keys for years. When they hit 2^31 - 1 on their user ID sequence, they had to migrate every table to BIGINT. This is not a theoretical problem — it bit them in 2012 and forced a massive cross-schema migration.

Their lesson: surrogate key type selection has a long tail. They chose INT early because it seemed reasonable at the time. By the time the problem surfaced, they had millions of rows and hundreds of foreign key relationships. The migration took months of planning and had to be executed in ways that avoided locking production tables.

The equivalent problem with UUIDs is storage and index bloat. UUIDs are 16 bytes versus 4 for an INT. On a table with 500 million rows, switching from INT to UUID surrogate keys adds roughly 6 GB of storage per index. For composite keys involving multiple UUIDs, the overhead multiplies. GitHub’s subsequent move to BIGINT was less painful than a full UUID migration would have been.

The broader lesson: pick your key type for where you expect the data to be in five years, not where it is today. And understand what your ORM defaults to — Rails historically used INT for id columns, which worked until it did not.

When Keys Matter Most

Keys matter most in write-heavy workloads. Every insert, update, and delete must maintain key constraints. For read-heavy analytical workloads, keys add overhead without proportionate benefit if you never enforce relationships at the database level.

OLTP systems need strong key enforcement. Analytical systems and data warehouses often skip foreign keys entirely, loading data that is already cleaned and validated. Know your workload and choose accordingly.

For more on how tables connect, see the guide on joins and relationships. To understand how keys fit into broader schema design, explore schema design and relational databases.

Real-world Failure Scenarios

Scenario 1: The Cascade Delete That Took Down Production

What happened: A support engineer needed to delete a test customer from the staging database. The customer account had accumulated 3 years of test orders, order items, and shipment records. When the DELETE ran with CASCADE, it locked every related row in the orders, order_items, shipments, and audit_logs tables simultaneously.

Root cause: CASCADE Delete was configured on all foreign key relationships without consideration for data volume. The staging database had grown to production scale without the team’s awareness, and the cascade path touched 8 related tables.

Impact: The DELETE blocked for 45 seconds, then the database locked all related tables. The staging environment became unresponsive, blocking deployments that relied on it for 2 hours.

Lesson learned: Always test CASCADE Delete with realistic data volumes. Consider SET NULL or RESTRICT for parent-child relationships where the child must outlive the parent or where deletion is rare and should be deliberate.

Scenario 2: The UUID Primary Key That Bloat the Index

What happened: A microservices architecture adopted UUIDs as primary keys across all services for distributed ID generation. After 6 months, storage costs doubled and query performance degraded significantly. The database team discovered B-tree indexes on UUID columns were causing severe page splits and index bloat.

Root cause: UUIDs are 16 bytes versus 4 for integers, and their random distribution causes new rows to be inserted into random positions in the B-tree, causing constant page splits and index fragmentation. The index was not fitting in memory efficiently.

Impact: Storage costs increased by 15GB across production databases. Query latency p99 increased from 20ms to 150ms due to index bloat causing extra disk I/O.

Lesson learned: UUIDs solve a real problem for distributed ID generation, but they have storage and performance costs. Consider uuid_generate_v4() with pgcrypto for sequential UUIDs, or use a composite key approach that preserves ordering. Always benchmark with realistic data volumes before standardizing on UUID primary keys.

Scenario 3: The Natural Key Change That Broke All External References

What happened: A company used ISBN as the primary key for their books table. When the publishing industry transitioned to ISBN-13, the development team updated the column but did not update the 47 places in application code, reports, and APIs that referenced the old ISBN-10 format. External systems that had stored the old ISBN format could no longer match books.

Root cause: The ISBN was used as both a natural key in the database and as an external identifier in URLs, APIs, and third-party integrations. When ISBN-13 replaced ISBN-10, the application treated it as a database schema change rather than an API contract change.

Impact: 47 integration points broke. Third-party resellers could not look up books. Revenue dropped 12% for two weeks while integrations were updated.

Lesson learned: Natural keys that appear in external systems should be treated as immutable external contracts. Use a surrogate key internally and reserve the natural key as a unique constraint or separate column. This separates the internal identifier from the external reference.

Interview Questions

1. What is the difference between a primary key and a unique index?

A primary key is a constraint that enforces uniqueness and cannot be NULL. A primary key also creates a clustered index by default in most databases (except PostgreSQL where it creates a B-tree index but does not cluster). A unique index enforces uniqueness but allows NULLs (multiple NULLs depending on the database). You can have many unique indexes on a table but only one primary key. Use a primary key when you need a true identifier for the row. Use a unique index for columns that should be unique but are not the row identifier.

2. You need to delete a customer who has 50,000 orders. The business cannot tolerate downtime. Walk through your approach.

Never run a single DELETE with CASCADE on a large parent-child relationship in production. The database will lock every related row. Instead, set the FK on the orders table to SET NULL or SET DEFAULT for the customer_id, then delete the customer. This avoids locking the orders rows during deletion. Alternatively, archive orders in batches — update customer_id to a placeholder (like -1 for "deleted") in batches of 1000, sleep briefly between batches to let writes through, then delete the customer. This keeps the table responsive throughout.

3. When would you choose a natural key over a surrogate key?

Natural keys make sense when the business already has a stable identifier that lives outside the database. An ISBN for books is a decent natural key — it exists before the book enters your system and it uniquely identifies the book. A SKU that is printed on physical products and used across multiple systems is another good candidate. The risk is that natural keys can change. If the ISBN system changes, your key changes. If a customer changes their email address (the natural key), every foreign key reference breaks. Natural keys work best when the identifier is truly immutable and genuinely unique.

4. A developer disabled foreign key constraints to improve bulk load performance. What are the risks and how would you mitigate them?

The risk is data corruption. With FK constraints disabled, you can insert orders with customer_id values that do not exist, and the database will happily accept them. Those orphaned rows silently accumulate. The mitigation is to disable constraints only for the bulk load window, then re-enable them immediately after, running a validation query to confirm no orphaned rows exist before re-enabling. Never ship production code with constraints disabled as a "performance optimization."

5. What is the difference between ON DELETE CASCADE and ON DELETE SET NULL?

ON DELETE CASCADE deletes the referencing rows when the parent row is deleted — removing a customer removes all their orders. ON DELETE SET NULL sets the foreign key column to NULL instead of deleting the rows — removing a customer leaves their orders in the table but without a customer reference. CASCADE is appropriate when the child rows should not exist without their parent. SET NULL is appropriate when the relationship is optional and the child rows should survive even if the parent is removed. SET NULL requires the FK column to be nullable.

6. What is a composite primary key and when would you use one?

A composite primary key uses multiple columns to uniquely identify a row. The combination of columns is unique, but individual columns may repeat. Use composite keys for junction tables in many-to-many relationships — PRIMARY KEY (order_id, product_id) ensures each product appears once per order. Composite keys are also appropriate when no single column is naturally unique but the combination is. Trade-offs: foreign key references to composite keys are verbose (must specify all columns), and queries that filter on only one column of the composite cannot use the index efficiently.

7. What is the performance impact of foreign key constraints?

Every INSERT and UPDATE on a child table must check that the foreign key value exists in the parent table. Every DELETE on a parent table must check for referencing rows (unless CASCADE is set). This check requires at least an index lookup, sometimes a full scan if no index exists. For high-throughput write paths with millions of inserts per day, FK constraint overhead accumulates. Some high-performance systems defer integrity checks to the application layer — this trades safety for speed and requires the application to enforce integrity perfectly.

8. Why should foreign key columns be indexed?

Without an index on the foreign key column, queries that JOIN on that column must scan the entire child table for each parent row. With millions of rows, this is catastrophic. The index allows the database to look up matching child rows efficiently. Additionally, when a parent row is deleted with CASCADE, the database must find all matching child rows — without an index, this requires a full table scan of the child table for each deleted parent row. Index FK columns when you routinely join on them in queries. Skip the index when the relationship is almost never queried and you only use cascade deletes for cleanup.

9. What is the difference between RESTRICT and NO ACTION for referential integrity?

Both block the operation when referencing rows exist. The difference is timing: RESTRICT checks immediately before the statement completes, failing fast. NO ACTION checks at the end of the statement, allowing other operations to proceed in between. Some databases treat these identically. In PostgreSQL, NO ACTION allows deferred checking while RESTRICT does not. In practice, NO ACTION is more flexible for complex transactions, while RESTRICT provides earlier failure detection. Choose RESTRICT for simpler schemas where immediate feedback on violations is preferred.

10. How do surrogate keys affect database portability?

Surrogate keys (auto-increment INT, UUID) are database-generated and have no meaning outside the database, making them portable across database engines. Natural keys (SSN, ISBN) exist independently and may have different formats or constraints in different databases. If you use INT surrogate keys and later migrate from MySQL to PostgreSQL, the key values are preserved. If you use VARCHAR natural keys that were unique in MySQL, they might not port cleanly if the target database has different case sensitivity or encoding rules. Surrogate keys reduce portability risk for the key itself, though other schema elements may still pose challenges.

11. What is the maximum size for a primary key and when does it matter?

Primary key size matters for two reasons: storage and index performance. A B-tree index on a UUID column (16 bytes) is larger than on an INT (4 bytes). For a table with 500 million rows, an INT primary key index is roughly 2 GB while a UUID index is roughly 8 GB. Additionally, UUIDs inserted randomly cause page splits and index fragmentation. Size also matters when the primary key is used as a foreign key in child tables — larger FK columns multiply storage costs. For extreme scale, consider whether a BIGINT or composite key of two INTs is more efficient than UUIDs.

12. What happens when you try to update a primary key value?

Updating a primary key value requires updating the primary key index and every foreign key index that references it. If child tables have CASCADE on the FK, the database must update every referencing row. This is expensive and locking-intensive. For surrogate keys, you should almost never update the primary key — the value has no meaning, so changing it serves no purpose. For natural keys used as primary keys, updates can be necessary but should be rare and tested with production-scale data volumes before deployment.

13. What is the relationship between primary keys and clustering?

In most databases (SQL Server, MySQL InnoDB), the primary key determines the clustered index — the physical order of data pages. Rows are stored in B-tree order by the primary key. This means that sequential primary key inserts (auto-increment) go to the end of the table efficiently, while random inserts (UUID) cause constant page splits throughout the table. PostgreSQL treats primary key and clustering separately — the primary key creates a B-tree index but does not determine physical row order. In PostgreSQL, you choose clustering separately with CLUSTER or the fillfactor storage parameter.

14. When would you use a natural key that is not the primary key?

Use a natural unique constraint (not primary key) when the business identifier has meaning but should not be the row identifier. An email address is a good example — it is unique and meaningful, but using it as the primary key means every foreign key in the system carries a 255-character string. A surrogate INT as primary key with a UNIQUE constraint on email keeps the business identifier enforced while keeping the internal reference compact. This separates the internal identifier from the external contract.

15. What is the ON UPDATE CASCADE use case?

ON UPDATE CASCADE propagates primary key changes to all foreign key references. This is necessary for natural keys that might legitimately change — like a country code that transitions from ISO 3166-1 alpha-2 to alpha-3, or a product SKU that gets reformatted. With CASCADE, updating the parent PK automatically updates all child FK values. Without it, the update either fails or leaves orphaned references. Surrogate keys (INT, UUID) do not need ON UPDATE CASCADE because they never change — once assigned, they are immutable.

16. How do you handle keys in a distributed database system?

Distributed databases cannot use auto-increment INT because different nodes might generate the same value. Options: UUID (16 bytes, globally unique but large and random), UUID v7 (time-ordered for better B-tree performance), snowflake IDs (64-bit with timestamp, machine ID, and sequence), and key generation services (centralized ticket servers). Choose based on whether you need time ordering (affects B-tree performance), whether IDs cross system boundaries (requires global uniqueness), and storage constraints (16 bytes vs 8 bytes).

17. What is the cost of CASCADE DELETE on a large table?

The cost is locking. When you delete a parent row, CASCADE must find and delete all child rows, then recursively delete their children. On a large parent with many children, this can lock the child tables for seconds or minutes, blocking other writes. The DELETE operation itself can only complete once all cascade targets are locked. On busy production systems, a single DELETE with CASCADE on a large hierarchy can cause an incident. Safer alternatives for large datasets: SET NULL (requires nullable FK), soft delete, or batched hard deletes with sleep intervals between batches.

18. What is the difference between a primary key constraint and a unique constraint in terms of NULL handling?

Primary key constraints cannot contain NULL — the constraint explicitly requires a value. Unique constraints allow one NULL (or multiple NULLs depending on the database since NULL != NULL). This matters for columns that are genuinely optional. If a column should be unique when populated but may be absent, use a unique constraint with a partial index or a filtered unique index (PostgreSQL) rather than a primary key. The database treats one NULL as unknown and allows multiple NULLs in unique constraints because NULL is not equal to itself.

19. Why should primary key names follow a convention?

Constraint names appear in error messages and system catalogs. A constraint named "fk_orders_customers" immediately communicates which tables and columns are involved. A constraint named "a3k9d2" requires a lookup. Explicit naming conventions also prevent name collisions — two tables might both have an "fk_orders_customer_id" if you do not qualify the table name. Standard naming (pk_

, fk_
_, uq_
_) makes constraint violations self-documenting and easier to debug in production logs.

20. How do you choose between INT and BIGINT for primary keys?

Choose INT (signed, max 2.1 billion) for tables that will never approach that limit. Choose BIGINT (max 9.2 quintillion) for high-volume tables, tables that serve as foreign key targets for many child tables, or when you expose the ID externally (URLs, APIs) and want headroom. BIGINT costs 8 bytes vs 4 bytes — for a table with 500 million rows, this is 2 GB vs 4 GB just for the primary key index. Only pay the BIGINT cost when you have a real need for the range. If you are unsure, INT is usually sufficient until proven otherwise.

Further Reading

Conclusion

Primary and foreign keys form the foundation of relational data modeling. Surrogate keys avoid the fragility of real-world identifiers; natural keys make sense when the identifier is stable and external. Foreign keys enforce referential integrity at the database level, but they come with write overhead that matters at scale. Cascade rules must be chosen deliberately and tested with realistic data volumes — a CASCADE DELETE that works fine in development can lock production tables for minutes.

Name your constraints explicitly. Index your foreign key columns. Choose key types for where your data will be, not where it is today. And treat any deferral of integrity checks to the application layer as a calculated risk, not a performance optimization.

Done well, keys are invisible — the database just works. Done poorly, they become the source of outages, data corruption, and migrations that take months. The time invested getting them right upfront pays dividends across the lifetime of the system.

Category

Related Posts

Understanding SQL JOINs and Database Relationships

Master SQL JOINs with this practical guide covering INNER, LEFT, RIGHT, FULL OUTER, and CROSS joins. Learn how relationship types between tables shape your queries.

#database #joins #sql

Database Normalization: From 1NF to BCNF

Learn database normalization from 1NF through BCNF. Understand how normalization eliminates redundancy, prevents update anomalies, and when denormalization makes sense for performance.

#database #normalization #relational-databases

Schema Design: Building the Foundation of Your Database

Learn how to design effective database schemas with proper data types, constraints, and relationships that scale with your application.

#database #schema-design #relational-databases