Zero-Downtime Database Migration Strategies

Learn how to evolve your database schema in production without downtime. This guide covers expand-contract patterns, backward-compatible migrations, rollback strategies, and tools.

published: reading time: 15 min read

Zero-Downtime Database Migration Strategies

Database migrations seem simple until you’re doing them in production with millions of rows and zero tolerance for downtime. Get it wrong and you get angry users, late-night escalations, and potentially corrupted data. Get it right and your application evolves smoothly while users don’t notice a thing.

This guide covers the strategies and tools that professional teams use to ship database changes without drama.

Understanding the Problem

When you’re working with a live database, every schema change carries risk. Adding a column seems innocuous until you realize it locks the table. Renaming a column sounds safe until the old code still running on some server tries to write to the old name. The problem isn’t the change itself—it’s the multiple versions of your application that exist during a deployment window.

Modern deployments use rolling updates or blue-green deployments. During these deployments, version N and version N+1 of your application are both running simultaneously. Your database migration must be compatible with both versions. This constraint is the foundation of all zero-downtime migration strategies.

The Expand-Contract Pattern

The expand-contract pattern, sometimes called the “two-phase migration,” works well for breaking schema changes. The core idea: never make a breaking change in a single step. Break it into three phases instead.

Phase 1: Expand

In the expand phase, you add new structure without removing the old. Suppose you want to rename a user_name column to display_name. Instead of renaming directly, you add the new column alongside the old one.

ALTER TABLE users ADD COLUMN display_name VARCHAR(255);

Your application code now writes to both columns. Version N writes to user_name, and version N+1 writes to display_name. A background job backfills the new column with data from the old column.

Phase 2: Migrate

Once the expand phase is complete and all historical data has been backfilled, you update the application code to stop writing to the old column. The old column becomes read-only.

-- Application now writes to display_name only
-- Old user_name column is kept for backward compatibility

Phase 3: Contract

After you have verified that no code is writing to the old column and all reads have been migrated, you can safely remove the old structure.

ALTER TABLE users DROP COLUMN user_name;

This three-phase approach ensures that at every point during the migration, some version of your application can function correctly.

flowchart LR
    subgraph Phase1["Phase 1: Expand"]
        A1[Add new column<br/>display_name] --> B1[Deploy vN+1 writes<br/>to both columns]
        B1 --> C1[Background job<br/>backfills new column]
    end

    subgraph Phase2["Phase 2: Migrate"]
        C1 --> D1[Stop writing to<br/>old column]
        D1 --> E1[Old column<br/>becomes read-only]
    end

    subgraph Phase3["Phase 3: Contract"]
        E1 --> F1[Verify no writes<br/>to old column]
        F1 --> G1[DROP old column<br/>user_name]
    end

    Phase1 --> Phase2 --> Phase3

The expand-contract pattern is a specific application of a broader principle: backward-compatible migrations. A migration is backward-compatible when it works correctly whether the application before or after the migration is running.

Adding Columns Without Locking

PostgreSQL’s MVCC (Multi-Version Concurrency Control) means that most DDL changes no longer require exclusive locks like they used to. However, some operations still cause issues.

Adding a column with a default value used to require a table rewrite in older PostgreSQL versions. Starting with PostgreSQL 11, adding a column with a volatile default only updates the table metadata, not the rows themselves. The default is stored once in the catalog and applied on reads.

-- Safe in PostgreSQL 11+
ALTER TABLE users ADD COLUMN is_verified BOOLEAN DEFAULT FALSE;

For older versions, use a separate step:

-- Step 1: Add column without default
ALTER TABLE users ADD COLUMN is_verified BOOLEAN;

-- Step 2: Add default separately
ALTER TABLE users ALTER COLUMN is_verified SET DEFAULT FALSE;

-- Step 3: Backfill existing rows
UPDATE users SET is_verified = FALSE WHERE is_verified IS NULL;

Index Creation

Creating an index locks writes to the table in versions before PostgreSQL 11. For large tables, this can cause significant downtime. Use CONCURRENTLY to build indexes without blocking writes.

-- Creates index without blocking writes
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);

Note that CREATE INDEX CONCURRENTLY cannot run inside a transaction. Your migration tool must support transactional migrations that can also run non-transactional commands.

When to Use / When Not to Use

Use expand-contract migrations when you need to rename or remove columns that existing code still references, change column types requiring data transformation, split or merge tables, or make any schema change that breaks the current application version.

Use backward-compatible migrations when adding columns with safe defaults (PostgreSQL 11+), adding indexes concurrently, adding constraints as NOT VALID before validating, or working with rolling deployments where old and new app versions coexist.

Do not use expand-contract when the change is purely additive (new table, new column that nothing touches — just deploy it), when you have a maintenance window for a coordinated full migration, or when the risk of multi-version confusion outweighs the risk of a coordinated deployment.

Never do the following in production: single-step column renames or type changes on large tables; non-concurrent index creation on tables with active writes; migrations inside the same transaction as application code deployment.

Rollback Strategies

Manual migration management does not scale. As your team grows and your database schema becomes more complex, you need tools that track migration history, enforce ordering, and help with rollback.

Flyway

Flyway is a mature, open-source migration tool that works with virtually any database. Migrations are stored as versioned SQL scripts.

V1__Create_initial_schema.sql
V2__Add_users_table.sql
V3__Add_email_index.sql

Flyway tracks which migrations have been applied and applies pending ones on startup. It supports undo migrations (in paid version) and baseline migrations for existing databases.

Configuration:

flyway.url=jdbc:postgresql://localhost:5432/mydb
flyway.user=flyway
flyway.password=secret
flyway.locations=filesystem:./migrations

Liquibase

Liquibase uses a changelog file (XML, JSON, YAML, or SQL) to define migrations. This provides better cross-database compatibility since Liquibase handles the dialect translation.

<changeSet id="1" author="alex">
  <addColumn tableName="users">
    <column name="display_name" type="VARCHAR(255)"/>
  </addColumn>
</changeSet>

Liquibase tracks changes via a changelog table, allowing it to apply only new changes. It also supports rollbacks via rollback SQL or rollback scripts.

Prisma Migrate

If you are using Prisma as your ORM, Prisma Migrate integrates directly with your schema definition. You define your schema in schema.prisma, and Prisma generates and tracks migrations.

model User {
  id          String   @id @default(cuid())
  email       String   @unique
  displayName String?
  createdAt   DateTime @default(now())
}

Prisma Migrate generates SQL migrations based on your schema changes. It maintains a migration history table and supports development, staging, and production environments.

The main advantage is tight integration with your ORM layer. The downside is vendor lock-in—if you switch away from Prisma, you need to migrate your migration system too.

Adding Columns Without Locking Tables

This deserves its own section because it is such a common operation and a common source of production incidents.

The Problem with DEFAULT values

Before PostgreSQL 11, adding a column with a DEFAULT value caused a table rewrite. For a table with 100 million rows, this could take hours and lock the table the entire time.

Even in PostgreSQL 11+, certain operations still cause table locks:

  • Changing a column’s type
  • Changing a column’s NOT NULL constraint
  • Adding a CHECK constraint

Techniques for Safe Column Addition

  1. Add nullable columns first: Columns that allow NULL never require a table rewrite.

  2. Add columns with non-blocking defaults: PostgreSQL 11+ handles volatile defaults efficiently for column addition.

  3. Use concurrent index creation: As mentioned earlier, always use CONCURRENTLY for indexes on large tables.

  4. Avoid constraints initially: Add CHECK constraints as NOT VALID, then validate later:

-- Add constraint without scanning
ALTER TABLE orders ADD CONSTRAINT positive_total CHECK (total >= 0) NOT VALID;

-- Validate without locking
ALTER TABLE orders VALIDATE CONSTRAINT positive_total;
  1. Break up large alterations: For very large tables, consider a multi-step approach where you add a trigger to maintain the new column rather than backfilling all at once.

Best Practices

After working on this for years, a few practices consistently make migrations safer.

Test on production-size data: A migration that takes 2 seconds on your laptop might take 2 hours on production. Test against realistic data volumes.

Monitor table statistics: Run ANALYZE after major migrations to ensure the query planner has accurate information.

Have a rollback plan: Not for the migration itself, but for the deployment. Can you route traffic back to the old application version if something goes wrong?

Schedule during low traffic: Even with all precautions, migrations are safer during low-traffic windows.

Keep migrations small and focused: Large migrations are harder to review and harder to rollback. One change per migration makes debugging easier.

Production Failure Scenarios

FailureCauseMitigation
Table lock during column renamePostgreSQL versions < 11 with DEFAULT values caused table rewriteUse PostgreSQL 11+ or multi-step column addition without defaults
Dual-write window data divergenceApplication bug writes to wrong column during expand phaseDual-write validation checks in staging, read-after-write consistency tests
Long-running migration holding locksTransaction timeout on large backfill UPDATEUse smaller batch sizes with sleep intervals, never hold locks across batches
Rollback leaves partial schema stateContract phase run before all app versions migratedGate contract phase on deployment confirmation, keep old column until fully verified
Concurrent REINDEX blocks writesREINDEX without CONCURRENTLY on busy indexAlways use REINDEX CONCURRENTLY in production

Trade-Off Table: Migration Tools

DimensionFlywayLiquibasePrisma MigrateRaw SQL
Learning curveLowMediumLowLow
Rollback supportPaid versionYes (changelog)LimitedManual
Cross-database supportYes (JDBC)YesNo (ORM-coupled)Yes
Transaction handlingPartial (non-transactional commands break it)PartialYesManual
State trackingVersioned filesChangelog tableMigration historyNone
CI/CD integrationExcellentGoodGoodRequires custom script
Large team supportGoodGoodGoodNeeds conventions

Capacity Estimation: Migration Window Sizing

Migration window sizing determines how long your migration will take and how much downtime you actually need.

Batch size and migration time formula:

rows_per_batch = batch_size_bytes / avg_row_width_bytes
total_batches = total_rows / rows_per_batch
migration_time_minutes = total_batches × batch_interval_seconds / 60

For a 100M row table adding a NOT NULL column with a default value:

  • Average row width: 200 bytes
  • Batch size: 50MB
  • Rows per batch: 50MB / 200 = 250,000 rows
  • Total batches: 100M / 250K = 400 batches
  • Batch interval (including index rebuild): 5 seconds
  • Total migration time: 400 × 5 / 60 = ~33 minutes

Downtime window formula:

actual_downtime = sum(contract_phase_locks)
contract_phase_downtime = migration_phase_batches × batch_interval_seconds

The expand-contract pattern minimizes actual downtime to only the contract phase. A 100M row table with 5-second batch intervals and 400 batches has a contract phase of ~33 minutes. With a shadow table approach, actual lock time per batch drops to milliseconds—meaning ~5 minutes of total downtime.

Validation time formula:

validation_time = table_row_count × (1 / validation_rows_per_second)

Validating a migrated table against the original (row count, checksum) at 100K rows/second for a 100M row table: ~17 minutes. Plan for validation time equal to or greater than the migration itself.

Real-World Case Study: Stripe’s Zero-Downtime Migration Toolkit

Stripe runs one of the largest PostgreSQL deployments in the financial technology space, processing billions of dollars in transactions annually. Their migration toolkit is open-source and represents years of production experience.

The problem Stripe solved: At their scale, a naive ALTER TABLE statement blocking for even 30 seconds would result in thousands of failed payments. They needed a migration system that worked across multiple application versions simultaneously, with the ability to roll back instantly if something went wrong.

Key practices from Stripe’s approach:

  • Shadow tables: Create an identical copy of the table to migrate. Applications write to both old and new tables during the expand phase.
  • Backfill with minimal locks: Use batched updates with row-level locking rather than table-level locks. Stripe’s tooling uses WHERE id > last_processed_id ORDER BY id LIMIT batch_size pattern.
  • Validation before switchover: Compare row counts, checksums, and a random sample of rows between shadow and original tables before promoting the shadow.
  • Backward-compatible application code: Every migration is paired with an application deployment that can read both old and new schemas. Old code never breaks—new columns are simply ignored.
  • Instant rollback: If validation fails, drop the shadow table and roll back application code. No schema changes to undo.

The lesson: Stripe’s toolkit proves that “zero-downtime” is not marketing—it’s a specific set of patterns (shadow table, dual write, validation) that require upfront investment but pay off at scale.

Interview Questions

Q1: Your team needs to add a NOT NULL column with a default value to a 500M row table. The application cannot have more than 5 minutes of downtime. Walk through your approach.

A: The key is to break the migration into zero-downtime phases using expand-contract. First, deploy application code that writes the default value to new and old columns simultaneously. Second, use a shadow table migration pattern: create a new table with the correct schema, backfill data in batches of ~10K rows using id > last_id ORDER BY id LIMIT batch_size to avoid table locks, replay live writes to both tables during the migration window. Third, validate row counts and checksums between shadow and original. Fourth, perform a quick lock switchover (typically under 30 seconds using ALTER TABLE ... RENAME). Finally, drop the old table after verifying the application works correctly.

Q2: What’s the difference between Flyway and Liquibase for managing database schema changes?

A: Flyway uses versioned SQL scripts (plain SQL files) with a simple migration history table. It’s straightforward, works with raw SQL expertise, and has minimal abstraction. Liquibase uses XML/YAML/JSON change logs with a declarative approach, tracking changes by change sets rather than raw SQL. Liquibase supports rollbacks natively through change set metadata; Flyway requires manually written undo scripts. For teams already using ORM tooling, Liquibase’s change log approach integrates well with Prisma and Hibernate. For raw SQL shops, Flyway’s simplicity wins.

Q3: A migration fails halfway through. The application is now running against a partially-migrated schema. How do you recover?

A: Recovery depends on whether the migration was wrapped in a transaction. For expand-contract patterns without atomic transactions, the application is designed to be backward-compatible during the migration window—so the old application code continues working. Recovery steps: first, do not panic—the expand-contract pattern means the old schema still exists. Second, roll back application code to the previous version that speaks only the old schema. Third, if using a shadow table, validate data integrity between shadow and original. Fourth, after validating old schema is fully functional, clean up any partial shadow table artifacts. Finally, investigate failure root cause, adjust migration scripts, and re-run during the next maintenance window.

Q4: How do you handle schema migrations in a multi-region active-active database setup?

A: Multi-region migrations require either region-by-region rollouts or schemas that are backward-compatible across all regions simultaneously. Key challenges: DDL statements in PostgreSQL are not fully transactional across regions (certain DDL changes can cause brief locks). Solutions: use migrate-based expansion where each region validates its own schema version; use online DDL tools like pg_repack for table alterations instead of ALTER TABLE directly; avoid adding NOT NULL constraints without defaults in a single atomic step—instead, add nullable first, backfill, then add the NOT NULL constraint. For active-active, the safest pattern is schema changes that never take exclusive locks: add columns as nullable, add indexes CONCURRENTLY, and use trigger-based shadow tables for data migrations.

Conclusion

Zero-downtime database migrations are not about any single technique—they are about a mindset. Assume your application code and database schema will be out of sync during deployments. Design migrations that work with multiple application versions. Use the expand-contract pattern for breaking changes. Leverage your migration tooling to track history and automate consistency.

The investment in proper migration practices pays off every deployment. No drama, no late-night incidents, no corrupted data. Just smooth, boring deployments—the kind you want.


Category

Related Posts

Deployment Strategies: Rolling, Blue-Green, and Canary Releases

Compare and implement deployment strategies—rolling updates, blue-green deployments, and canary releases—to reduce risk and enable safe production releases.

#deployment #devops #kubernetes

CI/CD Pipelines for Microservices

Learn how to design and implement CI/CD pipelines for microservices with automated testing, blue-green deployments, and canary releases.

#microservices #cicd #devops

Health Checks: Liveness, Readiness, and Service Availability

Master health check implementation for microservices including liveness probes, readiness probes, and graceful degradation patterns.

#microservices #health-checks #kubernetes