Zero-Downtime Database Migration Strategies

Learn how to evolve your database schema in production without downtime. This guide covers expand-contract patterns, backward-compatible migrations, rollback strategies, and tools.

published: reading time: 28 min read author: GeekWorkBench

Zero-Downtime Database Migration Strategies

Database migrations seem simple until you’re doing them in production with millions of rows and zero tolerance for downtime. Get it wrong and you get angry users, late-night escalations, and potentially corrupted data. Get it right and your application evolves smoothly while users don’t notice a thing.

This guide covers the strategies and tools that professional teams use to ship database changes without drama.

Introduction

Database migrations are the most dangerous deployment operation you run. A bad migration in production with millions of rows and a zero-downtime requirement can cause data corruption, service outages, or require emergency rollbacks that create their own problems. The difference between a smooth migration and an incident is almost always whether you followed a proven pattern.

This guide covers the expand-contract pattern for breaking schema changes, backward-compatible migration strategies that work during rolling deployments, batch backfill techniques that do not lock tables, and rollback approaches when things go wrong. It focuses on PostgreSQL but the patterns apply broadly.

Understanding the Problem

When you’re working with a live database, every schema change carries risk. Adding a column seems innocuous until you realize it locks the table. Renaming a column sounds safe until the old code still running on some server tries to write to the old name. The problem isn’t the change itself—it’s the multiple versions of your application that exist during a deployment window.

Modern deployments use rolling updates or blue-green deployments. During these deployments, version N and version N+1 of your application are both running simultaneously. Your database migration must be compatible with both versions. This constraint is the foundation of all zero-downtime migration strategies.

The Expand-Contract Pattern

The expand-contract pattern, sometimes called the “two-phase migration,” works well for breaking schema changes. The core idea: never make a breaking change in a single step. Break it into three phases instead.

Phase 1: Expand

In the expand phase, you add new structure without removing the old. Suppose you want to rename a user_name column to display_name. Instead of renaming directly, you add the new column alongside the old one.

ALTER TABLE users ADD COLUMN display_name VARCHAR(255);

Your application code now writes to both columns. Version N writes to user_name, and version N+1 writes to display_name. A background job backfills the new column with data from the old column.

Phase 2: Migrate

Once the expand phase is complete and all historical data has been backfilled, you update the application code to stop writing to the old column. The old column becomes read-only.

-- Application now writes to display_name only
-- Old user_name column is kept for backward compatibility

Phase 3: Contract

After you have verified that no code is writing to the old column and all reads have been migrated, you can safely remove the old structure.

ALTER TABLE users DROP COLUMN user_name;

This three-phase approach ensures that at every point during the migration, some version of your application can function correctly.

flowchart LR
    subgraph Phase1["Phase 1: Expand"]
        A1[Add new column<br/>display_name] --> B1[Deploy vN+1 writes<br/>to both columns]
        B1 --> C1[Background job<br/>backfills new column]
    end

    subgraph Phase2["Phase 2: Migrate"]
        C1 --> D1[Stop writing to<br/>old column]
        D1 --> E1[Old column<br/>becomes read-only]
    end

    subgraph Phase3["Phase 3: Contract"]
        E1 --> F1[Verify no writes<br/>to old column]
        F1 --> G1[DROP old column<br/>user_name]
    end

    Phase1 --> Phase2 --> Phase3

The expand-contract pattern is a specific application of a broader principle: backward-compatible migrations. A migration is backward-compatible when it works correctly whether the application before or after the migration is running.

Adding Columns Without Locking

PostgreSQL’s MVCC (Multi-Version Concurrency Control) means that most DDL changes no longer require exclusive locks like they used to. However, some operations still cause issues.

Adding a column with a default value used to require a table rewrite in older PostgreSQL versions. Starting with PostgreSQL 11, adding a column with a volatile default only updates the table metadata, not the rows themselves. The default is stored once in the catalog and applied on reads.

-- Safe in PostgreSQL 11+
ALTER TABLE users ADD COLUMN is_verified BOOLEAN DEFAULT FALSE;

For older versions, use a separate step:

-- Step 1: Add column without default
ALTER TABLE users ADD COLUMN is_verified BOOLEAN;

-- Step 2: Add default separately
ALTER TABLE users ALTER COLUMN is_verified SET DEFAULT FALSE;

-- Step 3: Backfill existing rows
UPDATE users SET is_verified = FALSE WHERE is_verified IS NULL;

Index Creation

Creating an index locks writes to the table in versions before PostgreSQL 11. For large tables, this can cause significant downtime. Use CONCURRENTLY to build indexes without blocking writes.

-- Creates index without blocking writes
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);

Note that CREATE INDEX CONCURRENTLY cannot run inside a transaction. Your migration tool must support transactional migrations that can also run non-transactional commands.

When to Use / When Not to Use

Use expand-contract migrations when you need to rename or remove columns that existing code still references, change column types requiring data transformation, split or merge tables, or make any schema change that breaks the current application version.

Use backward-compatible migrations when adding columns with safe defaults (PostgreSQL 11+), adding indexes concurrently, adding constraints as NOT VALID before validating, or working with rolling deployments where old and new app versions coexist.

Do not use expand-contract when the change is purely additive (new table, new column that nothing touches — just deploy it), when you have a maintenance window for a coordinated full migration, or when the risk of multi-version confusion outweighs the risk of a coordinated deployment.

Never do the following in production: single-step column renames or type changes on large tables; non-concurrent index creation on tables with active writes; migrations inside the same transaction as application code deployment.

Rollback Strategies

Manual migration management does not scale. As your team grows and your database schema becomes more complex, you need tools that track migration history, enforce ordering, and help with rollback.

Flyway

Flyway is a mature, open-source migration tool that works with virtually any database. Migrations are stored as versioned SQL scripts.

V1__Create_initial_schema.sql
V2__Add_users_table.sql
V3__Add_email_index.sql

Flyway tracks which migrations have been applied and applies pending ones on startup. It supports undo migrations (in paid version) and baseline migrations for existing databases.

Configuration:

flyway.url=jdbc:postgresql://localhost:5432/mydb
flyway.user=flyway
flyway.password=secret
flyway.locations=filesystem:./migrations

Liquibase

Liquibase uses a changelog file (XML, JSON, YAML, or SQL) to define migrations. This provides better cross-database compatibility since Liquibase handles the dialect translation.

<changeSet id="1" author="alex">
  <addColumn tableName="users">
    <column name="display_name" type="VARCHAR(255)"/>
  </addColumn>
</changeSet>

Liquibase tracks changes via a changelog table, allowing it to apply only new changes. It also supports rollbacks via rollback SQL or rollback scripts.

Prisma Migrate

If you are using Prisma as your ORM, Prisma Migrate integrates directly with your schema definition. You define your schema in schema.prisma, and Prisma generates and tracks migrations.

model User {
  id          String   @id @default(cuid())
  email       String   @unique
  displayName String?
  createdAt   DateTime @default(now())
}

Prisma Migrate generates SQL migrations based on your schema changes. It maintains a migration history table and supports development, staging, and production environments.

The main advantage is tight integration with your ORM layer. The downside is vendor lock-in—if you switch away from Prisma, you need to migrate your migration system too.

Adding Columns Without Locking Tables

This deserves its own section because it is such a common operation and a common source of production incidents.

The Problem with DEFAULT values

Before PostgreSQL 11, adding a column with a DEFAULT value caused a table rewrite. For a table with 100 million rows, this could take hours and lock the table the entire time.

Even in PostgreSQL 11+, certain operations still cause table locks:

  • Changing a column’s type
  • Changing a column’s NOT NULL constraint
  • Adding a CHECK constraint

Techniques for Safe Column Addition

  1. Add nullable columns first: Columns that allow NULL never require a table rewrite.

  2. Add columns with non-blocking defaults: PostgreSQL 11+ handles volatile defaults efficiently for column addition.

  3. Use concurrent index creation: As mentioned earlier, always use CONCURRENTLY for indexes on large tables.

  4. Avoid constraints initially: Add CHECK constraints as NOT VALID, then validate later:

-- Add constraint without scanning
ALTER TABLE orders ADD CONSTRAINT positive_total CHECK (total >= 0) NOT VALID;

-- Validate without locking
ALTER TABLE orders VALIDATE CONSTRAINT positive_total;
  1. Break up large alterations: For very large tables, consider a multi-step approach where you add a trigger to maintain the new column rather than backfilling all at once.

Best Practices

After working on this for years, a few practices consistently make migrations safer.

Test on production-size data: A migration that takes 2 seconds on your laptop might take 2 hours on production. Test against realistic data volumes.

Monitor table statistics: Run ANALYZE after major migrations to ensure the query planner has accurate information.

Have a rollback plan: Not for the migration itself, but for the deployment. Can you route traffic back to the old application version if something goes wrong?

Schedule during low traffic: Even with all precautions, migrations are safer during low-traffic windows.

Keep migrations small and focused: Large migrations are harder to review and harder to rollback. One change per migration makes debugging easier.

Production Failure Scenarios

FailureCauseMitigation
Table lock during column renamePostgreSQL versions < 11 with DEFAULT values caused table rewriteUse PostgreSQL 11+ or multi-step column addition without defaults
Dual-write window data divergenceApplication bug writes to wrong column during expand phaseDual-write validation checks in staging, read-after-write consistency tests
Long-running migration holding locksTransaction timeout on large backfill UPDATEUse smaller batch sizes with sleep intervals, never hold locks across batches
Rollback leaves partial schema stateContract phase run before all app versions migratedGate contract phase on deployment confirmation, keep old column until fully verified
Concurrent REINDEX blocks writesREINDEX without CONCURRENTLY on busy indexAlways use REINDEX CONCURRENTLY in production

Trade-Off Table: Migration Tools

DimensionFlywayLiquibasePrisma MigrateRaw SQL
Learning curveLowMediumLowLow
Rollback supportPaid versionYes (changelog)LimitedManual
Cross-database supportYes (JDBC)YesNo (ORM-coupled)Yes
Transaction handlingPartial (non-transactional commands break it)PartialYesManual
State trackingVersioned filesChangelog tableMigration historyNone
CI/CD integrationExcellentGoodGoodRequires custom script
Large team supportGoodGoodGoodNeeds conventions

Capacity Estimation: Migration Window Sizing

Migration window sizing determines how long your migration will take and how much downtime you actually need.

Batch size and migration time formula:

rows_per_batch = batch_size_bytes / avg_row_width_bytes
total_batches = total_rows / rows_per_batch
migration_time_minutes = total_batches × batch_interval_seconds / 60

For a 100M row table adding a NOT NULL column with a default value:

  • Average row width: 200 bytes
  • Batch size: 50MB
  • Rows per batch: 50MB / 200 = 250,000 rows
  • Total batches: 100M / 250K = 400 batches
  • Batch interval (including index rebuild): 5 seconds
  • Total migration time: 400 × 5 / 60 = ~33 minutes

Downtime window formula:

actual_downtime = sum(contract_phase_locks)
contract_phase_downtime = migration_phase_batches × batch_interval_seconds

The expand-contract pattern minimizes actual downtime to only the contract phase. A 100M row table with 5-second batch intervals and 400 batches has a contract phase of ~33 minutes. With a shadow table approach, actual lock time per batch drops to milliseconds—meaning ~5 minutes of total downtime.

Validation time formula:

validation_time = table_row_count × (1 / validation_rows_per_second)

Validating a migrated table against the original (row count, checksum) at 100K rows/second for a 100M row table: ~17 minutes. Plan for validation time equal to or greater than the migration itself.

Real-World Case Study: Stripe’s Zero-Downtime Migration Toolkit

Stripe runs one of the largest PostgreSQL deployments in the financial technology space, processing billions of dollars in transactions annually. Their migration toolkit is open-source and represents years of production experience.

The problem Stripe solved: At their scale, a naive ALTER TABLE statement blocking for even 30 seconds would result in thousands of failed payments. They needed a migration system that worked across multiple application versions simultaneously, with the ability to roll back instantly if something went wrong.

Key practices from Stripe’s approach:

  • Shadow tables: Create an identical copy of the table to migrate. Applications write to both old and new tables during the expand phase.
  • Backfill with minimal locks: Use batched updates with row-level locking rather than table-level locks. Stripe’s tooling uses WHERE id > last_processed_id ORDER BY id LIMIT batch_size pattern.
  • Validation before switchover: Compare row counts, checksums, and a random sample of rows between shadow and original tables before promoting the shadow.
  • Backward-compatible application code: Every migration is paired with an application deployment that can read both old and new schemas. Old code never breaks—new columns are simply ignored.
  • Instant rollback: If validation fails, drop the shadow table and roll back application code. No schema changes to undo.

The lesson: Stripe’s toolkit proves that “zero-downtime” is not marketing—it’s a specific set of patterns (shadow table, dual write, validation) that require upfront investment but pay off at scale.

Quick Recap Checklist

  • Zero-downtime migrations require backward-compatible schema changes that work with multiple app versions
  • The expand-contract pattern breaks breaking changes into three phases: add new, migrate data, remove old
  • Always use CONCURRENTLY for index creation on tables with active writes
  • Add columns as nullable first; add NOT NULL constraints after backfilling
  • Use NOT VALID to add CHECK constraints without scanning rows
  • Migration tools (Flyway, Liquibase, Prisma Migrate) track which migrations have been applied
  • Test migrations on production-size data — 2 seconds on laptop can mean 2 hours in production
  • Schedule migrations during low-traffic windows even with zero-downtime techniques
  • Keep migrations small and focused — one change per migration is easier to debug and rollback
  • Always have a rollback plan for the deployment (not the migration itself)
  • Monitor table statistics after major migrations — run ANALYZE to update query planner

Interview Questions

1. Your team needs to add a NOT NULL column with a default value to a 500M row table. The application cannot have more than 5 minutes of downtime. Walk through your approach.

The key is to break the migration into zero-downtime phases using expand-contract. First, deploy application code that writes the default value to new and old columns simultaneously. Second, use a shadow table migration pattern: create a new table with the correct schema, backfill data in batches of ~10K rows using id > last_id ORDER BY id LIMIT batch_size to avoid table locks, replay live writes to both tables during the migration window. Third, validate row counts and checksums between shadow and original. Fourth, perform a quick lock switchover (typically under 30 seconds using ALTER TABLE ... RENAME). Finally, drop the old table after verifying the application works correctly.

2. What's the difference between Flyway and Liquibase for managing database schema changes?

Flyway uses versioned SQL scripts (plain SQL files) with a simple migration history table. It's straightforward, works with raw SQL expertise, and has minimal abstraction. Liquibase uses XML/YAML/JSON change logs with a declarative approach, tracking changes by change sets rather than raw SQL. Liquibase supports rollbacks natively through change set metadata; Flyway requires manually written undo scripts. For teams already using ORM tooling, Liquibase's change log approach integrates well with Prisma and Hibernate. For raw SQL shops, Flyway's simplicity wins.

3. A migration fails halfway through. The application is now running against a partially-migrated schema. How do you recover?

Recovery depends on whether the migration was wrapped in a transaction. For expand-contract patterns without atomic transactions, the application is designed to be backward-compatible during the migration window—so the old application code continues working. Recovery steps: first, do not panic—the expand-contract pattern means the old schema still exists. Second, roll back application code to the previous version that speaks only the old schema. Third, if using a shadow table, validate data integrity between shadow and original. Fourth, after validating old schema is fully functional, clean up any partial shadow table artifacts. Finally, investigate failure root cause, adjust migration scripts, and re-run during the next maintenance window.

4. How do you handle schema migrations in a multi-region active-active database setup?

Multi-region migrations require either region-by-region rollouts or schemas that are backward-compatible across all regions simultaneously. Key challenges: DDL statements in PostgreSQL are not fully transactional across regions (certain DDL changes can cause brief locks). Solutions: use migrate-based expansion where each region validates its own schema version; use online DDL tools like pg_repack for table alterations instead of ALTER TABLE directly; avoid adding NOT NULL constraints without defaults in a single atomic step—instead, add nullable first, backfill, then add the NOT NULL constraint. For active-active, the safest pattern is schema changes that never take exclusive locks: add columns as nullable, add indexes CONCURRENTLY, and use trigger-based shadow tables for data migrations.

5. Your application uses Prisma Migrate but you need to migrate from PostgreSQL to MySQL. What challenges arise and how do you handle them?

Prisma Migrate is ORM-coupled and generates database-specific SQL, so switching databases requires rewriting migrations entirely. Challenges: different SQL dialects (PostgreSQL's SERIAL vs MySQL's AUTO_INCREMENT), different constraint syntax, different index types, and data type mismatches (PostgreSQL's JSONB vs MySQL's JSON). Approach: export your current schema as SQL, translate to MySQL dialect manually or via a tool like pg2mysql, then create new migration files in MySQL format. For data migration, use a ETL tool or write a migration script that reads from PostgreSQL and writes to MySQL. Alternatively, consider using Flyway or Liquibase for database-agnostic migrations since they use pure SQL changelogs that can be dialect-translated.

6. A migration requires backfilling millions of rows with a computed value. The backfill runs for hours and you need to deploy application changes. How do you handle this?

Split the migration into phases: First, deploy application code that writes the computed value to both old and new columns during the expand phase. Second, run the backfill in batches using UPDATE ... WHERE id > last_id ORDER BY id LIMIT batch_size with small batches (e.g., 1000 rows) and sleep intervals (e.g., 100ms between batches) to avoid lock contention and allow the application to continue writing. Third, once backfill is complete, deploy the contract phase that drops the old column. The key is that the expand phase allows the application to continue functioning while backfill runs in the background — the application writes are the "live backfill" for new rows. Schedule the final contract phase (old column drop) during low-traffic window since it requires a brief lock.

7. What is the difference between a "migration" and a "seed" in database deployment pipelines? When would you use each?

Migrations modify schema (DDL) — adding columns, creating indexes, altering constraints. Seeds populate initial or reference data — lookup tables, default configurations, test data. Use migrations for schema evolution; use seeds for data that belongs in the database from the start. In deployment pipelines: run migrations before the application starts (schema must exist for the app to connect), run seeds after migrations complete but before traffic arrives (data must exist for the app to function correctly). Never mix seeds into migrations — if a seed fails, you should be able to re-run it independently without re-running the entire migration history.

8. You need to rename a column that is referenced by 50+ stored procedures, functions, and views. How do you approach this safely in PostgreSQL?

Use the expand-contract pattern with dependent object handling: Phase 1 (Expand): Add the new column alongside the old one. Update all stored procedures, functions, and views to write to both columns using INSTEAD OF triggers or application-level dual write. Create new versions of all dependent objects that reference the new column name, keeping old versions for backward compatibility. Phase 2 (Migrate): Backfill data from old to new column. Switch all reads to use the new column. Phase 3 (Contract): Drop the old column and delete old versions of dependent objects. For 50+ dependent objects, use pg_proc and pg_depend to find all references: SELECT proname FROM pg_proc WHERE prosrc LIKE '%old_column_name%'. Automate the generation of the migration script using this query.

9. Explain why CREATE INDEX CONCURRENTLY cannot run inside a transaction and how this affects migration tooling.

CREATE INDEX CONCURRENTLY must not run inside a transaction because it needs to commit each intermediate step separately — the index build happens in multiple phases (index creation, initial data population, index ready). If the entire operation were in one transaction and failed, rolling back would leave partial index state that PostgreSQL cannot clean up automatically. Migration tools like Flyway normally wrap migrations in transactions for atomicity, but CONCURRENTLY breaks this. Solutions: use migration tools that support non-transactional commands (Liquibase can be configured for this), separate index creation migrations from DDL migrations, or use pg_repack for non-blocking index builds instead of CONCURRENTLY. Always verify your migration tool's transaction handling before deploying index changes in production.

10. How do you handle database migrations in a blue-green deployment where two versions of the application run simultaneously?

Blue-green deployments require backward-compatible migrations because the old (blue) and new (green) application versions both connect to the same database during the transition window. Rules: never drop or rename columns/tables that old code still references; never add NOT NULL constraints without defaults on columns that old code writes to; never create indexes without CONCURRENTLY. Migrations must be deployable before traffic switches to the new version, and they must be safe to run while both versions are active. The ideal migration is one that works with both versions — adding a nullable column, adding an index concurrently, or adding a constraint as NOT VALID. After traffic switches and old code is fully retired, you can run the contract phase to remove old structures.

11. A migration fails mid-execution because it tried to add a constraint that already exists (constraint violation). How do you make migrations idempotent?

Idempotent migrations check for existence before executing. Instead of ALTER TABLE orders ADD CONSTRAINT positive_total CHECK (total >= 0), use: ALTER TABLE orders ADD CONSTRAINT positive_total CHECK (total >= 0) NOT VALID combined with a pre-check: ALTER TABLE orders VALIDATE CONSTRAINT positive_total wrapped in DO $$ BEGIN ... EXCEPTION WHEN duplicate_object THEN NULL; END $$. For columns: ALTER TABLE users ADD COLUMN IF NOT EXISTS is_verified BOOLEAN DEFAULT FALSE (PostgreSQL 9.5+). For indexes: CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_users_email ON users(email). The pattern: always check existence or handle the exception, so re-running a migration that partially succeeded or already has the structure does not fail.

12. What is the "deadlock" scenario in large-table migrations and how does batch processing prevent it?

Deadlock occurs when a migration holds locks on multiple rows or tables in a different order than the application, or when multiple batches of the same migration run simultaneously. For example, a migration that updates rows in table A then table B, while an application updates B then A, creates deadlock potential. Batch processing prevents this by: acquiring locks in a consistent order (always update lower-id rows first), using smaller batch sizes to limit lock duration, adding SELECT ... FOR UPDATE SKIP LOCKED to skip rows already locked by concurrent operations, and adding sleep intervals between batches to allow other operations to proceed. Monitor pg_locks during large migrations and set alerts for blocked queries exceeding threshold.

13. Your team uses GitOps for database migrations (migrations run automatically on deployment). What safeguards do you need?

GitOps migration safeguards: (1) Migration files must be reviewedable and versioned — no inline SQL execution; (2) Migrations must be idempotent — re-running must not fail; (3) Include rollback capability in migration history, not just the migration itself; (4) Add a dry-run or validation step before production execution; (5) Set up manual approval gates for destructive migrations (DROP COLUMN, DROP TABLE, DROP INDEX); (6) Monitor migration duration and set timeouts — a migration running longer than expected indicates a problem; (7) Have a kill-switch that stops the deployment pipeline if migration fails; (8) Test migrations on a production-data clone before running in production; (9) Maintain a migration freeze window during peak traffic or critical business periods.

14. Explain the concept of "migration debt" and how to manage it.

Migration debt accumulates when schema changes diverge from application code expectations — old columns still referenced by code that was supposed to be removed, old tables that should have been dropped but weren't, constraints that were added but never validated. It builds up over time from rushed migrations, failed contracts, and deferred cleanup. Managing it: track pending contract phases in migration documentation or a project tracker; schedule periodic "schema cleanup" sprints; enforce that every migration includes a contract phase with a deadline; use static analysis tools like pgcluu or pg_stat_statements to detect unused columns and tables. The risk: migration debt increases operational complexity, slows queries (PostgreSQL still scans dead columns), and creates security surfaces.

15. How do you test migrations before production deployment?

Testing migrations: (1) Run on production-size data clone — use a point-in-time recovery replica or snapshot for testing; (2) Measure execution time — a migration that takes 2 seconds on laptop might take 2 hours on production with real data volumes; (3) Check lock escalation — verify the migration does not escalate to table-level locks under load; (4) Test rollback — verify that if the migration fails partway, the rollback path works correctly; (5) Run concurrent load test — while migration is running, simulate application traffic to verify no deadlock or blocking; (6) Check dependency impacts — run EXPLAIN on dependent queries to verify indexes are used correctly after the migration.

16. A team wants to use "lazy migration" where the application schema version and database schema version are allowed to drift temporarily. Is this safe?

Lazy migration (where application code and database schema are allowed to be out of sync temporarily) is safe only with strong backward compatibility guarantees and short drift windows. Requirements: application must handle both old and new schema simultaneously (expand phase is complete before deploy); drift window must be minutes to hours, not days; rollback must be possible without data loss; all team members must be aware of the drift and not introduce conflicting changes during the window. The expand-contract pattern is a form of lazy migration where the expand phase intentionally creates drift. The danger is when drift becomes unbounded — multiple teams making independent changes that each assume the other has migrated, leading to incompatible schema versions running simultaneously.

17. What are the trade-offs between schema-first (Prisma) and migration-first (Flyway) database deployment approaches?

Schema-first (Prisma): schema definition is the source of truth; migrations are generated from schema diffs. Pros: single source of truth, no drift between schema and migrations. Cons: ORM lock-in, less control over SQL syntax, harder to handle complex migrations. Migration-first (Flyway): SQL scripts are the source of truth; schema is built from migrations. Pros: full SQL control, database-agnostic, works with any ORM. Cons: schema and migration history can drift if not maintained carefully, manual schema management required. For teams using Prisma as ORM and wanting tight integration, schema-first makes sense. For teams that need cross-database support, complex custom SQL, or want to treat migrations as the audit log of schema changes, migration-first is better.

18. Your database has grown to 2TB and you need to add a new column with a default. PostgreSQL 11+ handles volatile defaults efficiently, but what if you are on an older version?

On PostgreSQL versions before 11, adding a column with a DEFAULT required a table rewrite (every row was updated). For a 2TB table, this could take hours and lock the table. Workaround: add the column as nullable without a default (ALTER TABLE users ADD COLUMN is_active BOOLEAN); update in batches (UPDATE users SET is_active = FALSE WHERE is_active IS NULL AND ctid IN (SELECT ctid FROM users WHERE is_active IS NULL LIMIT 10000)); add the default after all rows are backfilled (ALTER TABLE users ALTER COLUMN is_active SET DEFAULT FALSE); finally, backfill the default for any remaining NULLs if needed. This multi-step approach avoids the table rewrite while achieving the same result.

19. A migration adds a UNIQUE constraint but there are existing duplicate values in the column. How do you handle this?

Adding a UNIQUE constraint when duplicates exist will fail. Handle it before the migration: identify duplicates: SELECT email, COUNT(*) FROM users GROUP BY email HAVING COUNT(*) > 1. Decide on a resolution strategy: delete duplicates (keep row with highest ID, delete others), merge duplicates (combine data into one row, update references), or update duplicates to be unique (e.g., append a suffix to email). For production with duplicates, the safest approach is a multi-step expand-contract: add a nullable unique column with a temporary unique index; backfill with unique values (append a suffix to duplicates); validate uniqueness; then add the proper NOT NULL UNIQUE constraint. Always document how duplicates were resolved for audit purposes.

20. How do you handle cross-database schema migrations when you need to copy data between two different database systems?

Cross-database migrations require an ETL (Extract-Transform-Load) approach rather than a single SQL migration. Extract: read data from the source database using its native tools (pg_dump for PostgreSQL, mysqldump for MySQL). Transform: convert data types, handle dialect differences (e.g., TEXT vs VARCHAR, different date formats), and transform schemas (column renames, type mappings). Load: use a tool like pgloader, a custom ETL script, or a data integration platform. For zero-downtime cross-database migration: set up dual-write at the application layer (write to both databases simultaneously), run parallel reads during migration window, validate data integrity between source and target, then switch read path to new database. Tools: pgloader (PostgreSQL to many targets), AWS DMS (Database Migration Service), or custom scripts with batch processing.

Further Reading

Conclusion

Zero-downtime database migrations are not about any single technique—they are about a mindset. Assume your application code and database schema will be out of sync during deployments. Design migrations that work with multiple application versions. Use the expand-contract pattern for breaking changes. Leverage your migration tooling to track history and automate consistency.

The investment in proper migration practices pays off every deployment. No drama, no late-night incidents, no corrupted data. Just smooth, boring deployments—the kind you want.

Category

Related Posts

Deployment Strategies: Rolling, Blue-Green, and Canary Releases

Compare and implement deployment strategies—rolling updates, blue-green deployments, and canary releases—to reduce risk and enable safe production releases.

#deployment #devops #kubernetes

CI/CD Pipelines for Microservices

Learn how to design and implement CI/CD pipelines for microservices with automated testing, blue-green deployments, and canary releases.

#microservices #cicd #devops

Health Checks: Liveness, Readiness, and Service Availability

Master health check implementation for microservices including liveness probes, readiness probes, and graceful degradation patterns.

#microservices #health-checks #kubernetes