Zero-downtime database migrations: the expand-contract pattern in practice

Ship relational schema changes without maintenance windows using expand-contract phases, backfills, and safe cutovers—patterns that keep production APIs available under load.

Autor: Matheus Palma7 min de lectura
Software engineeringArchitectureBackendPostgreSQLDevOps

You need to split a users.full_name column into first_name and last_name before the next billing release. The naive approach—ALTER TABLE in a transaction, deploy code that reads the new shape, done—works on a laptop. In production, long-running DDL locks the table, API pods briefly see inconsistent shapes, and a rollback is no longer “revert the deploy” because the data has already moved. In client projects and long-lived products, the expensive failures are rarely the SQL syntax; they are ordering, lock duration, and mixed versions running at once.

Zero-downtime migrations are not a single tool. They are a contract between schema, application code, and deployment cadence, usually expressed as expand → migrate data → contract (often called expand/contract or parallel change). This article walks through that contract with PostgreSQL-oriented examples, explains why each phase exists, and calls out the pitfalls that still break teams who “follow the steps” without measuring locks and row counts.

Why one-step migrations fail in production

Relational databases couple metadata (schema) and data with strong consistency guarantees. When you rename a column or change a type, the database may need to rewrite the heap, rebuild indexes, or take strong locks so that concurrent sessions do not observe impossible states. Application servers rarely roll out atomically: during a rolling deploy, version N and version N+1 coexist. If N+1 expects a column that N never writes, or N reads a column N+1 stopped updating, you get subtle data loss or 500s that appear only under traffic splits.

The expand-contract pattern exists to guarantee that, for every moment in time, some deployed version’s assumptions match the schema and data that are actually present.

The three phases in plain language

Phase 1: Expand (additive, backward compatible)

Make the database strictly more capable without removing or redefining anything old:

  • Add nullable columns or new tables.
  • Add indexes concurrently where the engine supports it (for example CREATE INDEX CONCURRENTLY in PostgreSQL) so you do not block writes on large tables.
  • Add check constraints as NOT VALID first, then validate in a separate step to avoid long table scans under a blocking validation lock on some engines.

At the end of expand, old application code must still run correctly. That usually means new columns are nullable or have safe defaults, and new tables are unused until a later deploy.

Phase 2: Migrate data and behavior (dual paths)

Ship application logic that:

  • Writes to both old and new representations (or writes only to new columns while still reading from old ones with a fallback query), until backfill completes.
  • Backfills existing rows in batches—keyset iteration by primary key, modest batch sizes, throttling—to avoid saturating I/O or replication lag.
  • Reads prefer the new representation when populated, otherwise fall back to the old (or use feature flags to shift cohorts gradually).

This phase is where teams win or lose on observability: you want metrics on backfill progress, replication lag, and error rates per version.

Phase 3: Contract (remove the old path)

When the new representation is authoritative for all rows and all running versions read it:

  • Deploy code that no longer references the old columns or tables.
  • Drop old objects in a migration that is safe under load (again favoring concurrent index drops and aware lock behavior).

After contract, rollback across schema may be hard—treat these migrations as release events with backups and clear runbooks.

PostgreSQL-specific realities

Lock modes and ACCESS EXCLUSIVE. Operations such as ALTER TYPE, some ALTER COLUMN variants, or adding a column with a volatile default have historically caused full table rewrites or strong locks. Always consult current documentation for your exact version, and test migrations against a copy with production-like volume.

Defaults and backfills. Adding a NOT NULL column without a default is impossible on nonempty tables without a multi-step flow: add nullable, backfill, set NOT NULL in a later migration once invariant holds.

Foreign keys and cascades. New FKs can validate existing data and introduce lock contention; NOT VALID + VALIDATE CONSTRAINT splits the work in PostgreSQL.

Long transactions amplify pain. ORMs that open wide transactions around migrations interact badly with DDL. Run DDL from migration tooling with explicit session settings where appropriate, and avoid mixing interactive sessions with deployment hooks.

Trade-offs and limitations

Time and complexity. A one-line RENAME COLUMN becomes three tickets and four deploys. That cost is only justified when downtime has a dollar value—usually it does for customer-facing APIs and for teams with strict SLOs.

Application complexity. Dual-write paths mean more branches, more tests, and temporary inconsistency windows if writes fail partially. Compensating logic or background repair jobs may be necessary when perfect dual-write atomicity across tables is not achievable in one transaction.

Not a substitute for capacity planning. If the table is already near disk or CPU limits, additive indexes and backfills can tip it over. Migrations and load tests belong in the same conversation.

Cross-database portability. “Concurrent” index creation and constraint validation semantics differ across engines. The pattern generalizes; the exact DDL does not copy-paste to MySQL or SQL Server without review.

Practical example: extracting a profile flag from JSON into a column

Suppose users.settings is jsonb and you read settings->>'newsletterOptIn' on every listing query, which is becoming CPU-heavy. Target: boolean users.newsletter_opt_in with an index for campaigns.

Step A — Expand (deploy migration 1).

ALTER TABLE users
  ADD COLUMN newsletter_opt_in boolean;

CREATE INDEX CONCURRENTLY idx_users_newsletter_opt_in
  ON users (newsletter_opt_in)
  WHERE newsletter_opt_in IS TRUE;

Application still ignores the column.

Step B — Application deploy 1 (dual write, read fallback).

On create/update of settings, set newsletter_opt_in from the JSON payload in the same transaction as the row update. On read paths that need the flag, use COALESCE(users.newsletter_opt_in, (settings->>'newsletterOptIn')::boolean) until backfill completes.

Step C — Backfill job (idempotent).

-- Run repeatedly in batches until no rows match
UPDATE users u
SET newsletter_opt_in = (u.settings->>'newsletterOptIn')::boolean
WHERE u.id IN (
  SELECT id FROM users
  WHERE newsletter_opt_in IS NULL
    AND settings ? 'newsletterOptIn'
  ORDER BY id
  LIMIT 5000
  FOR UPDATE SKIP LOCKED
);

Monitor rows remaining and replication lag; tune batch size to your storage.

Step D — Enforce and simplify (deploy migration 2, after backfill metrics show zero gap).

ALTER TABLE users
  ALTER COLUMN newsletter_opt_in SET DEFAULT false;

UPDATE users SET newsletter_opt_in = false WHERE newsletter_opt_in IS NULL;

ALTER TABLE users
  ALTER COLUMN newsletter_opt_in SET NOT NULL;

Application deploy 2 drops JSON fallback reads and writes only the column (JSON may remain for other keys).

Step E — Contract (deploy migration 3, after code no longer references JSON for this field).

Stop writing the redundant JSON key in application code first; then remove it in a lazy cleanup migration or leave it inert if compatibility requires—contract is about dependency removal, not aesthetic minimalism.

This skeleton maps directly to how I structure change tickets for APIs that must stay up through business hours: each step is deployable alone, and each rollback has a defined meaning.

Common mistakes and pitfalls

  • Deploying code before additive DDL on rolling infrastructures: new pods crash-loop while old pods keep serving, burning error budgets for no reason.
  • Non-concurrent indexes on hot tables, blocking writes during index builds.
  • Assuming backfill finished without a SQL check or metric; partial backfills plus NOT NULL are an excellent way to force an emergency revert.
  • Dual-write without a reconciliation story when one path fails: you end up with silent divergence between JSON and column.
  • Big-bang data fixes in the same release as semantic refactors—if something breaks, you cannot tell whether the query, the index, or the migration caused it.
  • Ignoring lock waits in staging: always inspect pg_locks / wait events for representative data sizes, not empty schemas.

Conclusion

Zero-downtime migrations are sequenced compatibility work: expand the schema so old code keeps working, move data and traffic with measurable progress, then contract once only new assumptions remain. The pattern costs more tickets up front and saves far more in avoided outages, support churn, and emergency database surgery.

If you are designing a new service boundary or tightening an existing one, the same discipline applies—additive contracts first, observable transitions, and aggressive removal only when metrics prove it is safe. For collaboration on production-grade backends, scalable APIs, or migration planning on critical paths, the contact page is the right place to reach out; background on engineering focus areas is on about.

Suscríbete al boletín

Recibe un correo cuando se publiquen artículos nuevos. Sin spam — solo entradas nuevas de este blog.

Con Resend. Puedes darte de baja en cualquier correo.