Zero-Downtime Database Migrations: Strategies and Patterns

Database schema changes are risky. Traditional migrations require downtime, but modern applications demand 24/7 availability. This guide shows how to evolve your database schema without taking your application offline.

The Problem with Traditional Migrations

Standard migration approaches have a critical flaw:

-- Traditional approach: Breaking change
ALTER TABLE users DROP COLUMN middle_name;

What happens:

Deploy migration → column disappears
Old application code tries to read middle_name → ERROR
Application breaks until new code is deployed

Even with perfect coordination, there’s a window where the schema and code are incompatible.

Zero-Downtime Migration Principles

Follow these principles to migrate safely:

1. Backward Compatibility

Every schema change must work with both old and new application code:

-- Bad: Breaks old code immediately
ALTER TABLE users RENAME COLUMN name TO full_name;

-- Good: Add new column, keep old one
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
UPDATE users SET full_name = name WHERE full_name IS NULL;

2. Expand-Contract Pattern

The safest migration strategy:

Expand Phase:

Add new schema elements (columns, tables, indexes)
Old code ignores new schema
New code uses new schema

Contract Phase (after old code is gone):

Remove old schema elements
Clean up redundant data

3. Multi-Step Deployments

Never combine schema changes with code changes in a single deployment. Instead:

Deploy schema expansion
Deploy code that works with both old and new schema
Migrate data (if needed)
Deploy code that uses only new schema
Deploy schema contraction

Common Migration Patterns

Pattern 1: Adding a Column

Safe approach:

-- Step 1: Add column with default value
ALTER TABLE products ADD COLUMN description TEXT DEFAULT '';

-- Step 2: Deploy code that writes to description
-- Old code: Ignores description
-- New code: Reads and writes description

-- Step 3: Backfill existing rows (if needed)
UPDATE products SET description = name WHERE description = '';

-- Step 4: Remove default constraint (if needed)
ALTER TABLE products ALTER COLUMN description DROP DEFAULT;

-- Step 5: Add NOT NULL constraint (if needed)
-- Only after all rows have values!
ALTER TABLE products ALTER COLUMN description SET NOT NULL;

Timeline:

Deploy 1: Schema change (instant)
Deploy 2: Code using new column (minutes later)
Deploy 3: Backfill data (hours/days later, off-peak)
Deploy 4: Final constraints (days later)

Pattern 2: Removing a Column

Use expand-contract pattern:

-- Phase 1: EXPAND - Stop writing to column
-- Deploy code that doesn't write to legacy_field
-- Wait for all deployments to complete

-- Phase 2: Verify no writes
SELECT * FROM audit_log
WHERE table_name = 'users'
  AND column_name = 'legacy_field'
  AND operation = 'UPDATE'
  AND created_at > NOW() - INTERVAL '1 day';

-- Phase 3: CONTRACT - Drop column
ALTER TABLE users DROP COLUMN legacy_field;

Important: Wait days or weeks between phases to ensure no rollbacks are needed.

Pattern 3: Renaming a Column

Renaming requires a multi-step approach:

-- Step 1: Add new column
ALTER TABLE users ADD COLUMN email_address VARCHAR(255);

-- Step 2: Copy data
UPDATE users SET email_address = email WHERE email_address IS NULL;

-- Step 3: Create trigger to keep columns in sync
CREATE OR REPLACE FUNCTION sync_email_columns()
RETURNS TRIGGER AS $$
BEGIN
  IF TG_OP = 'INSERT' OR TG_OP = 'UPDATE' THEN
    NEW.email_address := NEW.email;
    NEW.email := NEW.email_address;
  END IF;
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER sync_email
  BEFORE INSERT OR UPDATE ON users
  FOR EACH ROW EXECUTE FUNCTION sync_email_columns();

-- Step 4: Deploy code that uses email_address
-- Both columns stay in sync via trigger

-- Step 5 (weeks later): Drop old column and trigger
DROP TRIGGER sync_email ON users;
DROP FUNCTION sync_email_columns();
ALTER TABLE users DROP COLUMN email;

Pattern 4: Changing Column Type

Type changes are particularly dangerous. Use expand-contract:

-- Current: price DECIMAL(10,2)
-- Target:  price DECIMAL(15,4)

-- Step 1: Add new column
ALTER TABLE products ADD COLUMN price_new DECIMAL(15,4);

-- Step 2: Backfill data
UPDATE products SET price_new = price WHERE price_new IS NULL;

-- Step 3: Create sync trigger
CREATE TRIGGER sync_price
  BEFORE INSERT OR UPDATE ON products
  FOR EACH ROW EXECUTE FUNCTION sync_price_columns();

-- Step 4: Deploy code reading from price_new
-- Step 5: Deploy code writing to price_new only
-- Step 6: Drop trigger and old column

For complex type changes (e.g., STRING → JSON):

-- Parse and migrate data in batches
DO $$
DECLARE
  batch_size INT := 1000;
  processed INT := 0;
BEGIN
  LOOP
    UPDATE products
    SET metadata_json = metadata_string::jsonb
    WHERE id IN (
      SELECT id FROM products
      WHERE metadata_json IS NULL
      LIMIT batch_size
    );

    GET DIAGNOSTICS processed = ROW_COUNT;
    EXIT WHEN processed = 0;

    -- Sleep to avoid overwhelming database
    PERFORM pg_sleep(1);
  END LOOP;
END $$;

Pattern 5: Adding an Index

Large indexes can lock tables. Use concurrent index creation:

-- Bad: Locks table for duration of index build
CREATE INDEX idx_users_email ON users(email);

-- Good: Builds index without blocking writes (PostgreSQL)
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);

-- For MySQL (5.6+):
ALTER TABLE users ADD INDEX idx_users_email (email), ALGORITHM=INPLACE, LOCK=NONE;

Important: Concurrent index creation is slower and uses more resources, but doesn’t block production traffic.

Pattern 6: Splitting a Table

When a table becomes too large or serves multiple purposes:

-- Current: users table with orders data
-- Goal: Separate users and user_orders tables

-- Step 1: Create new table
CREATE TABLE user_orders (
  id SERIAL PRIMARY KEY,
  user_id INT NOT NULL REFERENCES users(id),
  order_date DATE,
  total DECIMAL(10,2),
  -- other order fields
);

CREATE INDEX idx_user_orders_user_id ON user_orders(user_id);

-- Step 2: Migrate data
INSERT INTO user_orders (user_id, order_date, total)
SELECT id, last_order_date, last_order_total
FROM users
WHERE last_order_date IS NOT NULL;

-- Step 3: Deploy code that writes to both places
-- Step 4: Verify data consistency
-- Step 5: Deploy code that reads from new table
-- Step 6: Remove old columns from users table

Advanced Techniques

Feature Flags for Schema Changes

Use feature flags to control schema usage:

# Application code
def get_user_email(user_id):
    if feature_flags.is_enabled('use_new_email_column'):
        return db.query("SELECT email_address FROM users WHERE id = %s", user_id)
    else:
        return db.query("SELECT email FROM users WHERE id = %s", user_id)

Benefits:

Test new schema with subset of traffic
Instant rollback without deployment
Gradual migration

Shadow Traffic

Validate new schema without risk:

def write_order(order_data):
    # Write to production schema
    db.execute("INSERT INTO orders_v1 (data) VALUES (%s)", order_data)

    # Shadow write to new schema (don't use result)
    try:
        db.execute("INSERT INTO orders_v2 (data) VALUES (%s)", transform_data(order_data))
    except Exception as e:
        log_error("Shadow write failed", e)  # Log but don't fail request

Online Schema Change Tools

For massive tables, use specialized tools:

PostgreSQL:

pg_repack: Rebuild tables without locking

MySQL:

gh-ost: GitHub’s online schema change tool
pt-online-schema-change: Percona’s tool

Example with gh-ost:

gh-ost \
  --host=prod-db \
  --database=myapp \
  --table=users \
  --alter="ADD COLUMN bio TEXT" \
  --execute

Benefits:

Changes applied with minimal locking
Progress monitoring
Pause/resume capability
Automatic rollback on error

Migration Testing

Test in Production-Like Environment

-- Create staging replica
pg_basebackup -D /var/lib/postgresql/staging -Fp -Xs -P -R

-- Test migration
psql staging < migration.sql

-- Verify performance
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'test@example.com';

-- Check table size
SELECT pg_size_pretty(pg_total_relation_size('users'));

Rollback Strategy

Every migration needs a rollback plan:

-- migration_up.sql
BEGIN;
ALTER TABLE users ADD COLUMN phone VARCHAR(20);
CREATE INDEX idx_users_phone ON users(phone);
COMMIT;

-- migration_down.sql
BEGIN;
DROP INDEX idx_users_phone;
ALTER TABLE users DROP COLUMN phone;
COMMIT;

Test rollback before deploying:

# Apply migration
psql < migration_up.sql

# Test application

# Rollback
psql < migration_down.sql

Real-World Example: Migrating User Authentication

Goal: Move from username-based to email-based authentication

Current schema:

CREATE TABLE users (
  id SERIAL PRIMARY KEY,
  username VARCHAR(50) UNIQUE NOT NULL,
  password_hash VARCHAR(255) NOT NULL
);

Target schema:

CREATE TABLE users (
  id SERIAL PRIMARY KEY,
  email VARCHAR(255) UNIQUE NOT NULL,
  password_hash VARCHAR(255) NOT NULL
);

Migration steps:

-- Week 1: Add email column
ALTER TABLE users ADD COLUMN email VARCHAR(255);
CREATE UNIQUE INDEX CONCURRENTLY idx_users_email ON users(email);

-- Deploy: Application collects email during registration
-- Users table: username (unique), email (nullable, becoming populated)

-- Week 2: Backfill emails for existing users
-- Prompt users to add email, or generate from username
UPDATE users SET email = username || '@legacy.example.com'
WHERE email IS NULL;

-- Week 3: Make email required
ALTER TABLE users ALTER COLUMN email SET NOT NULL;

-- Deploy: Application authenticates with email OR username
-- Users table: username (unique), email (unique, not null)

-- Week 4: Deprecate username-based authentication
-- Deploy: Application only authenticates with email
-- Add warning banner: "Username login will be removed"

-- Week 5+: Monitor for username usage
SELECT COUNT(*) FROM auth_logs
WHERE auth_method = 'username'
  AND created_at > NOW() - INTERVAL '7 days';

-- Week 8: Remove username column (if usage == 0)
DROP INDEX idx_users_username;
ALTER TABLE users DROP COLUMN username;

-- Deploy: Remove username support from application

Monitoring During Migrations

Track key metrics:

-- Lock monitoring (PostgreSQL)
SELECT
  pid,
  usename,
  pg_blocking_pids(pid) as blocked_by,
  query as blocked_query
FROM pg_stat_activity
WHERE cardinality(pg_blocking_pids(pid)) > 0;

-- Long-running queries
SELECT
  pid,
  now() - pg_stat_activity.query_start AS duration,
  query
FROM pg_stat_activity
WHERE state = 'active'
  AND now() - pg_stat_activity.query_start > interval '5 minutes';

-- Table bloat after migration
SELECT
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

Best Practices Checklist

Before any migration:

Conclusion

Zero-downtime migrations require patience and planning:

Never break backward compatibility
Use expand-contract pattern for significant changes
Deploy in multiple phases separated by time
Test everything in staging first
Monitor closely during and after migration

The key insight: Database changes can’t be atomic when you have running application code. Accept this reality and design migrations that work with it.

A week-long migration might feel slow, but it’s infinitely faster than recovering from a production outage. Take your time, test thoroughly, and migrate safely.