12 min read
Dillon Browne

Zero-Downtime Database Migrations at Scale

A battle-tested playbook for executing database migrations on high-traffic production systems without downtime, featuring real-world patterns and automation.

Database Migration DevOps Zero Downtime PostgreSQL Cloud Architecture Automation Site Reliability Infrastructure as Code
Zero-Downtime Database Migrations at Scale

Database migrations are where DevOps theory meets harsh reality. You can have perfect CI/CD pipelines, immutable infrastructure, and blue-green deployments for your application layer—but when it’s time to migrate a production database serving millions of requests per second, all that elegance disappears. One wrong move and you’re looking at hours of downtime, data loss, or worse: silent data corruption that surfaces weeks later.

I’ve led database migrations ranging from simple schema changes on modest PostgreSQL instances to multi-petabyte data platform migrations across cloud providers. Each one taught me something new about what works, what fails spectacularly, and what “zero-downtime” actually means in production.

The Real Cost of Downtime

Before diving into patterns, let’s talk about why this matters. In my experience working with e-commerce, fintech, and SaaS platforms, I’ve seen the real costs:

  • Revenue loss: For a mid-sized e-commerce platform doing $50M/year, every hour of downtime costs roughly $5,700 in lost sales
  • Customer trust: Users who encounter errors during checkout have a 60%+ abandonment rate and may never return
  • SLA penalties: Enterprise contracts often include financial penalties for downtime exceeding agreed thresholds
  • Team burnout: Emergency migrations that go sideways mean 3 AM war rooms and exhausted engineers

The stakes are high enough that “we’ll just do it during a maintenance window” isn’t acceptable for most modern systems.

The Expand-Contract Pattern: Your Foundation

The expand-contract pattern is the cornerstone of zero-downtime database migrations. I’ve used this pattern successfully on databases ranging from 100GB to 50TB+, and it works because it separates concerns into distinct, reversible phases.

Phase 1: Expand (Additive Changes)

First, you make purely additive changes to the database schema. No deletions, no modifications—only additions.

-- Example: Renaming a column from 'user_name' to 'username'
-- Phase 1: Add the new column
ALTER TABLE users ADD COLUMN username VARCHAR(255);

-- Create a trigger to keep both columns in sync
CREATE OR REPLACE FUNCTION sync_username()
RETURNS TRIGGER AS $$
BEGIN
  IF NEW.username IS NULL AND NEW.user_name IS NOT NULL THEN
    NEW.username := NEW.user_name;
  END IF;
  IF NEW.user_name IS NULL AND NEW.username IS NOT NULL THEN
    NEW.user_name := NEW.username;
  END IF;
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER sync_username_trigger
BEFORE INSERT OR UPDATE ON users
FOR EACH ROW EXECUTE FUNCTION sync_username();

-- Backfill existing data (in batches to avoid locks)
-- This runs async, doesn't block the migration

At this point, your application still uses the old column (user_name), but the new column (username) exists and stays synchronized via triggers.

Phase 2: Migrate Application Code

Deploy application changes that write to both columns but read from the new one:

# Old code
user = User(user_name="john_doe")

# Transitional code (deploy this first)
user = User(
    user_name="john_doe",  # Still writing to old column
    username="john_doe"     # Also writing to new column
)

# Application reads from new column
print(user.username)  # Not user.user_name

This phase is critical: your application is now using the new schema, but the old schema still works. You can roll back at any time by reverting the application deployment.

Phase 3: Contract (Remove Old Schema)

Only after the new code has been stable in production for days or weeks do you remove the old column:

-- Drop the trigger first
DROP TRIGGER IF EXISTS sync_username_trigger ON users;
DROP FUNCTION IF EXISTS sync_username();

-- Remove the old column
ALTER TABLE users DROP COLUMN user_name;

This three-phase approach gives you multiple rollback points and ensures you never have a moment where the database schema and application code are incompatible.

Handling Large-Scale Data Migrations

The expand-contract pattern works great for schema changes, but what about migrating 10TB of data from one database to another? Here’s the playbook I’ve used for cloud-to-cloud migrations.

The Dual-Write Pattern

For massive data migrations (like moving from AWS RDS to Google Cloud SQL), I use a dual-write strategy with eventual consistency:

┌─────────────────────────────────────────────────────┐
│                  Application Layer                  │
│                                                     │
│  ┌──────────────────────────────────────────────┐   │
│  │  Write Logic (Synchronous)                   │   │
│  │  1. Write to Primary DB (Source)             │   │
│  │  2. Write to Secondary DB (Target)           │   │
│  │  3. Queue event for async verification       │   │
│  └──────────────────────────────────────────────┘   │
│                                                     │
│  ┌──────────────────────────────────────────────┐   │
│  │  Read Logic                                  │   │
│  │  - Phase 1: Read from Source                 │   │
│  │  - Phase 2: Read from Target (after cutover) │   │
│  └──────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────┘
         │                                    │
         ▼                                    ▼
┌──────────────────┐              ┌──────────────────┐
│   Source DB      │              │   Target DB      │
│   (AWS RDS)      │              │  (GCP Cloud SQL) │
│                  │              │                  │
│  Historical Data │◄─────────────│  Async Backfill  │
│                  │  DMS/Debezium│  + Verification  │
└──────────────────┘              └──────────────────┘

Implementation Strategy

Step 1: Initial Bulk Copy (Days to Weeks)

Use AWS Database Migration Service (DMS) or Debezium for initial replication:

# DMS Task Configuration (simplified)
migration_task:
  source: 
    engine: postgres
    endpoint: source-db.us-east-1.rds.amazonaws.com
  target:
    engine: postgres
    endpoint: target-db.us-central1.sql.gcp.com
  
  replication_config:
    full_load: true
    cdc_enabled: true  # Capture ongoing changes
    
  table_mappings:
    - schema: public
      tables: ["users", "orders", "products"]
      
  performance:
    parallel_threads: 8
    batch_size: 10000
    max_full_load_tasks: 4

Step 2: Enable Dual Writes

Modify your application to write to both databases:

class DualWriteUserRepository:
    def __init__(self, primary_db, secondary_db, metrics):
        self.primary = primary_db
        self.secondary = secondary_db
        self.metrics = metrics
        
    async def create_user(self, user_data):
        # Write to primary (source) - this must succeed
        try:
            primary_result = await self.primary.create(user_data)
        except Exception as e:
            self.metrics.increment('dual_write.primary.error')
            raise
        
        # Write to secondary (target) - failures are logged but don't block
        try:
            secondary_result = await self.secondary.create(user_data)
            self.metrics.increment('dual_write.secondary.success')
        except Exception as e:
            self.metrics.increment('dual_write.secondary.error')
            # Queue for retry, but don't fail the request
            await self.queue_retry(user_data)
            
        return primary_result
    
    async def queue_retry(self, data):
        """Queue failed secondary writes for async retry"""
        await self.retry_queue.publish({
            'operation': 'create_user',
            'data': data,
            'timestamp': datetime.utcnow()
        })

Step 3: Continuous Verification

This is where most teams fail. You need automated verification that both databases stay in sync:

# Verification worker (runs continuously)
class DataConsistencyVerifier:
    async def verify_batch(self, table, offset, limit):
        """Compare a batch of records between source and target"""
        
        # Fetch from both databases
        source_records = await self.source_db.query(
            f"SELECT * FROM {table} ORDER BY id LIMIT {limit} OFFSET {offset}"
        )
        target_records = await self.target_db.query(
            f"SELECT * FROM {table} ORDER BY id LIMIT {limit} OFFSET {offset}"
        )
        
        # Compare checksums for efficiency
        source_checksum = self._compute_checksum(source_records)
        target_checksum = self._compute_checksum(target_records)
        
        if source_checksum != target_checksum:
            # Detailed comparison to find discrepancies
            differences = self._find_differences(source_records, target_records)
            
            for diff in differences:
                await self.handle_inconsistency(diff)
                self.metrics.increment(f'verification.{table}.inconsistent')
        else:
            self.metrics.increment(f'verification.{table}.consistent')
    
    async def handle_inconsistency(self, diff):
        """Fix inconsistencies automatically or alert"""
        if diff['type'] == 'missing_in_target':
            # Auto-repair: copy from source to target
            await self.target_db.upsert(diff['record'])
        elif diff['type'] == 'data_mismatch':
            # Alert for manual review
            await self.alert(f"Data mismatch for ID {diff['id']}")

Step 4: The Cutover

After weeks of dual writes with verified consistency:

# Feature flag based cutover
class UserRepository:
    def __init__(self, source_db, target_db, feature_flags):
        self.source = source_db
        self.target = target_db
        self.flags = feature_flags
        
    async def get_user(self, user_id):
        # Gradual rollout: 1% -> 10% -> 50% -> 100%
        if self.flags.is_enabled('read_from_target', user_id):
            return await self.target.get(user_id)
        else:
            return await self.source.get(user_id)

This allows you to test the target database with production traffic while maintaining the ability to instantly roll back by toggling a feature flag.

Schema Migrations with Minimal Locking

One of the biggest challenges with PostgreSQL (and MySQL) is that many DDL operations acquire locks that block reads or writes. Here’s how I handle this:

Adding Indexes Without Blocking

-- Bad: Acquires lock, blocks writes during index creation
CREATE INDEX idx_users_email ON users(email);

-- Good: Creates index concurrently, doesn't block writes
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);

-- Even better: With error handling
DO $$
BEGIN
    CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_users_email ON users(email);
EXCEPTION
    WHEN OTHERS THEN
        -- Log the error, but don't fail the migration
        RAISE WARNING 'Index creation failed: %', SQLERRM;
END $$;

For large tables (billions of rows), even CREATE INDEX CONCURRENTLY can take hours. I use a batched approach:

-- Create partial indexes first (fast)
CREATE INDEX CONCURRENTLY idx_users_email_recent 
ON users(email) 
WHERE created_at > '2024-01-01';

-- Then create the full index during lower traffic
CREATE INDEX CONCURRENTLY idx_users_email_full ON users(email);

-- Drop the partial index after full index is ready
DROP INDEX CONCURRENTLY idx_users_email_recent;

Changing Column Types Safely

Changing column types is notoriously lock-heavy. Here’s my pattern:

-- Changing users.age from INTEGER to BIGINT

-- Step 1: Add new column
ALTER TABLE users ADD COLUMN age_new BIGINT;

-- Step 2: Backfill in batches (no locks)
DO $$
DECLARE
    batch_size INT := 10000;
    last_id BIGINT := 0;
    rows_updated INT;
BEGIN
    LOOP
        UPDATE users 
        SET age_new = age
        WHERE id > last_id 
          AND id <= last_id + batch_size
          AND age_new IS NULL;
        
        GET DIAGNOSTICS rows_updated = ROW_COUNT;
        EXIT WHEN rows_updated = 0;
        
        last_id := last_id + batch_size;
        
        -- Brief pause to avoid overwhelming the database
        PERFORM pg_sleep(0.1);
    END LOOP;
END $$;

-- Step 3: Add constraint to ensure new column is always populated
ALTER TABLE users ADD CONSTRAINT age_new_not_null 
CHECK (age_new IS NOT NULL) NOT VALID;

-- Validate the constraint (can be done during low traffic)
ALTER TABLE users VALIDATE CONSTRAINT age_new_not_null;

-- Step 4: Deploy application code to use age_new

-- Step 5: Drop old column (fast, minimal lock)
ALTER TABLE users DROP COLUMN age;
ALTER TABLE users RENAME COLUMN age_new TO age;

Automation and Safety Rails

Manual migrations are error-prone. Here’s my automation framework using Terraform and CI/CD:

# terraform/database-migration.tf

resource "null_resource" "database_migration" {
  triggers = {
    migration_version = var.migration_version
  }
  
  provisioner "local-exec" {
    command = <<-EOT
      # Pre-migration checks
      ./scripts/pre-migration-check.sh
      
      # Take snapshot (AWS RDS)
      aws rds create-db-snapshot \
        --db-instance-identifier ${var.db_instance} \
        --db-snapshot-identifier pre-migration-${var.migration_version}
      
      # Run migration with timeout and rollback on failure
      timeout 3600 ./scripts/run-migration.sh ${var.migration_version} || {
        echo "Migration failed, rolling back..."
        ./scripts/rollback-migration.sh ${var.migration_version}
        exit 1
      }
      
      # Post-migration verification
      ./scripts/verify-migration.sh
    EOT
  }
}

The pre-migration check script validates everything:

#!/bin/bash
# scripts/pre-migration-check.sh

set -euo pipefail

echo "Running pre-migration checks..."

# Check database connectivity
psql "$DATABASE_URL" -c "SELECT 1" > /dev/null || {
    echo "ERROR: Cannot connect to database"
    exit 1
}

# Check replication lag (if applicable)
REPLICATION_LAG=$(psql "$DATABASE_URL" -t -c "
    SELECT EXTRACT(EPOCH FROM (NOW() - pg_last_xact_replay_timestamp()))::INT
")

if [ "$REPLICATION_LAG" -gt 60 ]; then
    echo "ERROR: Replication lag is ${REPLICATION_LAG}s (max: 60s)"
    exit 1
fi

# Check disk space (need at least 20% free)
DISK_USAGE=$(psql "$DATABASE_URL" -t -c "

Found this helpful? Share it with others: