Skip to content

fix: enforce single concurrent migration with atomic cross-row lock#88

Open
thehecktour wants to merge 2 commits into
PostHog:mainfrom
thehecktour:fix/enforce-single-concurrent-migration
Open

fix: enforce single concurrent migration with atomic cross-row lock#88
thehecktour wants to merge 2 commits into
PostHog:mainfrom
thehecktour:fix/enforce-single-concurrent-migration

Conversation

@thehecktour

Copy link
Copy Markdown

What this fixes

Two async migrations could be triggered simultaneously and both would reach Running state, executing DDL operations concurrently on the same ClickHouse cluster.

The root cause: MAX_CONCURRENT_ASYNC_MIGRATIONS = 1 was defined and documented, but never actually checked before starting a migration.


The bug

When a migration is triggered, the flow is:

POST /trigger → set status = Starting → enqueue Celery task
                                              ↓
                                    Celery worker picks it up
                                              ↓
                               start_async_migration(migration)
                                              ↓
                          mark_async_migration_as_running(migration)  ← bug is here

The guard inside mark_async_migration_as_running only checked the status of the migration being started — not whether any other migration was already running:

# Before — only protects against double-starting the same migration
def mark_async_migration_as_running(migration: AsyncMigration) -> bool:
    with transaction.atomic():
        if migration.status not in [MigrationStatus.Starting, MigrationStatus.NotStarted]:
            return False          # ← checks THIS migration's own row only
        migration.status = MigrationStatus.Running
        migration.save()
    return True

transaction.atomic() acquires a row-level lock on the migration's own row. If two different migrations are triggered at the same time, each worker locks a different row — they never block each other.

Failure sequence

t=0   POST /trigger/migration-A  → A.status = Starting, Celery task A queued
t=1   POST /trigger/migration-B  → B.status = Starting, Celery task B queued

t=2   Worker 1: reads A, A.status in [Starting] ✅
      Worker 2: reads B, B.status in [Starting] ✅   (simultaneously)

t=3   Worker 1: locks row A → A.status = Running ✅
      Worker 2: locks row B → B.status = Running ✅   (different rows, no conflict)

t=4   Worker 1: ALTER TABLE foo ADD COLUMN bar   ⚠️
      Worker 2: ALTER TABLE foo ADD COLUMN baz   ⚠️  (concurrent DDL on same cluster)

ClickHouse has no DDL transactions. Interleaved schema changes can leave tables in an inconsistent state that rollback cannot recover from.

Three signals in the codebase that confirm this is a bug 🔍

1. The constant is defined but never read:

# runner.py — defined once, never referenced anywhere else
MAX_CONCURRENT_ASYNC_MIGRATIONS = 1

2. The select_for_update() pattern is already used in the same file for other operations — the developers knew about concurrent access, just didn't apply it here:

# async_migration_utils.py — halt_starting_migration
instance = AsyncMigration.objects.select_for_update().get(pk=migration.pk)

# async_migration_utils.py — update_async_migration
instance = AsyncMigration.objects.select_for_update().get(pk=migration.pk)

3. The original PostHog implementation had precheck/healthcheck/force-stop mechanisms (visible as commented-out code throughout the file) — the concurrency guard was lost when HouseWatch simplified the fork.


The fix

# After — cross-row lock enforces the "one at a time" invariant atomically
def mark_async_migration_as_running(migration: AsyncMigration) -> bool:
    with transaction.atomic():
        # Lock all rows currently in Running or Starting state.
        # Any concurrent transaction attempting the same check will block here
        # until this transaction commits, closing the race window.
        already_running = (
            AsyncMigration.objects
            .select_for_update()
            .filter(status__in=[MigrationStatus.Running, MigrationStatus.Starting])
            .exclude(pk=migration.pk)
            .exists()
        )
        if already_running:
            logger.warning(
                "Refusing to start migration: another migration is already running or starting",
                migration=migration.name,
            )
            return False

        # Re-read this migration's own row under a lock to guard against
        # double-start of the same migration by two concurrent workers.
        instance = AsyncMigration.objects.select_for_update().get(pk=migration.pk)
        if instance.status not in [MigrationStatus.Starting, MigrationStatus.NotStarted]:
            return False

        instance.status = MigrationStatus.Running
        instance.current_query_id = ""
        instance.progress = 0
        instance.current_operation_index = 0
        instance.started_at = now()
        instance.finished_at = None
        instance.save()

    return True

Why this works

Both select_for_update() calls happen inside the same transaction.atomic() block. This means:

  • The filter locks every row in Running or Starting state
  • A concurrent transaction trying to do the same read will block until this one commits
  • By the time it unblocks, already_running will return True and it will exit early

The two locks together close both gaps: concurrent migrations blocking each other, and the same migration being double-started by two workers.


Files changed

File Change
housewatch/async_migrations/async_migration_utils.py Replace mark_async_migration_as_running with cross-row atomic check
housewatch/async_migrations/runner.py Update log message to reflect both possible failure reasons

⚠️ Note: select_for_update() requires PostgreSQL (or MySQL with InnoDB). SQLite — Django's default test database — uses database-level locking instead of row-level locking, so tests for this change should be run against PostgreSQL to validate locking semantics accurately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant