fix: enforce single concurrent migration with atomic cross-row lock#88
Open
thehecktour wants to merge 2 commits into
Open
fix: enforce single concurrent migration with atomic cross-row lock#88thehecktour wants to merge 2 commits into
thehecktour wants to merge 2 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this fixes
Two async migrations could be triggered simultaneously and both would reach
Runningstate, executing DDL operations concurrently on the same ClickHouse cluster.The root cause:
MAX_CONCURRENT_ASYNC_MIGRATIONS = 1was defined and documented, but never actually checked before starting a migration.The bug
When a migration is triggered, the flow is:
The guard inside
mark_async_migration_as_runningonly checked the status of the migration being started — not whether any other migration was already running:transaction.atomic()acquires a row-level lock on the migration's own row. If two different migrations are triggered at the same time, each worker locks a different row — they never block each other.Failure sequence
ClickHouse has no DDL transactions. Interleaved schema changes can leave tables in an inconsistent state that rollback cannot recover from.
Three signals in the codebase that confirm this is a bug 🔍
1. The constant is defined but never read:
2. The
select_for_update()pattern is already used in the same file for other operations — the developers knew about concurrent access, just didn't apply it here:3. The original PostHog implementation had precheck/healthcheck/force-stop mechanisms (visible as commented-out code throughout the file) — the concurrency guard was lost when HouseWatch simplified the fork.
The fix
Why this works
Both
select_for_update()calls happen inside the sametransaction.atomic()block. This means:RunningorStartingstatealready_runningwill returnTrueand it will exit earlyThe two locks together close both gaps: concurrent migrations blocking each other, and the same migration being double-started by two workers.
Files changed
housewatch/async_migrations/async_migration_utils.pymark_async_migration_as_runningwith cross-row atomic checkhousewatch/async_migrations/runner.py