From 818a44409e391e10c26d2b4811308dd6534f051f Mon Sep 17 00:00:00 2001 From: Bryann Valderrama Date: Thu, 9 Apr 2026 13:26:35 -0500 Subject: [PATCH 1/6] docs: add course authoring automatic migration adr --- ...3-course-authoring-automatic-migration.rst | 233 ++++++++++++++++++ 1 file changed, 233 insertions(+) create mode 100644 docs/decisions/0013-course-authoring-automatic-migration.rst diff --git a/docs/decisions/0013-course-authoring-automatic-migration.rst b/docs/decisions/0013-course-authoring-automatic-migration.rst new file mode 100644 index 00000000..ed020ef9 --- /dev/null +++ b/docs/decisions/0013-course-authoring-automatic-migration.rst @@ -0,0 +1,233 @@ +0013: Course Authoring - Automatic Migration Triggered by Course Authoring Flag +############################################################################### + +Status +****** + +**Draft** - *2026-04-09* + +Context +******* + +The system is transitioning from the legacy permissions model (``CourseAccessRole``) +to the new openedx-authz system. + +Currently, migrations between both systems are performed manually using Django management commands: + +- ``authz_migrate_course_authoring`` (forward migration) +- ``authz_rollback_course_authoring`` (rollback migration) + +In `ADR 0011`_ and `ADR 0010`_ it was established that migration must occur automatically when +the feature flag ``authz.enable_course_authoring`` changes state, but they deferred the definition of +the specific mechanism. This ADR addresses that gap. + +The current manual approach presents the following risks: + +- **Inconsistency**: If an operator enables or disables the flag without running the migration + command, the permission data in both systems will diverge. +- **No status tracking**: There is no visibility into whether a migration is in progress, + completed, or failed. +- **No concurrency protection**: Multiple concurrent flag changes can trigger overlapping + migrations, leading to race conditions and data corruption. +- **No user feedback**: Operators have no way to know the result of a migration without + inspecting logs manually. + +Decision +******** + +We will implement an automatic and asynchronous migration mechanism triggered by changes in the +``authz.enable_course_authoring`` feature flag. The solution consists of: + +#. Django signal handler to detect flag state changes. +#. Celery tasks to execute migrations asynchronously. +#. A tracking model to record migration status and errors. +#. A locking mechanism to prevent concurrent migrations on the same scope. + +.. note:: + + **Scope Constraint** + + Automatic migration will only trigger for **course-level** and **organization-level** flag + overrides, not for global (instance-wide) Waffle flag changes. The reason is that a global + flag change could affect a large number of courses simultaneously, introducing an unacceptable + performance risk. Global flag changes must be handled via management commands by operators + who explicitly accept the performance implications. + +Operator Safety and Opt-in Design +================================== + +A concern was raised about the risks of triggering data migrations on a live instance. Data +migrations are typically executed under controlled conditions (e.g., during maintenance windows) +because any failure can leave the system in an invalid state. Triggering them automatically via +a feature flag toggle introduces additional risk: + +- Django Admin access is sometimes granted to instructors or non-technical staff who may not + understand the implications of toggling the flag. +- A live instance may be processing requests concurrently, increasing the chance of partial + failures or inconsistent transient states. + +To address this, the automatic migration mechanism will be **guarded by a Django setting**: + +.. code:: python + + ENABLE_AUTOMATIC_COURSE_AUTHORING_MIGRATION = False + +This setting: + +- Is **disabled by default**. +- Must be explicitly set to ``True`` by a site operator who understands the migration risks. +- Acts as a prerequisite check inside the signal handler: if it is not enabled, the signal + detects the flag change but does **not** dispatch the Celery task. The operator must then + run the migration manually using the existing management commands. + +This design preserves the automated behavior for operators who opt in while keeping the system +safe for deployments where uncontrolled migrations are unacceptable. + +Detailed Design +=============== + +1. Utility Function Updates +--------------------------- + +The existing utility functions ``migrate_legacy_course_roles_to_authz`` and +``migrate_authz_to_legacy_course_roles`` will be modified to incorporate the locking strategy +(see **Concurrency Control** below) and the tracking logic (see **Migration Tracking Model** below) +as integral steps of their execution. + +This approach ensures that both the Celery task and the management commands go through the same +tracking and locking path. + +2. Migration Trigger (Django Signals) +------------------------------------- + +``pre_save`` signal handlers are attached to ``WaffleFlagCourseOverrideModel`` and +``WaffleFlagOrgOverrideModel``. When a save is detected for the ``authz.enable_course_authoring`` +flag, the handler: + +#. Compares the previous and new flag state to determine the transition direction: + + - ``False → True``: triggers a **forward migration** (Legacy → openedx-authz) + - ``True → False``: triggers a **rollback migration** (openedx-authz → Legacy) + +#. Determines the scope (course or organization) from the model being saved. +#. Dispatches an asynchronous Celery task with the migration parameters. + +.. note:: + If no effective change is detected (i.e., the flag state is the same as the previous state), + the signal handler does nothing. + +3. Migration Tracking Model +--------------------------- + +A new model is introduced to track the lifecycle of each migration operation: + +.. code:: python + + class CourseAuthoringMigrationRun(models.Model): + migration_type = models.CharField(max_length=20) # forward / rollback + scope_type = models.CharField(max_length=20) # course / org + scope_key = models.CharField(max_length=255) + status = models.CharField(max_length=20) # pending, running, completed, skipped + created_at = models.DateTimeField(auto_now_add=True) + updated_at = models.DateTimeField(auto_now=True) + completed_at = models.DateTimeField(null=True, blank=True) + metadata = models.JSONField(default=dict) + +This model is registered in Django Admin so operators can inspect migration history and +diagnose failures without needing to access logs directly. + +4. Asynchronous Execution +------------------------- + +The Celery task acts strictly as a thin dispatcher. All core logic, including locking, +tracking, and migration execution, is implemented in the utility functions (see +**Utility Function Updates** above). + +All database operations within the migration itself execute inside an atomic transaction. +If the migration fails, no data is deleted from either system, preserving consistency. + +5. Concurrency Control (Locking Strategy) +----------------------------------------- + +To prevent race conditions caused by rapid or concurrent flag changes on the same scope, a +distributed lock is implemented using the Django cache backend (Redis): + +.. code:: python + + lock_key = f"authz_migration:{scope_type}:{scope_key}" + +The lock is acquired using ``cache.add()``, which is an atomic operation. The default TTL +is **1 hour**. If a lock already exists for the given scope, the migration is skipped +and a new tracking record is created with that status. This ensures that only one +migration runs at a time for the same scope. + +6. Execution Flow +------------------ + +1. An operator changes the ``authz.enable_course_authoring`` flag for a course or + organization via Django Admin or a management command. +2. The ``pre_save`` signal handler detects the state transition. +3. A Celery task is dispatched asynchronously. +4. The task calls the utility function, which acquires the lock, creates and updates the + ``CourseAuthoringMigrationRun`` record, and executes the migration. +5. The operator can check the migration status via Django Admin on the ``CourseAuthoringMigrationRun`` + model. + +Consequences +************ + +Positive consequences +===================== + +- **Migration is decoupled from the request cycle**: the flag change returns immediately and + migration happens in the background. +- **Full observability**: every migration run is recorded with its status, scope, and metadata + in the tracking model. +- **Concurrency-safe**: the lock strategy prevents overlapping migrations on the same scope. +- **No manual intervention required**: for course-level or organization-level flag changes. Operators + who have opted in do not need to remember to run management commands. +- **Safe by default**: the opt-in guard flag ensures that automatic migration is never triggered + unexpectedly on instances where operators have not explicitly accepted the risks. + +Negative consequences / risks +============================== + +- **Global flag changes are not covered**: operators must still run management commands + manually when enabling or disabling the flag at the instance level. This is a deliberate + trade-off to avoid performance risks. +- **Celery dependency**: the system now requires a functioning Celery worker for automatic + migration. If workers are down, migrations will be queued but not executed until workers + recover. +- **Lock TTL edge cases**: if a migration takes longer than 1 hour (unlikely but possible + for very large organizations), the lock will expire and a new migration for the same scope + could start concurrently for the same scope. + +Rejected Alternatives +********************* + +**Synchronous execution in the signal handler** + Executing the migration directly inside the ``pre_save`` signal would block the HTTP + request that triggered the flag change, leading to timeouts for large scopes and poor + operator experience. + +**Manual migration** + Error-prone, not scalable, and inconsistent. The flag is the source of truth, but manual + migration allows the system to end up in inconsistent states (e.g., flag enabled but data + still in the legacy system), resulting in an operationally fragile design. + +**Automatic global migration** + Triggering automatic migration when the flag is changed globally (instance-wide) would + risk performance degradation on large instances. This was explicitly ruled out: global + migrations must remain operator-initiated via management commands. + +References +********** + +* `Automatic Migration Spike`_ +* `ADR 0010`_ +* `ADR 0011`_ + +.. _Automatic Migration Spike: + https://openedx.atlassian.net/wiki/spaces/OEPM/pages/6205112321/Spike+-+RBAC+AuthZ+-+Automatic+Role+Migration +.. _ADR 0010: 0010-course-authoring-flag.rst +.. _ADR 0011: 0011-course-authoring-migration-process.rst From 944808e7ea100c03859f67e1645297d702f80f50 Mon Sep 17 00:00:00 2001 From: Bryann Valderrama Date: Fri, 10 Apr 2026 09:52:59 -0500 Subject: [PATCH 2/6] chore: address pr review - Rename Django setting - Rename tracking model --- .../0013-course-authoring-automatic-migration.rst | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/decisions/0013-course-authoring-automatic-migration.rst b/docs/decisions/0013-course-authoring-automatic-migration.rst index ed020ef9..38764af0 100644 --- a/docs/decisions/0013-course-authoring-automatic-migration.rst +++ b/docs/decisions/0013-course-authoring-automatic-migration.rst @@ -18,8 +18,8 @@ Currently, migrations between both systems are performed manually using Django m - ``authz_rollback_course_authoring`` (rollback migration) In `ADR 0011`_ and `ADR 0010`_ it was established that migration must occur automatically when -the feature flag ``authz.enable_course_authoring`` changes state, but they deferred the definition of -the specific mechanism. This ADR addresses that gap. +the feature flag ``authz.enable_course_authoring`` changes state, but the definition of +the specific mechanism was deferred. This ADR addresses that gap. The current manual approach presents the following risks: @@ -70,7 +70,7 @@ To address this, the automatic migration mechanism will be **guarded by a Django .. code:: python - ENABLE_AUTOMATIC_COURSE_AUTHORING_MIGRATION = False + ENABLE_AUTOMATIC_AUTHZ_COURSE_AUTHORING_MIGRATION = False This setting: @@ -123,7 +123,7 @@ A new model is introduced to track the lifecycle of each migration operation: .. code:: python - class CourseAuthoringMigrationRun(models.Model): + class AuthzCourseAuthoringMigrationRun(models.Model): migration_type = models.CharField(max_length=20) # forward / rollback scope_type = models.CharField(max_length=20) # course / org scope_key = models.CharField(max_length=255) @@ -169,8 +169,8 @@ migration runs at a time for the same scope. 2. The ``pre_save`` signal handler detects the state transition. 3. A Celery task is dispatched asynchronously. 4. The task calls the utility function, which acquires the lock, creates and updates the - ``CourseAuthoringMigrationRun`` record, and executes the migration. -5. The operator can check the migration status via Django Admin on the ``CourseAuthoringMigrationRun`` + ``AuthzCourseAuthoringMigrationRun`` record, and executes the migration. +5. The operator can check the migration status via Django Admin on the ``AuthzCourseAuthoringMigrationRun`` model. Consequences From eea181e1dc701dda5cbfbdde068e4ea02160ddd0 Mon Sep 17 00:00:00 2001 From: Bryann Valderrama Date: Fri, 10 Apr 2026 11:52:16 -0500 Subject: [PATCH 3/6] docs: clarify concurrency protection risks --- docs/decisions/0013-course-authoring-automatic-migration.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/decisions/0013-course-authoring-automatic-migration.rst b/docs/decisions/0013-course-authoring-automatic-migration.rst index 38764af0..fab6bd9b 100644 --- a/docs/decisions/0013-course-authoring-automatic-migration.rst +++ b/docs/decisions/0013-course-authoring-automatic-migration.rst @@ -27,8 +27,8 @@ The current manual approach presents the following risks: command, the permission data in both systems will diverge. - **No status tracking**: There is no visibility into whether a migration is in progress, completed, or failed. -- **No concurrency protection**: Multiple concurrent flag changes can trigger overlapping - migrations, leading to race conditions and data corruption. +- **No concurrency protection**: Nothing prevents operators from running the migration command + multiple times simultaneously, which can lead to race conditions and data corruption. - **No user feedback**: Operators have no way to know the result of a migration without inspecting logs manually. From c55800c38e973eb44b10cdc96888b84900d44265 Mon Sep 17 00:00:00 2001 From: Bryann Valderrama Date: Mon, 13 Apr 2026 16:05:11 -0500 Subject: [PATCH 4/6] chore: address PR reviews --- ...3-course-authoring-automatic-migration.rst | 233 +++++++++++------- 1 file changed, 142 insertions(+), 91 deletions(-) diff --git a/docs/decisions/0013-course-authoring-automatic-migration.rst b/docs/decisions/0013-course-authoring-automatic-migration.rst index fab6bd9b..e97486a6 100644 --- a/docs/decisions/0013-course-authoring-automatic-migration.rst +++ b/docs/decisions/0013-course-authoring-automatic-migration.rst @@ -4,7 +4,7 @@ Status ****** -**Draft** - *2026-04-09* +**Draft** - *2026-04-13* Context ******* @@ -12,36 +12,41 @@ Context The system is transitioning from the legacy permissions model (``CourseAccessRole``) to the new openedx-authz system. -Currently, migrations between both systems are performed manually using Django management commands: +Currently, migrations between the two systems are performed manually using Django +management commands: - ``authz_migrate_course_authoring`` (forward migration) - ``authz_rollback_course_authoring`` (rollback migration) -In `ADR 0011`_ and `ADR 0010`_ it was established that migration must occur automatically when +In `ADR 0010`_ and `ADR 0011`_ it was established that migrations must occur automatically when the feature flag ``authz.enable_course_authoring`` changes state, but the definition of the specific mechanism was deferred. This ADR addresses that gap. -The current manual approach presents the following risks: - -- **Inconsistency**: If an operator enables or disables the flag without running the migration - command, the permission data in both systems will diverge. -- **No status tracking**: There is no visibility into whether a migration is in progress, - completed, or failed. +The current manual approach has the following problems: + +- **Access disparity**: Many users have access to Django Admin and can toggle the flag, while + significantly fewer have permission to run management commands. This creates an operational + gap where the flag state can change independently of the migration process. As a result, + coordination is required between different roles (those managing flags vs. those executing + migrations), increasing the risk of delays, misalignment, and inconsistent system state. +- **Outage window**: When a flag change and the corresponding migration command are not executed + atomically, there is a period where the flag points to one system but the permission data + still lives in the other. Any permission check made during this window will fail, causing + real outages for affected courses or organizations. +- **No user feedback**: Users have no way to know the result of a migration without + inspecting logs manually. - **No concurrency protection**: Nothing prevents operators from running the migration command multiple times simultaneously, which can lead to race conditions and data corruption. -- **No user feedback**: Operators have no way to know the result of a migration without - inspecting logs manually. Decision ******** -We will implement an automatic and asynchronous migration mechanism triggered by changes in the +We will implement an automatic and synchronous migration mechanism triggered by changes in the ``authz.enable_course_authoring`` feature flag. The solution consists of: -#. Django signal handler to detect flag state changes. -#. Celery tasks to execute migrations asynchronously. +#. A ``post_save`` signal handler that detects flag changes and executes the migration. #. A tracking model to record migration status and errors. -#. A locking mechanism to prevent concurrent migrations on the same scope. +#. A database-level constraint to prevent concurrent migrations on the same scope. .. note:: @@ -54,7 +59,7 @@ We will implement an automatic and asynchronous migration mechanism triggered by who explicitly accept the performance implications. Operator Safety and Opt-in Design -================================== +================================= A concern was raised about the risks of triggering data migrations on a live instance. Data migrations are typically executed under controlled conditions (e.g., during maintenance windows) @@ -77,46 +82,53 @@ This setting: - Is **disabled by default**. - Must be explicitly set to ``True`` by a site operator who understands the migration risks. - Acts as a prerequisite check inside the signal handler: if it is not enabled, the signal - detects the flag change but does **not** dispatch the Celery task. The operator must then + detects the flag change but does not execute the migration. The operator must then run the migration manually using the existing management commands. -This design preserves the automated behavior for operators who opt in while keeping the system -safe for deployments where uncontrolled migrations are unacceptable. - Detailed Design =============== -1. Utility Function Updates ---------------------------- +1. Migration Trigger (Django Signals) +------------------------------------- -The existing utility functions ``migrate_legacy_course_roles_to_authz`` and -``migrate_authz_to_legacy_course_roles`` will be modified to incorporate the locking strategy -(see **Concurrency Control** below) and the tracking logic (see **Migration Tracking Model** below) -as integral steps of their execution. +A ``post_save`` handler is attached to ``WaffleFlagCourseOverrideModel`` and +``WaffleFlagOrgOverrideModel`` for the ``authz.enable_course_authoring`` flag. -This approach ensures that both the Celery task and the management commands go through the same -tracking and locking path. +The handler fires after the record is committed to the database, so the new flag value is +the authoritative and durable state of the system when the migration begins. -2. Migration Trigger (Django Signals) -------------------------------------- +Retrieving the previous state from the same model +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Both ``WaffleFlagCourseOverrideModel`` and ``WaffleFlagOrgOverrideModel`` extend +``ConfigurationModel``, which **creates a new row on every save** instead of updating the +existing record. This means the full change history for each scope is preserved in the +table. The previous override value is therefore always available as the most recent record +for the same scope that is not the one just saved. + +If no previous record exists for the scope (this is the first override ever created for +it), the migration runs unconditionally based on the current ``enabled`` value, without +comparing against a previous state. + +``post_save`` execution +~~~~~~~~~~~~~~~~~~~~~~~ -``pre_save`` signal handlers are attached to ``WaffleFlagCourseOverrideModel`` and -``WaffleFlagOrgOverrideModel``. When a save is detected for the ``authz.enable_course_authoring`` -flag, the handler: +The ``post_save`` handler: -#. Compares the previous and new flag state to determine the transition direction: +#. Queries the same flag override model for the previous record as described above. +#. If no previous record exists, runs the migration based on the current ``enabled`` value + without further comparison. +#. If a previous record exists, compares its ``enabled`` value with the saved one to + determine whether an effective transition occurred: - ``False → True``: triggers a **forward migration** (Legacy → openedx-authz) - ``True → False``: triggers a **rollback migration** (openedx-authz → Legacy) + - No change: the handler does nothing. No tracking record is created and no migration runs. #. Determines the scope (course or organization) from the model being saved. -#. Dispatches an asynchronous Celery task with the migration parameters. +#. Calls the utility function synchronously with the migration parameters. -.. note:: - If no effective change is detected (i.e., the flag state is the same as the previous state), - the signal handler does nothing. - -3. Migration Tracking Model +2. Migration Tracking Model --------------------------- A new model is introduced to track the lifecycle of each migration operation: @@ -127,51 +139,76 @@ A new model is introduced to track the lifecycle of each migration operation: migration_type = models.CharField(max_length=20) # forward / rollback scope_type = models.CharField(max_length=20) # course / org scope_key = models.CharField(max_length=255) - status = models.CharField(max_length=20) # pending, running, completed, skipped + status = models.CharField(max_length=20) # running, completed, partial_success, failed, skipped created_at = models.DateTimeField(auto_now_add=True) updated_at = models.DateTimeField(auto_now=True) completed_at = models.DateTimeField(null=True, blank=True) metadata = models.JSONField(default=dict) -This model is registered in Django Admin so operators can inspect migration history and +This model is registered in Django Admin so users can inspect migration history and diagnose failures without needing to access logs directly. -4. Asynchronous Execution -------------------------- - -The Celery task acts strictly as a thin dispatcher. All core logic, including locking, -tracking, and migration execution, is implemented in the utility functions (see -**Utility Function Updates** above). - -All database operations within the migration itself execute inside an atomic transaction. -If the migration fails, no data is deleted from either system, preserving consistency. - -5. Concurrency Control (Locking Strategy) ------------------------------------------ - -To prevent race conditions caused by rapid or concurrent flag changes on the same scope, a -distributed lock is implemented using the Django cache backend (Redis): +A higher-level orchestration layer (separate from the existing utility functions) will be +responsible for creating and updating ``AuthzCourseAuthoringMigrationRun`` records. This +layer wraps the core migration logic, ensuring that lifecycle tracking (opening a +``running`` record, handling errors, and writing the final status) is applied consistently +regardless of whether the migration is triggered by the signal handler or a management +command. + +Migration Outcome Semantics +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``status`` field reflects the precise outcome of each run. The possible values are: + +- ``running``: the migration is actively executing. +- ``completed``: all records were migrated successfully. +- ``partial_success``: the migration process ran to completion, but one or more individual + records failed and were skipped. The ``metadata`` field contains details about the + failures. +- ``failed``: a critical error prevented the migration from completing (e.g., an unhandled + exception or infrastructure problem). The ``metadata`` field contains the exception details. +- ``skipped``: the migration was not attempted because another run for the same scope was + already active. + +3. Concurrency Control +---------------------- + +To prevent overlapping migrations on the same scope, the tracking model enforces a +conditional ``UniqueConstraint`` on ``(scope_type, scope_key)`` filtered to +``status="running"``. This guarantees that no second active migration record can be +inserted for the same scope regardless of how many processes attempt to do so concurrently. +Any attempt raises an ``IntegrityError``, which the caller handles by recording a +``skipped`` run and aborting. .. code:: python - lock_key = f"authz_migration:{scope_type}:{scope_key}" - -The lock is acquired using ``cache.add()``, which is an atomic operation. The default TTL -is **1 hour**. If a lock already exists for the given scope, the migration is skipped -and a new tracking record is created with that status. This ensures that only one -migration runs at a time for the same scope. - -6. Execution Flow ------------------- - -1. An operator changes the ``authz.enable_course_authoring`` flag for a course or - organization via Django Admin or a management command. -2. The ``pre_save`` signal handler detects the state transition. -3. A Celery task is dispatched asynchronously. -4. The task calls the utility function, which acquires the lock, creates and updates the - ``AuthzCourseAuthoringMigrationRun`` record, and executes the migration. -5. The operator can check the migration status via Django Admin on the ``AuthzCourseAuthoringMigrationRun`` - model. + class Meta: + constraints = [ + models.UniqueConstraint( + fields=["scope_type", "scope_key"], + condition=models.Q(status="running"), + name="unique_active_migration_per_scope", + ) + ] + +4. Execution Flow +----------------- + +1. The user changes the ``authz.enable_course_authoring`` flag for a course or + organization and saves the record. A new row is created in the override table. +2. The ``post_save`` handler queries the same override model for the previous record + (most recent row for the same scope, excluding the one just saved) to obtain the + previous ``enabled`` value. +3. The handler compares the previous value with the current ``enabled`` value. If no + effective change occurred, it does nothing. +4. If a transition is detected, the handler calls the utility function synchronously. The + function creates an ``AuthzCourseAuthoringMigrationRun`` record with + ``status="running"`` (the database constraint prevents this if another run for the + same scope is already active) and executes the migration. +5. The record is updated to its final status (``completed``, ``partial_success``, ``failed``, + or ``skipped``) before the ``post_save`` handler returns. +6. The user can review the migration outcome via Django Admin on the + ``AuthzCourseAuthoringMigrationRun`` model. Consequences ************ @@ -179,36 +216,50 @@ Consequences Positive consequences ===================== -- **Migration is decoupled from the request cycle**: the flag change returns immediately and - migration happens in the background. - **Full observability**: every migration run is recorded with its status, scope, and metadata in the tracking model. -- **Concurrency-safe**: the lock strategy prevents overlapping migrations on the same scope. -- **No manual intervention required**: for course-level or organization-level flag changes. Operators - who have opted in do not need to remember to run management commands. +- **Concurrency-safe**: the database-level constraint prevents overlapping migrations on the same + scope, regardless of cache availability or worker failures. +- **No manual intervention required** for course-level or organization-level flag changes. Operators + or users who have opted in do not need to remember to run management commands. - **Safe by default**: the opt-in guard flag ensures that automatic migration is never triggered unexpectedly on instances where operators have not explicitly accepted the risks. Negative consequences / risks -============================== +============================= - **Global flag changes are not covered**: operators must still run management commands manually when enabling or disabling the flag at the instance level. This is a deliberate trade-off to avoid performance risks. -- **Celery dependency**: the system now requires a functioning Celery worker for automatic - migration. If workers are down, migrations will be queued but not executed until workers - recover. -- **Lock TTL edge cases**: if a migration takes longer than 1 hour (unlikely but possible - for very large organizations), the lock will expire and a new migration for the same scope - could start concurrently for the same scope. +- **Blocks the request**: the migration runs synchronously inside the ``post_save`` signal, + so the HTTP request that triggered the flag change does not return until the migration + finishes. For large organization-level scopes this can cause noticeable latency or + timeouts. This is an accepted trade-off given that automatic migration is scoped to + course-level and organization-level overrides only (never global), and is opt-in. +- **Runtime execution trade-offs**: Unlike management commands typically executed during + maintenance windows, this migration runs in a live production environment as part of + normal system operation. This means it executes under concurrent load, with active + requests and database activity, which introduces variability in execution conditions. + This trade-off is inherent to enabling the feature flag to act as a real-time source + of truth. The design prioritizes consistency between flag state and permission data + over strictly controlled execution environments, while providing observability and + recovery mechanisms to mitigate operational risk. Rejected Alternatives ********************* -**Synchronous execution in the signal handler** - Executing the migration directly inside the ``pre_save`` signal would block the HTTP - request that triggered the flag change, leading to timeouts for large scopes and poor - operator experience. +**Using pre_save to trigger the migration** + A ``pre_save`` handler could detect the transition direction and execute the migration + before the flag change is written. This approach violates ACID principles: at the moment + ``pre_save`` fires, the new flag value has not yet been committed to the database. If the + subsequent ``save()`` were to fail (e.g., a validation error, a database constraint + violation, or a network issue), the migration would have already run against a state that + was never persisted, leaving the permission data inconsistent with the actual flag value. + +**Asynchronous execution via Celery** + Given that automatic migration is scoped to course-level and organization-level + overrides where migration volumes are bounded, synchronous execution is simpler + and provides stronger consistency guarantees. **Manual migration** Error-prone, not scalable, and inconsistent. The flag is the source of truth, but manual From 72c52c701a5042531cd0a2b19fc48746a0d80748 Mon Sep 17 00:00:00 2001 From: Bryann Valderrama Date: Tue, 14 Apr 2026 13:48:50 -0500 Subject: [PATCH 5/6] docs: refine rationale for rejecting pre_save in migration logic --- .../0013-course-authoring-automatic-migration.rst | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/docs/decisions/0013-course-authoring-automatic-migration.rst b/docs/decisions/0013-course-authoring-automatic-migration.rst index e97486a6..967a92fc 100644 --- a/docs/decisions/0013-course-authoring-automatic-migration.rst +++ b/docs/decisions/0013-course-authoring-automatic-migration.rst @@ -249,12 +249,10 @@ Rejected Alternatives ********************* **Using pre_save to trigger the migration** - A ``pre_save`` handler could detect the transition direction and execute the migration - before the flag change is written. This approach violates ACID principles: at the moment - ``pre_save`` fires, the new flag value has not yet been committed to the database. If the - subsequent ``save()`` were to fail (e.g., a validation error, a database constraint - violation, or a network issue), the migration would have already run against a state that - was never persisted, leaving the permission data inconsistent with the actual flag value. + The use of pre_save signals was discarded because they depend on a state transition + that has not yet been committed to the database. Operating before persistence assumes + a future-valid state that may not materialize. post_save was preferred to ensure + migration logic operates only on confirmed states. **Asynchronous execution via Celery** Given that automatic migration is scoped to course-level and organization-level From 7bd91945c01cc767650e1557d595b30ed3019e4c Mon Sep 17 00:00:00 2001 From: Bryann Valderrama Date: Wed, 15 Apr 2026 07:13:19 -0500 Subject: [PATCH 6/6] docs: enhance migration outcome semantics with metadata details --- docs/decisions/0013-course-authoring-automatic-migration.rst | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/decisions/0013-course-authoring-automatic-migration.rst b/docs/decisions/0013-course-authoring-automatic-migration.rst index 967a92fc..038d688e 100644 --- a/docs/decisions/0013-course-authoring-automatic-migration.rst +++ b/docs/decisions/0013-course-authoring-automatic-migration.rst @@ -161,10 +161,11 @@ Migration Outcome Semantics The ``status`` field reflects the precise outcome of each run. The possible values are: - ``running``: the migration is actively executing. -- ``completed``: all records were migrated successfully. +- ``completed``: all records were migrated successfully. The ``metadata`` field contains the + details about the successful migrations. - ``partial_success``: the migration process ran to completion, but one or more individual records failed and were skipped. The ``metadata`` field contains details about the - failures. + failures and successfully migrated records. - ``failed``: a critical error prevented the migration from completing (e.g., an unhandled exception or infrastructure problem). The ``metadata`` field contains the exception details. - ``skipped``: the migration was not attempted because another run for the same scope was