|
| 1 | +0013: Course Authoring - Automatic Migration Triggered by Course Authoring Flag |
| 2 | +############################################################################### |
| 3 | + |
| 4 | +Status |
| 5 | +****** |
| 6 | + |
| 7 | +**Draft** - *2026-04-09* |
| 8 | + |
| 9 | +Context |
| 10 | +******* |
| 11 | + |
| 12 | +The system is transitioning from the legacy permissions model (``CourseAccessRole``) |
| 13 | +to the new openedx-authz system. |
| 14 | + |
| 15 | +Currently, migrations between both systems are performed manually using Django management commands: |
| 16 | + |
| 17 | +- ``authz_migrate_course_authoring`` (forward migration) |
| 18 | +- ``authz_rollback_course_authoring`` (rollback migration) |
| 19 | + |
| 20 | +In `ADR 0011`_ and `ADR 0010`_ it was established that migration must occur automatically when |
| 21 | +the feature flag ``authz.enable_course_authoring`` changes state, but they deferred the definition of |
| 22 | +the specific mechanism. This ADR addresses that gap. |
| 23 | + |
| 24 | +The current manual approach presents the following risks: |
| 25 | + |
| 26 | +- **Inconsistency**: If an operator enables or disables the flag without running the migration |
| 27 | + command, the permission data in both systems will diverge. |
| 28 | +- **No status tracking**: There is no visibility into whether a migration is in progress, |
| 29 | + completed, or failed. |
| 30 | +- **No concurrency protection**: Multiple concurrent flag changes can trigger overlapping |
| 31 | + migrations, leading to race conditions and data corruption. |
| 32 | +- **No user feedback**: Operators have no way to know the result of a migration without |
| 33 | + inspecting logs manually. |
| 34 | + |
| 35 | +Decision |
| 36 | +******** |
| 37 | + |
| 38 | +We will implement an automatic and asynchronous migration mechanism triggered by changes in the |
| 39 | +``authz.enable_course_authoring`` feature flag. The solution consists of: |
| 40 | + |
| 41 | +#. Django signal handler to detect flag state changes. |
| 42 | +#. Celery tasks to execute migrations asynchronously. |
| 43 | +#. A tracking model to record migration status and errors. |
| 44 | +#. A locking mechanism to prevent concurrent migrations on the same scope. |
| 45 | + |
| 46 | +.. note:: |
| 47 | + |
| 48 | + **Scope Constraint** |
| 49 | + |
| 50 | + Automatic migration will only trigger for **course-level** and **organization-level** flag |
| 51 | + overrides, not for global (instance-wide) Waffle flag changes. The reason is that a global |
| 52 | + flag change could affect a large number of courses simultaneously, introducing an unacceptable |
| 53 | + performance risk. Global flag changes must be handled via management commands by operators |
| 54 | + who explicitly accept the performance implications. |
| 55 | + |
| 56 | +Operator Safety and Opt-in Design |
| 57 | +================================== |
| 58 | + |
| 59 | +A concern was raised about the risks of triggering data migrations on a live instance. Data |
| 60 | +migrations are typically executed under controlled conditions (e.g., during maintenance windows) |
| 61 | +because any failure can leave the system in an invalid state. Triggering them automatically via |
| 62 | +a feature flag toggle introduces additional risk: |
| 63 | + |
| 64 | +- Django Admin access is sometimes granted to instructors or non-technical staff who may not |
| 65 | + understand the implications of toggling the flag. |
| 66 | +- A live instance may be processing requests concurrently, increasing the chance of partial |
| 67 | + failures or inconsistent transient states. |
| 68 | + |
| 69 | +To address this, the automatic migration mechanism will be **guarded by a Django setting**: |
| 70 | + |
| 71 | +.. code:: python |
| 72 | +
|
| 73 | + ENABLE_AUTOMATIC_COURSE_AUTHORING_MIGRATION = False |
| 74 | +
|
| 75 | +This setting: |
| 76 | + |
| 77 | +- Is **disabled by default**. |
| 78 | +- Must be explicitly set to ``True`` by a site operator who understands the migration risks. |
| 79 | +- Acts as a prerequisite check inside the signal handler: if it is not enabled, the signal |
| 80 | + detects the flag change but does **not** dispatch the Celery task. The operator must then |
| 81 | + run the migration manually using the existing management commands. |
| 82 | + |
| 83 | +This design preserves the automated behavior for operators who opt in while keeping the system |
| 84 | +safe for deployments where uncontrolled migrations are unacceptable. |
| 85 | + |
| 86 | +Detailed Design |
| 87 | +=============== |
| 88 | + |
| 89 | +1. Utility Function Updates |
| 90 | +--------------------------- |
| 91 | + |
| 92 | +The existing utility functions ``migrate_legacy_course_roles_to_authz`` and |
| 93 | +``migrate_authz_to_legacy_course_roles`` will be modified to incorporate the locking strategy |
| 94 | +(see **Concurrency Control** below) and the tracking logic (see **Migration Tracking Model** below) |
| 95 | +as integral steps of their execution. |
| 96 | + |
| 97 | +This approach ensures that both the Celery task and the management commands go through the same |
| 98 | +tracking and locking path. |
| 99 | + |
| 100 | +2. Migration Trigger (Django Signals) |
| 101 | +------------------------------------- |
| 102 | + |
| 103 | +``pre_save`` signal handlers are attached to ``WaffleFlagCourseOverrideModel`` and |
| 104 | +``WaffleFlagOrgOverrideModel``. When a save is detected for the ``authz.enable_course_authoring`` |
| 105 | +flag, the handler: |
| 106 | + |
| 107 | +#. Compares the previous and new flag state to determine the transition direction: |
| 108 | + |
| 109 | + - ``False → True``: triggers a **forward migration** (Legacy → openedx-authz) |
| 110 | + - ``True → False``: triggers a **rollback migration** (openedx-authz → Legacy) |
| 111 | + |
| 112 | +#. Determines the scope (course or organization) from the model being saved. |
| 113 | +#. Dispatches an asynchronous Celery task with the migration parameters. |
| 114 | + |
| 115 | +.. note:: |
| 116 | + If no effective change is detected (i.e., the flag state is the same as the previous state), |
| 117 | + the signal handler does nothing. |
| 118 | + |
| 119 | +3. Migration Tracking Model |
| 120 | +--------------------------- |
| 121 | + |
| 122 | +A new model is introduced to track the lifecycle of each migration operation: |
| 123 | + |
| 124 | +.. code:: python |
| 125 | +
|
| 126 | + class CourseAuthoringMigrationRun(models.Model): |
| 127 | + migration_type = models.CharField(max_length=20) # forward / rollback |
| 128 | + scope_type = models.CharField(max_length=20) # course / org |
| 129 | + scope_key = models.CharField(max_length=255) |
| 130 | + status = models.CharField(max_length=20) # pending, running, completed, skipped |
| 131 | + created_at = models.DateTimeField(auto_now_add=True) |
| 132 | + updated_at = models.DateTimeField(auto_now=True) |
| 133 | + completed_at = models.DateTimeField(null=True, blank=True) |
| 134 | + metadata = models.JSONField(default=dict) |
| 135 | +
|
| 136 | +This model is registered in Django Admin so operators can inspect migration history and |
| 137 | +diagnose failures without needing to access logs directly. |
| 138 | + |
| 139 | +4. Asynchronous Execution |
| 140 | +------------------------- |
| 141 | + |
| 142 | +The Celery task acts strictly as a thin dispatcher. All core logic, including locking, |
| 143 | +tracking, and migration execution, is implemented in the utility functions (see |
| 144 | +**Utility Function Updates** above). |
| 145 | + |
| 146 | +All database operations within the migration itself execute inside an atomic transaction. |
| 147 | +If the migration fails, no data is deleted from either system, preserving consistency. |
| 148 | + |
| 149 | +5. Concurrency Control (Locking Strategy) |
| 150 | +----------------------------------------- |
| 151 | + |
| 152 | +To prevent race conditions caused by rapid or concurrent flag changes on the same scope, a |
| 153 | +distributed lock is implemented using the Django cache backend (Redis): |
| 154 | + |
| 155 | +.. code:: python |
| 156 | +
|
| 157 | + lock_key = f"authz_migration:{scope_type}:{scope_key}" |
| 158 | +
|
| 159 | +The lock is acquired using ``cache.add()``, which is an atomic operation. The default TTL |
| 160 | +is **1 hour**. If a lock already exists for the given scope, the migration is skipped |
| 161 | +and a new tracking record is created with that status. This ensures that only one |
| 162 | +migration runs at a time for the same scope. |
| 163 | + |
| 164 | +6. Execution Flow |
| 165 | +------------------ |
| 166 | + |
| 167 | +1. An operator changes the ``authz.enable_course_authoring`` flag for a course or |
| 168 | + organization via Django Admin or a management command. |
| 169 | +2. The ``pre_save`` signal handler detects the state transition. |
| 170 | +3. A Celery task is dispatched asynchronously. |
| 171 | +4. The task calls the utility function, which acquires the lock, creates and updates the |
| 172 | + ``CourseAuthoringMigrationRun`` record, and executes the migration. |
| 173 | +5. The operator can check the migration status via Django Admin on the ``CourseAuthoringMigrationRun`` |
| 174 | + model. |
| 175 | + |
| 176 | +Consequences |
| 177 | +************ |
| 178 | + |
| 179 | +Positive consequences |
| 180 | +===================== |
| 181 | + |
| 182 | +- **Migration is decoupled from the request cycle**: the flag change returns immediately and |
| 183 | + migration happens in the background. |
| 184 | +- **Full observability**: every migration run is recorded with its status, scope, and metadata |
| 185 | + in the tracking model. |
| 186 | +- **Concurrency-safe**: the lock strategy prevents overlapping migrations on the same scope. |
| 187 | +- **No manual intervention required**: for course-level or organization-level flag changes. Operators |
| 188 | + who have opted in do not need to remember to run management commands. |
| 189 | +- **Safe by default**: the opt-in guard flag ensures that automatic migration is never triggered |
| 190 | + unexpectedly on instances where operators have not explicitly accepted the risks. |
| 191 | + |
| 192 | +Negative consequences / risks |
| 193 | +============================== |
| 194 | + |
| 195 | +- **Global flag changes are not covered**: operators must still run management commands |
| 196 | + manually when enabling or disabling the flag at the instance level. This is a deliberate |
| 197 | + trade-off to avoid performance risks. |
| 198 | +- **Celery dependency**: the system now requires a functioning Celery worker for automatic |
| 199 | + migration. If workers are down, migrations will be queued but not executed until workers |
| 200 | + recover. |
| 201 | +- **Lock TTL edge cases**: if a migration takes longer than 1 hour (unlikely but possible |
| 202 | + for very large organizations), the lock will expire and a new migration for the same scope |
| 203 | + could start concurrently for the same scope. |
| 204 | + |
| 205 | +Rejected Alternatives |
| 206 | +********************* |
| 207 | + |
| 208 | +**Synchronous execution in the signal handler** |
| 209 | + Executing the migration directly inside the ``pre_save`` signal would block the HTTP |
| 210 | + request that triggered the flag change, leading to timeouts for large scopes and poor |
| 211 | + operator experience. |
| 212 | + |
| 213 | +**Manual migration** |
| 214 | + Error-prone, not scalable, and inconsistent. The flag is the source of truth, but manual |
| 215 | + migration allows the system to end up in inconsistent states (e.g., flag enabled but data |
| 216 | + still in the legacy system), resulting in an operationally fragile design. |
| 217 | + |
| 218 | +**Automatic global migration** |
| 219 | + Triggering automatic migration when the flag is changed globally (instance-wide) would |
| 220 | + risk performance degradation on large instances. This was explicitly ruled out: global |
| 221 | + migrations must remain operator-initiated via management commands. |
| 222 | + |
| 223 | +References |
| 224 | +********** |
| 225 | + |
| 226 | +* `Automatic Migration Spike`_ |
| 227 | +* `ADR 0010`_ |
| 228 | +* `ADR 0011`_ |
| 229 | + |
| 230 | +.. _Automatic Migration Spike: |
| 231 | + https://openedx.atlassian.net/wiki/spaces/OEPM/pages/6205112321/Spike+-+RBAC+AuthZ+-+Automatic+Role+Migration |
| 232 | +.. _ADR 0010: 0010-course-authoring-flag.rst |
| 233 | +.. _ADR 0011: 0011-course-authoring-migration-process.rst |
0 commit comments