44Status
55******
66
7- **Draft ** - *2026-04-09 *
7+ **Draft ** - *2026-04-13 *
88
99Context
1010*******
1111
1212The system is transitioning from the legacy permissions model (``CourseAccessRole ``)
1313to the new openedx-authz system.
1414
15- Currently, migrations between both systems are performed manually using Django management commands:
15+ Currently, migrations between the two systems are performed manually using Django
16+ management commands:
1617
1718- ``authz_migrate_course_authoring `` (forward migration)
1819- ``authz_rollback_course_authoring `` (rollback migration)
1920
20- In `ADR 0011 `_ and `ADR 0010 `_ it was established that migration must occur automatically when
21+ In `ADR 0010 `_ and `ADR 0011 `_ it was established that migrations must occur automatically when
2122the feature flag ``authz.enable_course_authoring `` changes state, but the definition of
2223the specific mechanism was deferred. This ADR addresses that gap.
2324
24- The current manual approach presents the following risks:
25-
26- - **Inconsistency **: If an operator enables or disables the flag without running the migration
27- command, the permission data in both systems will diverge.
28- - **No status tracking **: There is no visibility into whether a migration is in progress,
29- completed, or failed.
25+ The current manual approach has the following problems:
26+
27+ - **Access disparity **: Many users have access to Django Admin and can toggle the flag, while
28+ significantly fewer have permission to run management commands. This creates an operational
29+ gap where the flag state can change independently of the migration process. As a result,
30+ coordination is required between different roles (those managing flags vs. those executing
31+ migrations), increasing the risk of delays, misalignment, and inconsistent system state.
32+ - **Outage window **: When a flag change and the corresponding migration command are not executed
33+ atomically, there is a period where the flag points to one system but the permission data
34+ still lives in the other. Any permission check made during this window will fail, causing
35+ real outages for affected courses or organizations.
36+ - **No user feedback **: Users have no way to know the result of a migration without
37+ inspecting logs manually.
3038- **No concurrency protection **: Nothing prevents operators from running the migration command
3139 multiple times simultaneously, which can lead to race conditions and data corruption.
32- - **No user feedback **: Operators have no way to know the result of a migration without
33- inspecting logs manually.
3440
3541Decision
3642********
3743
38- We will implement an automatic and asynchronous migration mechanism triggered by changes in the
44+ We will implement an automatic and synchronous migration mechanism triggered by changes in the
3945``authz.enable_course_authoring `` feature flag. The solution consists of:
4046
41- #. Django signal handler to detect flag state changes.
42- #. Celery tasks to execute migrations asynchronously.
47+ #. A ``post_save `` signal handler that detects flag changes and executes the migration.
4348#. A tracking model to record migration status and errors.
44- #. A locking mechanism to prevent concurrent migrations on the same scope.
49+ #. A database-level constraint to prevent concurrent migrations on the same scope.
4550
4651.. note ::
4752
@@ -54,7 +59,7 @@ We will implement an automatic and asynchronous migration mechanism triggered by
5459 who explicitly accept the performance implications.
5560
5661Operator Safety and Opt-in Design
57- ==================================
62+ =================================
5863
5964A concern was raised about the risks of triggering data migrations on a live instance. Data
6065migrations are typically executed under controlled conditions (e.g., during maintenance windows)
@@ -77,46 +82,53 @@ This setting:
7782- Is **disabled by default **.
7883- Must be explicitly set to ``True `` by a site operator who understands the migration risks.
7984- Acts as a prerequisite check inside the signal handler: if it is not enabled, the signal
80- detects the flag change but does ** not ** dispatch the Celery task . The operator must then
85+ detects the flag change but does not execute the migration . The operator must then
8186 run the migration manually using the existing management commands.
8287
83- This design preserves the automated behavior for operators who opt in while keeping the system
84- safe for deployments where uncontrolled migrations are unacceptable.
85-
8688Detailed Design
8789===============
8890
89- 1. Utility Function Updates
90- ---------------------------
91+ 1. Migration Trigger (Django Signals)
92+ -------------------------------------
9193
92- The existing utility functions ``migrate_legacy_course_roles_to_authz `` and
93- ``migrate_authz_to_legacy_course_roles `` will be modified to incorporate the locking strategy
94- (see **Concurrency Control ** below) and the tracking logic (see **Migration Tracking Model ** below)
95- as integral steps of their execution.
94+ A ``post_save `` handler is attached to ``WaffleFlagCourseOverrideModel `` and
95+ ``WaffleFlagOrgOverrideModel `` for the ``authz.enable_course_authoring `` flag.
9696
97- This approach ensures that both the Celery task and the management commands go through the same
98- tracking and locking path .
97+ The handler fires after the record is committed to the database, so the new flag value is
98+ the authoritative and durable state of the system when the migration begins .
9999
100- 2. Migration Trigger (Django Signals)
101- -------------------------------------
100+ Retrieving the previous state from the same model
101+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
102+
103+ Both ``WaffleFlagCourseOverrideModel `` and ``WaffleFlagOrgOverrideModel `` extend
104+ ``ConfigurationModel ``, which **creates a new row on every save ** instead of updating the
105+ existing record. This means the full change history for each scope is preserved in the
106+ table. The previous override value is therefore always available as the most recent record
107+ for the same scope that is not the one just saved.
108+
109+ If no previous record exists for the scope (this is the first override ever created for
110+ it), the migration runs unconditionally based on the current ``enabled `` value, without
111+ comparing against a previous state.
112+
113+ ``post_save `` execution
114+ ~~~~~~~~~~~~~~~~~~~~~~~
102115
103- ``pre_save `` signal handlers are attached to ``WaffleFlagCourseOverrideModel `` and
104- ``WaffleFlagOrgOverrideModel ``. When a save is detected for the ``authz.enable_course_authoring ``
105- flag, the handler:
116+ The ``post_save `` handler:
106117
107- #. Compares the previous and new flag state to determine the transition direction:
118+ #. Queries the same flag override model for the previous record as described above.
119+ #. If no previous record exists, runs the migration based on the current ``enabled `` value
120+ without further comparison.
121+ #. If a previous record exists, compares its ``enabled `` value with the saved one to
122+ determine whether an effective transition occurred:
108123
109124 - ``False → True ``: triggers a **forward migration ** (Legacy → openedx-authz)
110125 - ``True → False ``: triggers a **rollback migration ** (openedx-authz → Legacy)
126+ - No change: the handler does nothing. No tracking record is created and no migration runs.
111127
112128#. Determines the scope (course or organization) from the model being saved.
113- #. Dispatches an asynchronous Celery task with the migration parameters.
129+ #. Calls the utility function synchronously with the migration parameters.
114130
115- .. note ::
116- If no effective change is detected (i.e., the flag state is the same as the previous state),
117- the signal handler does nothing.
118-
119- 3. Migration Tracking Model
131+ 2. Migration Tracking Model
120132---------------------------
121133
122134A new model is introduced to track the lifecycle of each migration operation:
@@ -127,88 +139,127 @@ A new model is introduced to track the lifecycle of each migration operation:
127139 migration_type = models.CharField(max_length = 20 ) # forward / rollback
128140 scope_type = models.CharField(max_length = 20 ) # course / org
129141 scope_key = models.CharField(max_length = 255 )
130- status = models.CharField(max_length = 20 ) # pending, running, completed, skipped
142+ status = models.CharField(max_length = 20 ) # running, completed, partial_success, failed , skipped
131143 created_at = models.DateTimeField(auto_now_add = True )
132144 updated_at = models.DateTimeField(auto_now = True )
133145 completed_at = models.DateTimeField(null = True , blank = True )
134146 metadata = models.JSONField(default = dict )
135147
136- This model is registered in Django Admin so operators can inspect migration history and
148+ This model is registered in Django Admin so users can inspect migration history and
137149diagnose failures without needing to access logs directly.
138150
139- 4. Asynchronous Execution
140- -------------------------
141-
142- The Celery task acts strictly as a thin dispatcher. All core logic, including locking,
143- tracking, and migration execution, is implemented in the utility functions (see
144- **Utility Function Updates ** above).
145-
146- All database operations within the migration itself execute inside an atomic transaction.
147- If the migration fails, no data is deleted from either system, preserving consistency.
148-
149- 5. Concurrency Control (Locking Strategy)
150- -----------------------------------------
151-
152- To prevent race conditions caused by rapid or concurrent flag changes on the same scope, a
153- distributed lock is implemented using the Django cache backend (Redis):
151+ A higher-level orchestration layer (separate from the existing utility functions) will be
152+ responsible for creating and updating ``AuthzCourseAuthoringMigrationRun `` records. This
153+ layer wraps the core migration logic, ensuring that lifecycle tracking (opening a
154+ ``running `` record, handling errors, and writing the final status) is applied consistently
155+ regardless of whether the migration is triggered by the signal handler or a management
156+ command.
157+
158+ Migration Outcome Semantics
159+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~
160+
161+ The ``status `` field reflects the precise outcome of each run. The possible values are:
162+
163+ - ``running ``: the migration is actively executing.
164+ - ``completed ``: all records were migrated successfully.
165+ - ``partial_success ``: the migration process ran to completion, but one or more individual
166+ records failed and were skipped. The ``metadata `` field contains details about the
167+ failures.
168+ - ``failed ``: a critical error prevented the migration from completing (e.g., an unhandled
169+ exception or infrastructure problem). The ``metadata `` field contains the exception details.
170+ - ``skipped ``: the migration was not attempted because another run for the same scope was
171+ already active.
172+
173+ 3. Concurrency Control
174+ ----------------------
175+
176+ To prevent overlapping migrations on the same scope, the tracking model enforces a
177+ conditional ``UniqueConstraint `` on ``(scope_type, scope_key) `` filtered to
178+ ``status="running" ``. This guarantees that no second active migration record can be
179+ inserted for the same scope regardless of how many processes attempt to do so concurrently.
180+ Any attempt raises an ``IntegrityError ``, which the caller handles by recording a
181+ ``skipped `` run and aborting.
154182
155183.. code :: python
156184
157- lock_key = f " authz_migration: { scope_type} : { scope_key} "
158-
159- The lock is acquired using ``cache.add() ``, which is an atomic operation. The default TTL
160- is **1 hour **. If a lock already exists for the given scope, the migration is skipped
161- and a new tracking record is created with that status. This ensures that only one
162- migration runs at a time for the same scope.
163-
164- 6. Execution Flow
165- ------------------
166-
167- 1. An operator changes the ``authz.enable_course_authoring `` flag for a course or
168- organization via Django Admin or a management command.
169- 2. The ``pre_save `` signal handler detects the state transition.
170- 3. A Celery task is dispatched asynchronously.
171- 4. The task calls the utility function, which acquires the lock, creates and updates the
172- ``AuthzCourseAuthoringMigrationRun `` record, and executes the migration.
173- 5. The operator can check the migration status via Django Admin on the ``AuthzCourseAuthoringMigrationRun ``
174- model.
185+ class Meta :
186+ constraints = [
187+ models.UniqueConstraint(
188+ fields = [" scope_type" , " scope_key" ],
189+ condition = models.Q(status = " running" ),
190+ name = " unique_active_migration_per_scope" ,
191+ )
192+ ]
193+
194+ 4. Execution Flow
195+ -----------------
196+
197+ 1. The user changes the ``authz.enable_course_authoring `` flag for a course or
198+ organization and saves the record. A new row is created in the override table.
199+ 2. The ``post_save `` handler queries the same override model for the previous record
200+ (most recent row for the same scope, excluding the one just saved) to obtain the
201+ previous ``enabled `` value.
202+ 3. The handler compares the previous value with the current ``enabled `` value. If no
203+ effective change occurred, it does nothing.
204+ 4. If a transition is detected, the handler calls the utility function synchronously. The
205+ function creates an ``AuthzCourseAuthoringMigrationRun `` record with
206+ ``status="running" `` (the database constraint prevents this if another run for the
207+ same scope is already active) and executes the migration.
208+ 5. The record is updated to its final status (``completed ``, ``partial_success ``, ``failed ``,
209+ or ``skipped ``) before the ``post_save `` handler returns.
210+ 6. The user can review the migration outcome via Django Admin on the
211+ ``AuthzCourseAuthoringMigrationRun `` model.
175212
176213Consequences
177214************
178215
179216Positive consequences
180217=====================
181218
182- - **Migration is decoupled from the request cycle **: the flag change returns immediately and
183- migration happens in the background.
184219- **Full observability **: every migration run is recorded with its status, scope, and metadata
185220 in the tracking model.
186- - **Concurrency-safe **: the lock strategy prevents overlapping migrations on the same scope.
187- - **No manual intervention required **: for course-level or organization-level flag changes. Operators
188- who have opted in do not need to remember to run management commands.
221+ - **Concurrency-safe **: the database-level constraint prevents overlapping migrations on the same
222+ scope, regardless of cache availability or worker failures.
223+ - **No manual intervention required ** for course-level or organization-level flag changes. Operators
224+ or users who have opted in do not need to remember to run management commands.
189225- **Safe by default **: the opt-in guard flag ensures that automatic migration is never triggered
190226 unexpectedly on instances where operators have not explicitly accepted the risks.
191227
192228Negative consequences / risks
193- ==============================
229+ =============================
194230
195231- **Global flag changes are not covered **: operators must still run management commands
196232 manually when enabling or disabling the flag at the instance level. This is a deliberate
197233 trade-off to avoid performance risks.
198- - **Celery dependency **: the system now requires a functioning Celery worker for automatic
199- migration. If workers are down, migrations will be queued but not executed until workers
200- recover.
201- - **Lock TTL edge cases **: if a migration takes longer than 1 hour (unlikely but possible
202- for very large organizations), the lock will expire and a new migration for the same scope
203- could start concurrently for the same scope.
234+ - **Blocks the request **: the migration runs synchronously inside the ``post_save `` signal,
235+ so the HTTP request that triggered the flag change does not return until the migration
236+ finishes. For large organization-level scopes this can cause noticeable latency or
237+ timeouts. This is an accepted trade-off given that automatic migration is scoped to
238+ course-level and organization-level overrides only (never global), and is opt-in.
239+ - **Runtime execution trade-offs **: Unlike management commands typically executed during
240+ maintenance windows, this migration runs in a live production environment as part of
241+ normal system operation. This means it executes under concurrent load, with active
242+ requests and database activity, which introduces variability in execution conditions.
243+ This trade-off is inherent to enabling the feature flag to act as a real-time source
244+ of truth. The design prioritizes consistency between flag state and permission data
245+ over strictly controlled execution environments, while providing observability and
246+ recovery mechanisms to mitigate operational risk.
204247
205248Rejected Alternatives
206249*********************
207250
208- **Synchronous execution in the signal handler **
209- Executing the migration directly inside the ``pre_save `` signal would block the HTTP
210- request that triggered the flag change, leading to timeouts for large scopes and poor
211- operator experience.
251+ **Using pre_save to trigger the migration **
252+ A ``pre_save `` handler could detect the transition direction and execute the migration
253+ before the flag change is written. This approach violates ACID principles: at the moment
254+ ``pre_save `` fires, the new flag value has not yet been committed to the database. If the
255+ subsequent ``save() `` were to fail (e.g., a validation error, a database constraint
256+ violation, or a network issue), the migration would have already run against a state that
257+ was never persisted, leaving the permission data inconsistent with the actual flag value.
258+
259+ **Asynchronous execution via Celery **
260+ Given that automatic migration is scoped to course-level and organization-level
261+ overrides where migration volumes are bounded, synchronous execution is simpler
262+ and provides stronger consistency guarantees.
212263
213264**Manual migration **
214265 Error-prone, not scalable, and inconsistent. The flag is the source of truth, but manual
0 commit comments