Skip to content

Commit 818a444

Browse files
committed
docs: add course authoring automatic migration adr
1 parent 8fc8733 commit 818a444

1 file changed

Lines changed: 233 additions & 0 deletions

File tree

Lines changed: 233 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,233 @@
1+
0013: Course Authoring - Automatic Migration Triggered by Course Authoring Flag
2+
###############################################################################
3+
4+
Status
5+
******
6+
7+
**Draft** - *2026-04-09*
8+
9+
Context
10+
*******
11+
12+
The system is transitioning from the legacy permissions model (``CourseAccessRole``)
13+
to the new openedx-authz system.
14+
15+
Currently, migrations between both systems are performed manually using Django management commands:
16+
17+
- ``authz_migrate_course_authoring`` (forward migration)
18+
- ``authz_rollback_course_authoring`` (rollback migration)
19+
20+
In `ADR 0011`_ and `ADR 0010`_ it was established that migration must occur automatically when
21+
the feature flag ``authz.enable_course_authoring`` changes state, but they deferred the definition of
22+
the specific mechanism. This ADR addresses that gap.
23+
24+
The current manual approach presents the following risks:
25+
26+
- **Inconsistency**: If an operator enables or disables the flag without running the migration
27+
command, the permission data in both systems will diverge.
28+
- **No status tracking**: There is no visibility into whether a migration is in progress,
29+
completed, or failed.
30+
- **No concurrency protection**: Multiple concurrent flag changes can trigger overlapping
31+
migrations, leading to race conditions and data corruption.
32+
- **No user feedback**: Operators have no way to know the result of a migration without
33+
inspecting logs manually.
34+
35+
Decision
36+
********
37+
38+
We will implement an automatic and asynchronous migration mechanism triggered by changes in the
39+
``authz.enable_course_authoring`` feature flag. The solution consists of:
40+
41+
#. Django signal handler to detect flag state changes.
42+
#. Celery tasks to execute migrations asynchronously.
43+
#. A tracking model to record migration status and errors.
44+
#. A locking mechanism to prevent concurrent migrations on the same scope.
45+
46+
.. note::
47+
48+
**Scope Constraint**
49+
50+
Automatic migration will only trigger for **course-level** and **organization-level** flag
51+
overrides, not for global (instance-wide) Waffle flag changes. The reason is that a global
52+
flag change could affect a large number of courses simultaneously, introducing an unacceptable
53+
performance risk. Global flag changes must be handled via management commands by operators
54+
who explicitly accept the performance implications.
55+
56+
Operator Safety and Opt-in Design
57+
==================================
58+
59+
A concern was raised about the risks of triggering data migrations on a live instance. Data
60+
migrations are typically executed under controlled conditions (e.g., during maintenance windows)
61+
because any failure can leave the system in an invalid state. Triggering them automatically via
62+
a feature flag toggle introduces additional risk:
63+
64+
- Django Admin access is sometimes granted to instructors or non-technical staff who may not
65+
understand the implications of toggling the flag.
66+
- A live instance may be processing requests concurrently, increasing the chance of partial
67+
failures or inconsistent transient states.
68+
69+
To address this, the automatic migration mechanism will be **guarded by a Django setting**:
70+
71+
.. code:: python
72+
73+
ENABLE_AUTOMATIC_COURSE_AUTHORING_MIGRATION = False
74+
75+
This setting:
76+
77+
- Is **disabled by default**.
78+
- Must be explicitly set to ``True`` by a site operator who understands the migration risks.
79+
- Acts as a prerequisite check inside the signal handler: if it is not enabled, the signal
80+
detects the flag change but does **not** dispatch the Celery task. The operator must then
81+
run the migration manually using the existing management commands.
82+
83+
This design preserves the automated behavior for operators who opt in while keeping the system
84+
safe for deployments where uncontrolled migrations are unacceptable.
85+
86+
Detailed Design
87+
===============
88+
89+
1. Utility Function Updates
90+
---------------------------
91+
92+
The existing utility functions ``migrate_legacy_course_roles_to_authz`` and
93+
``migrate_authz_to_legacy_course_roles`` will be modified to incorporate the locking strategy
94+
(see **Concurrency Control** below) and the tracking logic (see **Migration Tracking Model** below)
95+
as integral steps of their execution.
96+
97+
This approach ensures that both the Celery task and the management commands go through the same
98+
tracking and locking path.
99+
100+
2. Migration Trigger (Django Signals)
101+
-------------------------------------
102+
103+
``pre_save`` signal handlers are attached to ``WaffleFlagCourseOverrideModel`` and
104+
``WaffleFlagOrgOverrideModel``. When a save is detected for the ``authz.enable_course_authoring``
105+
flag, the handler:
106+
107+
#. Compares the previous and new flag state to determine the transition direction:
108+
109+
- ``False → True``: triggers a **forward migration** (Legacy → openedx-authz)
110+
- ``True → False``: triggers a **rollback migration** (openedx-authz → Legacy)
111+
112+
#. Determines the scope (course or organization) from the model being saved.
113+
#. Dispatches an asynchronous Celery task with the migration parameters.
114+
115+
.. note::
116+
If no effective change is detected (i.e., the flag state is the same as the previous state),
117+
the signal handler does nothing.
118+
119+
3. Migration Tracking Model
120+
---------------------------
121+
122+
A new model is introduced to track the lifecycle of each migration operation:
123+
124+
.. code:: python
125+
126+
class CourseAuthoringMigrationRun(models.Model):
127+
migration_type = models.CharField(max_length=20) # forward / rollback
128+
scope_type = models.CharField(max_length=20) # course / org
129+
scope_key = models.CharField(max_length=255)
130+
status = models.CharField(max_length=20) # pending, running, completed, skipped
131+
created_at = models.DateTimeField(auto_now_add=True)
132+
updated_at = models.DateTimeField(auto_now=True)
133+
completed_at = models.DateTimeField(null=True, blank=True)
134+
metadata = models.JSONField(default=dict)
135+
136+
This model is registered in Django Admin so operators can inspect migration history and
137+
diagnose failures without needing to access logs directly.
138+
139+
4. Asynchronous Execution
140+
-------------------------
141+
142+
The Celery task acts strictly as a thin dispatcher. All core logic, including locking,
143+
tracking, and migration execution, is implemented in the utility functions (see
144+
**Utility Function Updates** above).
145+
146+
All database operations within the migration itself execute inside an atomic transaction.
147+
If the migration fails, no data is deleted from either system, preserving consistency.
148+
149+
5. Concurrency Control (Locking Strategy)
150+
-----------------------------------------
151+
152+
To prevent race conditions caused by rapid or concurrent flag changes on the same scope, a
153+
distributed lock is implemented using the Django cache backend (Redis):
154+
155+
.. code:: python
156+
157+
lock_key = f"authz_migration:{scope_type}:{scope_key}"
158+
159+
The lock is acquired using ``cache.add()``, which is an atomic operation. The default TTL
160+
is **1 hour**. If a lock already exists for the given scope, the migration is skipped
161+
and a new tracking record is created with that status. This ensures that only one
162+
migration runs at a time for the same scope.
163+
164+
6. Execution Flow
165+
------------------
166+
167+
1. An operator changes the ``authz.enable_course_authoring`` flag for a course or
168+
organization via Django Admin or a management command.
169+
2. The ``pre_save`` signal handler detects the state transition.
170+
3. A Celery task is dispatched asynchronously.
171+
4. The task calls the utility function, which acquires the lock, creates and updates the
172+
``CourseAuthoringMigrationRun`` record, and executes the migration.
173+
5. The operator can check the migration status via Django Admin on the ``CourseAuthoringMigrationRun``
174+
model.
175+
176+
Consequences
177+
************
178+
179+
Positive consequences
180+
=====================
181+
182+
- **Migration is decoupled from the request cycle**: the flag change returns immediately and
183+
migration happens in the background.
184+
- **Full observability**: every migration run is recorded with its status, scope, and metadata
185+
in the tracking model.
186+
- **Concurrency-safe**: the lock strategy prevents overlapping migrations on the same scope.
187+
- **No manual intervention required**: for course-level or organization-level flag changes. Operators
188+
who have opted in do not need to remember to run management commands.
189+
- **Safe by default**: the opt-in guard flag ensures that automatic migration is never triggered
190+
unexpectedly on instances where operators have not explicitly accepted the risks.
191+
192+
Negative consequences / risks
193+
==============================
194+
195+
- **Global flag changes are not covered**: operators must still run management commands
196+
manually when enabling or disabling the flag at the instance level. This is a deliberate
197+
trade-off to avoid performance risks.
198+
- **Celery dependency**: the system now requires a functioning Celery worker for automatic
199+
migration. If workers are down, migrations will be queued but not executed until workers
200+
recover.
201+
- **Lock TTL edge cases**: if a migration takes longer than 1 hour (unlikely but possible
202+
for very large organizations), the lock will expire and a new migration for the same scope
203+
could start concurrently for the same scope.
204+
205+
Rejected Alternatives
206+
*********************
207+
208+
**Synchronous execution in the signal handler**
209+
Executing the migration directly inside the ``pre_save`` signal would block the HTTP
210+
request that triggered the flag change, leading to timeouts for large scopes and poor
211+
operator experience.
212+
213+
**Manual migration**
214+
Error-prone, not scalable, and inconsistent. The flag is the source of truth, but manual
215+
migration allows the system to end up in inconsistent states (e.g., flag enabled but data
216+
still in the legacy system), resulting in an operationally fragile design.
217+
218+
**Automatic global migration**
219+
Triggering automatic migration when the flag is changed globally (instance-wide) would
220+
risk performance degradation on large instances. This was explicitly ruled out: global
221+
migrations must remain operator-initiated via management commands.
222+
223+
References
224+
**********
225+
226+
* `Automatic Migration Spike`_
227+
* `ADR 0010`_
228+
* `ADR 0011`_
229+
230+
.. _Automatic Migration Spike:
231+
https://openedx.atlassian.net/wiki/spaces/OEPM/pages/6205112321/Spike+-+RBAC+AuthZ+-+Automatic+Role+Migration
232+
.. _ADR 0010: 0010-course-authoring-flag.rst
233+
.. _ADR 0011: 0011-course-authoring-migration-process.rst

0 commit comments

Comments
 (0)