Skip to content

Commit c55800c

Browse files
committed
chore: address PR reviews
1 parent eea181e commit c55800c

1 file changed

Lines changed: 142 additions & 91 deletions

File tree

docs/decisions/0013-course-authoring-automatic-migration.rst

Lines changed: 142 additions & 91 deletions
Original file line numberDiff line numberDiff line change
@@ -4,44 +4,49 @@
44
Status
55
******
66

7-
**Draft** - *2026-04-09*
7+
**Draft** - *2026-04-13*
88

99
Context
1010
*******
1111

1212
The system is transitioning from the legacy permissions model (``CourseAccessRole``)
1313
to the new openedx-authz system.
1414

15-
Currently, migrations between both systems are performed manually using Django management commands:
15+
Currently, migrations between the two systems are performed manually using Django
16+
management commands:
1617

1718
- ``authz_migrate_course_authoring`` (forward migration)
1819
- ``authz_rollback_course_authoring`` (rollback migration)
1920

20-
In `ADR 0011`_ and `ADR 0010`_ it was established that migration must occur automatically when
21+
In `ADR 0010`_ and `ADR 0011`_ it was established that migrations must occur automatically when
2122
the feature flag ``authz.enable_course_authoring`` changes state, but the definition of
2223
the specific mechanism was deferred. This ADR addresses that gap.
2324

24-
The current manual approach presents the following risks:
25-
26-
- **Inconsistency**: If an operator enables or disables the flag without running the migration
27-
command, the permission data in both systems will diverge.
28-
- **No status tracking**: There is no visibility into whether a migration is in progress,
29-
completed, or failed.
25+
The current manual approach has the following problems:
26+
27+
- **Access disparity**: Many users have access to Django Admin and can toggle the flag, while
28+
significantly fewer have permission to run management commands. This creates an operational
29+
gap where the flag state can change independently of the migration process. As a result,
30+
coordination is required between different roles (those managing flags vs. those executing
31+
migrations), increasing the risk of delays, misalignment, and inconsistent system state.
32+
- **Outage window**: When a flag change and the corresponding migration command are not executed
33+
atomically, there is a period where the flag points to one system but the permission data
34+
still lives in the other. Any permission check made during this window will fail, causing
35+
real outages for affected courses or organizations.
36+
- **No user feedback**: Users have no way to know the result of a migration without
37+
inspecting logs manually.
3038
- **No concurrency protection**: Nothing prevents operators from running the migration command
3139
multiple times simultaneously, which can lead to race conditions and data corruption.
32-
- **No user feedback**: Operators have no way to know the result of a migration without
33-
inspecting logs manually.
3440

3541
Decision
3642
********
3743

38-
We will implement an automatic and asynchronous migration mechanism triggered by changes in the
44+
We will implement an automatic and synchronous migration mechanism triggered by changes in the
3945
``authz.enable_course_authoring`` feature flag. The solution consists of:
4046

41-
#. Django signal handler to detect flag state changes.
42-
#. Celery tasks to execute migrations asynchronously.
47+
#. A ``post_save`` signal handler that detects flag changes and executes the migration.
4348
#. A tracking model to record migration status and errors.
44-
#. A locking mechanism to prevent concurrent migrations on the same scope.
49+
#. A database-level constraint to prevent concurrent migrations on the same scope.
4550

4651
.. note::
4752

@@ -54,7 +59,7 @@ We will implement an automatic and asynchronous migration mechanism triggered by
5459
who explicitly accept the performance implications.
5560

5661
Operator Safety and Opt-in Design
57-
==================================
62+
=================================
5863

5964
A concern was raised about the risks of triggering data migrations on a live instance. Data
6065
migrations are typically executed under controlled conditions (e.g., during maintenance windows)
@@ -77,46 +82,53 @@ This setting:
7782
- Is **disabled by default**.
7883
- Must be explicitly set to ``True`` by a site operator who understands the migration risks.
7984
- Acts as a prerequisite check inside the signal handler: if it is not enabled, the signal
80-
detects the flag change but does **not** dispatch the Celery task. The operator must then
85+
detects the flag change but does not execute the migration. The operator must then
8186
run the migration manually using the existing management commands.
8287

83-
This design preserves the automated behavior for operators who opt in while keeping the system
84-
safe for deployments where uncontrolled migrations are unacceptable.
85-
8688
Detailed Design
8789
===============
8890

89-
1. Utility Function Updates
90-
---------------------------
91+
1. Migration Trigger (Django Signals)
92+
-------------------------------------
9193

92-
The existing utility functions ``migrate_legacy_course_roles_to_authz`` and
93-
``migrate_authz_to_legacy_course_roles`` will be modified to incorporate the locking strategy
94-
(see **Concurrency Control** below) and the tracking logic (see **Migration Tracking Model** below)
95-
as integral steps of their execution.
94+
A ``post_save`` handler is attached to ``WaffleFlagCourseOverrideModel`` and
95+
``WaffleFlagOrgOverrideModel`` for the ``authz.enable_course_authoring`` flag.
9696

97-
This approach ensures that both the Celery task and the management commands go through the same
98-
tracking and locking path.
97+
The handler fires after the record is committed to the database, so the new flag value is
98+
the authoritative and durable state of the system when the migration begins.
9999

100-
2. Migration Trigger (Django Signals)
101-
-------------------------------------
100+
Retrieving the previous state from the same model
101+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
102+
103+
Both ``WaffleFlagCourseOverrideModel`` and ``WaffleFlagOrgOverrideModel`` extend
104+
``ConfigurationModel``, which **creates a new row on every save** instead of updating the
105+
existing record. This means the full change history for each scope is preserved in the
106+
table. The previous override value is therefore always available as the most recent record
107+
for the same scope that is not the one just saved.
108+
109+
If no previous record exists for the scope (this is the first override ever created for
110+
it), the migration runs unconditionally based on the current ``enabled`` value, without
111+
comparing against a previous state.
112+
113+
``post_save`` execution
114+
~~~~~~~~~~~~~~~~~~~~~~~
102115

103-
``pre_save`` signal handlers are attached to ``WaffleFlagCourseOverrideModel`` and
104-
``WaffleFlagOrgOverrideModel``. When a save is detected for the ``authz.enable_course_authoring``
105-
flag, the handler:
116+
The ``post_save`` handler:
106117

107-
#. Compares the previous and new flag state to determine the transition direction:
118+
#. Queries the same flag override model for the previous record as described above.
119+
#. If no previous record exists, runs the migration based on the current ``enabled`` value
120+
without further comparison.
121+
#. If a previous record exists, compares its ``enabled`` value with the saved one to
122+
determine whether an effective transition occurred:
108123

109124
- ``False → True``: triggers a **forward migration** (Legacy → openedx-authz)
110125
- ``True → False``: triggers a **rollback migration** (openedx-authz → Legacy)
126+
- No change: the handler does nothing. No tracking record is created and no migration runs.
111127

112128
#. Determines the scope (course or organization) from the model being saved.
113-
#. Dispatches an asynchronous Celery task with the migration parameters.
129+
#. Calls the utility function synchronously with the migration parameters.
114130

115-
.. note::
116-
If no effective change is detected (i.e., the flag state is the same as the previous state),
117-
the signal handler does nothing.
118-
119-
3. Migration Tracking Model
131+
2. Migration Tracking Model
120132
---------------------------
121133

122134
A new model is introduced to track the lifecycle of each migration operation:
@@ -127,88 +139,127 @@ A new model is introduced to track the lifecycle of each migration operation:
127139
migration_type = models.CharField(max_length=20) # forward / rollback
128140
scope_type = models.CharField(max_length=20) # course / org
129141
scope_key = models.CharField(max_length=255)
130-
status = models.CharField(max_length=20) # pending, running, completed, skipped
142+
status = models.CharField(max_length=20) # running, completed, partial_success, failed, skipped
131143
created_at = models.DateTimeField(auto_now_add=True)
132144
updated_at = models.DateTimeField(auto_now=True)
133145
completed_at = models.DateTimeField(null=True, blank=True)
134146
metadata = models.JSONField(default=dict)
135147
136-
This model is registered in Django Admin so operators can inspect migration history and
148+
This model is registered in Django Admin so users can inspect migration history and
137149
diagnose failures without needing to access logs directly.
138150

139-
4. Asynchronous Execution
140-
-------------------------
141-
142-
The Celery task acts strictly as a thin dispatcher. All core logic, including locking,
143-
tracking, and migration execution, is implemented in the utility functions (see
144-
**Utility Function Updates** above).
145-
146-
All database operations within the migration itself execute inside an atomic transaction.
147-
If the migration fails, no data is deleted from either system, preserving consistency.
148-
149-
5. Concurrency Control (Locking Strategy)
150-
-----------------------------------------
151-
152-
To prevent race conditions caused by rapid or concurrent flag changes on the same scope, a
153-
distributed lock is implemented using the Django cache backend (Redis):
151+
A higher-level orchestration layer (separate from the existing utility functions) will be
152+
responsible for creating and updating ``AuthzCourseAuthoringMigrationRun`` records. This
153+
layer wraps the core migration logic, ensuring that lifecycle tracking (opening a
154+
``running`` record, handling errors, and writing the final status) is applied consistently
155+
regardless of whether the migration is triggered by the signal handler or a management
156+
command.
157+
158+
Migration Outcome Semantics
159+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
160+
161+
The ``status`` field reflects the precise outcome of each run. The possible values are:
162+
163+
- ``running``: the migration is actively executing.
164+
- ``completed``: all records were migrated successfully.
165+
- ``partial_success``: the migration process ran to completion, but one or more individual
166+
records failed and were skipped. The ``metadata`` field contains details about the
167+
failures.
168+
- ``failed``: a critical error prevented the migration from completing (e.g., an unhandled
169+
exception or infrastructure problem). The ``metadata`` field contains the exception details.
170+
- ``skipped``: the migration was not attempted because another run for the same scope was
171+
already active.
172+
173+
3. Concurrency Control
174+
----------------------
175+
176+
To prevent overlapping migrations on the same scope, the tracking model enforces a
177+
conditional ``UniqueConstraint`` on ``(scope_type, scope_key)`` filtered to
178+
``status="running"``. This guarantees that no second active migration record can be
179+
inserted for the same scope regardless of how many processes attempt to do so concurrently.
180+
Any attempt raises an ``IntegrityError``, which the caller handles by recording a
181+
``skipped`` run and aborting.
154182

155183
.. code:: python
156184
157-
lock_key = f"authz_migration:{scope_type}:{scope_key}"
158-
159-
The lock is acquired using ``cache.add()``, which is an atomic operation. The default TTL
160-
is **1 hour**. If a lock already exists for the given scope, the migration is skipped
161-
and a new tracking record is created with that status. This ensures that only one
162-
migration runs at a time for the same scope.
163-
164-
6. Execution Flow
165-
------------------
166-
167-
1. An operator changes the ``authz.enable_course_authoring`` flag for a course or
168-
organization via Django Admin or a management command.
169-
2. The ``pre_save`` signal handler detects the state transition.
170-
3. A Celery task is dispatched asynchronously.
171-
4. The task calls the utility function, which acquires the lock, creates and updates the
172-
``AuthzCourseAuthoringMigrationRun`` record, and executes the migration.
173-
5. The operator can check the migration status via Django Admin on the ``AuthzCourseAuthoringMigrationRun``
174-
model.
185+
class Meta:
186+
constraints = [
187+
models.UniqueConstraint(
188+
fields=["scope_type", "scope_key"],
189+
condition=models.Q(status="running"),
190+
name="unique_active_migration_per_scope",
191+
)
192+
]
193+
194+
4. Execution Flow
195+
-----------------
196+
197+
1. The user changes the ``authz.enable_course_authoring`` flag for a course or
198+
organization and saves the record. A new row is created in the override table.
199+
2. The ``post_save`` handler queries the same override model for the previous record
200+
(most recent row for the same scope, excluding the one just saved) to obtain the
201+
previous ``enabled`` value.
202+
3. The handler compares the previous value with the current ``enabled`` value. If no
203+
effective change occurred, it does nothing.
204+
4. If a transition is detected, the handler calls the utility function synchronously. The
205+
function creates an ``AuthzCourseAuthoringMigrationRun`` record with
206+
``status="running"`` (the database constraint prevents this if another run for the
207+
same scope is already active) and executes the migration.
208+
5. The record is updated to its final status (``completed``, ``partial_success``, ``failed``,
209+
or ``skipped``) before the ``post_save`` handler returns.
210+
6. The user can review the migration outcome via Django Admin on the
211+
``AuthzCourseAuthoringMigrationRun`` model.
175212

176213
Consequences
177214
************
178215

179216
Positive consequences
180217
=====================
181218

182-
- **Migration is decoupled from the request cycle**: the flag change returns immediately and
183-
migration happens in the background.
184219
- **Full observability**: every migration run is recorded with its status, scope, and metadata
185220
in the tracking model.
186-
- **Concurrency-safe**: the lock strategy prevents overlapping migrations on the same scope.
187-
- **No manual intervention required**: for course-level or organization-level flag changes. Operators
188-
who have opted in do not need to remember to run management commands.
221+
- **Concurrency-safe**: the database-level constraint prevents overlapping migrations on the same
222+
scope, regardless of cache availability or worker failures.
223+
- **No manual intervention required** for course-level or organization-level flag changes. Operators
224+
or users who have opted in do not need to remember to run management commands.
189225
- **Safe by default**: the opt-in guard flag ensures that automatic migration is never triggered
190226
unexpectedly on instances where operators have not explicitly accepted the risks.
191227

192228
Negative consequences / risks
193-
==============================
229+
=============================
194230

195231
- **Global flag changes are not covered**: operators must still run management commands
196232
manually when enabling or disabling the flag at the instance level. This is a deliberate
197233
trade-off to avoid performance risks.
198-
- **Celery dependency**: the system now requires a functioning Celery worker for automatic
199-
migration. If workers are down, migrations will be queued but not executed until workers
200-
recover.
201-
- **Lock TTL edge cases**: if a migration takes longer than 1 hour (unlikely but possible
202-
for very large organizations), the lock will expire and a new migration for the same scope
203-
could start concurrently for the same scope.
234+
- **Blocks the request**: the migration runs synchronously inside the ``post_save`` signal,
235+
so the HTTP request that triggered the flag change does not return until the migration
236+
finishes. For large organization-level scopes this can cause noticeable latency or
237+
timeouts. This is an accepted trade-off given that automatic migration is scoped to
238+
course-level and organization-level overrides only (never global), and is opt-in.
239+
- **Runtime execution trade-offs**: Unlike management commands typically executed during
240+
maintenance windows, this migration runs in a live production environment as part of
241+
normal system operation. This means it executes under concurrent load, with active
242+
requests and database activity, which introduces variability in execution conditions.
243+
This trade-off is inherent to enabling the feature flag to act as a real-time source
244+
of truth. The design prioritizes consistency between flag state and permission data
245+
over strictly controlled execution environments, while providing observability and
246+
recovery mechanisms to mitigate operational risk.
204247

205248
Rejected Alternatives
206249
*********************
207250

208-
**Synchronous execution in the signal handler**
209-
Executing the migration directly inside the ``pre_save`` signal would block the HTTP
210-
request that triggered the flag change, leading to timeouts for large scopes and poor
211-
operator experience.
251+
**Using pre_save to trigger the migration**
252+
A ``pre_save`` handler could detect the transition direction and execute the migration
253+
before the flag change is written. This approach violates ACID principles: at the moment
254+
``pre_save`` fires, the new flag value has not yet been committed to the database. If the
255+
subsequent ``save()`` were to fail (e.g., a validation error, a database constraint
256+
violation, or a network issue), the migration would have already run against a state that
257+
was never persisted, leaving the permission data inconsistent with the actual flag value.
258+
259+
**Asynchronous execution via Celery**
260+
Given that automatic migration is scoped to course-level and organization-level
261+
overrides where migration volumes are bounded, synchronous execution is simpler
262+
and provides stronger consistency guarantees.
212263

213264
**Manual migration**
214265
Error-prone, not scalable, and inconsistent. The flag is the source of truth, but manual

0 commit comments

Comments
 (0)