Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
283 changes: 283 additions & 0 deletions docs/decisions/0013-course-authoring-automatic-migration.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,283 @@
0013: Course Authoring - Automatic Migration Triggered by Course Authoring Flag
###############################################################################

Status
******

**Draft** - *2026-04-13*

Context
*******

The system is transitioning from the legacy permissions model (``CourseAccessRole``)
to the new openedx-authz system.

Currently, migrations between the two systems are performed manually using Django
management commands:

- ``authz_migrate_course_authoring`` (forward migration)
- ``authz_rollback_course_authoring`` (rollback migration)

In `ADR 0010`_ and `ADR 0011`_ it was established that migrations must occur automatically when
the feature flag ``authz.enable_course_authoring`` changes state, but the definition of
the specific mechanism was deferred. This ADR addresses that gap.

The current manual approach has the following problems:

- **Access disparity**: Many users have access to Django Admin and can toggle the flag, while
significantly fewer have permission to run management commands. This creates an operational
gap where the flag state can change independently of the migration process. As a result,
coordination is required between different roles (those managing flags vs. those executing
migrations), increasing the risk of delays, misalignment, and inconsistent system state.
- **Outage window**: When a flag change and the corresponding migration command are not executed
atomically, there is a period where the flag points to one system but the permission data
still lives in the other. Any permission check made during this window will fail, causing
real outages for affected courses or organizations.
- **No user feedback**: Users have no way to know the result of a migration without
inspecting logs manually.
- **No concurrency protection**: Nothing prevents operators from running the migration command
multiple times simultaneously, which can lead to race conditions and data corruption.

Decision
********

We will implement an automatic and synchronous migration mechanism triggered by changes in the
``authz.enable_course_authoring`` feature flag. The solution consists of:

#. A ``post_save`` signal handler that detects flag changes and executes the migration.
#. A tracking model to record migration status and errors.
#. A database-level constraint to prevent concurrent migrations on the same scope.

.. note::

**Scope Constraint**

Automatic migration will only trigger for **course-level** and **organization-level** flag
overrides, not for global (instance-wide) Waffle flag changes. The reason is that a global
flag change could affect a large number of courses simultaneously, introducing an unacceptable
performance risk. Global flag changes must be handled via management commands by operators
who explicitly accept the performance implications.

Operator Safety and Opt-in Design
=================================

A concern was raised about the risks of triggering data migrations on a live instance. Data
migrations are typically executed under controlled conditions (e.g., during maintenance windows)
because any failure can leave the system in an invalid state. Triggering them automatically via
a feature flag toggle introduces additional risk:

- Django Admin access is sometimes granted to instructors or non-technical staff who may not
understand the implications of toggling the flag.
- A live instance may be processing requests concurrently, increasing the chance of partial
failures or inconsistent transient states.

To address this, the automatic migration mechanism will be **guarded by a Django setting**:

.. code:: python

ENABLE_AUTOMATIC_AUTHZ_COURSE_AUTHORING_MIGRATION = False

This setting:

- Is **disabled by default**.
- Must be explicitly set to ``True`` by a site operator who understands the migration risks.
- Acts as a prerequisite check inside the signal handler: if it is not enabled, the signal
detects the flag change but does not execute the migration. The operator must then
run the migration manually using the existing management commands.

Detailed Design
===============

1. Migration Trigger (Django Signals)
-------------------------------------

A ``post_save`` handler is attached to ``WaffleFlagCourseOverrideModel`` and
``WaffleFlagOrgOverrideModel`` for the ``authz.enable_course_authoring`` flag.

The handler fires after the record is committed to the database, so the new flag value is
the authoritative and durable state of the system when the migration begins.

Retrieving the previous state from the same model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Both ``WaffleFlagCourseOverrideModel`` and ``WaffleFlagOrgOverrideModel`` extend
``ConfigurationModel``, which **creates a new row on every save** instead of updating the
existing record. This means the full change history for each scope is preserved in the
table. The previous override value is therefore always available as the most recent record
for the same scope that is not the one just saved.

If no previous record exists for the scope (this is the first override ever created for
it), the migration runs unconditionally based on the current ``enabled`` value, without
comparing against a previous state.

``post_save`` execution
~~~~~~~~~~~~~~~~~~~~~~~

The ``post_save`` handler:

#. Queries the same flag override model for the previous record as described above.
#. If no previous record exists, runs the migration based on the current ``enabled`` value
without further comparison.
#. If a previous record exists, compares its ``enabled`` value with the saved one to
determine whether an effective transition occurred:

- ``False → True``: triggers a **forward migration** (Legacy → openedx-authz)
- ``True → False``: triggers a **rollback migration** (openedx-authz → Legacy)
- No change: the handler does nothing. No tracking record is created and no migration runs.

#. Determines the scope (course or organization) from the model being saved.
#. Calls the utility function synchronously with the migration parameters.

2. Migration Tracking Model
---------------------------

A new model is introduced to track the lifecycle of each migration operation:

.. code:: python

class AuthzCourseAuthoringMigrationRun(models.Model):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that this model is scoped per course or organization, correct?
If the trigger comes from WaffleFlagOrgOverrideModel, how would the data be stored? Would it include all course IDs for the organization, or is storing just the organization info sufficient?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that the metadata field stores information about each migration that has been performed, including the scope (course_id).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Thanks,

migration_type = models.CharField(max_length=20) # forward / rollback
scope_type = models.CharField(max_length=20) # course / org
scope_key = models.CharField(max_length=255)
status = models.CharField(max_length=20) # running, completed, partial_success, failed, skipped
created_at = models.DateTimeField(auto_now_add=True)
updated_at = models.DateTimeField(auto_now=True)
completed_at = models.DateTimeField(null=True, blank=True)
metadata = models.JSONField(default=dict)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this include things like the traceback of any errors that occurred? I think that would probably be the most valuable thing to capture.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the idea. The Migration Outcome Semantics section mentions that the metadata field will contain those details.


This model is registered in Django Admin so users can inspect migration history and
diagnose failures without needing to access logs directly.

A higher-level orchestration layer (separate from the existing utility functions) will be
responsible for creating and updating ``AuthzCourseAuthoringMigrationRun`` records. This
layer wraps the core migration logic, ensuring that lifecycle tracking (opening a
``running`` record, handling errors, and writing the final status) is applied consistently
regardless of whether the migration is triggered by the signal handler or a management
command.

Migration Outcome Semantics
~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``status`` field reflects the precise outcome of each run. The possible values are:

- ``running``: the migration is actively executing.
- ``completed``: all records were migrated successfully. The ``metadata`` field contains the
details about the successful migrations.
- ``partial_success``: the migration process ran to completion, but one or more individual
records failed and were skipped. The ``metadata`` field contains details about the
failures and successfully migrated records.
- ``failed``: a critical error prevented the migration from completing (e.g., an unhandled
exception or infrastructure problem). The ``metadata`` field contains the exception details.
- ``skipped``: the migration was not attempted because another run for the same scope was
already active.

3. Concurrency Control
----------------------

To prevent overlapping migrations on the same scope, the tracking model enforces a
conditional ``UniqueConstraint`` on ``(scope_type, scope_key)`` filtered to
``status="running"``. This guarantees that no second active migration record can be
inserted for the same scope regardless of how many processes attempt to do so concurrently.
Any attempt raises an ``IntegrityError``, which the caller handles by recording a
``skipped`` run and aborting.

.. code:: python

class Meta:
constraints = [
models.UniqueConstraint(
fields=["scope_type", "scope_key"],
condition=models.Q(status="running"),
name="unique_active_migration_per_scope",
)
]

4. Execution Flow
-----------------

1. The user changes the ``authz.enable_course_authoring`` flag for a course or
organization and saves the record. A new row is created in the override table.
2. The ``post_save`` handler queries the same override model for the previous record
(most recent row for the same scope, excluding the one just saved) to obtain the
previous ``enabled`` value.
3. The handler compares the previous value with the current ``enabled`` value. If no
effective change occurred, it does nothing.
4. If a transition is detected, the handler calls the utility function synchronously. The
function creates an ``AuthzCourseAuthoringMigrationRun`` record with
``status="running"`` (the database constraint prevents this if another run for the
same scope is already active) and executes the migration.
5. The record is updated to its final status (``completed``, ``partial_success``, ``failed``,
or ``skipped``) before the ``post_save`` handler returns.
6. The user can review the migration outcome via Django Admin on the
``AuthzCourseAuthoringMigrationRun`` model.

Consequences
************

Positive consequences
=====================

- **Full observability**: every migration run is recorded with its status, scope, and metadata
in the tracking model.
- **Concurrency-safe**: the database-level constraint prevents overlapping migrations on the same
scope, regardless of cache availability or worker failures.
- **No manual intervention required** for course-level or organization-level flag changes. Operators
or users who have opted in do not need to remember to run management commands.
- **Safe by default**: the opt-in guard flag ensures that automatic migration is never triggered
unexpectedly on instances where operators have not explicitly accepted the risks.

Negative consequences / risks
=============================

- **Global flag changes are not covered**: operators must still run management commands
manually when enabling or disabling the flag at the instance level. This is a deliberate
trade-off to avoid performance risks.
- **Blocks the request**: the migration runs synchronously inside the ``post_save`` signal,
so the HTTP request that triggered the flag change does not return until the migration
finishes. For large organization-level scopes this can cause noticeable latency or
timeouts. This is an accepted trade-off given that automatic migration is scoped to
course-level and organization-level overrides only (never global), and is opt-in.
- **Runtime execution trade-offs**: Unlike management commands typically executed during
maintenance windows, this migration runs in a live production environment as part of
normal system operation. This means it executes under concurrent load, with active
requests and database activity, which introduces variability in execution conditions.
This trade-off is inherent to enabling the feature flag to act as a real-time source
of truth. The design prioritizes consistency between flag state and permission data
over strictly controlled execution environments, while providing observability and
recovery mechanisms to mitigate operational risk.

Rejected Alternatives
*********************

**Using pre_save to trigger the migration**
The use of pre_save signals was discarded because they depend on a state transition
that has not yet been committed to the database. Operating before persistence assumes
a future-valid state that may not materialize. post_save was preferred to ensure
migration logic operates only on confirmed states.

**Asynchronous execution via Celery**
Given that automatic migration is scoped to course-level and organization-level
overrides where migration volumes are bounded, synchronous execution is simpler
and provides stronger consistency guarantees.

**Manual migration**
Error-prone, not scalable, and inconsistent. The flag is the source of truth, but manual
migration allows the system to end up in inconsistent states (e.g., flag enabled but data
still in the legacy system), resulting in an operationally fragile design.

**Automatic global migration**
Triggering automatic migration when the flag is changed globally (instance-wide) would
risk performance degradation on large instances. This was explicitly ruled out: global
migrations must remain operator-initiated via management commands.

References
**********

* `Automatic Migration Spike`_
* `ADR 0010`_
* `ADR 0011`_

.. _Automatic Migration Spike:
https://openedx.atlassian.net/wiki/spaces/OEPM/pages/6205112321/Spike+-+RBAC+AuthZ+-+Automatic+Role+Migration
.. _ADR 0010: 0010-course-authoring-flag.rst
.. _ADR 0011: 0011-course-authoring-migration-process.rst