Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
233 changes: 233 additions & 0 deletions docs/decisions/0013-course-authoring-automatic-migration.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,233 @@
0013: Course Authoring - Automatic Migration Triggered by Course Authoring Flag
###############################################################################

Status
******

**Draft** - *2026-04-09*

Context
*******

The system is transitioning from the legacy permissions model (``CourseAccessRole``)
to the new openedx-authz system.

Currently, migrations between both systems are performed manually using Django management commands:

- ``authz_migrate_course_authoring`` (forward migration)
- ``authz_rollback_course_authoring`` (rollback migration)

In `ADR 0011`_ and `ADR 0010`_ it was established that migration must occur automatically when
the feature flag ``authz.enable_course_authoring`` changes state, but they deferred the definition of
Comment thread
BryanttV marked this conversation as resolved.
Outdated
the specific mechanism. This ADR addresses that gap.

The current manual approach presents the following risks:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make sure I understand the need for a different mechanism than a 1-time migration or a more controlled migration that on-demand job operators could use. I'm thinking of a mechanism like in forums V2, when the storage backend is changed:

  1. If the flag is on during initialization (tutor), then the migration is executed
  2. If not then not much happens
  3. If there's a failure during the migration, then automatically rollback

Why is this not an acceptable solution, given that it directly impacts operators and that they can manage this kind of controlled migration better than in a live environment?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Verawood this will have to be a course-by-course or org-by-org decision, I think?

  • For a lot of instances there are many people with django admin access who can make those changes, but few with the ability to run management commands
  • Having the changes happen in sync with the course/org flag overrides prevents outages where permissions don't exist in the system being switched to until the management command is run, which might need to be done by a different team on a different schedule
  • If the migration fails when the flag is being set the rollback of both the flag and migration would be automatic so there wouldn't be a time where permissions simply don't work for a course or org
  • If we want to have a separate path for instance-wide migration at init time that might make sense, but there is still a reasonable chance that some bugs will make it necessary for courses or orgs to override the instance default in which case we would still need this work

I think for Willow this flag would go away and we would rely solely on an instance-wide init.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key difference is that the authz.enable_course_authoring flag is intended to be a runtime source of truth, not just an initialization setting.

Unlike one-time migrations (like Forums V2), this flag can change dynamically (course/org level) and is expected to immediately reflect the system’s state. Therefore, a manual or operator-driven migration is not sufficient, as it introduces inconsistencies, depends on manual coordination, and does not guarantee that the system aligns with the current flag state.

This is my current thinking, however, we could also consider not implementing automatic migration and relying only on management commands. What do you think?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are strong arguments for me to understand the need for the automated sync. Can we add these to the ADR context? Thanks!


- **Inconsistency**: If an operator enables or disables the flag without running the migration
command, the permission data in both systems will diverge.
- **No status tracking**: There is no visibility into whether a migration is in progress,
completed, or failed.
- **No concurrency protection**: Multiple concurrent flag changes can trigger overlapping
migrations, leading to race conditions and data corruption.
Comment thread
mariajgrimaldi marked this conversation as resolved.
Outdated
- **No user feedback**: Operators have no way to know the result of a migration without
inspecting logs manually.

Decision
********

We will implement an automatic and asynchronous migration mechanism triggered by changes in the
``authz.enable_course_authoring`` feature flag. The solution consists of:

#. Django signal handler to detect flag state changes.
#. Celery tasks to execute migrations asynchronously.
Copy link
Copy Markdown
Contributor

@bmtcril bmtcril Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it might be overkill. Even if an org has 5000 courses, with 5 people with roles in each we should be able to batch those into a few manageable queries synchronously and preserve the atomicity of the flag change. This would give us both automatic rollback on failure and transaction-level locking for concurrency protection

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this risk is also related to this comment as well: https://github.com/openedx/openedx-authz/pull/251/changes#r3072539673

#. A tracking model to record migration status and errors.
#. A locking mechanism to prevent concurrent migrations on the same scope.

.. note::

**Scope Constraint**

Automatic migration will only trigger for **course-level** and **organization-level** flag
overrides, not for global (instance-wide) Waffle flag changes. The reason is that a global
flag change could affect a large number of courses simultaneously, introducing an unacceptable
performance risk. Global flag changes must be handled via management commands by operators
who explicitly accept the performance implications.

Operator Safety and Opt-in Design
==================================

A concern was raised about the risks of triggering data migrations on a live instance. Data
migrations are typically executed under controlled conditions (e.g., during maintenance windows)
because any failure can leave the system in an invalid state. Triggering them automatically via
a feature flag toggle introduces additional risk:

- Django Admin access is sometimes granted to instructors or non-technical staff who may not
understand the implications of toggling the flag.
- A live instance may be processing requests concurrently, increasing the chance of partial
failures or inconsistent transient states.

To address this, the automatic migration mechanism will be **guarded by a Django setting**:

.. code:: python

ENABLE_AUTOMATIC_COURSE_AUTHORING_MIGRATION = False
Comment thread
BryanttV marked this conversation as resolved.
Outdated

This setting:

- Is **disabled by default**.
- Must be explicitly set to ``True`` by a site operator who understands the migration risks.
- Acts as a prerequisite check inside the signal handler: if it is not enabled, the signal
detects the flag change but does **not** dispatch the Celery task. The operator must then
run the migration manually using the existing management commands.

This design preserves the automated behavior for operators who opt in while keeping the system
safe for deployments where uncontrolled migrations are unacceptable.

Detailed Design
===============

1. Utility Function Updates
---------------------------

The existing utility functions ``migrate_legacy_course_roles_to_authz`` and
``migrate_authz_to_legacy_course_roles`` will be modified to incorporate the locking strategy
Comment thread
mariajgrimaldi marked this conversation as resolved.
Outdated
(see **Concurrency Control** below) and the tracking logic (see **Migration Tracking Model** below)
as integral steps of their execution.

This approach ensures that both the Celery task and the management commands go through the same
tracking and locking path.

2. Migration Trigger (Django Signals)
-------------------------------------

``pre_save`` signal handlers are attached to ``WaffleFlagCourseOverrideModel`` and
Comment thread
mariajgrimaldi marked this conversation as resolved.
Outdated
``WaffleFlagOrgOverrideModel``. When a save is detected for the ``authz.enable_course_authoring``
flag, the handler:

#. Compares the previous and new flag state to determine the transition direction:

- ``False → True``: triggers a **forward migration** (Legacy → openedx-authz)
- ``True → False``: triggers a **rollback migration** (openedx-authz → Legacy)

#. Determines the scope (course or organization) from the model being saved.
#. Dispatches an asynchronous Celery task with the migration parameters.

.. note::
If no effective change is detected (i.e., the flag state is the same as the previous state),
the signal handler does nothing.

3. Migration Tracking Model
---------------------------

A new model is introduced to track the lifecycle of each migration operation:

.. code:: python

class CourseAuthoringMigrationRun(models.Model):
Comment thread
BryanttV marked this conversation as resolved.
Outdated
migration_type = models.CharField(max_length=20) # forward / rollback
scope_type = models.CharField(max_length=20) # course / org
scope_key = models.CharField(max_length=255)
status = models.CharField(max_length=20) # pending, running, completed, skipped
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it failed how would I get the exact log / error?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A failed state isn't included here because some roles might be added successfully while others fail. Instead, the count of successful and failed roles will be stored in the metadata field of each tracking record.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How big can the field be though?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I was thinking that this field would only have the success/failure count, something like this:

{
  "successes": 10,
  "errors": 5
}

But maybe it would be useful to include the reason why X role failed? What do you think?"

Regarding the size, I don't think there will be a large amount of data, considering that the migration is at the course or organization level.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature isn't for operators only, so users won't have access to the server logs to review the failures, so it makes sense to show them here for debugging or reporting.

created_at = models.DateTimeField(auto_now_add=True)
updated_at = models.DateTimeField(auto_now=True)
completed_at = models.DateTimeField(null=True, blank=True)
metadata = models.JSONField(default=dict)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this include things like the traceback of any errors that occurred? I think that would probably be the most valuable thing to capture.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the idea. The Migration Outcome Semantics section mentions that the metadata field will contain those details.


This model is registered in Django Admin so operators can inspect migration history and
diagnose failures without needing to access logs directly.

4. Asynchronous Execution
-------------------------

The Celery task acts strictly as a thin dispatcher. All core logic, including locking,
tracking, and migration execution, is implemented in the utility functions (see
**Utility Function Updates** above).

All database operations within the migration itself execute inside an atomic transaction.
If the migration fails, no data is deleted from either system, preserving consistency.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's considered failing in this case? Failures during the migration, or at least the errors caught, are skipped, and the migration continues. What kind of failures are we expecting here? Also, I don't think this migration happens in an atomic transaction cause some records can be migrated while others can fail so we might need more clarity on this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re right that the current behavior is not strictly atomic. The migration functions operate in a best-effort manner, where individual record failures are caught and skipped, allowing the rest of the migration to continue. This means partial migrations are possible.

To make this clearer, I think we should explicitly define migration outcomes as:

  • completed: all records migrated successfully.
  • partial_success: some records failed, but the process completed.
  • failed: the migration could not complete due to a critical error.

This would better reflect the actual behavior and improve observability. What do you think?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works!


5. Concurrency Control (Locking Strategy)
-----------------------------------------

To prevent race conditions caused by rapid or concurrent flag changes on the same scope, a
distributed lock is implemented using the Django cache backend (Redis):

.. code:: python

lock_key = f"authz_migration:{scope_type}:{scope_key}"

The lock is acquired using ``cache.add()``, which is an atomic operation. The default TTL
Comment thread
rodmgwgu marked this conversation as resolved.
Outdated
is **1 hour**. If a lock already exists for the given scope, the migration is skipped
and a new tracking record is created with that status. This ensures that only one
migration runs at a time for the same scope.

6. Execution Flow
------------------

1. An operator changes the ``authz.enable_course_authoring`` flag for a course or
organization via Django Admin or a management command.
2. The ``pre_save`` signal handler detects the state transition.
3. A Celery task is dispatched asynchronously.
4. The task calls the utility function, which acquires the lock, creates and updates the
``CourseAuthoringMigrationRun`` record, and executes the migration.
5. The operator can check the migration status via Django Admin on the ``CourseAuthoringMigrationRun``
model.

Consequences
************

Positive consequences
=====================

- **Migration is decoupled from the request cycle**: the flag change returns immediately and
migration happens in the background.
- **Full observability**: every migration run is recorded with its status, scope, and metadata
in the tracking model.
- **Concurrency-safe**: the lock strategy prevents overlapping migrations on the same scope.
- **No manual intervention required**: for course-level or organization-level flag changes. Operators
who have opted in do not need to remember to run management commands.
- **Safe by default**: the opt-in guard flag ensures that automatic migration is never triggered
unexpectedly on instances where operators have not explicitly accepted the risks.

Negative consequences / risks
==============================
Comment thread
mariajgrimaldi marked this conversation as resolved.
Outdated

- **Global flag changes are not covered**: operators must still run management commands
manually when enabling or disabling the flag at the instance level. This is a deliberate
trade-off to avoid performance risks.
- **Celery dependency**: the system now requires a functioning Celery worker for automatic
migration. If workers are down, migrations will be queued but not executed until workers
recover.
- **Lock TTL edge cases**: if a migration takes longer than 1 hour (unlikely but possible
for very large organizations), the lock will expire and a new migration for the same scope
could start concurrently for the same scope.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the cache backend fails and two flag changes run at the same time? The migration itself is not atomic so we could end up in an invalid state

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm, great point. I was thinking, and to address this, I’m considering enforcing concurrency at the database level using a UniqueConstraint on (scope_type, scope_key) for active migrations (e.g., pending and running states).

This would ensure that only one migration per scope can be active at any time, regardless of cache availability, and would eliminate the risk of concurrent executions even in failure scenarios. What do you think?


Rejected Alternatives
*********************

**Synchronous execution in the signal handler**
Executing the migration directly inside the ``pre_save`` signal would block the HTTP
request that triggered the flag change, leading to timeouts for large scopes and poor
operator experience.

**Manual migration**
Error-prone, not scalable, and inconsistent. The flag is the source of truth, but manual
migration allows the system to end up in inconsistent states (e.g., flag enabled but data
still in the legacy system), resulting in an operationally fragile design.

**Automatic global migration**
Triggering automatic migration when the flag is changed globally (instance-wide) would
risk performance degradation on large instances. This was explicitly ruled out: global
migrations must remain operator-initiated via management commands.

References
**********

* `Automatic Migration Spike`_
* `ADR 0010`_
* `ADR 0011`_

.. _Automatic Migration Spike:
https://openedx.atlassian.net/wiki/spaces/OEPM/pages/6205112321/Spike+-+RBAC+AuthZ+-+Automatic+Role+Migration
.. _ADR 0010: 0010-course-authoring-flag.rst
.. _ADR 0011: 0011-course-authoring-migration-process.rst