-
Notifications
You must be signed in to change notification settings - Fork 6
docs: add ADR for course authoring automatic migration #251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
818a444
944808e
eea181e
c55800c
72c52c7
7bd9194
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,233 @@ | ||
| 0013: Course Authoring - Automatic Migration Triggered by Course Authoring Flag | ||
| ############################################################################### | ||
|
|
||
| Status | ||
| ****** | ||
|
|
||
| **Draft** - *2026-04-09* | ||
|
|
||
| Context | ||
| ******* | ||
|
|
||
| The system is transitioning from the legacy permissions model (``CourseAccessRole``) | ||
| to the new openedx-authz system. | ||
|
|
||
| Currently, migrations between both systems are performed manually using Django management commands: | ||
|
|
||
| - ``authz_migrate_course_authoring`` (forward migration) | ||
| - ``authz_rollback_course_authoring`` (rollback migration) | ||
|
|
||
| In `ADR 0011`_ and `ADR 0010`_ it was established that migration must occur automatically when | ||
| the feature flag ``authz.enable_course_authoring`` changes state, but the definition of | ||
| the specific mechanism was deferred. This ADR addresses that gap. | ||
|
|
||
| The current manual approach presents the following risks: | ||
|
|
||
| - **Inconsistency**: If an operator enables or disables the flag without running the migration | ||
| command, the permission data in both systems will diverge. | ||
| - **No status tracking**: There is no visibility into whether a migration is in progress, | ||
| completed, or failed. | ||
| - **No concurrency protection**: Nothing prevents operators from running the migration command | ||
| multiple times simultaneously, which can lead to race conditions and data corruption. | ||
| - **No user feedback**: Operators have no way to know the result of a migration without | ||
| inspecting logs manually. | ||
|
|
||
| Decision | ||
| ******** | ||
|
|
||
| We will implement an automatic and asynchronous migration mechanism triggered by changes in the | ||
| ``authz.enable_course_authoring`` feature flag. The solution consists of: | ||
|
|
||
| #. Django signal handler to detect flag state changes. | ||
| #. Celery tasks to execute migrations asynchronously. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems like it might be overkill. Even if an org has 5000 courses, with 5 people with roles in each we should be able to batch those into a few manageable queries synchronously and preserve the atomicity of the flag change. This would give us both automatic rollback on failure and transaction-level locking for concurrency protection
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this risk is also related to this comment as well: https://github.com/openedx/openedx-authz/pull/251/changes#r3072539673 |
||
| #. A tracking model to record migration status and errors. | ||
| #. A locking mechanism to prevent concurrent migrations on the same scope. | ||
|
|
||
| .. note:: | ||
|
|
||
| **Scope Constraint** | ||
|
|
||
| Automatic migration will only trigger for **course-level** and **organization-level** flag | ||
| overrides, not for global (instance-wide) Waffle flag changes. The reason is that a global | ||
| flag change could affect a large number of courses simultaneously, introducing an unacceptable | ||
| performance risk. Global flag changes must be handled via management commands by operators | ||
| who explicitly accept the performance implications. | ||
|
|
||
| Operator Safety and Opt-in Design | ||
| ================================== | ||
|
|
||
| A concern was raised about the risks of triggering data migrations on a live instance. Data | ||
| migrations are typically executed under controlled conditions (e.g., during maintenance windows) | ||
| because any failure can leave the system in an invalid state. Triggering them automatically via | ||
| a feature flag toggle introduces additional risk: | ||
|
|
||
| - Django Admin access is sometimes granted to instructors or non-technical staff who may not | ||
| understand the implications of toggling the flag. | ||
| - A live instance may be processing requests concurrently, increasing the chance of partial | ||
| failures or inconsistent transient states. | ||
|
|
||
| To address this, the automatic migration mechanism will be **guarded by a Django setting**: | ||
|
|
||
| .. code:: python | ||
|
|
||
| ENABLE_AUTOMATIC_AUTHZ_COURSE_AUTHORING_MIGRATION = False | ||
|
|
||
| This setting: | ||
|
|
||
| - Is **disabled by default**. | ||
| - Must be explicitly set to ``True`` by a site operator who understands the migration risks. | ||
| - Acts as a prerequisite check inside the signal handler: if it is not enabled, the signal | ||
| detects the flag change but does **not** dispatch the Celery task. The operator must then | ||
| run the migration manually using the existing management commands. | ||
|
|
||
| This design preserves the automated behavior for operators who opt in while keeping the system | ||
| safe for deployments where uncontrolled migrations are unacceptable. | ||
|
|
||
| Detailed Design | ||
| =============== | ||
|
|
||
| 1. Utility Function Updates | ||
| --------------------------- | ||
|
|
||
| The existing utility functions ``migrate_legacy_course_roles_to_authz`` and | ||
| ``migrate_authz_to_legacy_course_roles`` will be modified to incorporate the locking strategy | ||
|
mariajgrimaldi marked this conversation as resolved.
Outdated
|
||
| (see **Concurrency Control** below) and the tracking logic (see **Migration Tracking Model** below) | ||
| as integral steps of their execution. | ||
|
|
||
| This approach ensures that both the Celery task and the management commands go through the same | ||
| tracking and locking path. | ||
|
|
||
| 2. Migration Trigger (Django Signals) | ||
| ------------------------------------- | ||
|
|
||
| ``pre_save`` signal handlers are attached to ``WaffleFlagCourseOverrideModel`` and | ||
|
mariajgrimaldi marked this conversation as resolved.
Outdated
|
||
| ``WaffleFlagOrgOverrideModel``. When a save is detected for the ``authz.enable_course_authoring`` | ||
| flag, the handler: | ||
|
|
||
| #. Compares the previous and new flag state to determine the transition direction: | ||
|
|
||
| - ``False → True``: triggers a **forward migration** (Legacy → openedx-authz) | ||
| - ``True → False``: triggers a **rollback migration** (openedx-authz → Legacy) | ||
|
|
||
| #. Determines the scope (course or organization) from the model being saved. | ||
| #. Dispatches an asynchronous Celery task with the migration parameters. | ||
|
|
||
| .. note:: | ||
| If no effective change is detected (i.e., the flag state is the same as the previous state), | ||
| the signal handler does nothing. | ||
|
|
||
| 3. Migration Tracking Model | ||
| --------------------------- | ||
|
|
||
| A new model is introduced to track the lifecycle of each migration operation: | ||
|
|
||
| .. code:: python | ||
|
|
||
| class AuthzCourseAuthoringMigrationRun(models.Model): | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My understanding is that this model is scoped per course or organization, correct?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The idea is that the metadata field stores information about each migration that has been performed, including the scope (
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good! Thanks, |
||
| migration_type = models.CharField(max_length=20) # forward / rollback | ||
| scope_type = models.CharField(max_length=20) # course / org | ||
| scope_key = models.CharField(max_length=255) | ||
| status = models.CharField(max_length=20) # pending, running, completed, skipped | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If it failed how would I get the exact log / error?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How big can the field be though?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, I was thinking that this field would only have the success/failure count, something like this: {
"successes": 10,
"errors": 5
}But maybe it would be useful to include the reason why X role failed? What do you think?" Regarding the size, I don't think there will be a large amount of data, considering that the migration is at the course or organization level.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This feature isn't for operators only, so users won't have access to the server logs to review the failures, so it makes sense to show them here for debugging or reporting. |
||
| created_at = models.DateTimeField(auto_now_add=True) | ||
| updated_at = models.DateTimeField(auto_now=True) | ||
| completed_at = models.DateTimeField(null=True, blank=True) | ||
| metadata = models.JSONField(default=dict) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would this include things like the traceback of any errors that occurred? I think that would probably be the most valuable thing to capture.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that's the idea. The Migration Outcome Semantics section mentions that the metadata field will contain those details. |
||
|
|
||
| This model is registered in Django Admin so operators can inspect migration history and | ||
| diagnose failures without needing to access logs directly. | ||
|
|
||
| 4. Asynchronous Execution | ||
| ------------------------- | ||
|
|
||
| The Celery task acts strictly as a thin dispatcher. All core logic, including locking, | ||
| tracking, and migration execution, is implemented in the utility functions (see | ||
| **Utility Function Updates** above). | ||
|
|
||
| All database operations within the migration itself execute inside an atomic transaction. | ||
| If the migration fails, no data is deleted from either system, preserving consistency. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's considered failing in this case? Failures during the migration, or at least the errors caught, are skipped, and the migration continues. What kind of failures are we expecting here? Also, I don't think this migration happens in an atomic transaction cause some records can be migrated while others can fail so we might need more clarity on this
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You’re right that the current behavior is not strictly atomic. The migration functions operate in a best-effort manner, where individual record failures are caught and skipped, allowing the rest of the migration to continue. This means partial migrations are possible. To make this clearer, I think we should explicitly define migration outcomes as:
This would better reflect the actual behavior and improve observability. What do you think?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That works! |
||
|
|
||
| 5. Concurrency Control (Locking Strategy) | ||
| ----------------------------------------- | ||
|
|
||
| To prevent race conditions caused by rapid or concurrent flag changes on the same scope, a | ||
| distributed lock is implemented using the Django cache backend (Redis): | ||
|
|
||
| .. code:: python | ||
|
|
||
| lock_key = f"authz_migration:{scope_type}:{scope_key}" | ||
|
|
||
| The lock is acquired using ``cache.add()``, which is an atomic operation. The default TTL | ||
|
rodmgwgu marked this conversation as resolved.
Outdated
|
||
| is **1 hour**. If a lock already exists for the given scope, the migration is skipped | ||
| and a new tracking record is created with that status. This ensures that only one | ||
| migration runs at a time for the same scope. | ||
|
|
||
| 6. Execution Flow | ||
| ------------------ | ||
|
|
||
| 1. An operator changes the ``authz.enable_course_authoring`` flag for a course or | ||
| organization via Django Admin or a management command. | ||
| 2. The ``pre_save`` signal handler detects the state transition. | ||
| 3. A Celery task is dispatched asynchronously. | ||
| 4. The task calls the utility function, which acquires the lock, creates and updates the | ||
| ``AuthzCourseAuthoringMigrationRun`` record, and executes the migration. | ||
| 5. The operator can check the migration status via Django Admin on the ``AuthzCourseAuthoringMigrationRun`` | ||
| model. | ||
|
|
||
| Consequences | ||
| ************ | ||
|
|
||
| Positive consequences | ||
| ===================== | ||
|
|
||
| - **Migration is decoupled from the request cycle**: the flag change returns immediately and | ||
| migration happens in the background. | ||
| - **Full observability**: every migration run is recorded with its status, scope, and metadata | ||
| in the tracking model. | ||
| - **Concurrency-safe**: the lock strategy prevents overlapping migrations on the same scope. | ||
| - **No manual intervention required**: for course-level or organization-level flag changes. Operators | ||
| who have opted in do not need to remember to run management commands. | ||
| - **Safe by default**: the opt-in guard flag ensures that automatic migration is never triggered | ||
| unexpectedly on instances where operators have not explicitly accepted the risks. | ||
|
|
||
| Negative consequences / risks | ||
| ============================== | ||
|
mariajgrimaldi marked this conversation as resolved.
Outdated
|
||
|
|
||
| - **Global flag changes are not covered**: operators must still run management commands | ||
| manually when enabling or disabling the flag at the instance level. This is a deliberate | ||
| trade-off to avoid performance risks. | ||
| - **Celery dependency**: the system now requires a functioning Celery worker for automatic | ||
| migration. If workers are down, migrations will be queued but not executed until workers | ||
| recover. | ||
| - **Lock TTL edge cases**: if a migration takes longer than 1 hour (unlikely but possible | ||
| for very large organizations), the lock will expire and a new migration for the same scope | ||
| could start concurrently for the same scope. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What if the cache backend fails and two flag changes run at the same time? The migration itself is not atomic so we could end up in an invalid state
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Mmm, great point. I was thinking, and to address this, I’m considering enforcing concurrency at the database level using a This would ensure that only one migration per scope can be active at any time, regardless of cache availability, and would eliminate the risk of concurrent executions even in failure scenarios. What do you think? |
||
|
|
||
| Rejected Alternatives | ||
| ********************* | ||
|
|
||
| **Synchronous execution in the signal handler** | ||
| Executing the migration directly inside the ``pre_save`` signal would block the HTTP | ||
| request that triggered the flag change, leading to timeouts for large scopes and poor | ||
| operator experience. | ||
|
|
||
| **Manual migration** | ||
| Error-prone, not scalable, and inconsistent. The flag is the source of truth, but manual | ||
| migration allows the system to end up in inconsistent states (e.g., flag enabled but data | ||
| still in the legacy system), resulting in an operationally fragile design. | ||
|
|
||
| **Automatic global migration** | ||
| Triggering automatic migration when the flag is changed globally (instance-wide) would | ||
| risk performance degradation on large instances. This was explicitly ruled out: global | ||
| migrations must remain operator-initiated via management commands. | ||
|
|
||
| References | ||
| ********** | ||
|
|
||
| * `Automatic Migration Spike`_ | ||
| * `ADR 0010`_ | ||
| * `ADR 0011`_ | ||
|
|
||
| .. _Automatic Migration Spike: | ||
| https://openedx.atlassian.net/wiki/spaces/OEPM/pages/6205112321/Spike+-+RBAC+AuthZ+-+Automatic+Role+Migration | ||
| .. _ADR 0010: 0010-course-authoring-flag.rst | ||
| .. _ADR 0011: 0011-course-authoring-migration-process.rst | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to make sure I understand the need for a different mechanism than a 1-time migration or a more controlled migration that on-demand job operators could use. I'm thinking of a mechanism like in forums V2, when the storage backend is changed:
Why is this not an acceptable solution, given that it directly impacts operators and that they can manage this kind of controlled migration better than in a live environment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Verawood this will have to be a course-by-course or org-by-org decision, I think?
I think for Willow this flag would go away and we would rely solely on an instance-wide init.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key difference is that the
authz.enable_course_authoringflag is intended to be a runtime source of truth, not just an initialization setting.Unlike one-time migrations (like Forums V2), this flag can change dynamically (course/org level) and is expected to immediately reflect the system’s state. Therefore, a manual or operator-driven migration is not sufficient, as it introduces inconsistencies, depends on manual coordination, and does not guarantee that the system aligns with the current flag state.
This is my current thinking, however, we could also consider not implementing automatic migration and relying only on management commands. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are strong arguments for me to understand the need for the automated sync. Can we add these to the ADR context? Thanks!