Draft - 2026-04-09
The system is transitioning from the legacy permissions model (CourseAccessRole)
to the new openedx-authz system.
Currently, migrations between both systems are performed manually using Django management commands:
authz_migrate_course_authoring(forward migration)authz_rollback_course_authoring(rollback migration)
In ADR 0011 and ADR 0010 it was established that migration must occur automatically when
the feature flag authz.enable_course_authoring changes state, but the definition of
the specific mechanism was deferred. This ADR addresses that gap.
The current manual approach presents the following risks:
- Inconsistency: If an operator enables or disables the flag without running the migration command, the permission data in both systems will diverge.
- No status tracking: There is no visibility into whether a migration is in progress, completed, or failed.
- No concurrency protection: Nothing prevents operators from running the migration command multiple times simultaneously, which can lead to race conditions and data corruption.
- No user feedback: Operators have no way to know the result of a migration without inspecting logs manually.
We will implement an automatic and asynchronous migration mechanism triggered by changes in the
authz.enable_course_authoring feature flag. The solution consists of:
- Django signal handler to detect flag state changes.
- Celery tasks to execute migrations asynchronously.
- A tracking model to record migration status and errors.
- A locking mechanism to prevent concurrent migrations on the same scope.
Note
Scope Constraint
Automatic migration will only trigger for course-level and organization-level flag overrides, not for global (instance-wide) Waffle flag changes. The reason is that a global flag change could affect a large number of courses simultaneously, introducing an unacceptable performance risk. Global flag changes must be handled via management commands by operators who explicitly accept the performance implications.
A concern was raised about the risks of triggering data migrations on a live instance. Data migrations are typically executed under controlled conditions (e.g., during maintenance windows) because any failure can leave the system in an invalid state. Triggering them automatically via a feature flag toggle introduces additional risk:
- Django Admin access is sometimes granted to instructors or non-technical staff who may not understand the implications of toggling the flag.
- A live instance may be processing requests concurrently, increasing the chance of partial failures or inconsistent transient states.
To address this, the automatic migration mechanism will be guarded by a Django setting:
ENABLE_AUTOMATIC_AUTHZ_COURSE_AUTHORING_MIGRATION = FalseThis setting:
- Is disabled by default.
- Must be explicitly set to
Trueby a site operator who understands the migration risks. - Acts as a prerequisite check inside the signal handler: if it is not enabled, the signal detects the flag change but does not dispatch the Celery task. The operator must then run the migration manually using the existing management commands.
This design preserves the automated behavior for operators who opt in while keeping the system safe for deployments where uncontrolled migrations are unacceptable.
The existing utility functions migrate_legacy_course_roles_to_authz and
migrate_authz_to_legacy_course_roles will be modified to incorporate the locking strategy
(see Concurrency Control below) and the tracking logic (see Migration Tracking Model below)
as integral steps of their execution.
This approach ensures that both the Celery task and the management commands go through the same tracking and locking path.
pre_save signal handlers are attached to WaffleFlagCourseOverrideModel and
WaffleFlagOrgOverrideModel. When a save is detected for the authz.enable_course_authoring
flag, the handler:
- Compares the previous and new flag state to determine the transition direction:
False → True: triggers a forward migration (Legacy → openedx-authz)True → False: triggers a rollback migration (openedx-authz → Legacy)
- Determines the scope (course or organization) from the model being saved.
- Dispatches an asynchronous Celery task with the migration parameters.
Note
If no effective change is detected (i.e., the flag state is the same as the previous state), the signal handler does nothing.
A new model is introduced to track the lifecycle of each migration operation:
class AuthzCourseAuthoringMigrationRun(models.Model):
migration_type = models.CharField(max_length=20) # forward / rollback
scope_type = models.CharField(max_length=20) # course / org
scope_key = models.CharField(max_length=255)
status = models.CharField(max_length=20) # pending, running, completed, skipped
created_at = models.DateTimeField(auto_now_add=True)
updated_at = models.DateTimeField(auto_now=True)
completed_at = models.DateTimeField(null=True, blank=True)
metadata = models.JSONField(default=dict)This model is registered in Django Admin so operators can inspect migration history and diagnose failures without needing to access logs directly.
The Celery task acts strictly as a thin dispatcher. All core logic, including locking, tracking, and migration execution, is implemented in the utility functions (see Utility Function Updates above).
All database operations within the migration itself execute inside an atomic transaction. If the migration fails, no data is deleted from either system, preserving consistency.
To prevent race conditions caused by rapid or concurrent flag changes on the same scope, a distributed lock is implemented using the Django cache backend (Redis):
lock_key = f"authz_migration:{scope_type}:{scope_key}"The lock is acquired using cache.add(), which is an atomic operation. The default TTL
is 1 hour. If a lock already exists for the given scope, the migration is skipped
and a new tracking record is created with that status. This ensures that only one
migration runs at a time for the same scope.
- An operator changes the
authz.enable_course_authoringflag for a course or organization via Django Admin or a management command. - The
pre_savesignal handler detects the state transition. - A Celery task is dispatched asynchronously.
- The task calls the utility function, which acquires the lock, creates and updates the
AuthzCourseAuthoringMigrationRunrecord, and executes the migration. - The operator can check the migration status via Django Admin on the
AuthzCourseAuthoringMigrationRunmodel.
- Migration is decoupled from the request cycle: the flag change returns immediately and migration happens in the background.
- Full observability: every migration run is recorded with its status, scope, and metadata in the tracking model.
- Concurrency-safe: the lock strategy prevents overlapping migrations on the same scope.
- No manual intervention required: for course-level or organization-level flag changes. Operators who have opted in do not need to remember to run management commands.
- Safe by default: the opt-in guard flag ensures that automatic migration is never triggered unexpectedly on instances where operators have not explicitly accepted the risks.
- Global flag changes are not covered: operators must still run management commands manually when enabling or disabling the flag at the instance level. This is a deliberate trade-off to avoid performance risks.
- Celery dependency: the system now requires a functioning Celery worker for automatic migration. If workers are down, migrations will be queued but not executed until workers recover.
- Lock TTL edge cases: if a migration takes longer than 1 hour (unlikely but possible for very large organizations), the lock will expire and a new migration for the same scope could start concurrently for the same scope.
- Synchronous execution in the signal handler
- Executing the migration directly inside the
pre_savesignal would block the HTTP request that triggered the flag change, leading to timeouts for large scopes and poor operator experience. - Manual migration
- Error-prone, not scalable, and inconsistent. The flag is the source of truth, but manual migration allows the system to end up in inconsistent states (e.g., flag enabled but data still in the legacy system), resulting in an operationally fragile design.
- Automatic global migration
- Triggering automatic migration when the flag is changed globally (instance-wide) would risk performance degradation on large instances. This was explicitly ruled out: global migrations must remain operator-initiated via management commands.