Skip to content

feat: Adds tdbg migration script to power migrations via visibility#10495

Open
davidporter-id-au wants to merge 8 commits into
temporalio:mainfrom
davidporter-id-au:feature/adding-fast-migration
Open

feat: Adds tdbg migration script to power migrations via visibility#10495
davidporter-id-au wants to merge 8 commits into
temporalio:mainfrom
davidporter-id-au:feature/adding-fast-migration

Conversation

@davidporter-id-au

@davidporter-id-au davidporter-id-au commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

What changed?

Adds a cli set of flags to the tdbg schedule migrate subcommand, to fetch from visibility (or stdin) and perform a migration.

Using Visibility

tdbg schedule migrate --target chasm --from-visibility
[dry-run] would migrate default/verify-sa-1780095016 -> chasm
Dry-run: 1 schedule(s) would be migrated. Re-run with --execute to perform.
tdbg --namespace default schedule migrate --target workflow  --from-visibility --execute
migrated default/verify-sa-1780095016 -> workflow
Done: 1 migrated, 0 failed (of 1).

Using stdin:

./tdbg schedule migrate --target chasm --from-visibility --output-log target.log
[dry-run] would migrate default/schedule-6 -> chasm
[dry-run] would migrate default/schedule-5 -> chasm
[dry-run] would migrate default/schedule-4 -> chasm
...
[dry-run] would migrate default/schedule-8 -> chasm
[dry-run] would migrate default/schedule-7 -> chasm
[dry-run] would migrate default/YourScheduleId-5 -> chasm
Dry-run: 20 schedule(s) would be migrated. Re-run with --execute to perform.
➜  temporal git:(bugfix/handling-list-with-migration-problems) ✗ head target.log
{"timestamp":"2026-06-05T06:41:27Z","namespace":"default","schedule_id":"schedule-6","target":"chasm","status":"dry-run"}
{"timestamp":"2026-06-05T06:41:27Z","namespace":"default","schedule_id":"schedule-5","target":"chasm","status":"dry-run"}
{"timestamp":"2026-06-05T06:41:27Z","namespace":"default","schedule_id":"schedule-
....
8","target":"chasm","status":"dry-run"}
{"timestamp":"2026-06-05T06:41:27Z","namespace":"default","schedule_id":"YourScheduleId-7","target":"chasm","status":"dry-run"}
➜  temporal git:(bugfix/handling-list-with-migration-problems) ✗ cat target.log | ./tdbg schedule migrate --target chasm
[dry-run] would migrate default/schedule-6 -> chasm
[dry-run] would migrate default/schedule-5 -> chasm
[dry-run] would migrate default/schedule-4 -> chasm
[dry-run] would migrate default/schedule-3 -> chasm
[dry-run] would migrate default/schedule-2 -> chasm
...
[dry-run] would migrate default/schedule-8 -> chasm
[dry-run] would migrate default/schedule-7 -> chasm
[dry-run] would migrate default/YourScheduleId-5 -> chasm
Dry-run: 20 schedule(s) would be migrated. Re-run with --execute to perform.

Why?

Temporal is moving to a V2/Chasm based scheduler. Migrating schedules manually is easier when done via visibility, so this is a hopefully more ergonomic set of commands for such a migration:

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

Potential risks

Not likely to be a lot relating to the tool itself, the migration path probably needs a bit more testing through.

@davidporter-id-au davidporter-id-au changed the title migration script feat: Adds tdbg migration script to power migrations via visibility Jun 5, 2026
@davidporter-id-au davidporter-id-au marked this pull request as ready for review June 5, 2026 06:52
@davidporter-id-au davidporter-id-au requested a review from a team as a code owner June 5, 2026 06:52
@chaptersix

Copy link
Copy Markdown
Contributor
The two full queries used are:

- Workflow-backed / V1:
  `TemporalNamespaceDivision = 'TemporalScheduler' AND ExecutionStatus = 'Running'`

- CHASM / V2:
  `TemporalNamespaceDivision = '403648407' AND ExecutionStatus = 'Running'`

`403648407` is the farmhash fingerprint32 of `scheduler.scheduler`.

@chaptersix chaptersix left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of the tdbg bulk migration changes. Overall looks solid -- nice dry-run default and structured logging. A few issues around flag validation and the default visibility query direction.

Comment thread tools/tdbg/commands.go Outdated
if err != nil {
return fmt.Errorf("unable to open output log %q: %w", logPath, err)
}
defer func() { _ = logFile.Close() }()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ListWorkflowExecutions fails mid-pagination, workers may have already migrated some schedules, but summary.print is never called -- the user has no idea what already happened. Move summary.print(c, execute) before the listErr check so partial progress is always reported.

Comment thread tools/tdbg/commands.go
Comment thread tools/tdbg/commands.go
if len(nextPageToken) == 0 {
break
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--output-log is silently ignored in stdin mode -- the migrateSummary is created without wiring up logEnc from FlagOutputLog. Either handle it here too, or reject the flag combination.

@davidporter-id-au davidporter-id-au Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this now addressed with some validation at the top of the function

Comment thread tools/tdbg/commands.go
var listErr error
var nextPageToken []byte
for {
ctx, cancel := newContext(c)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--workers is also silently ignored in stdin mode (processes sequentially). Consider either applying the worker pool here too, or validating that --workers isn't set when using stdin.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now returns an error

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, not sure if this comment was from an earlier line, it seems misaligned? no big deal either way

&cli.BoolFlag{
Name: FlagExecute,
Usage: "Perform the migration. Without this flag, --from-visibility and stdin modes only print what they would do (dry-run)",
},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--query is accepted without --from-visibility and silently ignored (both in --schedule-id and stdin modes). Worth validating the combination up front.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think tihs is addressed now

os.Stdin = f
defer func() {
os.Stdin = orig
_ = f.Close()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: withStdin mutates os.Stdin which is process-global. Worth a comment that these tests must not use t.Parallel().

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this one annoyed me. I was considering a larger refactor to fix but it got out of hand. Imho it'd be better to pass the os.ReadWriter interface down rather than just reaching in to Os.Stdin like this

@chaptersix

Copy link
Copy Markdown
Contributor
The two full queries used are:

- Workflow-backed / V1:
  `TemporalNamespaceDivision = 'TemporalScheduler' AND ExecutionStatus = 'Running'`

- CHASM / V2:
  `TemporalNamespaceDivision = '403648407' AND ExecutionStatus = 'Running'`

`403648407` is the farmhash fingerprint32 of `scheduler.scheduler`.

would be great to add this to the help text

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants