Skip to content

TS: integration with state-based replication#10138

Open
feiyang3cat wants to merge 2 commits into
temporalio:mainfrom
feiyang3cat:ts/replicaiton-new
Open

TS: integration with state-based replication#10138
feiyang3cat wants to merge 2 commits into
temporalio:mainfrom
feiyang3cat:ts/replicaiton-new

Conversation

@feiyang3cat

@feiyang3cat feiyang3cat commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

What changed?

  1. Active cluster — capture time-skipping state changes for replication using timeSkippingInfoUpdated, ms.isStateDirty(), cleanupTransaction and TimeSkippingInfo.LastUpdateVersionedTransition.
  2. Active & passive — idempotent timer-task regeneration
  • Active: only when a skip transition was emitted this transaction.
  • Passive (PartialRefresh, state-based replication): only when TimeSkippingInfo.LastUpdateVersionedTransition >= minVersionedTransition.
  1. Passive cluster — time-skipping timer-task executor
    The standby's regenerated TimeSkippingTimerTask fires through executeTimeSkippingTimerTask. If the associated FastForwardInfo is still the active, the standby awaits the replicated transition rather than driving it.

Why?

the goal is to make sure time skipping works correctly during failovers with state-based replication

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

@feiyang3cat feiyang3cat requested review from a team as code owners April 30, 2026 22:36
@feiyang3cat feiyang3cat changed the title timeskipping support replication wip: fit the time-skipping feature in replication Apr 30, 2026
@feiyang3cat feiyang3cat force-pushed the ts/replicaiton-new branch 6 times, most recently from c9e46d3 to 384f53a Compare May 1, 2026 00:57
Comment thread service/history/workflow/task_generator.go
@feiyang3cat feiyang3cat force-pushed the ts/replicaiton-new branch 2 times, most recently from 2880ec6 to 8977071 Compare May 1, 2026 01:41
Comment thread service/history/workflow/task_refresher.go Outdated
@feiyang3cat feiyang3cat changed the title wip: fit the time-skipping feature in replication Integrate the time-skipping feature into the replication process May 1, 2026
@feiyang3cat feiyang3cat changed the title Integrate the time-skipping feature into the replication process wip: Integrate the time-skipping feature into the replication process May 1, 2026
Comment thread service/history/workflow/mutable_state_impl.go
@feiyang3cat feiyang3cat changed the title wip: Integrate the time-skipping feature into the replication process Integrate the time-skipping feature into the replication process May 1, 2026
@feiyang3cat feiyang3cat changed the title Integrate the time-skipping feature into the replication process wip:Integrate the time-skipping feature into the replication process May 1, 2026
@feiyang3cat feiyang3cat changed the title wip:Integrate the time-skipping feature into the replication process draft:Integrate the time-skipping feature into the replication process May 1, 2026
@feiyang3cat feiyang3cat force-pushed the ts/replicaiton-new branch 2 times, most recently from 91c75c9 to 77a5b4f Compare May 4, 2026 17:48
@feiyang3cat feiyang3cat changed the title draft:Integrate the time-skipping feature into the replication process time-skipping integration with cross-cluster replication May 4, 2026
@feiyang3cat feiyang3cat force-pushed the ts/replicaiton-new branch from 77a5b4f to 273f09c Compare May 4, 2026 17:52
Comment thread service/history/workflow/mutable_state_impl.go Outdated
Comment thread service/history/timer_queue_standby_task_executor.go
Comment thread service/history/workflow/mutable_state_impl.go
@feiyang3cat feiyang3cat force-pushed the ts/replicaiton-new branch 3 times, most recently from 236252a to 3249657 Compare May 4, 2026 22:52
@feiyang3cat feiyang3cat force-pushed the ts/replicaiton-new branch 7 times, most recently from 6636093 to aae88b6 Compare May 17, 2026 21:44
}
tsi := mutableState.GetExecutionInfo().GetTimeSkippingInfo()
cb := tsi.GetCurrentElapsedDurationBound()

@feiyang3cat feiyang3cat May 17, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for CGS team:

Not sure if in complicated failover cases (A failover to a slow B and then A is up again as stand-by) there may be legacy timer tasks that have larger event IDs than the one of the current mutable state but even this happens loadMutableStateForTimerTask will filter these tasks and they will never reach executeTimeSkippingTimerTask

	// Validation based on eventID is not good enough as certain operation does not generate events.
	// For example, scheduling transient workflow task, or starting activities that have retry policy.
	//
	// Some tasks don't have an associated eventID (CHASM tasks).
	eventID, eidOk := getEventID(task)
	if !eidOk || eventID < mutableState.GetNextEventID() {
		return mutableState, nil
	}

already added time skipping timer task logic to getEventID in previous PRs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the source event ID in the skipping info > timer task event ID? Does it mean the timer task is invalid?

@feiyang3cat feiyang3cat May 19, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. yes, it can. if the user sets a new duration limit later, the event ID may be different, and the old timer shall be acked silently so I returned nil, nil for this case. does this make sense? another way is to find the overwritten source event ID in events or keep them in a list in the mutable state, and I was not sure what is the benefits so I used the current simple logic

  2. under current replication mechanism, will it happen under extreme edgy case

  • active A (like next event ID = 10), passive B (catching up to 5)
  • failover -> B as active (like next event ID = 5, and A as passive so A use next ID = 5 again
  • failover back to-> A as active and proceed to event ID = 10 with a new timer, and an older timer when A was active in the first place fired to match the "event ID 10" but totally a different history branch?

@feiyang3cat feiyang3cat force-pushed the ts/replicaiton-new branch 9 times, most recently from f44fc8a to c6f0f19 Compare May 18, 2026 16:53
@feiyang3cat feiyang3cat changed the title time-skipping integration with cross-cluster replication TS: integration with cross-cluster replication May 18, 2026
@feiyang3cat feiyang3cat changed the title TS: integration with cross-cluster replication TimeSkipping: integration with state-based replication May 18, 2026
@feiyang3cat feiyang3cat requested a review from yux0 May 18, 2026 17:39
@feiyang3cat feiyang3cat changed the title TimeSkipping: integration with state-based replication TS: integration with state-based replication Jun 2, 2026
return err
}

func (r *TaskRefresherImpl) refreshTasksForTimeSkipping(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refresh doesn't follow idempotency so that tasks shall be regenerated when lost

@feiyang3cat feiyang3cat force-pushed the ts/replicaiton-new branch 2 times, most recently from 36725fd to e7c6745 Compare June 17, 2026 04:22
@feiyang3cat feiyang3cat force-pushed the ts/replicaiton-new branch 3 times, most recently from 86d2687 to 1fb4f13 Compare June 17, 2026 05:01
Stamp TimeSkippingInfo.LastUpdateVersionedTransition on each skip and gate
PartialRefresh's timer re-stamp on it, instead of the TaskRegenerationStatus
enum. The stamp is a global fact, so it replicates via the generic ExecutionInfo
merge — dropping applyIncomingTimeSkippingInfo and the TimerRegenStatus constants.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants