TS: integration with state-based replication#10138
Conversation
c9e46d3 to
384f53a
Compare
2880ec6 to
8977071
Compare
91c75c9 to
77a5b4f
Compare
77a5b4f to
273f09c
Compare
236252a to
3249657
Compare
6636093 to
aae88b6
Compare
| } | ||
| tsi := mutableState.GetExecutionInfo().GetTimeSkippingInfo() | ||
| cb := tsi.GetCurrentElapsedDurationBound() | ||
|
|
There was a problem hiding this comment.
Question for CGS team:
Not sure if in complicated failover cases (A failover to a slow B and then A is up again as stand-by) there may be legacy timer tasks that have larger event IDs than the one of the current mutable state but even this happens loadMutableStateForTimerTask will filter these tasks and they will never reach executeTimeSkippingTimerTask
// Validation based on eventID is not good enough as certain operation does not generate events.
// For example, scheduling transient workflow task, or starting activities that have retry policy.
//
// Some tasks don't have an associated eventID (CHASM tasks).
eventID, eidOk := getEventID(task)
if !eidOk || eventID < mutableState.GetNextEventID() {
return mutableState, nil
}
already added time skipping timer task logic to getEventID in previous PRs
There was a problem hiding this comment.
can the source event ID in the skipping info > timer task event ID? Does it mean the timer task is invalid?
There was a problem hiding this comment.
-
yes, it can. if the user sets a new duration limit later, the event ID may be different, and the old timer shall be acked silently so I returned
nil, nilfor this case. does this make sense? another way is to find the overwritten source event ID in events or keep them in a list in the mutable state, and I was not sure what is the benefits so I used the current simple logic -
under current replication mechanism, will it happen under extreme edgy case
- active A (like next event ID = 10), passive B (catching up to 5)
- failover -> B as active (like next event ID = 5, and A as passive so A use next ID = 5 again
- failover back to-> A as active and proceed to event ID = 10 with a new timer, and an older timer when A was active in the first place fired to match the "event ID 10" but totally a different history branch?
f44fc8a to
c6f0f19
Compare
| return err | ||
| } | ||
|
|
||
| func (r *TaskRefresherImpl) refreshTasksForTimeSkipping( |
There was a problem hiding this comment.
refresh doesn't follow idempotency so that tasks shall be regenerated when lost
36725fd to
e7c6745
Compare
86d2687 to
1fb4f13
Compare
Stamp TimeSkippingInfo.LastUpdateVersionedTransition on each skip and gate PartialRefresh's timer re-stamp on it, instead of the TaskRegenerationStatus enum. The stamp is a global fact, so it replicates via the generic ExecutionInfo merge — dropping applyIncomingTimeSkippingInfo and the TimerRegenStatus constants.
1fb4f13 to
a2de8d0
Compare
What changed?
timeSkippingInfoUpdated,ms.isStateDirty(),cleanupTransactionandTimeSkippingInfo.LastUpdateVersionedTransition.The standby's regenerated TimeSkippingTimerTask fires through executeTimeSkippingTimerTask. If the associated FastForwardInfo is still the active, the standby awaits the replicated transition rather than driving it.
Why?
the goal is to make sure time skipping works correctly during failovers with state-based replication
How did you test it?