Recover terminal surfaces when their zmx client exits unexpectedly#453
Merged
Conversation
A surface's zmx client can exit out from under us for reasons the user never asked for: a transient network drop, a detach, a zmx server restart, a crash. The old path treated every such exit as a user-initiated close, so it tore down the tab and killed the underlying session, losing any running agent. A brief network drop (~10s) hit every surface at once and wiped the whole window. Detach an unexpected zmx exit from the close path and recover in place. When a surface's client dies, look up its session and swap a fresh surface view under the same UUID (bumping a per-tab surfaceGeneration so SwiftUI rebuilds) only when we positively own an idle session (0 clients). A session another client still holds (clients > 0) or one with an unknown count (nil) is spared, never killed, matching the orphan reaper's spare-on-in-use rule. Also shield repositories that failed to load from prune: a transient load failure leaves them with no worktree rows, so prune would drop them and kill their restored zmx sessions. prune now takes protectingRepositoryIDs seeded from loadFailuresByID. Closes #422.
Drive a real zmx session through abrupt client exit and detach, then assert the surface reattaches in place and the session survives, so the unexpected-exit recovery path stays covered end to end outside the unit suite.
The zmx crash-recovery tests polled with bare `Task.yield()` loops and no wall-clock budget, racing the detached session-kill and flaking under CI load. Replace the polling with continuation-based signaling on `ZmxTestProbe` (`waitForKill` / `waitForListCalls`), resumed by the real probe event. A bounded timeout backstop records an issue with the call-site source location instead of hanging until the global test timeout, and `waitUntil` polls against a real deadline rather than a fixed yield count.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A surface's
zmxclient can exit out from under us for reasons the user never asked for: a transient network drop, a detach, a zmx server restart, a crash. The old path treated every such exit as a user-initiated close, so it tore down the tab and killed the underlying session, losing any running agent. The reported case (#422) was a brief network drop (~10s) that hit every surface at once and wiped the whole window.This makes the terminal layer resilient to any unexpected zmx client exit, not just the network-drop instance.
What changed
surfaceGenerationtoken so SwiftUI rebuilds the tree. The agent keeps running; the user sees the terminal reconnect rather than vanish.clients == 0). A session another client still holds (clients > 0) or one with an unknown count (nil) is spared, matching the orphan reaper's documented spare-on-in-use rule.prunenow takesprotectingRepositoryIDs, seeded fromloadFailuresByID, so those sessions survive the blip.Tests
WorktreeTerminalManagerTestsandTerminalTabFeatureTests: reattach-in-place under the same UUID, spare-on-clients > 0, spare-on-unknown-count,surfaceGenerationpropagation, and prune protection for load-failed repos.scripts/smoke-zmx-crash-recovery.shdrives a real zmx session through abrupt client exit and detach, asserting the surface reattaches and the session survives end to end.make testgreen;make build-appandmake checkclean.Closes #422.