Skip to content

Recover terminal surfaces when their zmx client exits unexpectedly#453

Merged
sbertix merged 3 commits into
mainfrom
sbertix/422-fix-network-crash
Jun 22, 2026
Merged

Recover terminal surfaces when their zmx client exits unexpectedly#453
sbertix merged 3 commits into
mainfrom
sbertix/422-fix-network-crash

Conversation

@sbertix

@sbertix sbertix commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Summary

A surface's zmx client can exit out from under us for reasons the user never asked for: a transient network drop, a detach, a zmx server restart, a crash. The old path treated every such exit as a user-initiated close, so it tore down the tab and killed the underlying session, losing any running agent. The reported case (#422) was a brief network drop (~10s) that hit every surface at once and wiped the whole window.

This makes the terminal layer resilient to any unexpected zmx client exit, not just the network-drop instance.

What changed

  • Reattach in place on unexpected zmx exit. When a surface's zmx client dies, we look up its session and swap a fresh surface view under the same UUID, bumping a per-tab surfaceGeneration token so SwiftUI rebuilds the tree. The agent keeps running; the user sees the terminal reconnect rather than vanish.
  • Spare sessions we don't exclusively own. We only kill/reattach a session we positively own and that is idle (clients == 0). A session another client still holds (clients > 0) or one with an unknown count (nil) is spared, matching the orphan reaper's documented spare-on-in-use rule.
  • Shield load-failed repos from prune. A transient load failure leaves a repo with no worktree rows, which previously pruned it (and killed its restored zmx sessions). prune now takes protectingRepositoryIDs, seeded from loadFailuresByID, so those sessions survive the blip.

Tests

  • Unit coverage in WorktreeTerminalManagerTests and TerminalTabFeatureTests: reattach-in-place under the same UUID, spare-on-clients > 0, spare-on-unknown-count, surfaceGeneration propagation, and prune protection for load-failed repos.
  • scripts/smoke-zmx-crash-recovery.sh drives a real zmx session through abrupt client exit and detach, asserting the surface reattaches and the session survives end to end.
  • make test green; make build-app and make check clean.

Closes #422.

sbertix added 2 commits June 22, 2026 18:59
A surface's zmx client can exit out from under us for reasons the user
never asked for: a transient network drop, a detach, a zmx server
restart, a crash. The old path treated every such exit as a
user-initiated close, so it tore down the tab and killed the underlying
session, losing any running agent. A brief network drop (~10s) hit every
surface at once and wiped the whole window.

Detach an unexpected zmx exit from the close path and recover in place.
When a surface's client dies, look up its session and swap a fresh
surface view under the same UUID (bumping a per-tab surfaceGeneration so
SwiftUI rebuilds) only when we positively own an idle session
(0 clients). A session another client still holds (clients > 0) or one
with an unknown count (nil) is spared, never killed, matching the orphan
reaper's spare-on-in-use rule.

Also shield repositories that failed to load from prune: a transient
load failure leaves them with no worktree rows, so prune would drop them
and kill their restored zmx sessions. prune now takes
protectingRepositoryIDs seeded from loadFailuresByID.

Closes #422.
Drive a real zmx session through abrupt client exit and detach, then
assert the surface reattaches in place and the session survives, so the
unexpected-exit recovery path stays covered end to end outside the unit
suite.
@sbertix sbertix enabled auto-merge (squash) June 22, 2026 17:07
@tuist

tuist Bot commented Jun 22, 2026

Copy link
Copy Markdown

🛠️ Tuist Run Report 🛠️

Builds 🔨

Scheme Status Duration Commit
supacode 1m 9s 1e96a641d

The zmx crash-recovery tests polled with bare `Task.yield()` loops and no
wall-clock budget, racing the detached session-kill and flaking under CI load.

Replace the polling with continuation-based signaling on `ZmxTestProbe`
(`waitForKill` / `waitForListCalls`), resumed by the real probe event. A bounded
timeout backstop records an issue with the call-site source location instead of
hanging until the global test timeout, and `waitUntil` polls against a real
deadline rather than a fixed yield count.
@sbertix sbertix merged commit 50350b4 into main Jun 22, 2026
1 check passed
@sbertix sbertix deleted the sbertix/422-fix-network-crash branch June 22, 2026 19:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Brief network drop (~10s) tore down all terminals/tabs and killed every Claude session

1 participant