fix(codex): bound bridge app-server stalls#209
Open
yui-stingray wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
watch-oncefailures while keeping exit2as normal re-arm behaviorCloses #195.
Behavior notes
New knobs:
--connect-timeout-ms,AGMSG_CODEX_BRIDGE_CONNECT_TIMEOUT_MS, default10000--request-timeout-ms,AGMSG_CODEX_BRIDGE_REQUEST_TIMEOUT_MS, default30000--watch-failure-limit,AGMSG_CODEX_BRIDGE_WATCH_FAILURE_LIMIT, default30disables the corresponding timeout/limit.A request timeout inside an app-server event handler now intentionally terminates the bridge instead of only logging and continuing. With the new timeout behavior, continuing after a timed-out
process/spawnorturn/startcan leave the bridge alive but unable to monitor correctly, for example with a non-nullwatchHandleand no actual watch process. Failing fast gives a clear error instead of a silent pseudo-monitor stall.Validation
node --check scripts/drivers/types/codex/codex-bridge.jsgit diff --checkbats --print-output-on-failure tests/test_codex_bridge.bats-> 22/22bats --print-output-on-failure -f 'codex' tests/test_delivery.bats-> 10/10timeout 240s bats --print-output-on-failure tests/-> reached 168/393 before the outer timeout; the changed Codex bridge section passed (33-54), Codex delivery checks passed (157-163), and the only observed failure before timeout was the existing unrelateddelivery set monitor: existing settings with single-quoted hook commands stays valid JSON (#134)malformed JSON case.Review notes
This was checked with separate read-only review passes for approach, test design, implementation diff, and final readiness. The remaining practical risk is live Codex app-server variance; the added tests use fake stdio/WebSocket app-servers to cover the protocol stall/failure paths without touching real
db/,teams/, orrun/state.