Skip to content

fix(codex): bound bridge app-server stalls#209

Open
yui-stingray wants to merge 1 commit into
fujibee:mainfrom
yui-stingray:codex/bridge-timeouts-watch-limit
Open

fix(codex): bound bridge app-server stalls#209
yui-stingray wants to merge 1 commit into
fujibee:mainfrom
yui-stingray:codex/bridge-timeouts-watch-limit

Conversation

@yui-stingray

Copy link
Copy Markdown

Summary

  • add bounded Codex bridge app-server connect/upgrade and JSON-RPC request timeouts
  • apply request timeout cleanup to both stdio and direct WebSocket app-server clients
  • stop after configurable consecutive real watch-once failures while keeping exit 2 as normal re-arm behavior
  • clear failed watch spawn state and fail the bridge when an app-server request times out during event handling
  • isolate Codex bridge Bats project roots under each test temp dir

Closes #195.

Behavior notes

New knobs:

  • --connect-timeout-ms, AGMSG_CODEX_BRIDGE_CONNECT_TIMEOUT_MS, default 10000
  • --request-timeout-ms, AGMSG_CODEX_BRIDGE_REQUEST_TIMEOUT_MS, default 30000
  • --watch-failure-limit, AGMSG_CODEX_BRIDGE_WATCH_FAILURE_LIMIT, default 3

0 disables the corresponding timeout/limit.

A request timeout inside an app-server event handler now intentionally terminates the bridge instead of only logging and continuing. With the new timeout behavior, continuing after a timed-out process/spawn or turn/start can leave the bridge alive but unable to monitor correctly, for example with a non-null watchHandle and no actual watch process. Failing fast gives a clear error instead of a silent pseudo-monitor stall.

Validation

  • node --check scripts/drivers/types/codex/codex-bridge.js
  • git diff --check
  • bats --print-output-on-failure tests/test_codex_bridge.bats -> 22/22
  • bats --print-output-on-failure -f 'codex' tests/test_delivery.bats -> 10/10
  • timeout 240s bats --print-output-on-failure tests/ -> reached 168/393 before the outer timeout; the changed Codex bridge section passed (33-54), Codex delivery checks passed (157-163), and the only observed failure before timeout was the existing unrelated delivery set monitor: existing settings with single-quoted hook commands stays valid JSON (#134) malformed JSON case.

Review notes

This was checked with separate read-only review passes for approach, test design, implementation diff, and final readiness. The remaining practical risk is live Codex app-server variance; the added tests use fake stdio/WebSocket app-servers to cover the protocol stall/failure paths without touching real db/, teams/, or run/ state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Codex bridge can hang indefinitely on app-server stalls and watch failures

1 participant