Skip to content

Investigate: ProcessingState doesn't trigger response close when input EOSes mid-run #610

@tillrohrmann

Description

@tillrohrmann

While debugging a flaky restate E2E test (InvokerMemoryTest) we observed the SDK keeping its HTTP/2 response stream open for the full 5-second runtime drain timeout when the runtime closes the request stream mid-attempt (e.g., on OOM yield).

Tracing the path:

  • HttpRequestFlowAdapter.handleRequestEnd (sdk-http-vertx/src/main/java/dev/restate/sdk/http/vertx/HttpRequestFlowAdapter.java:95) → inputMessagesSubscriber.onComplete() on EOS.
  • StateMachineImpl.onComplete (sdk-core/src/main/java/dev/restate/sdk/core/statemachine/StateMachineImpl.java:160) → currentState.onInputClosed(stateContext) then triggerNextEventSignal().
  • WaitingStartState / WaitingReplayEntriesState correctly throw → hitError → response closes. ✓
  • The default onInputClosed in State.java:195-198 (used by ProcessingState) only marks input closed; it doesn't transition or close the response.
  • ProcessingState.doProgress (ProcessingState.java:83-90) does close the response (hitSuspended) — but only when called and only if no run is currently executing.

Scenario that hangs: the user coroutine has emitted a ProposeRunCompletion and is parked awaiting the ack. Server EOSes the request → onInputClosed marks input closed → triggerNextEventSignal runs whatever listener was registered, but the awaiting-ack coroutine isn't necessarily that listener, so it never re-enters doProgress. The response stays open until something else (eventually) pokes the state machine.

Question: is this intended? Options to investigate:

  1. Have onInputClosed in ProcessingState (and similar) actively schedule a re-run of doProgress so the state machine can decide to suspend.
  2. Cancel any pending user-code coroutine that's awaiting a completion when input closes.
  3. Document the current behavior as a deliberate contract (the runtime must drain) and accept the 5 s tail.

Symptom on the restate side: Response stream draining timeout! fired 108 times across 14 invocations in 120 s in a real CI run (https://github.com/restatedev/restate/actions/runs/26099619862/job/76748911672); companion issue filed against restate to revisit whether the server should drain at all on the yield path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions