Skip to content

SDK sendAndWait hangs indefinitely in tool-call loops once cumulative tool-result payload crosses a threshold #2911

@affandar

Description

@affandar

Summary

Using @github/copilot-sdk to drive a sequential tool-call loop, the SDK
silently hangs after a few tool iterations once tool result payloads grow
past a small threshold. The hang is wedged inside the model call:
assistant.turn_start fires, then no further events ever arrive
no assistant.message, no assistant.turn_end, no session.error.

The timeoutMs parameter to session.sendAndWait(opts, timeoutMs) is
ignored while wedged. The only recovery is killing the process or
applying an outer AbortController / Promise.race timeout.

This is reproducible with a ~120 line script using only @github/copilot-sdk,
no other dependencies, no dehydration / resume, no real tools.

Affected

  • @github/copilot-sdk (latest, as bundled with current Copilot CLI)
  • Models confirmed: claude-sonnet-4.6, claude-haiku-4.5, gpt-5.4
  • Both claude-opus-4.7 (older) and current claude-opus-4.6 series
  • Node v24.14.0 on macOS 14, also reproduces in containerized Linux

Claude is more sensitive (smaller threshold) but GPT wedges identically
once payloads × turn count get larger. Not model-specific.

Threshold matrix (clean repro, single fresh session, sequential tool loop)

Model 500 B per call 2 KB per call 5 KB per call
claude-sonnet-4.6 ✅ 20/20 in 58 s ❌ Hung at call 12 ❌ Hung at call 4
gpt-5.4 ✅ 20/20 in 81 s ✅ 20/20 in 169 s ❌ Hung at call 9

N=20 shards, single tool, model is told to call shard=1..N then return the sum.
Tool returns in <1 ms with a synthetic JSON { shard, value, padding: "x".repeat(...) }.

When wedged: [HANG] elapsed=473916ms was observed while still hung — i.e. the
SDK held the request open ~8 minutes despite a 180 s sendAndWait timeout
parameter.

Symptom (event trace, anonymized)

[HH:MM:SS] >>> turn_start #4
[HH:MM:SS] msg len=0
[HH:MM:SS] tool#3 shard=3 value=52
[HH:MM:SS] <<< turn_end (chars total: 0)
[HH:MM:SS] >>> turn_start #5
                                    ← hangs here, indefinitely
                                    ← no assistant.message
                                    ← no assistant.turn_end
                                    ← no session.error
                                    ← sendAndWait timeout ignored

Minimal reproduction (~50 lines, no extra deps)

import { CopilotClient, approveAll, defineTool } from "@github/copilot-sdk";

const N = 20;
const PAYLOAD = 5000; // bytes per tool return
const MODEL = "claude-sonnet-4.6";

const stepTool = defineTool("get_step_data", {
  description:
    `Fetch the next data shard. Call shard=1, then 2, ... until ${N}, ` +
    `then output ONE message with the sum. Don't summarize between calls.`,
  parameters: {
    type: "object",
    properties: { shard: { type: "integer" } },
    required: ["shard"],
  },
  handler: async ({ shard }) => JSON.stringify({
    shard,
    value: ((shard * 17) % 100) + 1,
    padding: "x".repeat(PAYLOAD - 60),
  }),
});

const client = new CopilotClient({
  autoStart: true,
  githubToken: process.env.GITHUB_TOKEN,
});
await client.start();

const session = await client.createSession({
  model: MODEL,
  tools: [stepTool],
  onPermissionRequest: approveAll,
  systemPrompt:
    "You are a sequential data fetcher. Call get_step_data with shard=1, " +
    "then shard=2, ..., then output ONE final message with the sum and stop.",
});

session.on((ev) => {
  const ts = new Date().toISOString().slice(11, 23);
  if (ev.type === "assistant.turn_start") console.log(`[${ts}] turn_start`);
  if (ev.type === "assistant.message") console.log(`[${ts}] msg ${ev.data?.content?.slice(0,60)}`);
  if (ev.type === "assistant.turn_end") console.log(`[${ts}] turn_end`);
});

// Outer timeout because sendAndWait's own timeout is ignored when wedged
const timeoutPromise = new Promise((_, reject) =>
  setTimeout(() => reject(new Error("HUNG")), 180_000));

try {
  const result = await Promise.race([
    session.sendAndWait({ prompt: `Fetch all ${N} shards and sum them.` }, 180_000),
    timeoutPromise,
  ]);
  console.log("OK:", result?.data?.content);
} catch (e) {
  console.log("HUNG:", e.message);
}

await client.stop();
process.exit(0);

Save as loop.mjs, run MODEL=claude-sonnet-4.6 GITHUB_TOKEN=ghp_... node loop.mjs.

To sweep:

for MODEL in claude-sonnet-4.6 gpt-5.4; do
  for PAYLOAD in 500 2000 5000; do
    echo "==== $MODEL payload=${PAYLOAD}B ===="
    MODEL=$MODEL PAYLOAD=$PAYLOAD node loop.mjs
  done
done

Expected behavior

  1. The model either streams an assistant.message / completes the turn, or
  2. The SDK surfaces an error event and the sendAndWait promise rejects within
    the timeoutMs parameter the caller passed.

Actual behavior

Neither happens. The SDK silently waits forever for an upstream stream that
never emits. The timeoutMs argument is not enforced when the wedge is
inside the provider stream.

What we ruled out

  • Single large tool result. A one-shot test injecting 40 KB / 100 KB /
    250 KB / 500 KB base64 tool returns into a fresh session completes
    cleanly in 13–22 s on claude-opus-4.7. The hang requires a loop of
    moderately-sized returns.
  • Conversation history size alone. Resuming a 326-event / 1.8 MB
    events.jsonl session with a tiny prompt completes in 4 s.
  • Tool handler latency. Repro tool returns in <1 ms.
  • Model family. GPT-5.4 wedges at the same symptom, just at higher
    payload thresholds than Claude.

Likely culprits (from the outside)

  • Provider streaming adapter not enforcing read deadlines on the upstream
    SSE / chunked response.
  • A specific tool-use repeat pattern that produces an "empty turn" the
    CLI keeps waiting on.
  • Cumulative tool-result payload pushing the request close to a
    provider-side limit that returns no error and no terminator.

What would help us

  1. An enforced internal read deadline on the upstream model stream that
    surfaces a session.error instead of waiting forever.
  2. The timeoutMs parameter to sendAndWait actually unsticking the
    request (it currently doesn't).
  3. A way to pass an AbortSignal into sendAndWait so callers can cancel.

Workarounds we use

  • Wrap every sendAndWait in Promise.race(sendAndWait, externalTimeout)
    so we at least know it hung.
  • Forcibly kill the worker process to release the request and let our
    durable-execution layer redeliver the turn.
  • Spill large tool returns to blob storage and pass back a small pointer,
    to keep cumulative tool-result payload below the wedge threshold.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:toolsBuilt-in tools: file editing, shell, search, LSP, git, and tool call behavior

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions