feat(endpointing): emit turn-outcome events for the endpointing eval (#505)#511
Merged
Conversation
…505) Report a text-free turn-outcome to the SayPi API after every endpointing-driven auto-submit, so the server can score how good `pFinishedSpeaking` is at deciding when a user's turn ended. The only non-circular ground truth for "did the user actually finish?" is whether they started speaking again inside the resume window (submit -> assistant response starts); a resume means we cut them off. The whole window already lives in ConversationMachine, so this rides existing transitions rather than inventing detection: - Window OPEN: `converting.submitting` snapshots the correlation data (session id, last sequence number + its pFinishedSpeaking, speech-offset, app) into context. Only the endpointing auto-submit path reaches this state, so every emit is `trigger:"auto"` (manual sends bypass the machine entirely). - Window CLOSE: the single transition that leaves `responding.piThinking` — piWriting/piSpeaking (response started), userInterrupting (user resumed first, a false finish), or the PI_THINKING_TIMEOUT_MS fallback (response never started). piThinking is entered/left once per turn, so exactly one event fires. - Maintenance messages (suppressed buffer flushes) are excluded — they aren't real endpointing decisions. The POST itself is a thin, fire-and-forget `postTurnOutcome()` in a new module, reusing the shared `callApi` client (JWT/anon + CSP-safe background routing). Never awaited, never retried, never throws to the UX; a dropped event just leaves that turn to the server-side fallback. Payload is booleans, timestamps, a sequence number and an optional score only — no transcript text. See saypi-api PR #310 (endpoint) and PR #309 (design/rationale). Co-Authored-By: Claude Opus 4.8 <[email protected]> Claude-Session: https://claude.ai/code/session_013L2TkKx2c5z9gcuCRpMq5d
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #505.
What & why
Reports a text-free turn-outcome to the SayPi API (
POST /turn-outcome) after every endpointing-driven auto-submit, so the server can score how goodpFinishedSpeakingis at deciding when a user's turn ended. The only non-circular ground truth for "did the user actually finish?" is whether they started speaking again inside the resume window (submit → assistant response starts) — a resume means we cut them off. API side is saypi-api PR #310; design/rationale in PR #309.Approach
The whole window already lives in
ConversationMachine, so this rides existing transitions rather than inventing detection (no VAD/DOM/observer changes):converting.submittingsnapshots the correlation data (session id, last sequence number + itspFinishedSpeaking, speech-offset, app) into context. Only the endpointing auto-submit path reaches this state, so every emit istrigger:"auto"— manual sends bypass the machine and emit nothing.responding.piThinking:piWriting/piSpeaking(response started),userInterrupting(user resumed first — a false finish), or thePI_THINKING_TIMEOUT_MSfallback (response never started).piThinkingis entered/left once per turn, so exactly one event fires.postTurnOutcome()— a thin, fire-and-forget POST in a newTurnOutcomeModule, reusing the sharedcallApiclient (JWT/anon + CSP-safe background routing + 401-retry). Never awaited, never retried, never throws to the machine; a dropped event just leaves that turn to the server-side fallback. Payload is booleans, timestamps, a sequence number and an optional score — no transcript text.Founder-approved v1 decisions
piThinkingisn't captured — an accepted optimism bias for v1.last_speech_ended_at: usestimeUserStoppedSpeaking(same clock assubmitted_at, so latency subtraction is skew-free).last_sequence_number: highest sequence number assigned in the turn, paired with itspFinishedSpeakingfrom the same/transcriberesponse.piWriting/piSpeaking). With TTS on this closes at text-appear rather than the spec's preferred TTS-start; spec-fidelity is a documented follow-up (Endpointing eval: close the resume window at TTS-start, not first response signal (follow-up to #505) #510).Correctness note found in review
Fixed a latent stale-snapshot re-emit:
piThinkinghas a second, non-submit entry (thesaypi:piThinkinghandler inlistening, currently emitter-less but wired). EnteringpiThinkingthat way now clearspendingTurnOutcome, so a prior turn's snapshot can't be re-emitted as a spuriousautoevent. Covered by a regression test.Known v1 limitations (non-blocking)
userInterrupting → piSpeakingonhasNoAudio) still emitsuser_resumed:true— slight over-count, defensible under the spec's "VAD detect the user starting to speak" wording.saypi:userSpeaking→userInterruptingduringresponding) rather than being re-verified end-to-end on a live host.Tests
test/TurnOutcomeModule.spec.ts— POST body/URL, ISO timestamps, privacy (no text), skip-without-correlation-key.test/state-machines/ConversationMachine-turnOutcome.spec.ts— response-started (piWriting & piSpeaking), resume/false-finish, timeout, maintenance exclusion, stale-snapshot regression.🤖 Generated with Claude Code