Skip to content

fix(cli): hard-exit past finalization to avoid lancedb segfault (139)#82

Merged
cyfyifanchen merged 2 commits into
mainfrom
fix/tui_exit_sigsegv
Jul 3, 2026
Merged

fix(cli): hard-exit past finalization to avoid lancedb segfault (139)#82
cyfyifanchen merged 2 commits into
mainfrom
fix/tui_exit_sigsegv

Conversation

@JasonJarvan

@JasonJarvan JasonJarvan commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Change description

raven tui segfaulted on exit (exit code 139 / SIGSEGV) after Ctrl+C, failing
expect_exit(0) for the ENTIRE TUI e2e suite (test_e2e_ctrl_c both variants,
test_e2e_tui_status_slash) and gating live-green for all TUI e2e.

Root cause (verified — not a build artifact; reproduces on a clean build and
with no node child / no TTY): any command that builds the agent loop opens a
lancedb-backed store, and lancedb starts a process-global Rust/tokio background
thread (LanceDBBackgroundEventLoop) that is a daemon thread with no public
shutdown hook. A normal CPython interpreter finalization tears down while that
native runtime is still live and segfaults. The crash is 100% in Py_FinalizeEx
(both "loop built" and "reached end of main" print before the fault; faulthandler
shows <no Python frame> with the pyarrow/numpy lance stack loaded). It is
independent of the long-term memory backend (reproduces with memory.backend
disabled too).

Because lancedb has no clean shutdown, the fix guards the single CLI exit
chokepoint rather than one exit path: raven.cli.commands.run (the
console-script entry) runs the Typer app, and on the way out — when the lancedb
thread is live — flushes stdio + loguru and os._exits past finalization;
otherwise it exits normally. Every command that builds the loop is covered at
once, so this also subsumes and removes an ad-hoc per-command os._exit that
already existed for the agent headless one-shot. CliRunner invokes the Typer
app object directly and never reaches the wrapper, so in-process test hosts keep
normal exit semantics.

Verification: real PTY-driven raven tui + idle Ctrl+C and cancel-then-exit
Ctrl+C both exit 0 (were 139); --help / --version and other no-loop commands
still exit normally; new subprocess regression test builds the real loop, asserts
the hazard gate fires, and confirms a clean guarded exit (with a companion test
that the crash reproduces under normal finalization, so the guard stays
load-bearing); the full in-process CliRunner TUI + CLI-smoke suites pass.

Type of change

  • Bug fix
  • New feature
  • Document
  • Others

Related issues (if there is)

N/A

Checklists

Development

  • Lint rules pass locally
  • Application changes have been tested thoroughly
  • Automated tests covering modified code pass

Security

  • Security impact of change has been considered
  • Code follows security best practices and guidelines

Code review

  • Pull request has a descriptive title and context useful to a reviewer. Screenshots or screencasts are attached as necessary

@JasonJarvan JasonJarvan force-pushed the fix/tui_exit_sigsegv branch from 190ab8c to 5178062 Compare July 2, 2026 13:52
@JasonJarvan JasonJarvan changed the title fix(cli): hard-exit raven tui to avoid lancedb finalization segfault fix(cli): hard-exit past finalization to avoid lancedb segfault (139) Jul 2, 2026
@JasonJarvan JasonJarvan requested a review from arelchan July 2, 2026 14:06
arelchan
arelchan previously approved these changes Jul 2, 2026
Sheng Zhao and others added 2 commits July 3, 2026 07:54
Any raven command that builds the agent loop opens a lancedb-backed store;
lancedb starts a process-global Rust/tokio background thread
(LanceDBBackgroundEventLoop) with no shutdown hook. Normal CPython
interpreter finalization races that live native runtime and segfaults
(exit 139), masking the command's real exit code and failing expect_exit(0)
for the whole TUI e2e suite.

Guard once at the console-script entry (raven.cli.commands.run): when the
lancedb thread is live, flush stdio + loguru and os._exit past finalization;
otherwise exit normally. CliRunner invokes the Typer app directly and never
reaches this wrapper, so in-process test hosts keep normal exit semantics.
Subsumes and removes the ad-hoc per-command os._exit in agent_commands.

Add a subprocess regression test that builds the real agent loop, asserts the
hazard gate fires, and checks a clean exit via the guarded path.

Co-authored-by: Claude (claude-opus-4-8) <[email protected]>
The pre-commit ruff-format hook collapses two multi-line assert messages
that fit within the 120-col limit. Formatting only; no behavior change.

Co-authored-by: Claude (claude-opus-4-8) <[email protected]>
@JasonJarvan JasonJarvan force-pushed the fix/tui_exit_sigsegv branch from e2437db to 90fde8b Compare July 3, 2026 07:56
@cyfyifanchen cyfyifanchen merged commit f827c00 into main Jul 3, 2026
7 checks passed
@cyfyifanchen cyfyifanchen deleted the fix/tui_exit_sigsegv branch July 3, 2026 07:57
cyfyifanchen pushed a commit that referenced this pull request Jul 3, 2026
## Summary

The commit-message lint rejects any `Merge ...` commit in the PR range
(`scripts/commit_lint.py`: "merge commits are not allowed in PR
ranges").
But this repository is squash-only (rebase/merge disabled), so a merge
commit created by syncing `main` into a feature branch is squashed away
at
merge time and never reaches `main`. The rule therefore guards nothing
while
failing the CI check for every PR that synced `main` via a merge commit
(observed on #82).

The fix makes `scripts/check_commit_messages.py` gather only non-merge
commits by passing `--no-merges` to its `git log` call, mirroring
commitlint's own default of ignoring merge commits. The `Merge ` rule in
`commit_lint.py` is left intact as belt-and-suspenders (it is also
reused by
the PR-title path), so no lint behavior is removed.

## Type

- [ ] Fix
- [ ] Feature
- [ ] Docs
- [x] CI / tooling
- [ ] Refactor
- [ ] Other

## Verification

- `uv run pytest tests/test_check_commit_messages.py
tests/test_commit_lint.py -q` -> 10 passed
- `uv run ruff check scripts/check_commit_messages.py
tests/test_check_commit_messages.py` -> All checks passed
- Added TDD coverage in `tests/test_check_commit_messages.py`: builds a
real
temporary repo with a `--no-ff` merge commit and asserts the range
passes
  (red before the fix, green after).

- [x] Relevant tests pass locally
- [x] Relevant lint / type checks pass locally
- [ ] User-facing docs or screenshots are updated when needed

## Risk

- [x] Security impact considered
- [x] Backward compatibility considered
- [x] Rollback path is clear for risky changes

Backward compatible: real commit messages are still linted exactly as
before; only merge commits are now skipped. Rollback is a one-line
revert of
the `--no-merges` flag.

## Related Issues

N/A

Co-authored-by: Claude (claude-opus-4-8) <[email protected]>

Co-authored-by: Sheng Zhao <[email protected]>
Co-authored-by: Claude (claude-opus-4-8) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants