[worker] Force-exit gevent test process after run to fix integration CI hang#369
Closed
haileyok wants to merge 1 commit into
Closed
[worker] Force-exit gevent test process after run to fix integration CI hang#369haileyok wants to merge 1 commit into
haileyok wants to merge 1 commit into
Conversation
Contributor
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
be519ac to
86b5a53
Compare
c0f1207 to
7fdc150
Compare
The integration-tests job runs pytest under gevent monkey-patching (python -m gevent.monkey --module pytest). After the suite finishes, the process stalls inside gevent's interpreter finalization on the CI runner and never exits, even though all tests passed and the junit report is written, so the job hangs until the 30-min timeout. The stall is timing/environment-specific (only the constrained CI runner; never locally even with a faithful full stack) and lives in gevent's own shutdown machinery rather than a cleanly disposable resource. Add a trylast pytest_sessionfinish that os._exit(exitstatus) once pytest has finished and junit is flushed, but only when gevent is patched, so the gevent test process skips the deadlocking finalization while a plain pytest run finalizes normally.
7fdc150 to
212ed87
Compare
This was referenced Jun 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The
Integration Testsjob started hanging (cancelled at the 30-min timeout) onmainafter #330 (Add stress.consumer). pytest completes —1096 passed ... in ~59s, junit written — but the process never exits, sodocker compose runnever returns and the job runs out the clock.Root cause
The test runner launches pytest under gevent monkey-patching (
entrypoint.sh→python -m gevent.monkey --module pytest). After the suite finishes, the process stalls inside gevent's own interpreter finalization on the CI runner and never exits. Afaulthandlerdump captured during the live hang shows the process parked in gevent's shutdown machinery (the gevent hub plus idle threadpool workers); there is no application thread holding ajoin/_shutdown.It is timing/environment-specific — it deadlocks on the constrained CI runner but exits cleanly on a fast multi-core box, so it cannot be reproduced locally even with a faithful full-stack run. #330 adds no native threads (kafka-python is pure Python); it perturbs scheduling just enough to make this latent stall deterministic in CI, which is why #328 and #329 happened to win the shutdown race and pass.
This is the well-known "pytest under
python -m gevent.monkey --module pytestwon't finalize after a fully-successful run" class of problem. The tests all run and report; only interpreter teardown deadlocks.Is this a production bug?
No — it's effectively a test/CI-only problem, and this fix is test-scoped. The deadlock happens during clean Python interpreter shutdown under gevent. The test runner does exactly that (run pytest, finish, exit). Production services (
osprey-worker,osprey-ui-api) are long-lived daemons — they run the gevent loop indefinitely and are torn down by SIGTERM/SIGKILL/container stop, not by a gracefulPy_Finalize, so the deadlocking path isn't normally exercised in prod. (This is an inference from the service lifecycle, not a proof: a process that did a clean interpreter exit and relied on atexit/finalization could in principle hit the same gevent stall — but that isn't the normal lifecycle, and this PR changes no production behavior.)Bisect
On
main, the push for #362 passed in ~4 min; #330 sits directly on top adding onlyconsumer.py+test_consumer.py, and its push hung for 30 min. The #330 PR branch and the follow-up CLI PR (#367) reproduced the same signature.Fix
Add a
@pytest.hookimpl(trylast=True) pytest_sessionfinishtoosprey_worker/conftest.pythatos._exit(exitstatus)once pytest has finished and the junit report is flushed — but only when gevent is monkey-patched (monkey.is_module_patched('socket')), so a plainpytestrun finalizes normally. This skips the deadlocking gevent finalization without changing any test behavior or any production code.Changes Made
osprey_worker/conftest.py:trylastpytest_sessionfinishthat flushes stdout/stderr andos._exit(exitstatus)under the gevent test runner.Confidence Level
Confidence Level: Claude
Testing
python -m gevent.monkey --module pytest, the process exits with the correct status and the junit report is fully written before exit; a plainpytestrun is unaffected (the hook returns early).uv run ruff check,uv run ruff format --diff, anduv run mypypass on the changed file.Investigation notes (for reviewers)
Three earlier theories were tried and ruled out by full CI runs before landing on the teardown stall:
init_gevent(mirroringosprey.worker.lib.patcher.patch_all) — wrong subsystem; no change.pytest_sessionfinish— dropped the at-exit thread count from ~69 to 2 locally, still hung in CI.OspreyEngine's per-instance compilationThreadPool— dropped it to 3 locally, still hung in CI.A
faulthandlerdump from the live CI hang is what showed the threadpool workers were a symptom, not the blocker: the stall is in gevent's own finalization, which is why only a hard exit after the (already-complete, already-reported) run resolves it.Notes / follow-ups
os._exitpreempts pytest's cosmetic terminal summary line (e.g.1096 passed); the junit report and process exit code are intact, so CI results/artifacts are unaffected.OspreyEngine.__init__creates agevent.threadpool.ThreadPool(maxsize=1)and never disposes it. Fine for the long-lived production singleton (one idle worker for the process's life), but tests build many engines and leak a worker thread each.Checklist
uv run ruff check .passes (no unused imports or other lint errors)uv tool run fawltydeps --check-unused --pyenv .venvpasses (no unused dependencies)CHANGELOG.mdwith my changes, if applicable (N/A — test-runner teardown fix)