Skip to content

[worker] Force-exit gevent test process after run to fix integration CI hang#369

Closed
haileyok wants to merge 1 commit into
mainfrom
fix-test-gevent-grpc-init
Closed

[worker] Force-exit gevent test process after run to fix integration CI hang#369
haileyok wants to merge 1 commit into
mainfrom
fix-test-gevent-grpc-init

Conversation

@haileyok

@haileyok haileyok commented Jun 17, 2026

Copy link
Copy Markdown
Member

Description

The Integration Tests job started hanging (cancelled at the 30-min timeout) on main after #330 (Add stress.consumer). pytest completes1096 passed ... in ~59s, junit written — but the process never exits, so docker compose run never returns and the job runs out the clock.

Root cause

The test runner launches pytest under gevent monkey-patching (entrypoint.shpython -m gevent.monkey --module pytest). After the suite finishes, the process stalls inside gevent's own interpreter finalization on the CI runner and never exits. A faulthandler dump captured during the live hang shows the process parked in gevent's shutdown machinery (the gevent hub plus idle threadpool workers); there is no application thread holding a join/_shutdown.

It is timing/environment-specific — it deadlocks on the constrained CI runner but exits cleanly on a fast multi-core box, so it cannot be reproduced locally even with a faithful full-stack run. #330 adds no native threads (kafka-python is pure Python); it perturbs scheduling just enough to make this latent stall deterministic in CI, which is why #328 and #329 happened to win the shutdown race and pass.

This is the well-known "pytest under python -m gevent.monkey --module pytest won't finalize after a fully-successful run" class of problem. The tests all run and report; only interpreter teardown deadlocks.

Is this a production bug?

No — it's effectively a test/CI-only problem, and this fix is test-scoped. The deadlock happens during clean Python interpreter shutdown under gevent. The test runner does exactly that (run pytest, finish, exit). Production services (osprey-worker, osprey-ui-api) are long-lived daemons — they run the gevent loop indefinitely and are torn down by SIGTERM/SIGKILL/container stop, not by a graceful Py_Finalize, so the deadlocking path isn't normally exercised in prod. (This is an inference from the service lifecycle, not a proof: a process that did a clean interpreter exit and relied on atexit/finalization could in principle hit the same gevent stall — but that isn't the normal lifecycle, and this PR changes no production behavior.)

Bisect

On main, the push for #362 passed in ~4 min; #330 sits directly on top adding only consumer.py + test_consumer.py, and its push hung for 30 min. The #330 PR branch and the follow-up CLI PR (#367) reproduced the same signature.

Fix

Add a @pytest.hookimpl(trylast=True) pytest_sessionfinish to osprey_worker/conftest.py that os._exit(exitstatus) once pytest has finished and the junit report is flushed — but only when gevent is monkey-patched (monkey.is_module_patched('socket')), so a plain pytest run finalizes normally. This skips the deadlocking gevent finalization without changing any test behavior or any production code.

Changes Made

  • osprey_worker/conftest.py: trylast pytest_sessionfinish that flushes stdout/stderr and os._exit(exitstatus) under the gevent test runner.

Confidence Level

Confidence Level: Claude

Testing

  • CI (the real validation): integration-tests now passes (~4.5 min) instead of hanging; all other checks (python/rust/ui-quality, CodeQL, zizmor) pass.
  • Locally: under python -m gevent.monkey --module pytest, the process exits with the correct status and the junit report is fully written before exit; a plain pytest run is unaffected (the hook returns early).
  • uv run ruff check, uv run ruff format --diff, and uv run mypy pass on the changed file.
  • The hang is CI-environment-specific and could not be reproduced locally even with a faithful isolated full-stack run, so CI on this PR is the end-to-end check.

Investigation notes (for reviewers)

Three earlier theories were tried and ruled out by full CI runs before landing on the teardown stall:

  1. grpc init_gevent (mirroring osprey.worker.lib.patcher.patch_all) — wrong subsystem; no change.
  2. Kill leaked gevent threadpools at pytest_sessionfinish — dropped the at-exit thread count from ~69 to 2 locally, still hung in CI.
  3. Dispose OspreyEngine's per-instance compilation ThreadPool — dropped it to 3 locally, still hung in CI.

A faulthandler dump from the live CI hang is what showed the threadpool workers were a symptom, not the blocker: the stall is in gevent's own finalization, which is why only a hard exit after the (already-complete, already-reported) run resolves it.

Notes / follow-ups

  • os._exit preempts pytest's cosmetic terminal summary line (e.g. 1096 passed); the junit report and process exit code are intact, so CI results/artifacts are unaffected.
  • Separately spotted (not the cause of this hang, harmless in prod, candidate for its own PR): OspreyEngine.__init__ creates a gevent.threadpool.ThreadPool(maxsize=1) and never disposes it. Fine for the long-lived production singleton (one idle worker for the process's life), but tests build many engines and leak a worker thread each.

Checklist

  • Tests pass locally
  • uv run ruff check . passes (no unused imports or other lint errors)
  • uv tool run fawltydeps --check-unused --pyenv .venv passes (no unused dependencies)
  • Updated CHANGELOG.md with my changes, if applicable (N/A — test-runner teardown fix)

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 4940636f-9dbf-4d83-964d-7aa492b950e3

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-test-gevent-grpc-init

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@haileyok haileyok force-pushed the fix-test-gevent-grpc-init branch from be519ac to 86b5a53 Compare June 17, 2026 16:13
@haileyok haileyok changed the title [worker] Init gRPC-gevent in test bootstrap to fix CI hang [worker] Dispose leaked gevent threadpools to fix integration CI hang Jun 17, 2026
@haileyok haileyok force-pushed the fix-test-gevent-grpc-init branch 2 times, most recently from c0f1207 to 7fdc150 Compare June 17, 2026 17:21
@haileyok haileyok changed the title [worker] Dispose leaked gevent threadpools to fix integration CI hang [worker] Dispose engine compilation threadpool to fix integration CI hang Jun 17, 2026
The integration-tests job runs pytest under gevent monkey-patching (python -m gevent.monkey --module pytest). After the suite finishes, the process stalls inside gevent's interpreter finalization on the CI runner and never exits, even though all tests passed and the junit report is written, so the job hangs until the 30-min timeout. The stall is timing/environment-specific (only the constrained CI runner; never locally even with a faithful full stack) and lives in gevent's own shutdown machinery rather than a cleanly disposable resource. Add a trylast pytest_sessionfinish that os._exit(exitstatus) once pytest has finished and junit is flushed, but only when gevent is patched, so the gevent test process skips the deadlocking finalization while a plain pytest run finalizes normally.
@haileyok haileyok force-pushed the fix-test-gevent-grpc-init branch from 7fdc150 to 212ed87 Compare June 17, 2026 17:35
@haileyok haileyok changed the title [worker] Dispose engine compilation threadpool to fix integration CI hang [worker] Force-exit gevent test process after run to fix integration CI hang Jun 17, 2026
@haileyok haileyok closed this Jun 17, 2026
@haileyok haileyok deleted the fix-test-gevent-grpc-init branch June 18, 2026 03:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant