Skip to content

feat: task status polling with runtime unreachable detection#501

Merged
hhhhsc701 merged 1 commit into
mainfrom
fix/ray
Jun 8, 2026
Merged

feat: task status polling with runtime unreachable detection#501
hhhhsc701 merged 1 commit into
mainfrom
fix/ray

Conversation

@MoeexT

@MoeexT MoeexT commented Jun 8, 2026

Copy link
Copy Markdown
Contributor
  • backend-python: global polling coroutine polls all RUNNING tasks every 2s
  • backend-python: mark task FAILED when runtime unreachable (httpx 60s timeout)
  • backend-python: singleton scheduler + startup() recovers RUNNING tasks on restart
  • runtime: Ray job connection failure counter (5 fails → FAILED)
  • runtime: stall detection (3600s no log progress → FAILED)
image

- backend-python: global polling coroutine polls all RUNNING tasks every 2s
- backend-python: mark task FAILED when runtime unreachable (httpx 60s timeout)
- backend-python: singleton scheduler + startup() recovers RUNNING tasks on restart
- runtime: Ray job connection failure counter (5 fails → FAILED)
- runtime: stall detection (3600s no log progress → FAILED)
@hhhhsc701 hhhhsc701 merged commit 4d6a24f into main Jun 8, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants