Skip to content

Update task status after the task fails#500

Merged
Dallas98 merged 5 commits into
mainfrom
develop/ray
Jun 8, 2026
Merged

Update task status after the task fails#500
Dallas98 merged 5 commits into
mainfrom
develop/ray

Conversation

@MoeexT

@MoeexT MoeexT commented Jun 4, 2026

Copy link
Copy Markdown
Contributor
image

close: #486

MoeexT added 5 commits June 4, 2026 17:23
When Ray head/worker pods are deleted during task execution:
- Task was stuck in RUNNING forever, frontend never updated

Changes:
1. job_task_scheduler.py: Add connection failure counter (5 retries)
   + stall detection (120s no log progress = FAILED)
2. operator_runtime.py: Add GET /api/task/{id}/status endpoint
   to expose RayJobScheduler task status to backend-python
3. cleaning_task_scheduler.py: Add background polling loop that
   queries runtime status every 2s and updates database on
   terminal states (completed/failed/cancelled)
4. operator_runtime.py: get_from_cfg() supports default fallback
- Add _validate_task_id() with UUID regex check for all endpoints
  accepting task_id in URL path (/submit, /status, /stop)
- Add defense-in-depth check in get_from_cfg() rejecting ../ \
  characters in task_id
- Add PARAM_ERROR error code for invalid parameter responses

Fixes CodeQL high-severity: Uncontrolled data used in path expression
Replace inline '../' check with os.path.normpath + startswith prefix
validation, which is recognized by CodeQL as a sanitization pattern.
@Dallas98 Dallas98 merged commit b22301b into main Jun 8, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

任务运行过程中,删除ray-cluster-worker 与ray-cluster-head容器,会导致容器任务卡住

2 participants