NIXL EP: fix rank removal by podkidyshev · Pull Request #913 · NVIDIA/cloudai

podkidyshev · 2026-06-05T11:22:54Z

Summary

Handle rank removal gracefully in case it appears as SLURM CANCELLED with 143 code

Test Plan

Automated CI
Manual runs

Additional Notes

coderabbitai · 2026-06-05T11:23:00Z

📝 Walkthrough

Walkthrough

NixlEP workload testing now suppresses failure reporting for planned srun terminations when rank removal is configured. Detection logic identifies planned termination patterns in logs, command generation produces launcher scripts that conditionally tolerate exit code 143, and tests validate both the log detection and generated script behavior.

Changes

Planned Rank Removal Detection and Exit Code Handling

Layer / File(s)	Summary
Log termination detection in NixlEPTestDefinition `src/cloudai/workloads/nixl_ep/nixl_ep.py`	New static method detects Slurm srun termination patterns via regex, new instance method inspects plan phases for negative ranks, and `_scan_log_for_failures` now returns `None` (non-failure) when both planned-termination conditions match.
Command generation strategy for planned rank removal `src/cloudai/workloads/nixl_ep/slurm_command_gen_strategy.py`	Helper method `_has_planned_rank_removal()` checks for negative-rank phases. Refactored `_wait_for_workers_lines()` from classmethod to instance method; conditionally enables exit code 143 tolerance and injects final-phase completion waits when planned rank removal is detected.
Reference launcher script exit code handling `tests/ref_data/nixl-ep-launch.sh`	`wait -n` loop preserves `rc=0` when exit code 143 occurs under `allow_planned_removal_143=1`. Conditional follow-up waits for phase 3 completion; sets `rc=143` if that wait fails.
Command generation tests for planned rank removal `tests/workloads/nixl_ep/test_command_gen_strategy_slurm.py`	Two new tests validate that planned-rank-removal plans generate `allow_planned_removal_143=1` with `wait_for_phase_completion "3"` and associated `
Job status evaluation tests for srun terminations `tests/workloads/nixl_ep/test_job_status_retrieval_strategy.py`	New test verifies planned srun termination messages are ignored when benchmark output exists (success status). Another test verifies unplanned srun termination is reported as failure with "srun failure" indicator.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

srivatsankrishnan
jeffnvidia
amaslenn

Poem

🐰 When ranks are planned to bid adieu,
The waiting scripts now know what's true:
A 143 exit code, once deemed a fall,
Becomes a peaceful planned curtain call.
The logs now speak with pattern clear,
And tests confirm there's naught to fear!

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'NIXL EP: fix rank removal' clearly describes the main change—handling rank removal in the NIXL EP workload. It accurately summarizes the primary objective of the changeset.
Description check	✅ Passed	The description is directly related to the changeset, explaining how rank removal is handled when it appears as a SLURM CANCELLED event with exit code 143, which aligns with the changes across multiple files.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ipod/nixlep-rank-removal-2

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/workloads/nixl_ep/nixl_ep.py`:
- Around line 153-160: The current _looks_like_planned_srun_termination function
is too strict because it requires every non-empty line to match the allowed srun
patterns; instead, filter the non-empty lines to only those that begin with
"srun:" and return True if there is at least one such srun line and every srun
line matches one of the allowed_patterns (i.e., change the logic to operate only
on srun-prefixed lines rather than all lines). Update any downstream usage (the
suppression check that currently calls _looks_like_planned_srun_termination) to
rely on this revised behavior so that extra non-srun lines in stderr no longer
prevent planned-termination suppression.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 2ae21a5b-da73-4902-a046-97b7d166643a

📥 Commits

Reviewing files that changed from the base of the PR and between d999b13 and 70e71e6.

📒 Files selected for processing (5)

src/cloudai/workloads/nixl_ep/nixl_ep.py
src/cloudai/workloads/nixl_ep/slurm_command_gen_strategy.py
tests/ref_data/nixl-ep-launch.sh
tests/workloads/nixl_ep/test_command_gen_strategy_slurm.py
tests/workloads/nixl_ep/test_job_status_retrieval_strategy.py

coderabbitai · 2026-06-05T11:30:13Z

+    def _looks_like_planned_srun_termination(content: str) -> bool:
+        allowed_patterns = (
+            re.compile(r"^srun: error: .+: task \d+: Terminated$"),
+            re.compile(r"^srun: Terminating StepId=\S+$"),
+            re.compile(r"^srun: Force Terminated StepId=\S+$"),
+        )
+        lines = [line.strip() for line in content.splitlines() if line.strip()]
+        return bool(lines) and all(any(pattern.match(line) for pattern in allowed_patterns) for line in lines)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Planned-termination detection is too strict and can still fail valid planned-rank-removal runs.

_looks_like_planned_srun_termination (Line 160) requires every non-empty line to match one of three srun patterns. If stderr.txt includes any additional non-srun line, Line 171 won’t suppress, and the generic "srun: error:" matcher later reports failure. This can misclassify valid planned terminations.

Proposed fix

`@staticmethod` def _looks_like_planned_srun_termination(content: str) -> bool: allowed_patterns = ( re.compile(r"^srun: error: .+: task \d+: Terminated$"), re.compile(r"^srun: Terminating StepId=\S+$"), re.compile(r"^srun: Force Terminated StepId=\S+$"), ) - lines = [line.strip() for line in content.splitlines() if line.strip()] - return bool(lines) and all(any(pattern.match(line) for pattern in allowed_patterns) for line in lines) + lines = [line.strip() for line in content.splitlines() if line.strip()] + srun_lines = [line for line in lines if line.startswith("srun:")] + if not srun_lines: + return False + + return ( + any(allowed_patterns[0].match(line) for line in srun_lines) + and any(allowed_patterns[1].match(line) for line in srun_lines) + and any(allowed_patterns[2].match(line) for line in srun_lines) + and all(any(pattern.match(line) for pattern in allowed_patterns) for line in srun_lines) + )

Also applies to: 171-172

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/workloads/nixl_ep/nixl_ep.py` around lines 153 - 160, The current _looks_like_planned_srun_termination function is too strict because it requires every non-empty line to match the allowed srun patterns; instead, filter the non-empty lines to only those that begin with "srun:" and return True if there is at least one such srun line and every srun line matches one of the allowed_patterns (i.e., change the logic to operate only on srun-prefixed lines rather than all lines). Update any downstream usage (the suppression check that currently calls _looks_like_planned_srun_termination) to rely on this revised behavior so that extra non-srun lines in stderr no longer prevent planned-termination suppression.

fix nixlep rank removal

70e71e6

podkidyshev self-assigned this Jun 5, 2026

podkidyshev marked this pull request as ready for review June 5, 2026 11:23

podkidyshev requested review from jeffnvidia and srivatsankrishnan as code owners June 5, 2026 11:23

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NIXL EP: fix rank removal#913

NIXL EP: fix rank removal#913
podkidyshev wants to merge 1 commit into
mainfrom
ipod/nixlep-rank-removal-2

podkidyshev commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

podkidyshev commented Jun 5, 2026

Summary

Test Plan

Additional Notes

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading