NIXL EP: fix rank removal#913
Conversation
📝 WalkthroughWalkthroughNixlEP workload testing now suppresses failure reporting for planned srun terminations when rank removal is configured. Detection logic identifies planned termination patterns in logs, command generation produces launcher scripts that conditionally tolerate exit code 143, and tests validate both the log detection and generated script behavior. ChangesPlanned Rank Removal Detection and Exit Code Handling
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cloudai/workloads/nixl_ep/nixl_ep.py`:
- Around line 153-160: The current _looks_like_planned_srun_termination function
is too strict because it requires every non-empty line to match the allowed srun
patterns; instead, filter the non-empty lines to only those that begin with
"srun:" and return True if there is at least one such srun line and every srun
line matches one of the allowed_patterns (i.e., change the logic to operate only
on srun-prefixed lines rather than all lines). Update any downstream usage (the
suppression check that currently calls _looks_like_planned_srun_termination) to
rely on this revised behavior so that extra non-srun lines in stderr no longer
prevent planned-termination suppression.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 2ae21a5b-da73-4902-a046-97b7d166643a
📒 Files selected for processing (5)
src/cloudai/workloads/nixl_ep/nixl_ep.pysrc/cloudai/workloads/nixl_ep/slurm_command_gen_strategy.pytests/ref_data/nixl-ep-launch.shtests/workloads/nixl_ep/test_command_gen_strategy_slurm.pytests/workloads/nixl_ep/test_job_status_retrieval_strategy.py
| def _looks_like_planned_srun_termination(content: str) -> bool: | ||
| allowed_patterns = ( | ||
| re.compile(r"^srun: error: .+: task \d+: Terminated$"), | ||
| re.compile(r"^srun: Terminating StepId=\S+$"), | ||
| re.compile(r"^srun: Force Terminated StepId=\S+$"), | ||
| ) | ||
| lines = [line.strip() for line in content.splitlines() if line.strip()] | ||
| return bool(lines) and all(any(pattern.match(line) for pattern in allowed_patterns) for line in lines) |
There was a problem hiding this comment.
Planned-termination detection is too strict and can still fail valid planned-rank-removal runs.
_looks_like_planned_srun_termination (Line 160) requires every non-empty line to match one of three srun patterns. If stderr.txt includes any additional non-srun line, Line 171 won’t suppress, and the generic "srun: error:" matcher later reports failure. This can misclassify valid planned terminations.
Proposed fix
`@staticmethod`
def _looks_like_planned_srun_termination(content: str) -> bool:
allowed_patterns = (
re.compile(r"^srun: error: .+: task \d+: Terminated$"),
re.compile(r"^srun: Terminating StepId=\S+$"),
re.compile(r"^srun: Force Terminated StepId=\S+$"),
)
- lines = [line.strip() for line in content.splitlines() if line.strip()]
- return bool(lines) and all(any(pattern.match(line) for pattern in allowed_patterns) for line in lines)
+ lines = [line.strip() for line in content.splitlines() if line.strip()]
+ srun_lines = [line for line in lines if line.startswith("srun:")]
+ if not srun_lines:
+ return False
+
+ return (
+ any(allowed_patterns[0].match(line) for line in srun_lines)
+ and any(allowed_patterns[1].match(line) for line in srun_lines)
+ and any(allowed_patterns[2].match(line) for line in srun_lines)
+ and all(any(pattern.match(line) for pattern in allowed_patterns) for line in srun_lines)
+ )Also applies to: 171-172
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cloudai/workloads/nixl_ep/nixl_ep.py` around lines 153 - 160, The current
_looks_like_planned_srun_termination function is too strict because it requires
every non-empty line to match the allowed srun patterns; instead, filter the
non-empty lines to only those that begin with "srun:" and return True if there
is at least one such srun line and every srun line matches one of the
allowed_patterns (i.e., change the logic to operate only on srun-prefixed lines
rather than all lines). Update any downstream usage (the suppression check that
currently calls _looks_like_planned_srun_termination) to rely on this revised
behavior so that extra non-srun lines in stderr no longer prevent
planned-termination suppression.
Summary
Test Plan
Additional Notes