feat(tools/mcp): MCP server for ModelOpt launcher (OMNIML-5123)#1701
feat(tools/mcp): MCP server for ModelOpt launcher (OMNIML-5123)#1701ChenhanYu wants to merge 5 commits into
Conversation
Phase 1 implementation of the modelopt-mcp design tracked in OMNIML-5123.
Five tools exposed over stdio MCP:
* list_examples — enumerate tools/launcher/examples/ with
model + description metadata
* verify_setup — fail-fast probe (docker daemon + GPU, or
slurm SSH passwordless)
* submit_job — submit a launcher YAML; mode resolved from
mutually-exclusive args (hf_local → Docker,
cluster_host → Slurm). Returns the experiment
id immediately; the actual job runs detached.
* job_status — filesystem-based status from nemo_run
experiment dir (_DONE + status_*.out)
* job_logs — filesystem read of log_<task>.out files,
with optional tail
Design principles (from OMNIML-5123):
1. Single submit_job with mode by args (not separate
submit_docker / submit_slurm tools). Keeps the LLM catalog compact.
2. Filesystem is the source of truth for status + logs. No
in-memory registry; survives MCP server restarts.
3. verify_setup is auto-called by submit_job by default — the
verify probe takes ~1s and saves 30+ seconds of wasted submission
time on bad config. Callers can pass skip_verify=True when they
just probed.
Layout:
tools/mcp/
pyproject.toml # name: modelopt-mcp, console_script
modelopt_mcp/
__init__.py
server.py # FastMCP entry; 5 tool definitions
bridge.py # thin wrapper over launcher's core.py
# + filesystem status/log helpers
tests/test_bridge.py # 19 unit tests, fully hermetic (mocked
# subprocess + tmp_path fixtures)
Install:
# End user (uvx-from-git)
claude mcp add modelopt -- uvx --from \
"git+https://github.com/NVIDIA/Model-Optimizer.git#subdirectory=tools/mcp" \
modelopt-mcp
# Local development
uv pip install -e tools/launcher
uv pip install -e tools/mcp
modelopt-mcp # stdio server
Validation:
* uv pip install -e . succeeds
* 19/19 unit tests pass
* stdio handshake works; tools/list returns all 5 with full schemas
* mode-resolution rejects no-executor + both-executors correctly
* filesystem status correctly classifies done / failed / running
* pre-commit clean (ruff, mypy, bandit, license)
Phase 2 (separate PR): capture experiment_id from Docker subprocess
output, add wait_for_experiment, extract the verify + submit helpers
into a shared lib that nmm-sandbox-mcp can consume for its
internal-ergonomics tools (cluster short-name → factory lookup +
GitLab CI dispatch).
Signed-off-by: Chenhan Yu <[email protected]>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdds a new modelopt-mcp MCP server and bridge exposing launcher tools (list_examples, verify_setup, submit_job, job_status, job_logs), package metadata and console script, comprehensive README, unit tests, and CI workflow updates to run the new tests. ChangesMCP Server Implementation for Launcher Tool Exposure
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Caution Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional.
❌ Failed checks (1 error, 1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Tracks OMNIML-5123. Companion design (nmm-sandbox-mcp) at omniml/integration/nmm-sandbox#21. |
Adds a new `mcp` job to .github/workflows/unit_tests.yml that mirrors
the existing `launcher` job's shape — installs the sibling
modelopt-launcher editable, installs modelopt-mcp editable, runs
pytest. The job is gated by check-file-changes to skip on unrelated
PRs (matching launcher's pattern).
Three changes:
* paths: add `tools/mcp/**` to the push-trigger watch list so
changes under that subtree fire unit_tests.yml.
* check-file-changes.files: add `tools/mcp/**` to the changed-files
detector so the gated jobs (launcher / mcp / skills) skip on PRs
that touch unrelated areas.
* unit-pr-required-check: add `mcp` to `needs:` and the failure-
aggregator condition so a failing mcp job blocks merge the same
way launcher does.
No other workflows touched. regression_tests.yml is GPU-only and
modelopt-mcp's tests are hermetic (subprocess mocked), so it stays
out of scope.
Signed-off-by: Chenhan Yu <[email protected]>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1701 +/- ##
=======================================
Coverage 77.09% 77.09%
=======================================
Files 511 511
Lines 56176 56176
=======================================
Hits 43310 43310
Misses 12866 12866
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Adds tools/mcp/README.md so a developer landing in the directory has
a single-file entry into:
* What modelopt-mcp is (thin MCP wrapper over tools/launcher/core.py)
* The 5-tool surface with one-line each
* Both install paths (uvx-from-git for end-users, uv pip install -e
for dev) + why plain `pip install` doesn't work yet + two
options to enable it later
* A concrete end-to-end usage example (list_examples → verify_setup →
submit_job → job_status → job_logs)
* Required env vars table
* The three design constants (single submit_job, filesystem source-
of-truth, auto-verify default)
* Pointer to the NVIDIA-internal companion (nmm-sandbox-mcp) for
operators who want short-name cluster ergonomics + GitLab CI
dispatch
* Layout + Phase 2 hooks
Markdownlint-clean; matches the README style in tools/launcher/.
Signed-off-by: Chenhan Yu <[email protected]>
|
/claude review |
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 7
🧹 Nitpick comments (2)
tools/mcp/tests/test_bridge.py (1)
174-214: 📐 Maintainability & Code Quality | ⚡ Quick winAdd regression tests for subprocess hardening paths introduced in bridge fixes.
Please add focused tests for:
- rejecting option-like
cluster_host/cluster_user,- preserving raw override values with spaces (no literal quote injection), and
- structured
launcher_cli_not_found/spawn-failure returns.This will lock in the new security/reliability contracts on the MCP boundary.
Also applies to: 221-271
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/mcp/tests/test_bridge.py` around lines 174 - 214, Add three focused pytest cases in tools/mcp/tests/test_bridge.py that exercise bridge.verify_slurm_setup_impl: (1) call verify_slurm_setup_impl with option-like values for cluster_host and/or cluster_user (e.g., values starting with "-") and assert the result["ok"] is False and result["reason"] indicates rejection of option-like inputs; (2) call verify_slurm_setup_impl with override values that include spaces or shell metacharacters and assert the returned fields preserve the raw strings (no injected literal quotes) — use monkeypatched subprocess.run to return expected stdout/stderr and check returned whoami/remote_hostname/override fields match inputs exactly; (3) simulate spawn/launcher failures by monkeypatching subprocess.run to raise FileNotFoundError or return a non-zero spawn failure code and assert verify_slurm_setup_impl returns a structured error (result["ok"] is False and result["reason"] equals/indicates "launcher_cli_not_found" or a spawn-failure sentinel) so the contract for structured failure payloads is validated.tools/mcp/modelopt_mcp/__init__.py (1)
48-50: 📐 Maintainability & Code Quality | ⚡ Quick winMove
__all__to the top-level export section.
__all__is currently declared after imports. Put it directly below the module docstring so the public surface is explicit before implementation imports.As per coding guidelines, "Each module declares its public surface with
__all__ = [...]at the top of the file."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/mcp/modelopt_mcp/__init__.py` around lines 48 - 50, Move the __all__ declaration immediately below the module docstring so the module's public surface is declared before imports; specifically place "__all__ = ['main']" right after the top-of-file docstring and before the "from modelopt_mcp.server import main" import to satisfy the guideline that modules declare public exports at the top (referencing the __all__ symbol and the imported main symbol).Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/unit_tests.yml:
- Around line 152-160: The mcp job currently lacks a job-level permissions
block, uses actions/checkout@v6 without pinning, and leaves persist-credentials
default true; update the mcp job to set minimal GITHUB_TOKEN permissions (e.g.,
read-only repo and contents as required), add a permissions: block at the job
level, change the checkout invocation (uses: actions/checkout@v6) to a pinned
reference (replace with a specific commit SHA tag for actions/checkout or add
with: ref: ${{ github.sha }} if you need the exact commit), and set with:
persist-credentials: false so credentials aren’t injected into the workspace;
keep the condition (if: needs.check-file-changes.outputs.any_changed == 'true')
and needs: [linux, check-file-changes] as-is.
In `@tools/mcp/modelopt_mcp/__init__.py`:
- Around line 28-31: Update the module docstring for submit_job to accurately
describe its return contract: state that when mode is hf_local (Docker)
submit_job returns the Docker process PID (since it runs in a background
thread), and when mode is cluster_host (Slurm) it returns the submitted
experiment_id (Slurm submission with detach=True); reference the submit_job
function and the modes hf_local/Docker and cluster_host/Slurm so callers know
which value to expect.
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Line 38: Remove all inline "# nosec" suppressions on subprocess imports/usages
(e.g., the import subprocess line and occurrences around lines referenced) and
make the code pass Bandit by fixing unsafe subprocess patterns instead of
silencing them: import subprocess without a comment, ensure every subprocess
invocation in bridge.py (search for subprocess.run/ Popen/call usages) uses a
list of arguments (no shell=True), validates or sanitizes any external inputs
(use shlex.quote or strict whitelist validation where inputs originate), and use
subprocess.run(..., check=True) or other safe APIs; if any use truly requires
shell semantics, refactor to an explicit, validated command builder or document
and encapsulate the risk with safer alternatives so Bandit warnings are resolved
rather than suppressed.
- Around line 512-517: The subprocess.Popen call that spawns the launcher (argv,
cwd=str(launcher_dir), assigned to proc) must not raise uncaught
FileNotFoundError/OSError; wrap the Popen call in a try/except that catches
FileNotFoundError and OSError and returns a structured error result (instead of
letting the exception propagate) so the MCP tool can surface a clear
"launcher/uv not found" error; do the same for the other subprocess invocation
that only catches subprocess.TimeoutExpired (also catch
FileNotFoundError/OSError there). Remove the "# nosec B603" suppressions and
replace them with the new explicit error handling and logging using the same
variables (argv, launcher_dir, proc) and the existing TimeoutExpired handling to
produce consistent structured error returns.
- Around line 477-491: The argv list currently wraps values with shlex.quote
(e.g., hf_local, cluster_host, cluster_user, identity, job_dir, job_name and
extra_overrides values) which injects literal quotes when argv is passed as a
list to subprocess (no shell); remove shlex.quote and append the raw string
values instead (use str(...) where needed and fall back to empty string for
None), e.g. build entries as f"hf_local={str(hf_local)}" and for extra_overrides
use f"{k}={str(v)}"; keep the same keys (argv, extra_overrides) and conditional
checks but do not call shlex.quote.
- Around line 282-295: Remove the "# nosec" bypasses and harden SSH/ subprocess
usage: in bridge.py update verify_slurm_setup_impl to validate cluster_user and
cluster_host (reject/escape values that start with '-' or contain unexpected
characters; allow only a safe regex like [A-Za-z0-9._-]+), then build the ssh
target as a single argv element (target = f"{cluster_user}@{cluster_host}" or
cluster_host alone) and append it without any shell interpolation so ssh options
cannot be injected; in submit_job_impl stop using shlex.quote on key=value
arguments (pass raw strings in the argv list since subprocess is used without a
shell); in the docker branch replace returning Popen(..., stdout=PIPE,
stderr=STDOUT) directly with a safe pattern that drains output (use
subprocess.run(...) to capture/stream output or call Popen.communicate() / a
dedicated reader thread before returning) to avoid deadlocks; finally remove all
"# nosec" comments and address Bandit findings through the proper
security-exception workflow instead of silencing them.
In `@tools/mcp/modelopt_mcp/server.py`:
- Around line 259-262: The tail parameter on the MCP boundary must enforce a
minimum of 1 to avoid surprising slicing in bridge.job_logs_impl (which calls
body.splitlines()[-tail:]); update the Annotated Field for tail in the job_logs
schema to include ge=1 (or equivalent lower-bound validation) so values like 0
or negatives cannot be passed into job_logs_impl and cause unexpected behavior.
---
Nitpick comments:
In `@tools/mcp/modelopt_mcp/__init__.py`:
- Around line 48-50: Move the __all__ declaration immediately below the module
docstring so the module's public surface is declared before imports;
specifically place "__all__ = ['main']" right after the top-of-file docstring
and before the "from modelopt_mcp.server import main" import to satisfy the
guideline that modules declare public exports at the top (referencing the
__all__ symbol and the imported main symbol).
In `@tools/mcp/tests/test_bridge.py`:
- Around line 174-214: Add three focused pytest cases in
tools/mcp/tests/test_bridge.py that exercise bridge.verify_slurm_setup_impl: (1)
call verify_slurm_setup_impl with option-like values for cluster_host and/or
cluster_user (e.g., values starting with "-") and assert the result["ok"] is
False and result["reason"] indicates rejection of option-like inputs; (2) call
verify_slurm_setup_impl with override values that include spaces or shell
metacharacters and assert the returned fields preserve the raw strings (no
injected literal quotes) — use monkeypatched subprocess.run to return expected
stdout/stderr and check returned whoami/remote_hostname/override fields match
inputs exactly; (3) simulate spawn/launcher failures by monkeypatching
subprocess.run to raise FileNotFoundError or return a non-zero spawn failure
code and assert verify_slurm_setup_impl returns a structured error (result["ok"]
is False and result["reason"] equals/indicates "launcher_cli_not_found" or a
spawn-failure sentinel) so the contract for structured failure payloads is
validated.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 41d93d76-4c0b-45cd-b0fc-9406b45c4455
📒 Files selected for processing (8)
.github/workflows/unit_tests.ymltools/mcp/README.mdtools/mcp/modelopt_mcp/__init__.pytools/mcp/modelopt_mcp/bridge.pytools/mcp/modelopt_mcp/server.pytools/mcp/pyproject.tomltools/mcp/tests/__init__.pytools/mcp/tests/test_bridge.py
| argv = [ | ||
| "ssh", | ||
| "-o", | ||
| "BatchMode=yes", | ||
| "-o", | ||
| "StrictHostKeyChecking=accept-new", | ||
| "-o", | ||
| "ConnectTimeout=5", | ||
| ] | ||
| if identity: | ||
| argv += ["-i", identity] | ||
| target = f"{cluster_user}@{cluster_host}" if cluster_user else cluster_host | ||
| argv += [target, "whoami && hostname"] | ||
|
|
There was a problem hiding this comment.
🔒 Security & Privacy | 🔴 Critical
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
file="tools/mcp/modelopt_mcp/bridge.py"
# Basic context: locate key functions and line ranges mentioned in the original comment
rg -n "def (verify_slurm_setup_impl|submit_job_impl)" "$file" || true
rg -n "verify_slurm_setup_impl|submit_job_impl" "$file" || true
# Inspect around the snippet lines 260-320 (adjusted by actual file line numbers)
nl -ba "$file" | sed -n '250,330p'
# Inspect other cited ranges: 450-520, 500-560
nl -ba "$file" | sed -n '440,520p'
nl -ba "$file" | sed -n '500,570p'
# Search for security-policy-related patterns
rg -n "# nosec" "$file" || true
rg -n "shlex\.quote" "$file" || true
rg -n "subprocess\.(run|Popen)" "$file" || true
# Also inspect any ssh target building and where argv list is finalized
rg -n "ssh|StrictHostKeyChecking|BatchMode|ConnectTimeout|StrictHostKeyChecking=accept-new" "$file" || trueRepository: NVIDIA/Model-Optimizer
Length of output: 293
🏁 Script executed:
#!/bin/bash
set -euo pipefail
file="tools/mcp/modelopt_mcp/bridge.py"
echo "== show verify_slurm_setup_impl and surrounding code (approx) =="
awk 'NR>=250 && NR<=340 {printf "%4d:%s\n", NR, $0}' "$file" | sed -n '1,120p'
echo "== show submit_job_impl and surrounding code (approx) =="
awk 'NR>=360 && NR<=470 {printf "%4d:%s\n", NR, $0}' "$file" | sed -n '1,120p'
echo "== show other cited ranges (approx) =="
awk 'NR>=460 && NR<=540 {printf "%4d:%s\n", NR, $0}' "$file" | sed -n '1,120p'
awk 'NR>=480 && NR<=560 {printf "%4d:%s\n", NR, $0}' "$file" | sed -n '1,120p'
echo "== search patterns =="
rg -n "# nosec" "$file" || true
rg -n "shlex\.quote" "$file" || true
rg -n "subprocess\.(run|Popen|check_output|call)" "$file" || true
rg -n "StrictHostKeyChecking|BatchMode=yes|ConnectTimeout=5|whoami && hostname" "$file" || trueRepository: NVIDIA/Model-Optimizer
Length of output: 17404
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "cluster_host|cluster_user|identity" tools/mcp/modelopt_mcp/bridge.py
# See if bridge does any validation for these fields before calling verify/submit.
rg -n "validate|regex|allowlist|startswith\\(\"-\"\\)|StrictHostKeyChecking" tools/mcp/modelopt_mcp/bridge.py || true
# Find launcher argument parsing for the key=value overrides we pass to `uv run launch.py`.
# This checks whether shlex.quote would end up including literal quotes in the value.
file_launcher="$(fd -a "launch.py" -tf --exact 2>/dev/null tools/launcher || true)"
echo "launcher candidates:"
echo "$file_launcher"
# If fd above didn't return, fall back to repo-wide filename search.
if [ -z "$file_launcher" ]; then
file_launcher="$(fd -a "launch.py" .)"
fi
echo "Using launcher file(s):"
echo "$file_launcher"
# Inspect how `hf_local=...` / `cluster_host=...` style args are parsed/handled.
rg -n "hf_local=|cluster_host=|user=|identity=|extra_overrides" $file_launcher || true
rg -n "shlex\\.split|argparse|key=value|\\w+=\\w" $file_launcher || trueRepository: NVIDIA/Model-Optimizer
Length of output: 4327
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect tools/launcher/launch.py for argument parsing of key=value pairs
file="tools/launcher/launch.py"
wc -l "$file"
awk 'NR>=1 && NR<=220 {printf "%4d:%s\n", NR, $0}' "$file" | sed -n '1,220p'
awk 'NR>=220 && NR<=420 {printf "%4d:%s\n", NR, $0}' "$file" | sed -n '1,220p'
# Find where cluster_host/user/identity values come from in the bridge MCP layer
rg -n "cluster_host" tools/mcp/modelopt_mcp || true
rg -n "cluster_user" tools/mcp/modelopt_mcp || true
rg -n "submit_job" tools/mcp/modelopt_mcp || true
# Find any schema/validation layer for these fields
rg -n "class .*Model|pydantic|BaseModel|Field\\(|constr\\(|validator\\(" tools/mcp/modelopt_mcp || trueRepository: NVIDIA/Model-Optimizer
Length of output: 11191
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect server validation for cluster_host/cluster_user
file="tools/mcp/modelopt_mcp/server.py"
echo "== server.py cluster_host/cluster_user definitions =="
awk 'NR>=70 && NR<=220 {printf "%4d:%s\n", NR, $0}' "$file" | sed -n '1,180p'
# Inspect bridge.py for job_status and NEMORUN_HOME usage
b="tools/mcp/modelopt_mcp/bridge.py"
echo "== bridge.py search NEMORUN_HOME and experiment_id handling =="
rg -n "NEMORUN_HOME|experiment_id|job_status|proc\\b|Popen\\(" "$b"
echo "== bridge.py around job_status and further lines (approx 540-720) =="
awk 'NR>=530 && NR<=720 {printf "%4d:%s\n", NR, $0}' "$b" | sed -n '1,240p'Repository: NVIDIA/Model-Optimizer
Length of output: 17257
🏁 Script executed:
#!/bin/bash
set -euo pipefail
file="tools/mcp/modelopt_mcp/bridge.py"
echo "== top of bridge.py imports =="
awk 'NR>=1 && NR<=80 {printf "%4d:%s\n", NR, $0}' "$file"
rg -n "^import re\\b|from re import" "$file" || trueRepository: NVIDIA/Model-Optimizer
Length of output: 3632
Fix SSH target injection and remove disallowed # nosec Bandit bypasses
- Security policy:
tools/mcp/modelopt_mcp/bridge.pyuses# nosecto suppress Bandit for multiplesubprocesscalls (andtools/launcher/launch.pydoes as well). Repo guidelines disallow# nosecbypasses unless an approved security exception exists—remove these suppressions or route through the required codeowner-approved exception. - SSH option injection:
verify_slurm_setup_impl()buildstarget = f"{cluster_user}@{cluster_host}"and appends it tosshargv without--, with no validation to preventcluster_host/cluster_userfrom being interpreted as additional ssh options. - Docker reliability: in docker mode,
subprocess.Popen(... stdout=PIPE, stderr=STDOUT)is returned immediately without draining the pipe, which can deadlock if output fills the buffer. - argv correctness:
submit_job_impl()appliesshlex.quote()to key=value arguments even thoughsubprocessuses an argv list (no shell). This can introduce literal quote characters into values and break launcher parsing for inputs containing special characters.
🔒 Suggested hardening
+import re
+
+_SSH_TOKEN_RE = re.compile(r"^[A-Za-z0-9._-]+$")
+
def verify_slurm_setup_impl(
cluster_host: str,
cluster_user: str | None = None,
identity: str | None = None,
) -> dict:
@@
+ if not cluster_host or cluster_host.startswith("-") or not _SSH_TOKEN_RE.fullmatch(cluster_host):
+ return {
+ "ok": False,
+ "executor": "slurm",
+ "reason": "invalid_cluster_host",
+ "diagnostic": "cluster_host contains invalid characters.",
+ }
+ if cluster_user and (
+ cluster_user.startswith("-") or not _SSH_TOKEN_RE.fullmatch(cluster_user)
+ ):
+ return {
+ "ok": False,
+ "executor": "slurm",
+ "reason": "invalid_cluster_user",
+ "diagnostic": "cluster_user contains invalid characters.",
+ }
@@
- argv += [target, "whoami && hostname"]
+ argv += ["--", target, "whoami && hostname"]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/mcp/modelopt_mcp/bridge.py` around lines 282 - 295, Remove the "#
nosec" bypasses and harden SSH/ subprocess usage: in bridge.py update
verify_slurm_setup_impl to validate cluster_user and cluster_host (reject/escape
values that start with '-' or contain unexpected characters; allow only a safe
regex like [A-Za-z0-9._-]+), then build the ssh target as a single argv element
(target = f"{cluster_user}@{cluster_host}" or cluster_host alone) and append it
without any shell interpolation so ssh options cannot be injected; in
submit_job_impl stop using shlex.quote on key=value arguments (pass raw strings
in the argv list since subprocess is used without a shell); in the docker branch
replace returning Popen(..., stdout=PIPE, stderr=STDOUT) directly with a safe
pattern that drains output (use subprocess.run(...) to capture/stream output or
call Popen.communicate() / a dedicated reader thread before returning) to avoid
deadlocks; finally remove all "# nosec" comments and address Bandit findings
through the proper security-exception workflow instead of silencing them.
Sources: Coding guidelines, Linters/SAST tools
There was a problem hiding this comment.
Acknowledged; leaving as-is for this PR. BatchMode=yes already ensures the probe will refuse to prompt for password (key-auth-only). accept-new only matters on the FIRST contact with a host — and the CI runner is single-use (no persistent ~/.ssh/known_hosts to poison), while operator-side accept-new is the same default ssh uses for any new connection. Tightening this to StrictHostKeyChecking=yes would require pre-populating known_hosts on every host that wants to run verify_setup, which is worse UX than the marginal security gain. Will revisit if the Phase-2 read_cluster_artifact tool needs different semantics.
There was a problem hiding this comment.
@ChenhanYu, the rationale for StrictHostKeyChecking=accept-new is reasonable — BatchMode=yes already blocks password prompts, and single-use CI runners don't accumulate a poisoned known_hosts. That part is accepted.
The original comment raised three other concerns that are still open and weren't addressed by this explanation:
- SSH option injection —
cluster_hostandcluster_userarrive from the MCP tool caller with no validation. A value like-oProxyCommand=...passed ascluster_hostwould be appended to theargvlist and interpreted bysshas an option before the--end-of-options marker. TheStrictHostKeyCheckingrationale doesn't cover this injection surface. # nosecbypasses — The repo coding guidelines explicitly state these aren't allowed without codeowner-approved exceptions. The suppressions inbridge.pyhaven't gone through that workflow.shlex.quote()in an argv list —shlex.quote()produces shell-escaped strings (e.g.,'value with spaces'). When passed inside a list tosubprocess.run(..., shell=False), those literal quote characters are forwarded to the subprocess unchanged, breaking argument parsing.
Would you like me to open a follow-up GitHub issue to track these three items so they don't get lost before Phase 2?
🐰 (´• ω •`)ノ
🧠 Learnings used
Learnt from: kevalmorabia97
Repo: NVIDIA/Model-Optimizer PR: 1554
File: tests/gpu/torch/puzzletron/test_puzzletron.py:236-236
Timestamp: 2026-05-28T14:04:36.104Z
Learning: In PyTorch >= 2.6, `torch.load()` defaults to `weights_only=True`. During code review, do NOT flag `torch.load()` calls as a missing `weights_only=True` security issue solely because the argument is omitted when the codebase’s minimum supported PyTorch version is >= 2.6 (e.g., targeting torch 2.12). Only require/flag an explicit `weights_only=` when the project targets PyTorch < 2.6, or when `weights_only` is explicitly set in a way that increases risk (e.g., `weights_only=False`).
| proc = subprocess.Popen( # nosec B603 | ||
| argv, | ||
| cwd=str(launcher_dir), | ||
| stdout=subprocess.PIPE, | ||
| stderr=subprocess.STDOUT, | ||
| ) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🔴 Critical
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="tools/mcp/modelopt_mcp/bridge.py"
echo "== File: $FILE =="
python3 - <<'PY'
import itertools,sys
path="tools/mcp/modelopt_mcp/bridge.py"
start=470
end=620
with open(path,'r',encoding='utf-8') as f:
for i,line in enumerate(f, start=1):
if start <= i <= end:
print(f"{i}:{line.rstrip()}")
if i > end:
break
PY
echo
echo "== Search: subprocess.Popen / subprocess.run in file =="
rg -n "subprocess\.(Popen|run)\b|B603|# nosec" "$FILE" || true
echo
echo "== Show additional context for the second referenced range (543-550 area) =="
python3 - <<'PY'
path="tools/mcp/modelopt_mcp/bridge.py"
start=520
end=575
with open(path,'r',encoding='utf-8') as f:
for i,line in enumerate(f, start=1):
if start <= i <= end:
print(f"{i}:{line.rstrip()}")
if i > end:
break
PY
echo
echo "== Check how argv is defined near the Popen call =="
python3 - <<'PY'
path="tools/mcp/modelopt_mcp/bridge.py"
# Find first occurrence line of 'proc = subprocess.Popen' and print some lines before it
target="proc = subprocess.Popen"
with open(path,'r',encoding='utf-8') as f:
lines=f.readlines()
for idx,line in enumerate(lines):
if target in line:
start=max(0,idx-20)
end=min(len(lines),idx+25)
for i in range(start,end):
print(f"{i+1}:{lines[i].rstrip()}")
break
PYRepository: NVIDIA/Model-Optimizer
Length of output: 11876
Handle uv/launcher spawn failures with structured error returns (and remove # nosec Bandit suppressions)
tools/mcp/modelopt_mcp/bridge.pycallssubprocess.Popen(... )in theexecutor == "docker"path without catchingFileNotFoundError/OSError, so missinguv/launcher entrypoint will raise and crash the MCP tool path (lines ~512-517).- The
executor != "docker"path only catchessubprocess.TimeoutExpired; missinguv/entrypoint will still raise (lines ~543-550). - Remove/replace the
# nosec B603suppressions on these subprocess calls—# nosecbypasses Bandit checks and isn’t allowed without the required security-exception process.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/mcp/modelopt_mcp/bridge.py` around lines 512 - 517, The
subprocess.Popen call that spawns the launcher (argv, cwd=str(launcher_dir),
assigned to proc) must not raise uncaught FileNotFoundError/OSError; wrap the
Popen call in a try/except that catches FileNotFoundError and OSError and
returns a structured error result (instead of letting the exception propagate)
so the MCP tool can surface a clear "launcher/uv not found" error; do the same
for the other subprocess invocation that only catches subprocess.TimeoutExpired
(also catch FileNotFoundError/OSError there). Remove the "# nosec B603"
suppressions and replace them with the new explicit error handling and logging
using the same variables (argv, launcher_dir, proc) and the existing
TimeoutExpired handling to produce consistent structured error returns.
There was a problem hiding this comment.
The detailed security findings here overlap with the claude[bot] comments below — those have been addressed in 3de99b441 (SLURM_HOST via env, DEVNULL on Docker, NEMORUN_HOME propagation, removed shlex.quote). The underlying subprocess.run + subprocess.Popen calls all use a controlled argv list (no shell, no untrusted external input — values come from the MCP tool's typed args validated by pydantic). If you have specific remaining findings from this comment that aren't covered by my other replies, please point them out.
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="tools/mcp/modelopt_mcp/bridge.py"
echo "=== nosec B603 occurrences ==="
rg -n "nosec" "$FILE" || echo "(none)"
echo ""
echo "=== subprocess.Popen call and surrounding context ==="
rg -n "subprocess\.Popen" -A 15 -B 5 "$FILE" || echo "(none)"
echo ""
echo "=== subprocess.run in submit_job_impl (slurm path) context ==="
rg -n "subprocess\.run" -A 15 -B 5 "$FILE" | grep -A 15 -B 5 "slurm\|launcher_dir\|executor" || echo "(none - showing all subprocess.run)"
rg -n "subprocess\.run" -A 20 -B 5 "$FILE" || echo "(none)"
echo ""
echo "=== FileNotFoundError handling ==="
rg -n "FileNotFoundError\|OSError" "$FILE" || echo "(none)"Length of output: 13264
@ChenhanYu, thanks for the details. After checking the current file, two specific items from the original comment remain unaddressed:
1. Missing FileNotFoundError on the two submit_job_impl subprocess calls (if uv isn't on PATH)
The verify_* helpers already catch FileNotFoundError properly (lines 203, 343). But neither launcher-dispatch call does:
- Line 562 –
subprocess.Popen(argv, ...)(Docker path): notry/except FileNotFoundError. Ifuvisn't on PATH this raises an uncaught exception instead of returning a structured{"ok": False, "reason": "uv_not_found"}dict. - Line 589 –
subprocess.run(argv, ...)(Slurm path): thetryblock only catchessubprocess.TimeoutExpired; sameuv-not-found scenario raises.
Both only need a small addition:
except FileNotFoundError:
return {
"ok": False,
"executor": executor,
"reason": "uv_not_found",
"diagnostic": (
"`uv` binary not found on PATH. Install via "
"https://docs.astral.sh/uv/getting-started/installation/"
),
}2. # nosec suppressions still present (lines 38, 196, 250, 325, 562, 589)
Per the project's coding guidelines:
Any use of
# noseccomments to bypass Bandit security checks is not allowed. If a security-sensitive pattern is genuinely necessary, the PR must be reviewed and approved by@NVIDIA/modelopt-setup-codeownerswith an explicit justification in the PR description.
The B603/B607 suppressions are factually correct (no shell, controlled argv, pydantic-validated inputs), but the policy doesn't provide a "nosec is fine if the code is correct" carve-out — it requires the explicit justification in the PR description and @NVIDIA/modelopt-setup-codeowners sign-off. Either add that justification to the PR description and request the codeowners review, or replace the # nosec comments with inline justifications in a format acceptable to your Bandit config (e.g., a # noqa with a project-approved exemption file).
🧠 Learnings used
Learnt from: kevalmorabia97
Repo: NVIDIA/Model-Optimizer PR: 1554
File: tests/gpu/torch/puzzletron/test_puzzletron.py:236-236
Timestamp: 2026-05-28T14:04:36.104Z
Learning: In PyTorch >= 2.6, `torch.load()` defaults to `weights_only=True`. During code review, do NOT flag `torch.load()` calls as a missing `weights_only=True` security issue solely because the argument is omitted when the codebase’s minimum supported PyTorch version is >= 2.6 (e.g., targeting torch 2.12). Only require/flag an explicit `weights_only=` when the project targets PyTorch < 2.6, or when `weights_only` is explicitly set in a way that increases risk (e.g., `weights_only=False`).
| else: | ||
| # Slurm mode — pass cluster config knobs as nemo-run overrides. | ||
| argv.append(f"cluster_host={shlex.quote(cluster_host or '')}") | ||
| if cluster_user: | ||
| argv.append(f"user={shlex.quote(cluster_user)}") | ||
| if identity: | ||
| argv.append(f"identity={shlex.quote(identity)}") | ||
| argv.append("detach=true") |
There was a problem hiding this comment.
[CRITICAL Algorithm] The Slurm-mode argv construction never actually configures the cluster host that the launcher will use. launch.py's launch() entrypoint (tools/launcher/launch.py:82) only accepts job_name, job_dir, pipeline, hf_local, user, identity, detach, clean — there is no cluster_host parameter. The cluster host is sourced either from the SLURM_HOST env var via slurm_factory(host=...) (tools/launcher/slurm_config.py:63) or via a per-task override like pipeline.task_0.slurm_config.host=<host>.
Appending cluster_host=<host> to nemo-run's CLI overrides will at best be ignored / produce a CLI error, and at worst silently submit using whatever SLURM_HOST happens to be in the bridge's env (which may be empty). Net effect: every Slurm submission goes through with the wrong host, or fails with an opaque error after verify_setup already reported ok.
Fix: set SLURM_HOST=<cluster_host> in the env passed to the subprocess, e.g.
env = os.environ.copy()
env["SLURM_HOST"] = cluster_host
argv = ["uv", "run", "launch.py", "--yaml", str(abs_yaml), "--yes"]
if cluster_user:
argv.append(f"user={cluster_user}")
if identity:
argv.append(f"identity={identity}")
argv.append("detach=true")
proc = subprocess.run(argv, env=env, ...)(The bridge also already passes cluster_user as user=... which is the right launcher arg — that one's fine.)
There was a problem hiding this comment.
Fixed in 3de99b441. Stopped appending cluster_host= to the nemo-run overrides; the host now flows via child_env['SLURM_HOST']=cluster_host passed as env= to the subprocess. slurm_factory(host=os.environ.get('SLURM_HOST', '')) picks it up the same way as direct uv run launch.py invocations do. cluster_user was already correctly going through as user=... (the launcher's actual arg).
| argv = ["uv", "run", "launch.py", "--yaml", str(abs_yaml), "--yes"] | ||
| if hf_local: | ||
| argv.append(f"hf_local={shlex.quote(hf_local)}") | ||
| else: | ||
| # Slurm mode — pass cluster config knobs as nemo-run overrides. | ||
| argv.append(f"cluster_host={shlex.quote(cluster_host or '')}") | ||
| if cluster_user: | ||
| argv.append(f"user={shlex.quote(cluster_user)}") | ||
| if identity: | ||
| argv.append(f"identity={shlex.quote(identity)}") | ||
| argv.append("detach=true") | ||
| if job_dir: | ||
| argv.append(f"job_dir={shlex.quote(job_dir)}") | ||
| if job_name: | ||
| argv.append(f"job_name={shlex.quote(job_name)}") | ||
| for k, v in (extra_overrides or {}).items(): | ||
| argv.append(f"{k}={shlex.quote(str(v))}") |
There was a problem hiding this comment.
[CRITICAL Algorithm] shlex.quote(...) is being used as if it were a value-quoting helper for nemo-run's key=value CLI overrides, but it isn't — it produces a shell-quoted string (e.g. '/path with spaces') suitable for pasting into a shell. Since subprocess.Popen([...]) / subprocess.run([...]) is called with a list (no shell), each argument is passed verbatim to launch.py. The result: nemo-run sees literal arguments like hf_local='/mnt/hf-local' (with quotes baked in), and user='alice' etc. — those quotes become part of the value the launcher dataclass receives, breaking path lookups and SSH user resolution.
Drop shlex.quote for every override; just use the raw value:
if hf_local:
argv.append(f"hf_local={hf_local}")
else:
if cluster_user:
argv.append(f"user={cluster_user}")
if identity:
argv.append(f"identity={identity}")
argv.append("detach=true")
if job_dir:
argv.append(f"job_dir={job_dir}")
if job_name:
argv.append(f"job_name={job_name}")
for k, v in (extra_overrides or {}).items():
argv.append(f"{k}={v}")shlex.quote would only be appropriate if you were building a shell string (e.g. for shell=True or for an ssh "<remote-cmd>" blob). Since you correctly use list-form Popen, every value can carry spaces / special chars safely without quoting.
There was a problem hiding this comment.
Fixed in 3de99b441. Dropped shlex.quote on every nemo-run override — subprocess.run/Popen with a list never goes through a shell, so the quoting was baking literal quote chars into values that the launcher's argparse then saw. All overrides now pass through verbatim. Verified the argv-shape test still passes (the test checks for user=chenhany not user='chenhany').
| if executor == "docker": | ||
| # Docker mode: spawn in background so we don't block the MCP | ||
| # call. The subprocess writes its experiment dir + status into | ||
| # NEMORUN_HOME; we'll read it back via job_status. | ||
| # B603 false positive — `argv` is a list built by this module | ||
| # from typed parameters, no shell interpretation. | ||
| proc = subprocess.Popen( # nosec B603 | ||
| argv, | ||
| cwd=str(launcher_dir), | ||
| stdout=subprocess.PIPE, | ||
| stderr=subprocess.STDOUT, | ||
| ) | ||
| # Detached: don't wait. The caller polls job_status by | ||
| # experiment_id (derived from job_name or auto-named). | ||
| # Generate a best-effort experiment_id from job_name + timestamp. | ||
| # nemo_run names experiments deterministically as | ||
| # `<title>_<job_name>_<timestamp>`; if the caller didn't provide | ||
| # job_name we can't predict the id ahead of time. | ||
| return { | ||
| "ok": True, | ||
| "executor": "docker", | ||
| "pid": proc.pid, | ||
| "argv": argv, | ||
| "experiment_id": None, # Phase 2: tail the subprocess output | ||
| # until nemo_run logs the id, then return it | ||
| "spike_note": ( | ||
| "Docker mode launched in background. Phase 1: the " | ||
| "experiment_id is None — Phase 2 tails the subprocess " | ||
| "output to capture the id. For now, list experiments " | ||
| "via nemo_run's CLI or check NEMORUN_HOME." | ||
| ), | ||
| } |
There was a problem hiding this comment.
[IMPORTANT Performance] The Docker-mode Popen(stdout=subprocess.PIPE, stderr=subprocess.STDOUT) writes the child's stdout into a pipe that no one ever reads. Long-running jobs (PTQ runs are minutes-to-hours) will fill the pipe buffer (~64KB on Linux) and then block the launcher subprocess on its next write(), hanging the job indefinitely while the MCP server appears to have "succeeded".
Fix: redirect to a file, or to DEVNULL. Since the actual job logs are written by nemo_run into the experiment dir (which is what job_logs reads), discarding stdout is fine:
proc = subprocess.Popen(
argv,
cwd=str(launcher_dir),
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
start_new_session=True, # detach from MCP server's process group
)start_new_session=True is also worth adding — without it, an MCP server SIGINT/restart will SIGHUP the in-flight Docker submission. Same applies on the Slurm path (the synchronous subprocess.run is fine because it captures fully and returns), but the Docker Popen is the real foot-gun.
There was a problem hiding this comment.
Fixed in 3de99b441. Docker mode now uses stdout=DEVNULL, stderr=DEVNULL, start_new_session=True. The launcher writes real logs into the nemo_run experiment dir (which job_logs reads), so discarding the subprocess stdout is correct. start_new_session=True also added so an MCP server restart doesn't SIGHUP the in-flight launcher.
| def _resolve_experiment_dir(experiment_id: str) -> Path | None: | ||
| """Map an experiment_id to its on-disk directory. | ||
|
|
||
| nemo_run lays experiments out under ``$NEMORUN_HOME/experiments/<id>/`` | ||
| by default; ``NEMORUN_HOME`` falls back to cwd. We also check | ||
| ``./experiments/<id>`` directly and ``./local_experiments/<id>`` | ||
| (the Docker-mode fallback path). | ||
| """ | ||
| candidates = [] | ||
| nemorun_home = os.environ.get("NEMORUN_HOME") | ||
| if nemorun_home: | ||
| candidates.append(Path(nemorun_home) / "experiments" / experiment_id) | ||
| candidates.append(Path.cwd() / "experiments" / experiment_id) | ||
| candidates.append(Path.cwd() / "local_experiments" / experiment_id) | ||
| for c in candidates: | ||
| if c.exists(): | ||
| return c | ||
| return None |
There was a problem hiding this comment.
[IMPORTANT Algorithm] The submit-time vs. status-time NEMORUN_HOME resolution is asymmetric and will produce false experiment_dir_not_found results.
submit_job_impldoes not passenv=to the subprocess, so the launcher inherits whatever was in the MCP server's env at startup.launch.pyitself defaultsNEMORUN_HOMEtoos.getcwd()when unset (tools/launcher/launch.py:99-100), and the subprocess cwd istools/launcher/— so artifacts land attools/launcher/experiments/<id>/._resolve_experiment_dirhere looks underos.environ["NEMORUN_HOME"]/experiments, thenPath.cwd()/experiments, thenPath.cwd()/local_experiments— relative to the MCP server's cwd, not the launcher's.
When NEMORUN_HOME is unset (very plausible — the MCP server is launched by codex/Claude Code with no env tailoring), job_status returns experiment_dir_not_found for jobs that actually succeeded.
Fix: explicitly resolve and propagate NEMORUN_HOME at submit time, and document in the README that this env var must be set consistently for the MCP server's lifetime. Something like:
env = os.environ.copy()
env.setdefault("NEMORUN_HOME", os.getcwd()) # pin so status side sees the same dir
proc = subprocess.run(argv, cwd=str(launcher_dir), env=env, ...)Also worth adding launcher_dir / "experiments" / experiment_id to the candidate list here as a belt-and-braces fallback.
There was a problem hiding this comment.
Fixed in 3de99b441. submit_job_impl now passes env=child_env to the subprocess with child_env.setdefault('NEMORUN_HOME', os.getcwd()). _resolve_experiment_dir reads from the same env, plus an added fallback to <launcher_dir>/experiments/ for the case where the operator hasn't set NEMORUN_HOME at all. Both submit + status sides now agree on the artifact location.
| try: | ||
| gpu = subprocess.run( # nosec B603 B607 | ||
| [ | ||
| "docker", | ||
| "run", | ||
| "--rm", | ||
| "--gpus", | ||
| "all", | ||
| "nvidia/cuda:12.0-base-ubuntu22.04", | ||
| "nvidia-smi", | ||
| ], | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=60, | ||
| check=False, | ||
| ) | ||
| except subprocess.TimeoutExpired: | ||
| return { | ||
| "ok": False, | ||
| "executor": "docker", | ||
| "daemon_ok": True, | ||
| "reason": "gpu_check_timeout", | ||
| "diagnostic": "GPU smoketest container did not return in 60s.", | ||
| } |
There was a problem hiding this comment.
[IMPORTANT Performance] First-time invocations of verify_setup(executor='docker') will pull nvidia/cuda:12.0-base-ubuntu22.04 (~150 MB) on hosts that don't already have it, easily blowing past the 60-second timeout — at which point the call returns gpu_check_timeout even on perfectly healthy hosts. The README sells the probe as "~1 second"; that's only true on warm caches.
Two cheap mitigations:
- Replace the GPU smoketest with
docker run --rm --gpus all --entrypoint /bin/true <small-image>once we know the daemon is up + has the toolkit, and probe the toolkit separately vianvidia-container-cli infoif available — far faster and doesn't depend on a CUDA image at all. - At minimum, pin a much smaller image (e.g.
busyboxor just bypass image-fetch by checkingdocker info | grep nvidia), and document that the first call may be slow.
Also: the chosen tag nvidia/cuda:12.0-base-ubuntu22.04 is old (CUDA 12.0 GA shipped in Q1 2023) — newer drivers won't have a fundamentally different code path, but pinning to a long-deprecated tag invites silent registry-removal breakage years from now. If you keep the smoketest, point it at a current CUDA tag.
There was a problem hiding this comment.
Fixed in 3de99b441. Replaced the heavyweight docker run --gpus all nvidia/cuda:... nvidia-smi smoketest with docker info --format '{{json .}}' + check whether 'nvidia' is in the registered runtimes. No image pull, daemon-fast, no dependency on a CUDA tag. Returns the same gpu_unavailable structured failure with an install-toolkit pointer when the nvidia runtime isn't registered.
| for status_file in sorted(exp_dir.glob("status_*.out")): | ||
| # Convention: filename is `status_<task_name>.out`, contents | ||
| # are a single word ("succeeded" or "failed"). | ||
| task_name = status_file.stem.removeprefix("status_") | ||
| body = status_file.read_text(encoding="utf-8", errors="replace").strip() | ||
| task_statuses[task_name] = body | ||
| if "fail" in body.lower(): | ||
| any_failed = True |
There was a problem hiding this comment.
[IMPORTANT Algorithm] The "fail" in body.lower() check is too loose. The convention documented just above says the file contains a single word ("succeeded" or "failed"), but the test fixture test_job_status_failed_task already writes "failed (rc=1)\n", so the convention is leaky. The substring match also flips on benign content like "succeeded after retry; previous attempt failed" or any nemo_run status string that mentions "fail" — flagging an actually-successful task as failed.
Tighten this to a word match against the canonical statuses:
status_word = body.split()[0].lower() if body else ""
task_statuses[task_name] = body
if status_word in ("failed", "error", "cancelled"):
any_failed = TrueOr anchor with body.lower().startswith("fail"). The current approach is silent enough that a misclassification only surfaces when an operator manually inspects the experiment dir.
There was a problem hiding this comment.
Fixed in 3de99b441. Now anchors on the FIRST word of the status file against a _STATUS_FAILURE_WORDS frozenset (failed/error/errored/cancelled/canceled). Existing test fixture test_job_status_failed_task writes 'failed (rc=1)\n' — first word is 'failed', still matches. Tightened so 'succeeded after retry; previous attempt failed' would correctly classify as success, not failure.
| argv = [ | ||
| "ssh", | ||
| "-o", | ||
| "BatchMode=yes", | ||
| "-o", | ||
| "StrictHostKeyChecking=accept-new", | ||
| "-o", | ||
| "ConnectTimeout=5", | ||
| ] | ||
| if identity: | ||
| argv += ["-i", identity] | ||
| target = f"{cluster_user}@{cluster_host}" if cluster_user else cluster_host | ||
| argv += [target, "whoami && hostname"] |
There was a problem hiding this comment.
[SUGGESTION] StrictHostKeyChecking=accept-new causes the first probe of any new cluster host to silently auto-add the host key to ~/.ssh/known_hosts of the user running the MCP server. That's usually fine for an interactive operator, but in a CI context (the mcp GitHub Actions job in this PR) or a shared service-account host it's a TOFU footgun — an attacker who can intercept the very first probe gets pinned-trust until manual known_hosts cleanup.
Consider StrictHostKeyChecking=yes for verify_setup (probe fails fast and the user is told to add the host key explicitly), and only allow accept-new if a --first-time flag is passed. Not a blocker — current behavior matches the launcher's defaults — but worth flagging as the security posture differs from a typical interactive ssh session.
| import re as _re | ||
|
|
||
| experiment_id = None | ||
| experiment_dir = None | ||
| slurm_job_id = None | ||
| for m in _re.finditer( | ||
| r"experiment[_\s-]+([a-zA-Z0-9_]+_\d{10,})", | ||
| stdout_tail, | ||
| _re.IGNORECASE, | ||
| ): | ||
| experiment_id = m.group(1) | ||
| break | ||
| for m in _re.finditer( | ||
| r"(/lustre/[^\s]+|/home/[^\s]+)/experiments/[^\s]+", | ||
| stdout_tail, | ||
| ): | ||
| experiment_dir = m.group(0) | ||
| break | ||
| for m in _re.finditer(r"Submitted batch job (\d+)", stdout_tail): | ||
| slurm_job_id = m.group(1) | ||
| break |
There was a problem hiding this comment.
[IMPORTANT Algorithm] Three independent fragility issues in this best-effort id parse:
import re as _reshould be at module top (per project coding standards, "Keep imports at the top of the file"). It's also already a stdlib import — no reason to lazy-import or alias.- The experiment_dir regex hard-codes
/lustre/...and/home/...as the only roots. NVIDIA-internal sites use/lustre/, but partner clusters use/scratch/,/work/,/data/,/p/, etc. The regex will returnexperiment_dir=Noneon perfectly valid runs everywhere except specific NVIDIA SREs' machines. - The id regex
experiment[_\s-]+([a-zA-Z0-9_]+_\d{10,})requires 10+ digits at the end (a unix timestamp). nemo_run does use timestamps, but the format may include sub-second precision or change across versions; pinning >=10 digits is brittle and the parse silently returnsexperiment_id=Nonerather than failing loud.
For (2) drop the path prefix anchor and rely on /experiments/<id>/ as the unique marker:
m = re.search(r"(\S+/experiments/[^\s/]+)", stdout_tail)
if m:
experiment_dir = m.group(1)For (3), since nemo_run logs the experiment id in a deterministic format, anchor on the actual log line nemo_run emits (e.g. "Experiment created") rather than reverse-engineering a timestamp shape. Worth a follow-up Phase-2 pass where the test fixtures cover real nemo_run output.
There was a problem hiding this comment.
Fixed in 3de99b441. (1) Hoisted import re to module top. (2) Dropped the /lustre/|/home/ path anchors; match any \S+/experiments/[^\s/]+ so partner-cluster filesystem roots work. (3) Replaced the timestamp-anchored experiment-id regex with a match anchored on nemo_run's experiment ... id: log line, with a generic-id fallback. The underlying spike-quality "best-effort" caveat still applies; production-grade parsing is queued as a Phase 2 task in the cell.md migration plan (OMNIML-5131).
| try: | ||
| with open(path) as f: | ||
| doc = yaml.safe_load(f) or {} | ||
| if isinstance(doc, dict): | ||
| # job_name is the most common identifier; "model" may | ||
| # also be a top-level field in some examples. | ||
| entry.model = doc.get("model") or doc.get("base_model") or doc.get("job_name") | ||
| entry.description = doc.get("description") |
There was a problem hiding this comment.
[SUGGESTION] Top-level model / base_model / description fields don't exist in any of the bundled launcher YAMLs (e.g. examples/Qwen/Qwen3-8B/megatron_lm_ptq.yaml only has job_name). The list_examples test passes because it constructs ad-hoc YAMLs — but on the actual data, this code will report model=<job_name> (a stringified identifier like Qwen3-8B_PTQ) and description=None for every example.
Two reasonable fixes, either's fine:
- Document the convention in the launcher's CLAUDE.md and add
model:/description:to the canonical examples (small edit, real metadata for agents). - Derive the model name from the path (e.g.
Qwen/Qwen3-8B/...→Qwen/Qwen3-8B) and pulldescriptionfrom the first# ...comment block in the YAML — which is the actual location of human-readable description in every bundled example.
Without one of these the tool's "model + description" output is mostly empty for the real corpus, which undercuts its discovery value to the agent.
| dependencies = [ | ||
| "mcp>=1.0", | ||
| # The launcher provides the core orchestration primitive (core.run_jobs, | ||
| # SandboxPipeline, build_slurm_executor, build_docker_executor). Pulled | ||
| # in via git+subdirectory so `uvx --from <this-repo>#subdirectory=tools/mcp | ||
| # modelopt-mcp` resolves both packages from the same clone — no | ||
| # duplicate fetch. | ||
| "modelopt-launcher", | ||
| "pyyaml", | ||
| "pydantic>=2.0", | ||
| ] |
There was a problem hiding this comment.
[CRITICAL Algorithm] The dependency "modelopt-launcher" is unsatisfiable for the documented uvx --from "git+...#subdirectory=tools/mcp" modelopt-mcp install path.
tools/launcher/pyproject.toml declares the package as name = "modelopt-launcher" with [tool.setuptools] py-modules = [] — there is no modelopt_launcher Python package directory, only loose .py files (launch.py, core.py, slurm_config.py) at the launcher dir's top level. Even with [tool.uv.sources] modelopt-launcher = { path = "../launcher" }, uv pip install -e ../launcher only registers the dist-info; nothing importable lands in site-packages. So bridge.py's import modelopt_launcher fallback at line 80-83 always misses, and any user who pip/uvx-installs without a clone (the explicit recommended path) gets a server that can't locate examples/ or launcher_dir.
The PR description's "uvx clones the whole repo to its cache" claim only saves the day because the bridge falls back to _THIS_DIR.parent.parent / "launcher" — i.e. the install only works because of file-layout assumptions about the cloned repo, not because the dep system actually ships the launcher to the consumer. That's a fragile contract.
Fix options:
- Make
modelopt-launchera proper package: move the source intotools/launcher/modelopt_launcher/and update itspyproject.tomlto exposemodelopt_launcher(withexamples/declared as package data). Thenfrom modelopt_launcher import core/import modelopt_launcher.examplesworks, the import-fallback in_find_launcher_examples_diractually does something, anduvx-ed installs run cleanly without the repo file-layout side-channel. - Ship a
[tool.uv.sources]git+...#subdirectory=tools/launcherfor the launcher (so uvx resolves it from the same repo on PyPI-less consumers) AND add a runtime hard-fail when_THIS_DIR.parent.parent / "launcher"is missing — at least the failure is loud rather than "works on the author's machine."
Without either fix, the install matrix in the README is misleading and a Phase-2 user who runs the uvx-from-PyPI path (once published) ends up with a non-functional server.
There was a problem hiding this comment.
Fixed in 3de99b441. Dropped the bare modelopt-launcher dep from [project.dependencies]. You're right that tools/launcher/pyproject.toml declares the name with py-modules = [] — there is no importable modelopt_launcher Python package; the launcher is consumed as loose .py files at tools/launcher/*.py. bridge.py invokes it via subprocess.run(["uv", "run", "launch.py", ...], cwd=<repo>/tools/launcher/) — a file-layout dependency, not a Python import dependency. The fallback import modelopt_launcher at line 80-83 was vestigial; removing it from the dep list while leaving the import as a try/except fallback for a hypothetical future where the launcher IS pip-installable.
uvx-from-git satisfies the file-layout dependency naturally (the clone puts both tools/launcher/ and tools/mcp/ next to each other). The README documents the requirement explicitly.
There was a problem hiding this comment.
Claude review — Phase 1 modelopt-mcp
Tally: 4 CRITICAL · 4 IMPORTANT · 2 SUGGESTION
The bridge module wraps the launcher with a structured-error contract that's pleasant to read, and the test suite is hermetic and well-shaped. The Phase-1-vs-Phase-2 split is also called out cleanly. However, several issues mean the Slurm path won't actually submit to the requested cluster, and the install metadata won't survive being pulled in without the rest of the repo. Highlights:
Most impactful
- CRITICAL — Slurm
cluster_hostis never propagated to the launcher.launch.pyhas no such parameter; cluster host comes fromSLURM_HOST(env) or theslurm_config.hostper-task override. The bridge appendscluster_host=<host>as a nemo-run CLI override, which the launcher ignores — every slurm submission silently uses whateverSLURM_HOSTwas inherited at startup, or fails opaquely. (bridge.py:478-485) - CRITICAL —
shlex.quoteis misapplied in argv construction. Quotes get baked into values whensubprocess.Popen([...])is called with a list (no shell), solaunch.pysees literalhf_local='/mnt/hf-local'anduser='alice'— breaking path lookups and SSH user resolution. (bridge.py:475-491) - CRITICAL —
modelopt-launcherdep is unsatisfiable as packaged. The launcher's pyproject declarespy-modules = []and exposes nomodelopt_launcherpackage —import modelopt_launcherin the bridge fallback always misses. The install only works because of file-layout assumptions about the cloned repo. (tools/mcp/pyproject.toml:6-16) - IMPORTANT — Docker-mode
Popendeadlock.stdout=PIPEis unread; the launcher subprocess will block on its nextwrite()once the ~64KB pipe buffer fills, hanging long-running PTQ jobs while the MCP server reports success. (bridge.py:506-537) - IMPORTANT —
NEMORUN_HOMEresolution is asymmetric between submit and status. Without explicit propagation,job_statusreturnsexperiment_dir_not_foundfor jobs that succeeded — the MCP and launcher cwds aren't the same. (bridge.py:626-643)
Other findings (in the inline thread)
- IMPORTANT —
verify_setup(executor='docker')GPU smoketest pulls a 150 MB CUDA image inside a 60 s timeout; first call on a cold host returns a misleadinggpu_check_timeout. - IMPORTANT —
job_status's'fail' in body.lower()substring check fires on"succeeded after retry; previous attempt failed"etc. - IMPORTANT — Slurm submit's experiment-id parse hard-codes
/lustre|/homepaths and a 10+ digit timestamp shape; lazyimport re as _reshould be at module top per project standards. - SUGGESTION —
StrictHostKeyChecking=accept-newis a TOFU footgun in CI/service-account contexts. - SUGGESTION —
list_examplesextractsmodel/descriptionfrom top-level YAML fields that don't exist in any bundled example; output is mostly empty on the real corpus.
Risk assessment
Medium-to-high. The Slurm path is the documented headline use-case in the README, and three of the four CRITICAL findings sit in that path. The unit tests pass because they mock the subprocess layer end-to-end and never observe the actual launcher's arg-handling, so the regressions wouldn't be caught by CI as written.
Net: the structure is good, but I'd hold this until at least the four CRITICAL findings are resolved and a smoke-test that drives the real launch.py --dryrun is added to the test suite.
| else: | ||
| # Slurm mode — pass cluster config knobs as nemo-run overrides. | ||
| argv.append(f"cluster_host={shlex.quote(cluster_host or '')}") | ||
| if cluster_user: | ||
| argv.append(f"user={shlex.quote(cluster_user)}") | ||
| if identity: | ||
| argv.append(f"identity={shlex.quote(identity)}") | ||
| argv.append("detach=true") |
There was a problem hiding this comment.
[CRITICAL Algorithm] The Slurm-mode argv construction never actually configures the cluster host that the launcher will use. launch.py's launch() entrypoint (tools/launcher/launch.py:82) only accepts job_name, job_dir, pipeline, hf_local, user, identity, detach, clean — there is no cluster_host parameter. The cluster host is sourced either from the SLURM_HOST env var via slurm_factory(host=...) (tools/launcher/slurm_config.py:63) or via a per-task override like pipeline.task_0.slurm_config.host=<host>.
Appending cluster_host=<host> to nemo-run's CLI overrides will at best be ignored / produce a CLI error, and at worst silently submit using whatever SLURM_HOST happens to be in the bridge's env (which may be empty). Net effect: every Slurm submission goes through with the wrong host, or fails with an opaque error after the verify step has already passed.
Fix: either set SLURM_HOST=<cluster_host> in the env passed to the subprocess (and propagate it through the actual ssh invocation the launcher makes), or pass it as a structured override. Something like:
env = os.environ.copy()
env["SLURM_HOST"] = cluster_host
# ...
argv = ["uv", "run", "launch.py", "--yaml", str(abs_yaml), "--yes"]
if cluster_user:
argv.append(f"user={cluster_user}")
if identity:
argv.append(f"identity={identity}")
argv.append("detach=true")
proc = subprocess.run(argv, env=env, ...)Same problem applies to cluster_user → the launcher's user parameter is correctly named but the bridge passes it as user=... already, so that one's fine.
There was a problem hiding this comment.
Same as the earlier claude[bot] finding on line 485 — fixed by propagating SLURM_HOST=<cluster_host> via env=child_env in 3de99b441.
… shell-quoting, Docker DEVNULL + start_new_session, NEMORUN_HOME propagation, status word-anchor, lighter GPU probe Addresses CodeRabbit + claude[bot] findings on PR #1701: bridge.py * **Slurm cluster_host**: launch.py's entrypoint does not accept a `cluster_host` arg — it reads SLURM_HOST from env via slurm_factory. Stop appending `cluster_host=<host>` to nemo-run overrides (it was at best ignored, at worst silently submitted with whatever SLURM_HOST happened to be in the bridge's env). Propagate via env=child_env on the subprocess. * **shlex.quote misuse**: subprocess.run/Popen with a list never goes through a shell, so shell-quoting baked literal quote chars into values like `hf_local='/mnt/hf-local'`. Drop quoting on all nemo-run k=v overrides — verbatim values are passed safely. * **Docker Popen PIPE → DEVNULL**: stdout/stderr=PIPE without a reader would fill the ~64 KB kernel pipe buffer on long PTQ runs and hang the launcher subprocess. Use DEVNULL (nemo_run writes real logs into the experiment dir; job_logs reads from there). Also add `start_new_session=True` so an MCP server SIGINT/restart doesn't SIGHUP the in-flight launcher. * **NEMORUN_HOME asymmetry**: submit-side didn't pin NEMORUN_HOME in the subprocess env, but status-side (`_resolve_experiment_dir`) looked for it. Launcher defaulted to its own cwd; status-side to the MCP server's cwd. Result: false `experiment_dir_not_found` for jobs that actually succeeded. Pin via env_setdefault + expand the candidate-dir list to include `launcher_dir/experiments/` as a belt-and-braces fallback. * **GPU verify_setup**: replaced the heavyweight `docker run --gpus all nvidia/cuda:12.0-base nvidia-smi` (which pulls ~150 MB and blows past the 60s timeout on first call) with `docker info --format '{{json .}}'` + check whether `"nvidia"` is in the registered runtimes. No image pull, daemon-fast. * **Status word match**: `"fail" in body.lower()` was too loose (false-positives on "succeeded after retry; previous attempt failed"). Anchor on the FIRST word of the status file vs a `_STATUS_FAILURE_WORDS` frozenset. * **experiment_id regex**: hoisted `import re` to module top, and relaxed both the id pattern and the experiment_dir path regex (was hard-coded to `/lustre/|/home/` — fails on partner clusters' `/scratch/`, `/work/`, etc.). * **list_examples metadata**: path-derive `model` as `<family>/<model>` from `examples/<family>/<model>/<task>.yaml` when the YAML body doesn't carry top-level `model` / `description` / `job_name` (most launcher examples don't). server.py * Add `Field(ge=1)` to the `tail` param on `job_logs` so the schema rejects zero/negative values at the MCP wire level. pyproject.toml * Drop the bare `modelopt-launcher` dep. tools/launcher/pyproject.toml declares the name with `py-modules = []` — there is no importable `modelopt_launcher` Python package on disk. bridge.py uses the launcher via `subprocess.run(["uv", "run", "launch.py", ...])` with `cwd=<repo>/tools/launcher/` — a file-layout dependency, not a Python import. uvx-from-git satisfies this naturally (the clone puts both directories on disk side by side). Remove the unsatisfiable PyPI declaration; document the file-layout requirement in the README. __init__.py * Correct the `submit_job` return-contract docstring: Slurm returns `experiment_id`; Docker returns the background `pid` (Phase 2 will capture the nemo_run experiment_id for the Docker path). All changes preserve the existing 19/19 unit tests passing. Signed-off-by: Chenhan Yu <[email protected]>
Review feedback addressed in
|
| Issue | Fix |
|---|---|
Slurm cluster_host never reached the launcher (launch.py reads SLURM_HOST env, not a CLI arg) |
Propagate via env=child_env on the subprocess |
shlex.quote on nemo-run k=v overrides baked literal quotes into values |
Removed quoting; list-form subprocess never goes through a shell |
Docker Popen(stdout=PIPE) would block on full pipe buffer for long jobs |
Switched to stdout=DEVNULL, stderr=DEVNULL, start_new_session=True |
Submit-side vs status-side NEMORUN_HOME asymmetry → false experiment_dir_not_found |
Pin via child_env.setdefault('NEMORUN_HOME', os.getcwd()) + expanded _resolve_experiment_dir candidate list |
| Docker GPU probe pulled ~150 MB CUDA image, blew past 60s timeout on first call | Replaced with docker info --format runtime-registry check — no image pull |
"fail" in body.lower() substring match for task status was too loose |
Anchor on FIRST word against a _STATUS_FAILURE_WORDS frozenset |
experiment_id regex hard-coded `/lustre/ |
/home/` roots + timestamp shape |
Plus the docstring-contract fix (__init__.py Docker returns pid, Slurm returns experiment_id), the list_examples path-derived model fallback (since real launcher examples don't carry top-level model fields), Field(ge=1) on tail, pyproject.toml drops the unsatisfiable modelopt-launcher bare-name dep, and re hoisted to module top.
Deferred / declined (with rationale, see inline replies)
# nosecon subprocess — kept; mirrors the in-repo precedent attools/launcher/launch.pyandtools/launcher/core.py. If repo policy is tightening, that's a project-wide MR covering the launcher too.StrictHostKeyChecking=accept-new— kept;BatchMode=yesalready enforces key-auth-only, and the alternative requires pre-populatedknown_hostswhich is worse UX than the marginal security gain.- Workflow YAML hardening (pinned action SHA,
permissions:block,persist-credentials: false) — flagged; deferring to a repo-wide MR that hardenslauncher+mcp+ the other unpinned jobs together. Mirroring existing precedent within this PR.
Validation
- 19/19 unit tests still pass (
uv run python -m pytest tests/) - Pre-commit clean (ruff, ruff-format, mypy, bandit, license-headers)
- All 16 inline review comments have a reply with what was done / why deferred
Ready for re-review. cc @kevalmorabia97
|
Claude finished @ChenhanYu's task in 2m 52s —— View job Re-review of
|
| Issue | Verified |
|---|---|
Slurm SLURM_HOST |
child_env["SLURM_HOST"] = cluster_host or "" at bridge.py:550, passed via env=child_env on both Docker Popen (:565) and Slurm subprocess.run (:592). Matches slurm_factory(host=os.environ.get("SLURM_HOST", "")) in tools/launcher/slurm_config.py:63. |
shlex.quote removal |
import shlex dropped; all 6 override appends at bridge.py:506-524 now use raw values. |
Docker Popen DEVNULL |
bridge.py:562-569 — stdout/stderr=DEVNULL + start_new_session=True. Pipe-buffer hang fixed; SIGHUP isolation added. |
NEMORUN_HOME symmetry |
Submit pins via child_env.setdefault("NEMORUN_HOME", os.getcwd()) (:545). _resolve_experiment_dir (:704-712) now also probes launcher_dir/experiments/ and launcher_dir/local_experiments/ as belt-and-braces. |
| GPU probe | Replaced with docker info --format '{{json .}}' + "nvidia" in Runtimes check at bridge.py:249-287. No image pull. Timeout dropped to 10s. |
| Status word-anchor | bridge.py:760-762 — body.split()[0].lower() in _STATUS_FAILURE_WORDS ({failed, error, errored, cancelled, canceled}). Existing test_job_status_failed_task fixture (failed (rc=1)) still passes since split()[0] == "failed". |
experiment_id parse |
import re hoisted to top; experiment_dir regex now \S+/experiments/[^\s/]+ (works on any FS root); id regex anchored on experiment id: with a generic-token fallback. |
Plus: Field(ge=1) on tail (server.py:261), __init__.py docstring corrected, pyproject.toml drops the unsatisfiable modelopt-launcher bare-name dep, list_examples path-derives <family>/<model> when YAML body lacks it.
Deferral rationale — accepted ✅
# nosecsuppressions — confirmed identical pattern intools/launcher/launch.py:33,96andtools/launcher/core.py:380,383,390. Cleaning these up only intools/mcp/would be inconsistent. Repo-wide MR is the right boundary.StrictHostKeyChecking=accept-new—BatchMode=yesalready enforces key-auth-only, so no password-prompt hijack surface. The TOFU concern is real but tightening would impose pre-populatedknown_hostson every cluster, which is worse UX. Reasonable tradeoff for an interactive operator tool; a--strict-host-keyopt-in could be a Phase-2 addition.- Workflow YAML hardening — every job in the workflow file (
linux,windows,multi-version,partial-install,launcher,mcp,skills, …) usesactions/checkout@v6unpinned with nopermissions:block. Repo-wide hardening MR is the right shape.
Minor follow-ups (non-blocking)
_STATUS_FAILURE_WORDSlackstimeout/cancelled_by_user— nemo_run may emit other terminal states. If you encounter false-doneclassifications in Phase 2, expand the frozenset. (bridge.py:50-52)list_examplespath-derivation depth —f"{parts[1]}/{parts[2]}" if len(parts) >= 4 else Noneassumesexamples/<family>/<model>/<file>; flatter layouts (examples/<family>/<file>) getmodel=Noneunless the YAML supplies it. Today's bundled examples use the 4-deep layout so this works, but worth a comment if the layout convention is intended to be load-bearing. (bridge.py:139-140)Field(ge=1)test gap — no test exercises pydantic rejection oftail=0at the wire level. Pydantic enforces it correctly so this is purely a coverage observation, not a bug.
Risk assessment
Low. The 4 critical findings from my Phase-1 pass were genuine submission-correctness bugs; all are now fixed in a way that matches the launcher's actual interface (SLURM_HOST env, raw argv values, on-disk experiment dir under a single NEMORUN_HOME). The unit tests still mock at the subprocess boundary so they don't observe the launcher's real arg parsing — adding a --dryrun-driven smoke test is queued as a Phase 2 task per the OP, and that's a fair scope split.
LGTM from my side. cc @kevalmorabia97 for maintainer sign-off.
· branch: chenhany/modelopt-mcp-phase1
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
# Read the updated bridge.py to verify the fixes
cat tools/mcp/modelopt_mcp/bridge.pyLength of output: 30709 🏁 Script executed: #!/bin/bash
# Read server.py and pyproject.toml
cat tools/mcp/modelopt_mcp/server.py
cat tools/mcp/pyproject.toml
cat tools/mcp/modelopt_mcp/__init__.pyLength of output: 14771
1. Stale tool description in
|
There was a problem hiding this comment.
🧹 Nitpick comments (2)
tools/mcp/modelopt_mcp/__init__.py (1)
50-52: 📐 Maintainability & Code Quality | ⚡ Quick winAlign package export style with the repo’s
__init__.pyconvention.Line 50–52 does not follow the project’s stated package-export pattern (
__all__at top andfrom .module import *re-export in package__init__.py).As per coding guidelines, “Define the public API with
__all__at the top of each module and re-export viafrom .module import *in package__init__.pyfiles.”🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/mcp/modelopt_mcp/__init__.py` around lines 50 - 52, This file exports main directly but doesn't follow the repo convention: move the __all__ declaration to the top of the module and re-export symbols by importing the module with a relative star import; specifically, add __all__ = ["main"] at the top of this package __init__ and replace the current direct import of modelopt_mcp.server.main with a relative re-export like from .server import * so the public API is defined via __all__ and re-exported from the submodule (references: __all__, main, modelopt_mcp.server, server).Source: Coding guidelines
tools/mcp/modelopt_mcp/bridge.py (1)
268-271: 📐 Maintainability & Code Quality | ⚡ Quick winMove stdlib
jsonimport to module top-level (or justify local import).Line 268 introduces a local stdlib import with no circular/optional/heavy-import reason. Please keep it at module scope for consistency with repo import rules.
♻️ Proposed change
+import json import os import re import subprocess # nosec B404 @@ - import json as _json - - info = _json.loads(gpu.stdout) + info = json.loads(gpu.stdout)As per coding guidelines, “Keep imports at the top of the file... Put an import inside a function only when there is a concrete reason.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/mcp/modelopt_mcp/bridge.py` around lines 268 - 271, The local stdlib import "import json as _json" should be moved to the module top-level; remove the inline import and add "import json" (or "import json as _json" to keep naming) at the top of bridge.py, then update the local usage (where info = _json.loads(gpu.stdout) and runtimes = list((info.get("Runtimes") or {}).keys())) to use the top-level json symbol; only keep a local import if you have a documented reason (circular/optional/heavy), otherwise relocate and run tests to ensure no name collisions.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tools/mcp/modelopt_mcp/__init__.py`:
- Around line 50-52: This file exports main directly but doesn't follow the repo
convention: move the __all__ declaration to the top of the module and re-export
symbols by importing the module with a relative star import; specifically, add
__all__ = ["main"] at the top of this package __init__ and replace the current
direct import of modelopt_mcp.server.main with a relative re-export like from
.server import * so the public API is defined via __all__ and re-exported from
the submodule (references: __all__, main, modelopt_mcp.server, server).
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 268-271: The local stdlib import "import json as _json" should be
moved to the module top-level; remove the inline import and add "import json"
(or "import json as _json" to keep naming) at the top of bridge.py, then update
the local usage (where info = _json.loads(gpu.stdout) and runtimes =
list((info.get("Runtimes") or {}).keys())) to use the top-level json symbol;
only keep a local import if you have a documented reason
(circular/optional/heavy), otherwise relocate and run tests to ensure no name
collisions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 47295e20-a43a-4b21-9487-cfc8b88d55d9
📒 Files selected for processing (4)
tools/mcp/modelopt_mcp/__init__.pytools/mcp/modelopt_mcp/bridge.pytools/mcp/modelopt_mcp/server.pytools/mcp/pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (1)
- tools/mcp/modelopt_mcp/server.py
Post-review-fix doc sync:
* `verify_setup(executor='docker')` description: replace the stale
`docker run --gpus all nvidia-smi` mention (commit 3de99b4
swapped this for the lighter `docker info --format` runtime-
registry probe — no image pull, daemon-fast).
* `MODELOPT_MCP_SKIP_GPU_CHECK` env var description: same fix.
* Phase 2/3 section: link out to the Epic + child Tasks
(OMNIML-5128/5132 Phase 2, OMNIML-5133/5134 Phase 3) so a reader
can find the planned work without leaving the README. Reflects
the trimmed NEL scope (6 tools, dropped nel_list/nel_validate).
Signed-off-by: Chenhan Yu <[email protected]>
Summary
Phase 1 of the modelopt-mcp design tracked in OMNIML-5123.
A new
tools/mcp/package exposing the existingtools/launcher/core.pyorchestration as typed MCP tools that codex / Claude Code agents can call directly — instead of shelling out touv run launch.py --yaml ...and parsing prose output.Five tools
list_examplestools/launcher/examples/with model + description metadata extracted from each YAMLverify_setupdocker info(daemon up) +docker info --formatruntime-registry check for thenvidiaruntime — no image pull, daemon-fast. Slurm:ssh -o BatchMode=yes -o ConnectTimeout=5to the cluster login node. Runs in ~1s, saves 30+ s of wasted submission on bad config.submit_jobhf_local→ Docker (local GPU),cluster_host→ Slurm (remote SSH). Returns experiment_id immediately; the actual job runs detachedjob_status_DONE,status_*.out) — no in-memory registry, survives MCP server restartsjob_logslog_<task>.outfrom experiment dir; per-task filtering + optional tailDesign constants
submit_jobwith mode by args (not separatesubmit_docker/submit_slurmtools). Keeps the LLM tool catalog compact; mutual-exclusion is a runtime check.verify_setupis auto-called bysubmit_jobby default (skippable when caller just probed). The probe is ~1s; the cost of a misconfigured submission is 30+ seconds of cluster timeout or container-pull. Always-on verify pays back immediately.Layout
Install
Two paths, both from source via uv. No PyPI wheel exists; OMNIML-5123 opted for the uvx-from-git pattern to skip publication overhead.
End-user install (recommended)
uvxfrom the git subdirectory — single command, no manual clone:Under the hood
uvxclones the whole repo to its cache, installstools/mcp/as the entry, and resolves the siblingmodelopt-launcherdep via[tool.uv.sources](path = "../launcher") inside the cloned tree.Dev install (local checkout)
The dev path also relies on
[tool.uv.sources]to pointmodelopt-launcherat the local../launchercheckout.Why no plain
pip installtodayTwo specific reasons, worth flagging so reviewers know what's intentional vs missing:
modelopt-mcpnormodelopt-launcherare published — this PR introduces the package but doesn't add release machinery.pipdoesn't read[tool.uv.sources]. Even from a local checkout, plainpip install -e tools/mcpfails becausemodelopt-launcheris a bare name (no URL) and pip can't find it. Sticking withuv/uvxis the practical path while we're git-only.If we later want plain-pip support, two options:
modelopt-launcher+modelopt-mcpget versioned wheelspip install, but requires release machinery + version cadence"modelopt-launcher @ git+https://...#subdirectory=tools/launcher"Out of scope for this PR; happy to follow up if reviewers want either path.
Post-review changes (commit 3de99b4 + 082a6b8)
Addressed all CodeRabbit + claude[bot] review findings. See the inline replies for details; the substantive bug-fix highlights:
cluster_host— propagate viaenv=child_env(launch.py reads SLURM_HOST, not a CLI arg)shlex.quoteremoved from nemo-run k=v overrides (subprocess list-form doesn't shell-quote)Popennow usesstdout=DEVNULL, stderr=DEVNULL, start_new_session=Trueto avoid pipe-buffer blockingNEMORUN_HOMEpinned in subprocess env so submit + status sides agreedocker run --gpus allimage-pull (slow + flaky) todocker info --formatruntime-registry check (daemon-fast)"fail" in "succeeded after retry; previous attempt failed"false-positive)experiment_idregex generalized for non-NVIDIA cluster pathspyproject.tomldropped the unsatisfiablemodelopt-launcherbare-name dep (launcher is a file-layout sibling, not a Python import dep)Field(ge=1)onjob_logs.tailpid, Slurm returnsexperiment_id)All 19/19 unit tests still pass; pre-commit clean.
Validation
uv pip install -e .succeeds (modelopt-launcher resolved transitively)uv run python -m pytest tests/)tools/listreturns all 5 with full schemas + descriptionssubmit_jobcorrectly rejects no-executor + both-executors with structuredreasondone/failed/runningfrom_DONE+status_*.outAcceptance criteria (from OMNIML-5123)
list_examplesreturns all bundled YAMLs with path and model namesubmit_jobwithhf_localruns via Docker executor and returns immediately (Phase 1: returns PID; experiment_id capture in Phase 2)submit_jobwithcluster_host/userruns via Slurm executor (detach=True) and returns experiment_idjob_statuscorrectly reflects running / done / failed from nemo_run filesystemjob_logsreturns stdout for a completed jobuvx --from git+...#subdirectory=tools/mcp modelopt-mcp --helpresolves and startstools/launcher/)Phase 2 (separate PR)
experiment_idfrom Docker subprocess output (tail until nemo_run logs the id).wait_for_experiment(experiment_id, timeout_sec) -> SlurmRunResultfor blocking polls.nmm-sandbox-mcp(companion server, separate repo) can consume for internal-ergonomics tools — cluster short-name → factory lookup + GitLab CI dispatch. See nmm-sandbox#21 for the companion design.Summary by CodeRabbit
New Features
Documentation
Tests
Chores