Skip to content

feat(tools/mcp): MCP server for ModelOpt launcher (OMNIML-5123)#1701

Open
ChenhanYu wants to merge 5 commits into
mainfrom
chenhany/modelopt-mcp-phase1
Open

feat(tools/mcp): MCP server for ModelOpt launcher (OMNIML-5123)#1701
ChenhanYu wants to merge 5 commits into
mainfrom
chenhany/modelopt-mcp-phase1

Conversation

@ChenhanYu

@ChenhanYu ChenhanYu commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Summary

Phase 1 of the modelopt-mcp design tracked in OMNIML-5123.

A new tools/mcp/ package exposing the existing tools/launcher/core.py orchestration as typed MCP tools that codex / Claude Code agents can call directly — instead of shelling out to uv run launch.py --yaml ... and parsing prose output.

Five tools

Tool Description
list_examples Enumerate tools/launcher/examples/ with model + description metadata extracted from each YAML
verify_setup Fail-fast probe for the named executor. Docker: docker info (daemon up) + docker info --format runtime-registry check for the nvidia runtime — no image pull, daemon-fast. Slurm: ssh -o BatchMode=yes -o ConnectTimeout=5 to the cluster login node. Runs in ~1s, saves 30+ s of wasted submission on bad config.
submit_job Submit a launcher YAML. Mode is determined by mutually-exclusive args: hf_local → Docker (local GPU), cluster_host → Slurm (remote SSH). Returns experiment_id immediately; the actual job runs detached
job_status Filesystem-based status from nemo_run's experiment dir (_DONE, status_*.out) — no in-memory registry, survives MCP server restarts
job_logs Read log_<task>.out from experiment dir; per-task filtering + optional tail

Design constants

  1. Single submit_job with mode by args (not separate submit_docker / submit_slurm tools). Keeps the LLM tool catalog compact; mutual-exclusion is a runtime check.
  2. Filesystem is the source of truth for status + logs. No in-memory registry. Survives MCP server restarts cleanly — important because operators / agents kill + restart their hosts often.
  3. verify_setup is auto-called by submit_job by default (skippable when caller just probed). The probe is ~1s; the cost of a misconfigured submission is 30+ seconds of cluster timeout or container-pull. Always-on verify pays back immediately.

Layout

tools/mcp/
├── pyproject.toml          # name: modelopt-mcp, console_script
├── modelopt_mcp/
│   ├── __init__.py
│   ├── server.py           # FastMCP entry; 5 tool definitions
│   └── bridge.py           # thin wrapper over launcher's core.py
│                           #   + filesystem status/log helpers
└── tests/
    └── test_bridge.py      # 19 unit tests, fully hermetic
                            # (mocked subprocess + tmp_path fixtures)

Install

Two paths, both from source via uv. No PyPI wheel exists; OMNIML-5123 opted for the uvx-from-git pattern to skip publication overhead.

End-user install (recommended)

uvx from the git subdirectory — single command, no manual clone:

# Claude Code
claude mcp add modelopt -- uvx --from \
  "git+https://github.com/NVIDIA/Model-Optimizer.git#subdirectory=tools/mcp" \
  modelopt-mcp

# Codex
codex mcp add modelopt -- uvx --from \
  "git+https://github.com/NVIDIA/Model-Optimizer.git#subdirectory=tools/mcp" \
  modelopt-mcp

Under the hood uvx clones the whole repo to its cache, installs tools/mcp/ as the entry, and resolves the sibling modelopt-launcher dep via [tool.uv.sources] (path = "../launcher") inside the cloned tree.

Dev install (local checkout)

uv pip install -e tools/launcher    # sibling dep first
uv pip install -e tools/mcp         # then this package
modelopt-mcp                         # entry on PATH

The dev path also relies on [tool.uv.sources] to point modelopt-launcher at the local ../launcher checkout.

Why no plain pip install today

Two specific reasons, worth flagging so reviewers know what's intentional vs missing:

  1. Nothing on PyPI yet. Neither modelopt-mcp nor modelopt-launcher are published — this PR introduces the package but doesn't add release machinery.
  2. pip doesn't read [tool.uv.sources]. Even from a local checkout, plain pip install -e tools/mcp fails because modelopt-launcher is a bare name (no URL) and pip can't find it. Sticking with uv / uvx is the practical path while we're git-only.

If we later want plain-pip support, two options:

Path Tradeoff
Publish to PyPImodelopt-launcher + modelopt-mcp get versioned wheels Clean pip install, but requires release machinery + version cadence
PEP-440 direct URL in pyproject"modelopt-launcher @ git+https://...#subdirectory=tools/launcher" Works with pip + uv, but double-clones the same repo on install (cheap and ugly)

Out of scope for this PR; happy to follow up if reviewers want either path.

Post-review changes (commit 3de99b4 + 082a6b8)

Addressed all CodeRabbit + claude[bot] review findings. See the inline replies for details; the substantive bug-fix highlights:

  • Slurm cluster_host — propagate via env=child_env (launch.py reads SLURM_HOST, not a CLI arg)
  • shlex.quote removed from nemo-run k=v overrides (subprocess list-form doesn't shell-quote)
  • Docker Popen now uses stdout=DEVNULL, stderr=DEVNULL, start_new_session=True to avoid pipe-buffer blocking
  • NEMORUN_HOME pinned in subprocess env so submit + status sides agree
  • GPU verify swapped from docker run --gpus all image-pull (slow + flaky) to docker info --format runtime-registry check (daemon-fast)
  • Task-status word match anchors on first word against a fixed failure-word set (no more "fail" in "succeeded after retry; previous attempt failed" false-positive)
  • experiment_id regex generalized for non-NVIDIA cluster paths
  • pyproject.toml dropped the unsatisfiable modelopt-launcher bare-name dep (launcher is a file-layout sibling, not a Python import dep)
  • Field(ge=1) on job_logs.tail
  • Docstring contract clarified (Docker returns pid, Slurm returns experiment_id)
  • README updated to reflect the new GPU probe + Phase 2/3 task linkage

All 19/19 unit tests still pass; pre-commit clean.

Validation

  • uv pip install -e . succeeds (modelopt-launcher resolved transitively)
  • 19/19 unit tests pass (uv run python -m pytest tests/)
  • stdio handshake works end-to-end; tools/list returns all 5 with full schemas + descriptions
  • Mode-resolution: submit_job correctly rejects no-executor + both-executors with structured reason
  • Filesystem status: correctly classifies done / failed / running from _DONE + status_*.out
  • Pre-commit clean: ruff, ruff-format, mypy, bandit, license-headers

Acceptance criteria (from OMNIML-5123)

  • list_examples returns all bundled YAMLs with path and model name
  • submit_job with hf_local runs via Docker executor and returns immediately (Phase 1: returns PID; experiment_id capture in Phase 2)
  • submit_job with cluster_host/user runs via Slurm executor (detach=True) and returns experiment_id
  • job_status correctly reflects running / done / failed from nemo_run filesystem
  • job_logs returns stdout for a completed job
  • uvx --from git+...#subdirectory=tools/mcp modelopt-mcp --help resolves and starts
  • Existing launcher tests unaffected (no changes to tools/launcher/)

Phase 2 (separate PR)

  • Capture experiment_id from Docker subprocess output (tail until nemo_run logs the id).
  • Add wait_for_experiment(experiment_id, timeout_sec) -> SlurmRunResult for blocking polls.
  • Extract the verify + submit helpers into a shared lib that nmm-sandbox-mcp (companion server, separate repo) can consume for internal-ergonomics tools — cluster short-name → factory lookup + GitLab CI dispatch. See nmm-sandbox#21 for the companion design.

Summary by CodeRabbit

  • New Features

    • Added a ModelOpt MCP server: list examples, verify executors, submit jobs, check status, and fetch logs; supports Docker (local GPU) and Slurm backends and a console entrypoint.
  • Documentation

    • Comprehensive README with install, end-to-end examples, usage, and deployment notes.
  • Tests

    • Unit tests covering discovery, verification, submission flow, status, and logs.
  • Chores

    • CI updated to run MCP tests; package/project config added.

Phase 1 implementation of the modelopt-mcp design tracked in OMNIML-5123.

Five tools exposed over stdio MCP:
  * list_examples       — enumerate tools/launcher/examples/ with
                          model + description metadata
  * verify_setup        — fail-fast probe (docker daemon + GPU, or
                          slurm SSH passwordless)
  * submit_job          — submit a launcher YAML; mode resolved from
                          mutually-exclusive args (hf_local → Docker,
                          cluster_host → Slurm). Returns the experiment
                          id immediately; the actual job runs detached.
  * job_status          — filesystem-based status from nemo_run
                          experiment dir (_DONE + status_*.out)
  * job_logs            — filesystem read of log_<task>.out files,
                          with optional tail

Design principles (from OMNIML-5123):
  1. Single submit_job with mode by args (not separate
     submit_docker / submit_slurm tools). Keeps the LLM catalog compact.
  2. Filesystem is the source of truth for status + logs. No
     in-memory registry; survives MCP server restarts.
  3. verify_setup is auto-called by submit_job by default — the
     verify probe takes ~1s and saves 30+ seconds of wasted submission
     time on bad config. Callers can pass skip_verify=True when they
     just probed.

Layout:
  tools/mcp/
    pyproject.toml           # name: modelopt-mcp, console_script
    modelopt_mcp/
      __init__.py
      server.py              # FastMCP entry; 5 tool definitions
      bridge.py              # thin wrapper over launcher's core.py
                             #   + filesystem status/log helpers
    tests/test_bridge.py     # 19 unit tests, fully hermetic (mocked
                             # subprocess + tmp_path fixtures)

Install:
  # End user (uvx-from-git)
  claude mcp add modelopt -- uvx --from \
    "git+https://github.com/NVIDIA/Model-Optimizer.git#subdirectory=tools/mcp" \
    modelopt-mcp

  # Local development
  uv pip install -e tools/launcher
  uv pip install -e tools/mcp
  modelopt-mcp  # stdio server

Validation:
  * uv pip install -e . succeeds
  * 19/19 unit tests pass
  * stdio handshake works; tools/list returns all 5 with full schemas
  * mode-resolution rejects no-executor + both-executors correctly
  * filesystem status correctly classifies done / failed / running
  * pre-commit clean (ruff, mypy, bandit, license)

Phase 2 (separate PR): capture experiment_id from Docker subprocess
output, add wait_for_experiment, extract the verify + submit helpers
into a shared lib that nmm-sandbox-mcp can consume for its
internal-ergonomics tools (cluster short-name → factory lookup +
GitLab CI dispatch).

Signed-off-by: Chenhan Yu <[email protected]>
@copy-pr-bot

copy-pr-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4b1448db-0035-40af-9131-707395c401a2

📥 Commits

Reviewing files that changed from the base of the PR and between 3de99b4 and 082a6b8.

📒 Files selected for processing (1)
  • tools/mcp/README.md
✅ Files skipped from review due to trivial changes (1)
  • tools/mcp/README.md

📝 Walkthrough

Walkthrough

Adds a new modelopt-mcp MCP server and bridge exposing launcher tools (list_examples, verify_setup, submit_job, job_status, job_logs), package metadata and console script, comprehensive README, unit tests, and CI workflow updates to run the new tests.

Changes

MCP Server Implementation for Launcher Tool Exposure

Layer / File(s) Summary
CI workflow integration
.github/workflows/unit_tests.yml
tools/mcp/** paths are now treated as test-relevant; adds an mcp job that installs uv, sets up a venv, installs the sibling launcher package then tools/mcp, runs pytest, and updates unit-pr-required-check to depend on mcp and include its result in failure conditions.
End-user and developer documentation
tools/mcp/README.md
Documentation for modelopt-mcp: purpose, Docker vs Slurm execution modes, the five MCP tools, uv/uvx install flows, example agent workflow, environment variables, design principles, repo layout, and Phase 2 roadmap.
Package configuration and entry point
tools/mcp/pyproject.toml, tools/mcp/modelopt_mcp/__init__.py
Adds modelopt-mcp project metadata and dependencies (mcp, pyyaml, pydantic), console script modelopt-mcp = modelopt_mcp.server:main, build/package discovery config, and re-exports main at package level.
MCP-to-launcher bridge — examples
tools/mcp/modelopt_mcp/bridge.py
Locate launcher examples dir and implement ExampleEntry + list_examples_impl() to enumerate YAML examples, derive model from path, and parse model/description fields when present.
MCP-to-launcher bridge — verification
tools/mcp/modelopt_mcp/bridge.py
Implements verify_docker_setup_impl() (checks docker info and, unless skipped, verifies nvidia runtime via docker info JSON) and verify_slurm_setup_impl() (SSH probe with structured diagnostics and parsed whoami/hostname).
MCP-to-launcher bridge — submission
tools/mcp/modelopt_mcp/bridge.py
Adds _normalize_yaml_path() and submit_job_impl() to select executor (Docker vs Slurm), optionally run verification, validate YAML existence, build uv run launch.py --yaml ... --yes, dispatch detached Popen for Docker (returns experiment_id: None) or synchronous subprocess.run for Slurm (parse experiment id/dir/slurm_job_id, handle timeouts and failure tails).
MCP-to-launcher bridge — inspection
tools/mcp/modelopt_mcp/bridge.py
Adds _resolve_experiment_dir() to map experiment IDs to on-disk directories, job_status_impl() to compute done/failed/running from _DONE and status_*.out, and job_logs_impl() to read log_<task>.out files and optionally tail them.
MCP server and tool registration
tools/mcp/modelopt_mcp/server.py
Builds a FastMCP("modelopt") server, registers Phase 1 tools (list_examples, verify_setup, submit_job, job_status, job_logs) with typed inputs delegating to bridge functions, validates Slurm cluster_host when required, and exposes main() for stdio serving with configurable logging.
Unit tests for bridge and tool logic
tools/mcp/tests/__init__.py, tools/mcp/tests/test_bridge.py
Adds unit tests covering example discovery, malformed YAML tolerance, Docker and Slurm verification behaviors (including missing binaries and SSH auth failures), submit_job executor selection and YAML-not-found handling, _DONE semantics for job_status, and job_logs aggregation/tailing and missing-task errors.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes


Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
Security Anti-Patterns ❌ Error tools/mcp/modelopt_mcp/bridge.py uses “# nosec B404” (subprocess import). SECURITY.md forbids #nosec bypasses without an approved security exception. Remove the #nosec comment or obtain an approved security exception from @NVIDIA/modelopt-setup-codeowners with explicit justification in the PR description.
Docstring Coverage ⚠️ Warning Docstring coverage is 73.17% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: introducing an MCP server for the ModelOpt launcher tool. It is concise, uses a conventional feat prefix, and directly relates to the primary changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chenhany/modelopt-mcp-phase1

Comment @coderabbitai help to get the list of available commands and usage tips.

@ChenhanYu

Copy link
Copy Markdown
Collaborator Author

Tracks OMNIML-5123. Companion design (nmm-sandbox-mcp) at omniml/integration/nmm-sandbox#21.

Adds a new `mcp` job to .github/workflows/unit_tests.yml that mirrors
the existing `launcher` job's shape — installs the sibling
modelopt-launcher editable, installs modelopt-mcp editable, runs
pytest. The job is gated by check-file-changes to skip on unrelated
PRs (matching launcher's pattern).

Three changes:
  * paths: add `tools/mcp/**` to the push-trigger watch list so
    changes under that subtree fire unit_tests.yml.
  * check-file-changes.files: add `tools/mcp/**` to the changed-files
    detector so the gated jobs (launcher / mcp / skills) skip on PRs
    that touch unrelated areas.
  * unit-pr-required-check: add `mcp` to `needs:` and the failure-
    aggregator condition so a failing mcp job blocks merge the same
    way launcher does.

No other workflows touched. regression_tests.yml is GPU-only and
modelopt-mcp's tests are hermetic (subprocess mocked), so it stays
out of scope.

Signed-off-by: Chenhan Yu <[email protected]>
@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.09%. Comparing base (ddc0a8e) to head (082a6b8).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1701   +/-   ##
=======================================
  Coverage   77.09%   77.09%           
=======================================
  Files         511      511           
  Lines       56176    56176           
=======================================
  Hits        43310    43310           
  Misses      12866    12866           
Flag Coverage Δ
unit 54.34% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ChenhanYu ChenhanYu requested a review from kevalmorabia97 June 12, 2026 21:57
Adds tools/mcp/README.md so a developer landing in the directory has
a single-file entry into:

  * What modelopt-mcp is (thin MCP wrapper over tools/launcher/core.py)
  * The 5-tool surface with one-line each
  * Both install paths (uvx-from-git for end-users, uv pip install -e
    for dev) + why plain `pip install` doesn't work yet + two
    options to enable it later
  * A concrete end-to-end usage example (list_examples → verify_setup →
    submit_job → job_status → job_logs)
  * Required env vars table
  * The three design constants (single submit_job, filesystem source-
    of-truth, auto-verify default)
  * Pointer to the NVIDIA-internal companion (nmm-sandbox-mcp) for
    operators who want short-name cluster ergonomics + GitLab CI
    dispatch
  * Layout + Phase 2 hooks

Markdownlint-clean; matches the README style in tools/launcher/.

Signed-off-by: Chenhan Yu <[email protected]>
@ChenhanYu ChenhanYu marked this pull request as ready for review June 12, 2026 22:04
@ChenhanYu ChenhanYu requested a review from a team as a code owner June 12, 2026 22:04
@ChenhanYu

Copy link
Copy Markdown
Collaborator Author

/claude review

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 7

🧹 Nitpick comments (2)
tools/mcp/tests/test_bridge.py (1)

174-214: 📐 Maintainability & Code Quality | ⚡ Quick win

Add regression tests for subprocess hardening paths introduced in bridge fixes.

Please add focused tests for:

  1. rejecting option-like cluster_host/cluster_user,
  2. preserving raw override values with spaces (no literal quote injection), and
  3. structured launcher_cli_not_found/spawn-failure returns.

This will lock in the new security/reliability contracts on the MCP boundary.

Also applies to: 221-271

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/tests/test_bridge.py` around lines 174 - 214, Add three focused
pytest cases in tools/mcp/tests/test_bridge.py that exercise
bridge.verify_slurm_setup_impl: (1) call verify_slurm_setup_impl with
option-like values for cluster_host and/or cluster_user (e.g., values starting
with "-") and assert the result["ok"] is False and result["reason"] indicates
rejection of option-like inputs; (2) call verify_slurm_setup_impl with override
values that include spaces or shell metacharacters and assert the returned
fields preserve the raw strings (no injected literal quotes) — use monkeypatched
subprocess.run to return expected stdout/stderr and check returned
whoami/remote_hostname/override fields match inputs exactly; (3) simulate
spawn/launcher failures by monkeypatching subprocess.run to raise
FileNotFoundError or return a non-zero spawn failure code and assert
verify_slurm_setup_impl returns a structured error (result["ok"] is False and
result["reason"] equals/indicates "launcher_cli_not_found" or a spawn-failure
sentinel) so the contract for structured failure payloads is validated.
tools/mcp/modelopt_mcp/__init__.py (1)

48-50: 📐 Maintainability & Code Quality | ⚡ Quick win

Move __all__ to the top-level export section.

__all__ is currently declared after imports. Put it directly below the module docstring so the public surface is explicit before implementation imports.

As per coding guidelines, "Each module declares its public surface with __all__ = [...] at the top of the file."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/modelopt_mcp/__init__.py` around lines 48 - 50, Move the __all__
declaration immediately below the module docstring so the module's public
surface is declared before imports; specifically place "__all__ = ['main']"
right after the top-of-file docstring and before the "from modelopt_mcp.server
import main" import to satisfy the guideline that modules declare public exports
at the top (referencing the __all__ symbol and the imported main symbol).

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/unit_tests.yml:
- Around line 152-160: The mcp job currently lacks a job-level permissions
block, uses actions/checkout@v6 without pinning, and leaves persist-credentials
default true; update the mcp job to set minimal GITHUB_TOKEN permissions (e.g.,
read-only repo and contents as required), add a permissions: block at the job
level, change the checkout invocation (uses: actions/checkout@v6) to a pinned
reference (replace with a specific commit SHA tag for actions/checkout or add
with: ref: ${{ github.sha }} if you need the exact commit), and set with:
persist-credentials: false so credentials aren’t injected into the workspace;
keep the condition (if: needs.check-file-changes.outputs.any_changed == 'true')
and needs: [linux, check-file-changes] as-is.

In `@tools/mcp/modelopt_mcp/__init__.py`:
- Around line 28-31: Update the module docstring for submit_job to accurately
describe its return contract: state that when mode is hf_local (Docker)
submit_job returns the Docker process PID (since it runs in a background
thread), and when mode is cluster_host (Slurm) it returns the submitted
experiment_id (Slurm submission with detach=True); reference the submit_job
function and the modes hf_local/Docker and cluster_host/Slurm so callers know
which value to expect.

In `@tools/mcp/modelopt_mcp/bridge.py`:
- Line 38: Remove all inline "# nosec" suppressions on subprocess imports/usages
(e.g., the import subprocess line and occurrences around lines referenced) and
make the code pass Bandit by fixing unsafe subprocess patterns instead of
silencing them: import subprocess without a comment, ensure every subprocess
invocation in bridge.py (search for subprocess.run/ Popen/call usages) uses a
list of arguments (no shell=True), validates or sanitizes any external inputs
(use shlex.quote or strict whitelist validation where inputs originate), and use
subprocess.run(..., check=True) or other safe APIs; if any use truly requires
shell semantics, refactor to an explicit, validated command builder or document
and encapsulate the risk with safer alternatives so Bandit warnings are resolved
rather than suppressed.
- Around line 512-517: The subprocess.Popen call that spawns the launcher (argv,
cwd=str(launcher_dir), assigned to proc) must not raise uncaught
FileNotFoundError/OSError; wrap the Popen call in a try/except that catches
FileNotFoundError and OSError and returns a structured error result (instead of
letting the exception propagate) so the MCP tool can surface a clear
"launcher/uv not found" error; do the same for the other subprocess invocation
that only catches subprocess.TimeoutExpired (also catch
FileNotFoundError/OSError there). Remove the "# nosec B603" suppressions and
replace them with the new explicit error handling and logging using the same
variables (argv, launcher_dir, proc) and the existing TimeoutExpired handling to
produce consistent structured error returns.
- Around line 477-491: The argv list currently wraps values with shlex.quote
(e.g., hf_local, cluster_host, cluster_user, identity, job_dir, job_name and
extra_overrides values) which injects literal quotes when argv is passed as a
list to subprocess (no shell); remove shlex.quote and append the raw string
values instead (use str(...) where needed and fall back to empty string for
None), e.g. build entries as f"hf_local={str(hf_local)}" and for extra_overrides
use f"{k}={str(v)}"; keep the same keys (argv, extra_overrides) and conditional
checks but do not call shlex.quote.
- Around line 282-295: Remove the "# nosec" bypasses and harden SSH/ subprocess
usage: in bridge.py update verify_slurm_setup_impl to validate cluster_user and
cluster_host (reject/escape values that start with '-' or contain unexpected
characters; allow only a safe regex like [A-Za-z0-9._-]+), then build the ssh
target as a single argv element (target = f"{cluster_user}@{cluster_host}" or
cluster_host alone) and append it without any shell interpolation so ssh options
cannot be injected; in submit_job_impl stop using shlex.quote on key=value
arguments (pass raw strings in the argv list since subprocess is used without a
shell); in the docker branch replace returning Popen(..., stdout=PIPE,
stderr=STDOUT) directly with a safe pattern that drains output (use
subprocess.run(...) to capture/stream output or call Popen.communicate() / a
dedicated reader thread before returning) to avoid deadlocks; finally remove all
"# nosec" comments and address Bandit findings through the proper
security-exception workflow instead of silencing them.

In `@tools/mcp/modelopt_mcp/server.py`:
- Around line 259-262: The tail parameter on the MCP boundary must enforce a
minimum of 1 to avoid surprising slicing in bridge.job_logs_impl (which calls
body.splitlines()[-tail:]); update the Annotated Field for tail in the job_logs
schema to include ge=1 (or equivalent lower-bound validation) so values like 0
or negatives cannot be passed into job_logs_impl and cause unexpected behavior.

---

Nitpick comments:
In `@tools/mcp/modelopt_mcp/__init__.py`:
- Around line 48-50: Move the __all__ declaration immediately below the module
docstring so the module's public surface is declared before imports;
specifically place "__all__ = ['main']" right after the top-of-file docstring
and before the "from modelopt_mcp.server import main" import to satisfy the
guideline that modules declare public exports at the top (referencing the
__all__ symbol and the imported main symbol).

In `@tools/mcp/tests/test_bridge.py`:
- Around line 174-214: Add three focused pytest cases in
tools/mcp/tests/test_bridge.py that exercise bridge.verify_slurm_setup_impl: (1)
call verify_slurm_setup_impl with option-like values for cluster_host and/or
cluster_user (e.g., values starting with "-") and assert the result["ok"] is
False and result["reason"] indicates rejection of option-like inputs; (2) call
verify_slurm_setup_impl with override values that include spaces or shell
metacharacters and assert the returned fields preserve the raw strings (no
injected literal quotes) — use monkeypatched subprocess.run to return expected
stdout/stderr and check returned whoami/remote_hostname/override fields match
inputs exactly; (3) simulate spawn/launcher failures by monkeypatching
subprocess.run to raise FileNotFoundError or return a non-zero spawn failure
code and assert verify_slurm_setup_impl returns a structured error (result["ok"]
is False and result["reason"] equals/indicates "launcher_cli_not_found" or a
spawn-failure sentinel) so the contract for structured failure payloads is
validated.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 41d93d76-4c0b-45cd-b0fc-9406b45c4455

📥 Commits

Reviewing files that changed from the base of the PR and between ddc0a8e and cb3766d.

📒 Files selected for processing (8)
  • .github/workflows/unit_tests.yml
  • tools/mcp/README.md
  • tools/mcp/modelopt_mcp/__init__.py
  • tools/mcp/modelopt_mcp/bridge.py
  • tools/mcp/modelopt_mcp/server.py
  • tools/mcp/pyproject.toml
  • tools/mcp/tests/__init__.py
  • tools/mcp/tests/test_bridge.py

Comment thread .github/workflows/unit_tests.yml
Comment thread tools/mcp/modelopt_mcp/__init__.py Outdated
Comment thread tools/mcp/modelopt_mcp/bridge.py
Comment on lines +282 to +295
argv = [
"ssh",
"-o",
"BatchMode=yes",
"-o",
"StrictHostKeyChecking=accept-new",
"-o",
"ConnectTimeout=5",
]
if identity:
argv += ["-i", identity]
target = f"{cluster_user}@{cluster_host}" if cluster_user else cluster_host
argv += [target, "whoami && hostname"]

@coderabbitai coderabbitai Bot Jun 12, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

file="tools/mcp/modelopt_mcp/bridge.py"

# Basic context: locate key functions and line ranges mentioned in the original comment
rg -n "def (verify_slurm_setup_impl|submit_job_impl)" "$file" || true
rg -n "verify_slurm_setup_impl|submit_job_impl" "$file" || true

# Inspect around the snippet lines 260-320 (adjusted by actual file line numbers)
nl -ba "$file" | sed -n '250,330p'

# Inspect other cited ranges: 450-520, 500-560
nl -ba "$file" | sed -n '440,520p'
nl -ba "$file" | sed -n '500,570p'

# Search for security-policy-related patterns
rg -n "# nosec" "$file" || true
rg -n "shlex\.quote" "$file" || true
rg -n "subprocess\.(run|Popen)" "$file" || true

# Also inspect any ssh target building and where argv list is finalized
rg -n "ssh|StrictHostKeyChecking|BatchMode|ConnectTimeout|StrictHostKeyChecking=accept-new" "$file" || true

Repository: NVIDIA/Model-Optimizer

Length of output: 293


🏁 Script executed:

#!/bin/bash
set -euo pipefail

file="tools/mcp/modelopt_mcp/bridge.py"

echo "== show verify_slurm_setup_impl and surrounding code (approx) =="
awk 'NR>=250 && NR<=340 {printf "%4d:%s\n", NR, $0}' "$file" | sed -n '1,120p'

echo "== show submit_job_impl and surrounding code (approx) =="
awk 'NR>=360 && NR<=470 {printf "%4d:%s\n", NR, $0}' "$file" | sed -n '1,120p'

echo "== show other cited ranges (approx) =="
awk 'NR>=460 && NR<=540 {printf "%4d:%s\n", NR, $0}' "$file" | sed -n '1,120p'
awk 'NR>=480 && NR<=560 {printf "%4d:%s\n", NR, $0}' "$file" | sed -n '1,120p'

echo "== search patterns =="
rg -n "# nosec" "$file" || true
rg -n "shlex\.quote" "$file" || true
rg -n "subprocess\.(run|Popen|check_output|call)" "$file" || true
rg -n "StrictHostKeyChecking|BatchMode=yes|ConnectTimeout=5|whoami && hostname" "$file" || true

Repository: NVIDIA/Model-Optimizer

Length of output: 17404


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "cluster_host|cluster_user|identity" tools/mcp/modelopt_mcp/bridge.py

# See if bridge does any validation for these fields before calling verify/submit.
rg -n "validate|regex|allowlist|startswith\\(\"-\"\\)|StrictHostKeyChecking" tools/mcp/modelopt_mcp/bridge.py || true

# Find launcher argument parsing for the key=value overrides we pass to `uv run launch.py`.
# This checks whether shlex.quote would end up including literal quotes in the value.
file_launcher="$(fd -a "launch.py" -tf --exact 2>/dev/null tools/launcher || true)"
echo "launcher candidates:"
echo "$file_launcher"

# If fd above didn't return, fall back to repo-wide filename search.
if [ -z "$file_launcher" ]; then
  file_launcher="$(fd -a "launch.py" .)"
fi

echo "Using launcher file(s):"
echo "$file_launcher"

# Inspect how `hf_local=...` / `cluster_host=...` style args are parsed/handled.
rg -n "hf_local=|cluster_host=|user=|identity=|extra_overrides" $file_launcher || true
rg -n "shlex\\.split|argparse|key=value|\\w+=\\w" $file_launcher || true

Repository: NVIDIA/Model-Optimizer

Length of output: 4327


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect tools/launcher/launch.py for argument parsing of key=value pairs
file="tools/launcher/launch.py"
wc -l "$file"
awk 'NR>=1 && NR<=220 {printf "%4d:%s\n", NR, $0}' "$file" | sed -n '1,220p'
awk 'NR>=220 && NR<=420 {printf "%4d:%s\n", NR, $0}' "$file" | sed -n '1,220p'

# Find where cluster_host/user/identity values come from in the bridge MCP layer
rg -n "cluster_host" tools/mcp/modelopt_mcp || true
rg -n "cluster_user" tools/mcp/modelopt_mcp || true
rg -n "submit_job" tools/mcp/modelopt_mcp || true

# Find any schema/validation layer for these fields
rg -n "class .*Model|pydantic|BaseModel|Field\\(|constr\\(|validator\\(" tools/mcp/modelopt_mcp || true

Repository: NVIDIA/Model-Optimizer

Length of output: 11191


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect server validation for cluster_host/cluster_user
file="tools/mcp/modelopt_mcp/server.py"
echo "== server.py cluster_host/cluster_user definitions =="
awk 'NR>=70 && NR<=220 {printf "%4d:%s\n", NR, $0}' "$file" | sed -n '1,180p'

# Inspect bridge.py for job_status and NEMORUN_HOME usage
b="tools/mcp/modelopt_mcp/bridge.py"
echo "== bridge.py search NEMORUN_HOME and experiment_id handling =="
rg -n "NEMORUN_HOME|experiment_id|job_status|proc\\b|Popen\\(" "$b"

echo "== bridge.py around job_status and further lines (approx 540-720) =="
awk 'NR>=530 && NR<=720 {printf "%4d:%s\n", NR, $0}' "$b" | sed -n '1,240p'

Repository: NVIDIA/Model-Optimizer

Length of output: 17257


🏁 Script executed:

#!/bin/bash
set -euo pipefail

file="tools/mcp/modelopt_mcp/bridge.py"
echo "== top of bridge.py imports =="
awk 'NR>=1 && NR<=80 {printf "%4d:%s\n", NR, $0}' "$file"
rg -n "^import re\\b|from re import" "$file" || true

Repository: NVIDIA/Model-Optimizer

Length of output: 3632


Fix SSH target injection and remove disallowed # nosec Bandit bypasses

  • Security policy: tools/mcp/modelopt_mcp/bridge.py uses # nosec to suppress Bandit for multiple subprocess calls (and tools/launcher/launch.py does as well). Repo guidelines disallow # nosec bypasses unless an approved security exception exists—remove these suppressions or route through the required codeowner-approved exception.
  • SSH option injection: verify_slurm_setup_impl() builds target = f"{cluster_user}@{cluster_host}" and appends it to ssh argv without --, with no validation to prevent cluster_host / cluster_user from being interpreted as additional ssh options.
  • Docker reliability: in docker mode, subprocess.Popen(... stdout=PIPE, stderr=STDOUT) is returned immediately without draining the pipe, which can deadlock if output fills the buffer.
  • argv correctness: submit_job_impl() applies shlex.quote() to key=value arguments even though subprocess uses an argv list (no shell). This can introduce literal quote characters into values and break launcher parsing for inputs containing special characters.
🔒 Suggested hardening
+import re
+
+_SSH_TOKEN_RE = re.compile(r"^[A-Za-z0-9._-]+$")
+
 def verify_slurm_setup_impl(
     cluster_host: str,
     cluster_user: str | None = None,
     identity: str | None = None,
 ) -> dict:
@@
+    if not cluster_host or cluster_host.startswith("-") or not _SSH_TOKEN_RE.fullmatch(cluster_host):
+        return {
+            "ok": False,
+            "executor": "slurm",
+            "reason": "invalid_cluster_host",
+            "diagnostic": "cluster_host contains invalid characters.",
+        }
+    if cluster_user and (
+        cluster_user.startswith("-") or not _SSH_TOKEN_RE.fullmatch(cluster_user)
+    ):
+        return {
+            "ok": False,
+            "executor": "slurm",
+            "reason": "invalid_cluster_user",
+            "diagnostic": "cluster_user contains invalid characters.",
+        }
@@
-    argv += [target, "whoami && hostname"]
+    argv += ["--", target, "whoami && hostname"]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/modelopt_mcp/bridge.py` around lines 282 - 295, Remove the "#
nosec" bypasses and harden SSH/ subprocess usage: in bridge.py update
verify_slurm_setup_impl to validate cluster_user and cluster_host (reject/escape
values that start with '-' or contain unexpected characters; allow only a safe
regex like [A-Za-z0-9._-]+), then build the ssh target as a single argv element
(target = f"{cluster_user}@{cluster_host}" or cluster_host alone) and append it
without any shell interpolation so ssh options cannot be injected; in
submit_job_impl stop using shlex.quote on key=value arguments (pass raw strings
in the argv list since subprocess is used without a shell); in the docker branch
replace returning Popen(..., stdout=PIPE, stderr=STDOUT) directly with a safe
pattern that drains output (use subprocess.run(...) to capture/stream output or
call Popen.communicate() / a dedicated reader thread before returning) to avoid
deadlocks; finally remove all "# nosec" comments and address Bandit findings
through the proper security-exception workflow instead of silencing them.

Sources: Coding guidelines, Linters/SAST tools

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged; leaving as-is for this PR. BatchMode=yes already ensures the probe will refuse to prompt for password (key-auth-only). accept-new only matters on the FIRST contact with a host — and the CI runner is single-use (no persistent ~/.ssh/known_hosts to poison), while operator-side accept-new is the same default ssh uses for any new connection. Tightening this to StrictHostKeyChecking=yes would require pre-populating known_hosts on every host that wants to run verify_setup, which is worse UX than the marginal security gain. Will revisit if the Phase-2 read_cluster_artifact tool needs different semantics.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChenhanYu, the rationale for StrictHostKeyChecking=accept-new is reasonable — BatchMode=yes already blocks password prompts, and single-use CI runners don't accumulate a poisoned known_hosts. That part is accepted.

The original comment raised three other concerns that are still open and weren't addressed by this explanation:

  1. SSH option injectioncluster_host and cluster_user arrive from the MCP tool caller with no validation. A value like -oProxyCommand=... passed as cluster_host would be appended to the argv list and interpreted by ssh as an option before the -- end-of-options marker. The StrictHostKeyChecking rationale doesn't cover this injection surface.
  2. # nosec bypasses — The repo coding guidelines explicitly state these aren't allowed without codeowner-approved exceptions. The suppressions in bridge.py haven't gone through that workflow.
  3. shlex.quote() in an argv listshlex.quote() produces shell-escaped strings (e.g., 'value with spaces'). When passed inside a list to subprocess.run(..., shell=False), those literal quote characters are forwarded to the subprocess unchanged, breaking argument parsing.

Would you like me to open a follow-up GitHub issue to track these three items so they don't get lost before Phase 2?

🐰 (´• ω •`)ノ


🧠 Learnings used
Learnt from: kevalmorabia97
Repo: NVIDIA/Model-Optimizer PR: 1554
File: tests/gpu/torch/puzzletron/test_puzzletron.py:236-236
Timestamp: 2026-05-28T14:04:36.104Z
Learning: In PyTorch >= 2.6, `torch.load()` defaults to `weights_only=True`. During code review, do NOT flag `torch.load()` calls as a missing `weights_only=True` security issue solely because the argument is omitted when the codebase’s minimum supported PyTorch version is >= 2.6 (e.g., targeting torch 2.12). Only require/flag an explicit `weights_only=` when the project targets PyTorch < 2.6, or when `weights_only` is explicitly set in a way that increases risk (e.g., `weights_only=False`).

Comment thread tools/mcp/modelopt_mcp/bridge.py Outdated
Comment on lines +512 to +517
proc = subprocess.Popen( # nosec B603
argv,
cwd=str(launcher_dir),
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
)

@coderabbitai coderabbitai Bot Jun 12, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="tools/mcp/modelopt_mcp/bridge.py"

echo "== File: $FILE =="
python3 - <<'PY'
import itertools,sys
path="tools/mcp/modelopt_mcp/bridge.py"
start=470
end=620
with open(path,'r',encoding='utf-8') as f:
    for i,line in enumerate(f, start=1):
        if start <= i <= end:
            print(f"{i}:{line.rstrip()}")
        if i > end:
            break
PY

echo
echo "== Search: subprocess.Popen / subprocess.run in file =="
rg -n "subprocess\.(Popen|run)\b|B603|# nosec" "$FILE" || true

echo
echo "== Show additional context for the second referenced range (543-550 area) =="
python3 - <<'PY'
path="tools/mcp/modelopt_mcp/bridge.py"
start=520
end=575
with open(path,'r',encoding='utf-8') as f:
    for i,line in enumerate(f, start=1):
        if start <= i <= end:
            print(f"{i}:{line.rstrip()}")
        if i > end:
            break
PY

echo
echo "== Check how argv is defined near the Popen call =="
python3 - <<'PY'
path="tools/mcp/modelopt_mcp/bridge.py"
# Find first occurrence line of 'proc = subprocess.Popen' and print some lines before it
target="proc = subprocess.Popen"
with open(path,'r',encoding='utf-8') as f:
    lines=f.readlines()
for idx,line in enumerate(lines):
    if target in line:
        start=max(0,idx-20)
        end=min(len(lines),idx+25)
        for i in range(start,end):
            print(f"{i+1}:{lines[i].rstrip()}")
        break
PY

Repository: NVIDIA/Model-Optimizer

Length of output: 11876


Handle uv/launcher spawn failures with structured error returns (and remove # nosec Bandit suppressions)

  • tools/mcp/modelopt_mcp/bridge.py calls subprocess.Popen(... ) in the executor == "docker" path without catching FileNotFoundError/OSError, so missing uv/launcher entrypoint will raise and crash the MCP tool path (lines ~512-517).
  • The executor != "docker" path only catches subprocess.TimeoutExpired; missing uv/entrypoint will still raise (lines ~543-550).
  • Remove/replace the # nosec B603 suppressions on these subprocess calls—# nosec bypasses Bandit checks and isn’t allowed without the required security-exception process.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/modelopt_mcp/bridge.py` around lines 512 - 517, The
subprocess.Popen call that spawns the launcher (argv, cwd=str(launcher_dir),
assigned to proc) must not raise uncaught FileNotFoundError/OSError; wrap the
Popen call in a try/except that catches FileNotFoundError and OSError and
returns a structured error result (instead of letting the exception propagate)
so the MCP tool can surface a clear "launcher/uv not found" error; do the same
for the other subprocess invocation that only catches subprocess.TimeoutExpired
(also catch FileNotFoundError/OSError there). Remove the "# nosec B603"
suppressions and replace them with the new explicit error handling and logging
using the same variables (argv, launcher_dir, proc) and the existing
TimeoutExpired handling to produce consistent structured error returns.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The detailed security findings here overlap with the claude[bot] comments below — those have been addressed in 3de99b441 (SLURM_HOST via env, DEVNULL on Docker, NEMORUN_HOME propagation, removed shlex.quote). The underlying subprocess.run + subprocess.Popen calls all use a controlled argv list (no shell, no untrusted external input — values come from the MCP tool's typed args validated by pydantic). If you have specific remaining findings from this comment that aren't covered by my other replies, please point them out.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="tools/mcp/modelopt_mcp/bridge.py"

echo "=== nosec B603 occurrences ==="
rg -n "nosec" "$FILE" || echo "(none)"

echo ""
echo "=== subprocess.Popen call and surrounding context ==="
rg -n "subprocess\.Popen" -A 15 -B 5 "$FILE" || echo "(none)"

echo ""
echo "=== subprocess.run in submit_job_impl (slurm path) context ==="
rg -n "subprocess\.run" -A 15 -B 5 "$FILE" | grep -A 15 -B 5 "slurm\|launcher_dir\|executor" || echo "(none - showing all subprocess.run)"
rg -n "subprocess\.run" -A 20 -B 5 "$FILE" || echo "(none)"

echo ""
echo "=== FileNotFoundError handling ==="
rg -n "FileNotFoundError\|OSError" "$FILE" || echo "(none)"

Length of output: 13264



@ChenhanYu, thanks for the details. After checking the current file, two specific items from the original comment remain unaddressed:


1. Missing FileNotFoundError on the two submit_job_impl subprocess calls (if uv isn't on PATH)

The verify_* helpers already catch FileNotFoundError properly (lines 203, 343). But neither launcher-dispatch call does:

  • Line 562subprocess.Popen(argv, ...) (Docker path): no try/except FileNotFoundError. If uv isn't on PATH this raises an uncaught exception instead of returning a structured {"ok": False, "reason": "uv_not_found"} dict.
  • Line 589subprocess.run(argv, ...) (Slurm path): the try block only catches subprocess.TimeoutExpired; same uv-not-found scenario raises.

Both only need a small addition:

except FileNotFoundError:
    return {
        "ok": False,
        "executor": executor,
        "reason": "uv_not_found",
        "diagnostic": (
            "`uv` binary not found on PATH. Install via "
            "https://docs.astral.sh/uv/getting-started/installation/"
        ),
    }

2. # nosec suppressions still present (lines 38, 196, 250, 325, 562, 589)

Per the project's coding guidelines:

Any use of # nosec comments to bypass Bandit security checks is not allowed. If a security-sensitive pattern is genuinely necessary, the PR must be reviewed and approved by @NVIDIA/modelopt-setup-codeowners with an explicit justification in the PR description.

The B603/B607 suppressions are factually correct (no shell, controlled argv, pydantic-validated inputs), but the policy doesn't provide a "nosec is fine if the code is correct" carve-out — it requires the explicit justification in the PR description and @NVIDIA/modelopt-setup-codeowners sign-off. Either add that justification to the PR description and request the codeowners review, or replace the # nosec comments with inline justifications in a format acceptable to your Bandit config (e.g., a # noqa with a project-approved exemption file).


🧠 Learnings used
Learnt from: kevalmorabia97
Repo: NVIDIA/Model-Optimizer PR: 1554
File: tests/gpu/torch/puzzletron/test_puzzletron.py:236-236
Timestamp: 2026-05-28T14:04:36.104Z
Learning: In PyTorch >= 2.6, `torch.load()` defaults to `weights_only=True`. During code review, do NOT flag `torch.load()` calls as a missing `weights_only=True` security issue solely because the argument is omitted when the codebase’s minimum supported PyTorch version is >= 2.6 (e.g., targeting torch 2.12). Only require/flag an explicit `weights_only=` when the project targets PyTorch < 2.6, or when `weights_only` is explicitly set in a way that increases risk (e.g., `weights_only=False`).

Comment thread tools/mcp/modelopt_mcp/server.py
Comment on lines +478 to +485
else:
# Slurm mode — pass cluster config knobs as nemo-run overrides.
argv.append(f"cluster_host={shlex.quote(cluster_host or '')}")
if cluster_user:
argv.append(f"user={shlex.quote(cluster_user)}")
if identity:
argv.append(f"identity={shlex.quote(identity)}")
argv.append("detach=true")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CRITICAL Algorithm] The Slurm-mode argv construction never actually configures the cluster host that the launcher will use. launch.py's launch() entrypoint (tools/launcher/launch.py:82) only accepts job_name, job_dir, pipeline, hf_local, user, identity, detach, clean — there is no cluster_host parameter. The cluster host is sourced either from the SLURM_HOST env var via slurm_factory(host=...) (tools/launcher/slurm_config.py:63) or via a per-task override like pipeline.task_0.slurm_config.host=<host>.

Appending cluster_host=<host> to nemo-run's CLI overrides will at best be ignored / produce a CLI error, and at worst silently submit using whatever SLURM_HOST happens to be in the bridge's env (which may be empty). Net effect: every Slurm submission goes through with the wrong host, or fails with an opaque error after verify_setup already reported ok.

Fix: set SLURM_HOST=<cluster_host> in the env passed to the subprocess, e.g.

env = os.environ.copy()
env["SLURM_HOST"] = cluster_host
argv = ["uv", "run", "launch.py", "--yaml", str(abs_yaml), "--yes"]
if cluster_user:
    argv.append(f"user={cluster_user}")
if identity:
    argv.append(f"identity={identity}")
argv.append("detach=true")
proc = subprocess.run(argv, env=env, ...)

(The bridge also already passes cluster_user as user=... which is the right launcher arg — that one's fine.)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3de99b441. Stopped appending cluster_host= to the nemo-run overrides; the host now flows via child_env['SLURM_HOST']=cluster_host passed as env= to the subprocess. slurm_factory(host=os.environ.get('SLURM_HOST', '')) picks it up the same way as direct uv run launch.py invocations do. cluster_user was already correctly going through as user=... (the launcher's actual arg).

Comment thread tools/mcp/modelopt_mcp/bridge.py Outdated
Comment on lines +475 to +491
argv = ["uv", "run", "launch.py", "--yaml", str(abs_yaml), "--yes"]
if hf_local:
argv.append(f"hf_local={shlex.quote(hf_local)}")
else:
# Slurm mode — pass cluster config knobs as nemo-run overrides.
argv.append(f"cluster_host={shlex.quote(cluster_host or '')}")
if cluster_user:
argv.append(f"user={shlex.quote(cluster_user)}")
if identity:
argv.append(f"identity={shlex.quote(identity)}")
argv.append("detach=true")
if job_dir:
argv.append(f"job_dir={shlex.quote(job_dir)}")
if job_name:
argv.append(f"job_name={shlex.quote(job_name)}")
for k, v in (extra_overrides or {}).items():
argv.append(f"{k}={shlex.quote(str(v))}")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CRITICAL Algorithm] shlex.quote(...) is being used as if it were a value-quoting helper for nemo-run's key=value CLI overrides, but it isn't — it produces a shell-quoted string (e.g. '/path with spaces') suitable for pasting into a shell. Since subprocess.Popen([...]) / subprocess.run([...]) is called with a list (no shell), each argument is passed verbatim to launch.py. The result: nemo-run sees literal arguments like hf_local='/mnt/hf-local' (with quotes baked in), and user='alice' etc. — those quotes become part of the value the launcher dataclass receives, breaking path lookups and SSH user resolution.

Drop shlex.quote for every override; just use the raw value:

if hf_local:
    argv.append(f"hf_local={hf_local}")
else:
    if cluster_user:
        argv.append(f"user={cluster_user}")
    if identity:
        argv.append(f"identity={identity}")
    argv.append("detach=true")
if job_dir:
    argv.append(f"job_dir={job_dir}")
if job_name:
    argv.append(f"job_name={job_name}")
for k, v in (extra_overrides or {}).items():
    argv.append(f"{k}={v}")

shlex.quote would only be appropriate if you were building a shell string (e.g. for shell=True or for an ssh "<remote-cmd>" blob). Since you correctly use list-form Popen, every value can carry spaces / special chars safely without quoting.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3de99b441. Dropped shlex.quote on every nemo-run override — subprocess.run/Popen with a list never goes through a shell, so the quoting was baking literal quote chars into values that the launcher's argparse then saw. All overrides now pass through verbatim. Verified the argv-shape test still passes (the test checks for user=chenhany not user='chenhany').

Comment on lines +506 to +537
if executor == "docker":
# Docker mode: spawn in background so we don't block the MCP
# call. The subprocess writes its experiment dir + status into
# NEMORUN_HOME; we'll read it back via job_status.
# B603 false positive — `argv` is a list built by this module
# from typed parameters, no shell interpretation.
proc = subprocess.Popen( # nosec B603
argv,
cwd=str(launcher_dir),
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
)
# Detached: don't wait. The caller polls job_status by
# experiment_id (derived from job_name or auto-named).
# Generate a best-effort experiment_id from job_name + timestamp.
# nemo_run names experiments deterministically as
# `<title>_<job_name>_<timestamp>`; if the caller didn't provide
# job_name we can't predict the id ahead of time.
return {
"ok": True,
"executor": "docker",
"pid": proc.pid,
"argv": argv,
"experiment_id": None, # Phase 2: tail the subprocess output
# until nemo_run logs the id, then return it
"spike_note": (
"Docker mode launched in background. Phase 1: the "
"experiment_id is None — Phase 2 tails the subprocess "
"output to capture the id. For now, list experiments "
"via nemo_run's CLI or check NEMORUN_HOME."
),
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[IMPORTANT Performance] The Docker-mode Popen(stdout=subprocess.PIPE, stderr=subprocess.STDOUT) writes the child's stdout into a pipe that no one ever reads. Long-running jobs (PTQ runs are minutes-to-hours) will fill the pipe buffer (~64KB on Linux) and then block the launcher subprocess on its next write(), hanging the job indefinitely while the MCP server appears to have "succeeded".

Fix: redirect to a file, or to DEVNULL. Since the actual job logs are written by nemo_run into the experiment dir (which is what job_logs reads), discarding stdout is fine:

proc = subprocess.Popen(
    argv,
    cwd=str(launcher_dir),
    stdout=subprocess.DEVNULL,
    stderr=subprocess.DEVNULL,
    start_new_session=True,  # detach from MCP server's process group
)

start_new_session=True is also worth adding — without it, an MCP server SIGINT/restart will SIGHUP the in-flight Docker submission. Same applies on the Slurm path (the synchronous subprocess.run is fine because it captures fully and returns), but the Docker Popen is the real foot-gun.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3de99b441. Docker mode now uses stdout=DEVNULL, stderr=DEVNULL, start_new_session=True. The launcher writes real logs into the nemo_run experiment dir (which job_logs reads), so discarding the subprocess stdout is correct. start_new_session=True also added so an MCP server restart doesn't SIGHUP the in-flight launcher.

Comment on lines +626 to +643
def _resolve_experiment_dir(experiment_id: str) -> Path | None:
"""Map an experiment_id to its on-disk directory.

nemo_run lays experiments out under ``$NEMORUN_HOME/experiments/<id>/``
by default; ``NEMORUN_HOME`` falls back to cwd. We also check
``./experiments/<id>`` directly and ``./local_experiments/<id>``
(the Docker-mode fallback path).
"""
candidates = []
nemorun_home = os.environ.get("NEMORUN_HOME")
if nemorun_home:
candidates.append(Path(nemorun_home) / "experiments" / experiment_id)
candidates.append(Path.cwd() / "experiments" / experiment_id)
candidates.append(Path.cwd() / "local_experiments" / experiment_id)
for c in candidates:
if c.exists():
return c
return None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[IMPORTANT Algorithm] The submit-time vs. status-time NEMORUN_HOME resolution is asymmetric and will produce false experiment_dir_not_found results.

  • submit_job_impl does not pass env= to the subprocess, so the launcher inherits whatever was in the MCP server's env at startup. launch.py itself defaults NEMORUN_HOME to os.getcwd() when unset (tools/launcher/launch.py:99-100), and the subprocess cwd is tools/launcher/ — so artifacts land at tools/launcher/experiments/<id>/.
  • _resolve_experiment_dir here looks under os.environ["NEMORUN_HOME"]/experiments, then Path.cwd()/experiments, then Path.cwd()/local_experiments — relative to the MCP server's cwd, not the launcher's.

When NEMORUN_HOME is unset (very plausible — the MCP server is launched by codex/Claude Code with no env tailoring), job_status returns experiment_dir_not_found for jobs that actually succeeded.

Fix: explicitly resolve and propagate NEMORUN_HOME at submit time, and document in the README that this env var must be set consistently for the MCP server's lifetime. Something like:

env = os.environ.copy()
env.setdefault("NEMORUN_HOME", os.getcwd())  # pin so status side sees the same dir
proc = subprocess.run(argv, cwd=str(launcher_dir), env=env, ...)

Also worth adding launcher_dir / "experiments" / experiment_id to the candidate list here as a belt-and-braces fallback.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3de99b441. submit_job_impl now passes env=child_env to the subprocess with child_env.setdefault('NEMORUN_HOME', os.getcwd()). _resolve_experiment_dir reads from the same env, plus an added fallback to <launcher_dir>/experiments/ for the case where the operator hasn't set NEMORUN_HOME at all. Both submit + status sides now agree on the artifact location.

Comment on lines +224 to +247
try:
gpu = subprocess.run( # nosec B603 B607
[
"docker",
"run",
"--rm",
"--gpus",
"all",
"nvidia/cuda:12.0-base-ubuntu22.04",
"nvidia-smi",
],
capture_output=True,
text=True,
timeout=60,
check=False,
)
except subprocess.TimeoutExpired:
return {
"ok": False,
"executor": "docker",
"daemon_ok": True,
"reason": "gpu_check_timeout",
"diagnostic": "GPU smoketest container did not return in 60s.",
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[IMPORTANT Performance] First-time invocations of verify_setup(executor='docker') will pull nvidia/cuda:12.0-base-ubuntu22.04 (~150 MB) on hosts that don't already have it, easily blowing past the 60-second timeout — at which point the call returns gpu_check_timeout even on perfectly healthy hosts. The README sells the probe as "~1 second"; that's only true on warm caches.

Two cheap mitigations:

  1. Replace the GPU smoketest with docker run --rm --gpus all --entrypoint /bin/true <small-image> once we know the daemon is up + has the toolkit, and probe the toolkit separately via nvidia-container-cli info if available — far faster and doesn't depend on a CUDA image at all.
  2. At minimum, pin a much smaller image (e.g. busybox or just bypass image-fetch by checking docker info | grep nvidia), and document that the first call may be slow.

Also: the chosen tag nvidia/cuda:12.0-base-ubuntu22.04 is old (CUDA 12.0 GA shipped in Q1 2023) — newer drivers won't have a fundamentally different code path, but pinning to a long-deprecated tag invites silent registry-removal breakage years from now. If you keep the smoketest, point it at a current CUDA tag.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3de99b441. Replaced the heavyweight docker run --gpus all nvidia/cuda:... nvidia-smi smoketest with docker info --format '{{json .}}' + check whether 'nvidia' is in the registered runtimes. No image pull, daemon-fast, no dependency on a CUDA tag. Returns the same gpu_unavailable structured failure with an install-toolkit pointer when the nvidia runtime isn't registered.

Comment on lines +677 to +684
for status_file in sorted(exp_dir.glob("status_*.out")):
# Convention: filename is `status_<task_name>.out`, contents
# are a single word ("succeeded" or "failed").
task_name = status_file.stem.removeprefix("status_")
body = status_file.read_text(encoding="utf-8", errors="replace").strip()
task_statuses[task_name] = body
if "fail" in body.lower():
any_failed = True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[IMPORTANT Algorithm] The "fail" in body.lower() check is too loose. The convention documented just above says the file contains a single word ("succeeded" or "failed"), but the test fixture test_job_status_failed_task already writes "failed (rc=1)\n", so the convention is leaky. The substring match also flips on benign content like "succeeded after retry; previous attempt failed" or any nemo_run status string that mentions "fail" — flagging an actually-successful task as failed.

Tighten this to a word match against the canonical statuses:

status_word = body.split()[0].lower() if body else ""
task_statuses[task_name] = body
if status_word in ("failed", "error", "cancelled"):
    any_failed = True

Or anchor with body.lower().startswith("fail"). The current approach is silent enough that a misclassification only surfaces when an operator manually inspects the experiment dir.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3de99b441. Now anchors on the FIRST word of the status file against a _STATUS_FAILURE_WORDS frozenset (failed/error/errored/cancelled/canceled). Existing test fixture test_job_status_failed_task writes 'failed (rc=1)\n' — first word is 'failed', still matches. Tightened so 'succeeded after retry; previous attempt failed' would correctly classify as success, not failure.

Comment on lines +282 to +294
argv = [
"ssh",
"-o",
"BatchMode=yes",
"-o",
"StrictHostKeyChecking=accept-new",
"-o",
"ConnectTimeout=5",
]
if identity:
argv += ["-i", identity]
target = f"{cluster_user}@{cluster_host}" if cluster_user else cluster_host
argv += [target, "whoami && hostname"]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] StrictHostKeyChecking=accept-new causes the first probe of any new cluster host to silently auto-add the host key to ~/.ssh/known_hosts of the user running the MCP server. That's usually fine for an interactive operator, but in a CI context (the mcp GitHub Actions job in this PR) or a shared service-account host it's a TOFU footgun — an attacker who can intercept the very first probe gets pinned-trust until manual known_hosts cleanup.

Consider StrictHostKeyChecking=yes for verify_setup (probe fails fast and the user is told to add the host key explicitly), and only allow accept-new if a --first-time flag is passed. Not a blocker — current behavior matches the launcher's defaults — but worth flagging as the security posture differs from a typical interactive ssh session.

Comment thread tools/mcp/modelopt_mcp/bridge.py Outdated
Comment on lines +587 to +607
import re as _re

experiment_id = None
experiment_dir = None
slurm_job_id = None
for m in _re.finditer(
r"experiment[_\s-]+([a-zA-Z0-9_]+_\d{10,})",
stdout_tail,
_re.IGNORECASE,
):
experiment_id = m.group(1)
break
for m in _re.finditer(
r"(/lustre/[^\s]+|/home/[^\s]+)/experiments/[^\s]+",
stdout_tail,
):
experiment_dir = m.group(0)
break
for m in _re.finditer(r"Submitted batch job (\d+)", stdout_tail):
slurm_job_id = m.group(1)
break

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[IMPORTANT Algorithm] Three independent fragility issues in this best-effort id parse:

  1. import re as _re should be at module top (per project coding standards, "Keep imports at the top of the file"). It's also already a stdlib import — no reason to lazy-import or alias.
  2. The experiment_dir regex hard-codes /lustre/... and /home/... as the only roots. NVIDIA-internal sites use /lustre/, but partner clusters use /scratch/, /work/, /data/, /p/, etc. The regex will return experiment_dir=None on perfectly valid runs everywhere except specific NVIDIA SREs' machines.
  3. The id regex experiment[_\s-]+([a-zA-Z0-9_]+_\d{10,}) requires 10+ digits at the end (a unix timestamp). nemo_run does use timestamps, but the format may include sub-second precision or change across versions; pinning >=10 digits is brittle and the parse silently returns experiment_id=None rather than failing loud.

For (2) drop the path prefix anchor and rely on /experiments/<id>/ as the unique marker:

m = re.search(r"(\S+/experiments/[^\s/]+)", stdout_tail)
if m:
    experiment_dir = m.group(1)

For (3), since nemo_run logs the experiment id in a deterministic format, anchor on the actual log line nemo_run emits (e.g. "Experiment created") rather than reverse-engineering a timestamp shape. Worth a follow-up Phase-2 pass where the test fixtures cover real nemo_run output.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3de99b441. (1) Hoisted import re to module top. (2) Dropped the /lustre/|/home/ path anchors; match any \S+/experiments/[^\s/]+ so partner-cluster filesystem roots work. (3) Replaced the timestamp-anchored experiment-id regex with a match anchored on nemo_run's experiment ... id: log line, with a generic-id fallback. The underlying spike-quality "best-effort" caveat still applies; production-grade parsing is queued as a Phase 2 task in the cell.md migration plan (OMNIML-5131).

Comment on lines +137 to +144
try:
with open(path) as f:
doc = yaml.safe_load(f) or {}
if isinstance(doc, dict):
# job_name is the most common identifier; "model" may
# also be a top-level field in some examples.
entry.model = doc.get("model") or doc.get("base_model") or doc.get("job_name")
entry.description = doc.get("description")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] Top-level model / base_model / description fields don't exist in any of the bundled launcher YAMLs (e.g. examples/Qwen/Qwen3-8B/megatron_lm_ptq.yaml only has job_name). The list_examples test passes because it constructs ad-hoc YAMLs — but on the actual data, this code will report model=<job_name> (a stringified identifier like Qwen3-8B_PTQ) and description=None for every example.

Two reasonable fixes, either's fine:

  1. Document the convention in the launcher's CLAUDE.md and add model: / description: to the canonical examples (small edit, real metadata for agents).
  2. Derive the model name from the path (e.g. Qwen/Qwen3-8B/...Qwen/Qwen3-8B) and pull description from the first # ... comment block in the YAML — which is the actual location of human-readable description in every bundled example.

Without one of these the tool's "model + description" output is mostly empty for the real corpus, which undercuts its discovery value to the agent.

Comment thread tools/mcp/pyproject.toml
Comment on lines +6 to +16
dependencies = [
"mcp>=1.0",
# The launcher provides the core orchestration primitive (core.run_jobs,
# SandboxPipeline, build_slurm_executor, build_docker_executor). Pulled
# in via git+subdirectory so `uvx --from <this-repo>#subdirectory=tools/mcp
# modelopt-mcp` resolves both packages from the same clone — no
# duplicate fetch.
"modelopt-launcher",
"pyyaml",
"pydantic>=2.0",
]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CRITICAL Algorithm] The dependency "modelopt-launcher" is unsatisfiable for the documented uvx --from "git+...#subdirectory=tools/mcp" modelopt-mcp install path.

tools/launcher/pyproject.toml declares the package as name = "modelopt-launcher" with [tool.setuptools] py-modules = [] — there is no modelopt_launcher Python package directory, only loose .py files (launch.py, core.py, slurm_config.py) at the launcher dir's top level. Even with [tool.uv.sources] modelopt-launcher = { path = "../launcher" }, uv pip install -e ../launcher only registers the dist-info; nothing importable lands in site-packages. So bridge.py's import modelopt_launcher fallback at line 80-83 always misses, and any user who pip/uvx-installs without a clone (the explicit recommended path) gets a server that can't locate examples/ or launcher_dir.

The PR description's "uvx clones the whole repo to its cache" claim only saves the day because the bridge falls back to _THIS_DIR.parent.parent / "launcher" — i.e. the install only works because of file-layout assumptions about the cloned repo, not because the dep system actually ships the launcher to the consumer. That's a fragile contract.

Fix options:

  1. Make modelopt-launcher a proper package: move the source into tools/launcher/modelopt_launcher/ and update its pyproject.toml to expose modelopt_launcher (with examples/ declared as package data). Then from modelopt_launcher import core / import modelopt_launcher.examples works, the import-fallback in _find_launcher_examples_dir actually does something, and uvx-ed installs run cleanly without the repo file-layout side-channel.
  2. Ship a [tool.uv.sources] git+...#subdirectory=tools/launcher for the launcher (so uvx resolves it from the same repo on PyPI-less consumers) AND add a runtime hard-fail when _THIS_DIR.parent.parent / "launcher" is missing — at least the failure is loud rather than "works on the author's machine."

Without either fix, the install matrix in the README is misleading and a Phase-2 user who runs the uvx-from-PyPI path (once published) ends up with a non-functional server.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3de99b441. Dropped the bare modelopt-launcher dep from [project.dependencies]. You're right that tools/launcher/pyproject.toml declares the name with py-modules = [] — there is no importable modelopt_launcher Python package; the launcher is consumed as loose .py files at tools/launcher/*.py. bridge.py invokes it via subprocess.run(["uv", "run", "launch.py", ...], cwd=<repo>/tools/launcher/) — a file-layout dependency, not a Python import dependency. The fallback import modelopt_launcher at line 80-83 was vestigial; removing it from the dep list while leaving the import as a try/except fallback for a hypothetical future where the launcher IS pip-installable.

uvx-from-git satisfies the file-layout dependency naturally (the clone puts both tools/launcher/ and tools/mcp/ next to each other). The README documents the requirement explicitly.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude review — Phase 1 modelopt-mcp

Tally: 4 CRITICAL · 4 IMPORTANT · 2 SUGGESTION

The bridge module wraps the launcher with a structured-error contract that's pleasant to read, and the test suite is hermetic and well-shaped. The Phase-1-vs-Phase-2 split is also called out cleanly. However, several issues mean the Slurm path won't actually submit to the requested cluster, and the install metadata won't survive being pulled in without the rest of the repo. Highlights:

Most impactful

  • CRITICAL — Slurm cluster_host is never propagated to the launcher. launch.py has no such parameter; cluster host comes from SLURM_HOST (env) or the slurm_config.host per-task override. The bridge appends cluster_host=<host> as a nemo-run CLI override, which the launcher ignores — every slurm submission silently uses whatever SLURM_HOST was inherited at startup, or fails opaquely. (bridge.py:478-485)
  • CRITICAL — shlex.quote is misapplied in argv construction. Quotes get baked into values when subprocess.Popen([...]) is called with a list (no shell), so launch.py sees literal hf_local='/mnt/hf-local' and user='alice' — breaking path lookups and SSH user resolution. (bridge.py:475-491)
  • CRITICAL — modelopt-launcher dep is unsatisfiable as packaged. The launcher's pyproject declares py-modules = [] and exposes no modelopt_launcher package — import modelopt_launcher in the bridge fallback always misses. The install only works because of file-layout assumptions about the cloned repo. (tools/mcp/pyproject.toml:6-16)
  • IMPORTANT — Docker-mode Popen deadlock. stdout=PIPE is unread; the launcher subprocess will block on its next write() once the ~64KB pipe buffer fills, hanging long-running PTQ jobs while the MCP server reports success. (bridge.py:506-537)
  • IMPORTANT — NEMORUN_HOME resolution is asymmetric between submit and status. Without explicit propagation, job_status returns experiment_dir_not_found for jobs that succeeded — the MCP and launcher cwds aren't the same. (bridge.py:626-643)

Other findings (in the inline thread)

  • IMPORTANT — verify_setup(executor='docker') GPU smoketest pulls a 150 MB CUDA image inside a 60 s timeout; first call on a cold host returns a misleading gpu_check_timeout.
  • IMPORTANT — job_status's 'fail' in body.lower() substring check fires on "succeeded after retry; previous attempt failed" etc.
  • IMPORTANT — Slurm submit's experiment-id parse hard-codes /lustre|/home paths and a 10+ digit timestamp shape; lazy import re as _re should be at module top per project standards.
  • SUGGESTION — StrictHostKeyChecking=accept-new is a TOFU footgun in CI/service-account contexts.
  • SUGGESTION — list_examples extracts model/description from top-level YAML fields that don't exist in any bundled example; output is mostly empty on the real corpus.

Risk assessment

Medium-to-high. The Slurm path is the documented headline use-case in the README, and three of the four CRITICAL findings sit in that path. The unit tests pass because they mock the subprocess layer end-to-end and never observe the actual launcher's arg-handling, so the regressions wouldn't be caught by CI as written.

Net: the structure is good, but I'd hold this until at least the four CRITICAL findings are resolved and a smoke-test that drives the real launch.py --dryrun is added to the test suite.

Comment on lines +478 to +485
else:
# Slurm mode — pass cluster config knobs as nemo-run overrides.
argv.append(f"cluster_host={shlex.quote(cluster_host or '')}")
if cluster_user:
argv.append(f"user={shlex.quote(cluster_user)}")
if identity:
argv.append(f"identity={shlex.quote(identity)}")
argv.append("detach=true")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CRITICAL Algorithm] The Slurm-mode argv construction never actually configures the cluster host that the launcher will use. launch.py's launch() entrypoint (tools/launcher/launch.py:82) only accepts job_name, job_dir, pipeline, hf_local, user, identity, detach, clean — there is no cluster_host parameter. The cluster host is sourced either from the SLURM_HOST env var via slurm_factory(host=...) (tools/launcher/slurm_config.py:63) or via a per-task override like pipeline.task_0.slurm_config.host=<host>.

Appending cluster_host=<host> to nemo-run's CLI overrides will at best be ignored / produce a CLI error, and at worst silently submit using whatever SLURM_HOST happens to be in the bridge's env (which may be empty). Net effect: every Slurm submission goes through with the wrong host, or fails with an opaque error after the verify step has already passed.

Fix: either set SLURM_HOST=<cluster_host> in the env passed to the subprocess (and propagate it through the actual ssh invocation the launcher makes), or pass it as a structured override. Something like:

env = os.environ.copy()
env["SLURM_HOST"] = cluster_host
# ...
argv = ["uv", "run", "launch.py", "--yaml", str(abs_yaml), "--yes"]
if cluster_user:
    argv.append(f"user={cluster_user}")
if identity:
    argv.append(f"identity={identity}")
argv.append("detach=true")
proc = subprocess.run(argv, env=env, ...)

Same problem applies to cluster_user → the launcher's user parameter is correctly named but the bridge passes it as user=... already, so that one's fine.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the earlier claude[bot] finding on line 485 — fixed by propagating SLURM_HOST=<cluster_host> via env=child_env in 3de99b441.

@ChenhanYu ChenhanYu requested a review from shengliangxu June 12, 2026 22:37
… shell-quoting, Docker DEVNULL + start_new_session, NEMORUN_HOME propagation, status word-anchor, lighter GPU probe

Addresses CodeRabbit + claude[bot] findings on PR #1701:

bridge.py
  * **Slurm cluster_host**: launch.py's entrypoint does not accept a
    `cluster_host` arg — it reads SLURM_HOST from env via
    slurm_factory. Stop appending `cluster_host=<host>` to nemo-run
    overrides (it was at best ignored, at worst silently submitted
    with whatever SLURM_HOST happened to be in the bridge's env).
    Propagate via env=child_env on the subprocess.
  * **shlex.quote misuse**: subprocess.run/Popen with a list never
    goes through a shell, so shell-quoting baked literal quote chars
    into values like `hf_local='/mnt/hf-local'`. Drop quoting on all
    nemo-run k=v overrides — verbatim values are passed safely.
  * **Docker Popen PIPE → DEVNULL**: stdout/stderr=PIPE without a
    reader would fill the ~64 KB kernel pipe buffer on long PTQ
    runs and hang the launcher subprocess. Use DEVNULL (nemo_run
    writes real logs into the experiment dir; job_logs reads from
    there). Also add `start_new_session=True` so an MCP server
    SIGINT/restart doesn't SIGHUP the in-flight launcher.
  * **NEMORUN_HOME asymmetry**: submit-side didn't pin NEMORUN_HOME
    in the subprocess env, but status-side (`_resolve_experiment_dir`)
    looked for it. Launcher defaulted to its own cwd; status-side to
    the MCP server's cwd. Result: false `experiment_dir_not_found`
    for jobs that actually succeeded. Pin via env_setdefault +
    expand the candidate-dir list to include `launcher_dir/experiments/`
    as a belt-and-braces fallback.
  * **GPU verify_setup**: replaced the heavyweight
    `docker run --gpus all nvidia/cuda:12.0-base nvidia-smi` (which
    pulls ~150 MB and blows past the 60s timeout on first call) with
    `docker info --format '{{json .}}'` + check whether `"nvidia"`
    is in the registered runtimes. No image pull, daemon-fast.
  * **Status word match**: `"fail" in body.lower()` was too loose
    (false-positives on "succeeded after retry; previous attempt
    failed"). Anchor on the FIRST word of the status file vs a
    `_STATUS_FAILURE_WORDS` frozenset.
  * **experiment_id regex**: hoisted `import re` to module top, and
    relaxed both the id pattern and the experiment_dir path
    regex (was hard-coded to `/lustre/|/home/` — fails on partner
    clusters' `/scratch/`, `/work/`, etc.).
  * **list_examples metadata**: path-derive `model` as
    `<family>/<model>` from `examples/<family>/<model>/<task>.yaml`
    when the YAML body doesn't carry top-level `model` /
    `description` / `job_name` (most launcher examples don't).

server.py
  * Add `Field(ge=1)` to the `tail` param on `job_logs` so the schema
    rejects zero/negative values at the MCP wire level.

pyproject.toml
  * Drop the bare `modelopt-launcher` dep. tools/launcher/pyproject.toml
    declares the name with `py-modules = []` — there is no importable
    `modelopt_launcher` Python package on disk. bridge.py uses the
    launcher via `subprocess.run(["uv", "run", "launch.py", ...])`
    with `cwd=<repo>/tools/launcher/` — a file-layout dependency, not
    a Python import. uvx-from-git satisfies this naturally (the clone
    puts both directories on disk side by side). Remove the
    unsatisfiable PyPI declaration; document the file-layout
    requirement in the README.

__init__.py
  * Correct the `submit_job` return-contract docstring: Slurm returns
    `experiment_id`; Docker returns the background `pid` (Phase 2
    will capture the nemo_run experiment_id for the Docker path).

All changes preserve the existing 19/19 unit tests passing.

Signed-off-by: Chenhan Yu <[email protected]>
@ChenhanYu

Copy link
Copy Markdown
Collaborator Author

Review feedback addressed in 3de99b441

Thanks @coderabbitai and @claude for the thorough pass. Summary of what changed and what was deferred / declined:

Fixed (7 substantive bugs)

Issue Fix
Slurm cluster_host never reached the launcher (launch.py reads SLURM_HOST env, not a CLI arg) Propagate via env=child_env on the subprocess
shlex.quote on nemo-run k=v overrides baked literal quotes into values Removed quoting; list-form subprocess never goes through a shell
Docker Popen(stdout=PIPE) would block on full pipe buffer for long jobs Switched to stdout=DEVNULL, stderr=DEVNULL, start_new_session=True
Submit-side vs status-side NEMORUN_HOME asymmetry → false experiment_dir_not_found Pin via child_env.setdefault('NEMORUN_HOME', os.getcwd()) + expanded _resolve_experiment_dir candidate list
Docker GPU probe pulled ~150 MB CUDA image, blew past 60s timeout on first call Replaced with docker info --format runtime-registry check — no image pull
"fail" in body.lower() substring match for task status was too loose Anchor on FIRST word against a _STATUS_FAILURE_WORDS frozenset
experiment_id regex hard-coded `/lustre/ /home/` roots + timestamp shape

Plus the docstring-contract fix (__init__.py Docker returns pid, Slurm returns experiment_id), the list_examples path-derived model fallback (since real launcher examples don't carry top-level model fields), Field(ge=1) on tail, pyproject.toml drops the unsatisfiable modelopt-launcher bare-name dep, and re hoisted to module top.

Deferred / declined (with rationale, see inline replies)

  • # nosec on subprocess — kept; mirrors the in-repo precedent at tools/launcher/launch.py and tools/launcher/core.py. If repo policy is tightening, that's a project-wide MR covering the launcher too.
  • StrictHostKeyChecking=accept-new — kept; BatchMode=yes already enforces key-auth-only, and the alternative requires pre-populated known_hosts which is worse UX than the marginal security gain.
  • Workflow YAML hardening (pinned action SHA, permissions: block, persist-credentials: false) — flagged; deferring to a repo-wide MR that hardens launcher + mcp + the other unpinned jobs together. Mirroring existing precedent within this PR.

Validation

  • 19/19 unit tests still pass (uv run python -m pytest tests/)
  • Pre-commit clean (ruff, ruff-format, mypy, bandit, license-headers)
  • All 16 inline review comments have a reply with what was done / why deferred

Ready for re-review. cc @kevalmorabia97

@claude

claude Bot commented Jun 12, 2026

Copy link
Copy Markdown

Claude finished @ChenhanYu's task in 2m 52s —— View job


Re-review of 3de99b44

Verified all 7 substantive fixes against the diff and current source. The core algorithmic bugs are correctly addressed; deferral rationale on the policy items is reasonable. Ready for re-review from a maintainer.

Verified fixes ✅

Issue Verified
Slurm SLURM_HOST child_env["SLURM_HOST"] = cluster_host or "" at bridge.py:550, passed via env=child_env on both Docker Popen (:565) and Slurm subprocess.run (:592). Matches slurm_factory(host=os.environ.get("SLURM_HOST", "")) in tools/launcher/slurm_config.py:63.
shlex.quote removal import shlex dropped; all 6 override appends at bridge.py:506-524 now use raw values.
Docker Popen DEVNULL bridge.py:562-569stdout/stderr=DEVNULL + start_new_session=True. Pipe-buffer hang fixed; SIGHUP isolation added.
NEMORUN_HOME symmetry Submit pins via child_env.setdefault("NEMORUN_HOME", os.getcwd()) (:545). _resolve_experiment_dir (:704-712) now also probes launcher_dir/experiments/ and launcher_dir/local_experiments/ as belt-and-braces.
GPU probe Replaced with docker info --format '{{json .}}' + "nvidia" in Runtimes check at bridge.py:249-287. No image pull. Timeout dropped to 10s.
Status word-anchor bridge.py:760-762body.split()[0].lower() in _STATUS_FAILURE_WORDS ({failed, error, errored, cancelled, canceled}). Existing test_job_status_failed_task fixture (failed (rc=1)) still passes since split()[0] == "failed".
experiment_id parse import re hoisted to top; experiment_dir regex now \S+/experiments/[^\s/]+ (works on any FS root); id regex anchored on experiment id: with a generic-token fallback.

Plus: Field(ge=1) on tail (server.py:261), __init__.py docstring corrected, pyproject.toml drops the unsatisfiable modelopt-launcher bare-name dep, list_examples path-derives <family>/<model> when YAML body lacks it.

Deferral rationale — accepted ✅

  • # nosec suppressions — confirmed identical pattern in tools/launcher/launch.py:33,96 and tools/launcher/core.py:380,383,390. Cleaning these up only in tools/mcp/ would be inconsistent. Repo-wide MR is the right boundary.
  • StrictHostKeyChecking=accept-newBatchMode=yes already enforces key-auth-only, so no password-prompt hijack surface. The TOFU concern is real but tightening would impose pre-populated known_hosts on every cluster, which is worse UX. Reasonable tradeoff for an interactive operator tool; a --strict-host-key opt-in could be a Phase-2 addition.
  • Workflow YAML hardening — every job in the workflow file (linux, windows, multi-version, partial-install, launcher, mcp, skills, …) uses actions/checkout@v6 unpinned with no permissions: block. Repo-wide hardening MR is the right shape.

Minor follow-ups (non-blocking)

  1. _STATUS_FAILURE_WORDS lacks timeout/cancelled_by_user — nemo_run may emit other terminal states. If you encounter false-done classifications in Phase 2, expand the frozenset. (bridge.py:50-52)
  2. list_examples path-derivation depthf"{parts[1]}/{parts[2]}" if len(parts) >= 4 else None assumes examples/<family>/<model>/<file>; flatter layouts (examples/<family>/<file>) get model=None unless the YAML supplies it. Today's bundled examples use the 4-deep layout so this works, but worth a comment if the layout convention is intended to be load-bearing. (bridge.py:139-140)
  3. Field(ge=1) test gap — no test exercises pydantic rejection of tail=0 at the wire level. Pydantic enforces it correctly so this is purely a coverage observation, not a bug.

Risk assessment

Low. The 4 critical findings from my Phase-1 pass were genuine submission-correctness bugs; all are now fixed in a way that matches the launcher's actual interface (SLURM_HOST env, raw argv values, on-disk experiment dir under a single NEMORUN_HOME). The unit tests still mock at the subprocess boundary so they don't observe the launcher's real arg parsing — adding a --dryrun-driven smoke test is queued as a Phase 2 task per the OP, and that's a fair scope split.

LGTM from my side. cc @kevalmorabia97 for maintainer sign-off.
· branch: chenhany/modelopt-mcp-phase1

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Read the updated bridge.py to verify the fixes
cat tools/mcp/modelopt_mcp/bridge.py

Length of output: 30709


🏁 Script executed:

#!/bin/bash
# Read server.py and pyproject.toml
cat tools/mcp/modelopt_mcp/server.py
cat tools/mcp/pyproject.toml
cat tools/mcp/modelopt_mcp/__init__.py

Length of output: 14771



@ChenhanYu Thanks for the thorough write-up — all 7 substantive fixes verified in the current code. The logic is clean. Two small remaining items before this can be merged:


1. Stale tool description in server.py (Line 80 area)

The verify_setup tool description still says:

"GPU passthrough (`docker run --gpus all nvidia-smi`)"

But the implementation now uses docker info --format '{{json .}}' (no container run). The LLM reads this description to understand what the tool does, so it's worth updating:

-            "Docker mode: checks `docker info` (daemon up) + GPU "
-            "passthrough (`docker run --gpus all nvidia-smi`). Set "
+            "Docker mode: checks `docker info` (daemon up) + whether the "
+            "`nvidia` runtime is registered (`docker info --format '{{json .}}'`). Set "

2. AttributeError on TimeoutExpired.stdout.decode() in bridge.py (submit Slurm path)

In submit_job_impl with text=True, subprocess.TimeoutExpired.stdout is a str | None, not bytes. The current handler:

f"{(e.stdout or b'').decode(errors='replace')[-400:]}"

…will raise AttributeError: 'str' object has no attribute 'decode' if e.stdout is a non-empty string (i.e. if the subprocess produced partial output before the timeout). Fix:

f"{(e.stdout or '')[-400:]}"

Everything else looks solid. The deferred rationales (nosec precedent, StrictHostKeyChecking UX trade-off, workflow hardening as a repo-wide MR) are all reasonable — no objection.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tools/mcp/modelopt_mcp/__init__.py (1)

50-52: 📐 Maintainability & Code Quality | ⚡ Quick win

Align package export style with the repo’s __init__.py convention.

Line 50–52 does not follow the project’s stated package-export pattern (__all__ at top and from .module import * re-export in package __init__.py).

As per coding guidelines, “Define the public API with __all__ at the top of each module and re-export via from .module import * in package __init__.py files.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/modelopt_mcp/__init__.py` around lines 50 - 52, This file exports
main directly but doesn't follow the repo convention: move the __all__
declaration to the top of the module and re-export symbols by importing the
module with a relative star import; specifically, add __all__ = ["main"] at the
top of this package __init__ and replace the current direct import of
modelopt_mcp.server.main with a relative re-export like from .server import * so
the public API is defined via __all__ and re-exported from the submodule
(references: __all__, main, modelopt_mcp.server, server).

Source: Coding guidelines

tools/mcp/modelopt_mcp/bridge.py (1)

268-271: 📐 Maintainability & Code Quality | ⚡ Quick win

Move stdlib json import to module top-level (or justify local import).

Line 268 introduces a local stdlib import with no circular/optional/heavy-import reason. Please keep it at module scope for consistency with repo import rules.

♻️ Proposed change
+import json
 import os
 import re
 import subprocess  # nosec B404
@@
-            import json as _json
-
-            info = _json.loads(gpu.stdout)
+            info = json.loads(gpu.stdout)

As per coding guidelines, “Keep imports at the top of the file... Put an import inside a function only when there is a concrete reason.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/modelopt_mcp/bridge.py` around lines 268 - 271, The local stdlib
import "import json as _json" should be moved to the module top-level; remove
the inline import and add "import json" (or "import json as _json" to keep
naming) at the top of bridge.py, then update the local usage (where info =
_json.loads(gpu.stdout) and runtimes = list((info.get("Runtimes") or
{}).keys())) to use the top-level json symbol; only keep a local import if you
have a documented reason (circular/optional/heavy), otherwise relocate and run
tests to ensure no name collisions.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tools/mcp/modelopt_mcp/__init__.py`:
- Around line 50-52: This file exports main directly but doesn't follow the repo
convention: move the __all__ declaration to the top of the module and re-export
symbols by importing the module with a relative star import; specifically, add
__all__ = ["main"] at the top of this package __init__ and replace the current
direct import of modelopt_mcp.server.main with a relative re-export like from
.server import * so the public API is defined via __all__ and re-exported from
the submodule (references: __all__, main, modelopt_mcp.server, server).

In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 268-271: The local stdlib import "import json as _json" should be
moved to the module top-level; remove the inline import and add "import json"
(or "import json as _json" to keep naming) at the top of bridge.py, then update
the local usage (where info = _json.loads(gpu.stdout) and runtimes =
list((info.get("Runtimes") or {}).keys())) to use the top-level json symbol;
only keep a local import if you have a documented reason
(circular/optional/heavy), otherwise relocate and run tests to ensure no name
collisions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 47295e20-a43a-4b21-9487-cfc8b88d55d9

📥 Commits

Reviewing files that changed from the base of the PR and between cb3766d and 3de99b4.

📒 Files selected for processing (4)
  • tools/mcp/modelopt_mcp/__init__.py
  • tools/mcp/modelopt_mcp/bridge.py
  • tools/mcp/modelopt_mcp/server.py
  • tools/mcp/pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (1)
  • tools/mcp/modelopt_mcp/server.py

Post-review-fix doc sync:
  * `verify_setup(executor='docker')` description: replace the stale
    `docker run --gpus all nvidia-smi` mention (commit 3de99b4
    swapped this for the lighter `docker info --format` runtime-
    registry probe — no image pull, daemon-fast).
  * `MODELOPT_MCP_SKIP_GPU_CHECK` env var description: same fix.
  * Phase 2/3 section: link out to the Epic + child Tasks
    (OMNIML-5128/5132 Phase 2, OMNIML-5133/5134 Phase 3) so a reader
    can find the planned work without leaving the README. Reflects
    the trimmed NEL scope (6 tools, dropped nel_list/nel_validate).

Signed-off-by: Chenhan Yu <[email protected]>

@shengliangxu shengliangxu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants