feat(tools/mcp): MCP server for ModelOpt launcher (OMNIML-5123) by ChenhanYu · Pull Request #1701 · NVIDIA/Model-Optimizer

ChenhanYu · 2026-06-12T21:31:46Z

Summary

Phase 1 of the modelopt-mcp design tracked in OMNIML-5123.

A new tools/mcp/ package exposing the existing tools/launcher/core.py orchestration as typed MCP tools that codex / Claude Code agents can call directly — instead of shelling out to uv run launch.py --yaml ... and parsing prose output.

Five tools

Tool	Description
`list_examples`	Enumerate `tools/launcher/examples/` with model + description metadata extracted from each YAML
`verify_setup`	Fail-fast probe for the named executor. Docker: `docker info` (daemon up) + `docker info --format` runtime-registry check for the `nvidia` runtime — no image pull, daemon-fast. Slurm: `ssh -o BatchMode=yes -o ConnectTimeout=5` to the cluster login node. Runs in ~1s, saves 30+ s of wasted submission on bad config.
`submit_job`	Submit a launcher YAML. Mode is determined by mutually-exclusive args: `hf_local` → Docker (local GPU), `cluster_host` → Slurm (remote SSH). Returns experiment_id immediately; the actual job runs detached
`job_status`	Filesystem-based status from nemo_run's experiment dir (`_DONE`, `status_*.out`) — no in-memory registry, survives MCP server restarts
`job_logs`	Read `log_<task>.out` from experiment dir; per-task filtering + optional tail

Design constants

Single submit_job with mode by args (not separate submit_docker / submit_slurm tools). Keeps the LLM tool catalog compact; mutual-exclusion is a runtime check.
Filesystem is the source of truth for status + logs. No in-memory registry. Survives MCP server restarts cleanly — important because operators / agents kill + restart their hosts often.
verify_setup is auto-called by submit_job by default (skippable when caller just probed). The probe is ~1s; the cost of a misconfigured submission is 30+ seconds of cluster timeout or container-pull. Always-on verify pays back immediately.

Layout

tools/mcp/
├── pyproject.toml          # name: modelopt-mcp, console_script
├── modelopt_mcp/
│   ├── __init__.py
│   ├── server.py           # FastMCP entry; 5 tool definitions
│   └── bridge.py           # thin wrapper over launcher's core.py
│                           #   + filesystem status/log helpers
└── tests/
    └── test_bridge.py      # 19 unit tests, fully hermetic
                            # (mocked subprocess + tmp_path fixtures)

Install

Two paths, both from source via uv. No PyPI wheel exists; OMNIML-5123 opted for the uvx-from-git pattern to skip publication overhead.

End-user install (recommended)

uvx from the git subdirectory — single command, no manual clone:

# Claude Code
claude mcp add modelopt -- uvx --from \
  "git+https://github.com/NVIDIA/Model-Optimizer.git#subdirectory=tools/mcp" \
  modelopt-mcp

# Codex
codex mcp add modelopt -- uvx --from \
  "git+https://github.com/NVIDIA/Model-Optimizer.git#subdirectory=tools/mcp" \
  modelopt-mcp

Under the hood uvx clones the whole repo to its cache, installs tools/mcp/ as the entry, and resolves the sibling modelopt-launcher dep via [tool.uv.sources] (path = "../launcher") inside the cloned tree.

Dev install (local checkout)

uv pip install -e tools/launcher    # sibling dep first
uv pip install -e tools/mcp         # then this package
modelopt-mcp                         # entry on PATH

The dev path also relies on [tool.uv.sources] to point modelopt-launcher at the local ../launcher checkout.

Why no plain `pip install` today

Two specific reasons, worth flagging so reviewers know what's intentional vs missing:

Nothing on PyPI yet. Neither modelopt-mcp nor modelopt-launcher are published — this PR introduces the package but doesn't add release machinery.
pip doesn't read [tool.uv.sources]. Even from a local checkout, plain pip install -e tools/mcp fails because modelopt-launcher is a bare name (no URL) and pip can't find it. Sticking with uv / uvx is the practical path while we're git-only.

If we later want plain-pip support, two options:

Path	Tradeoff
Publish to PyPI — `modelopt-launcher` + `modelopt-mcp` get versioned wheels	Clean `pip install`, but requires release machinery + version cadence
PEP-440 direct URL in pyproject — `"modelopt-launcher @ git+https://...#subdirectory=tools/launcher"`	Works with pip + uv, but double-clones the same repo on install (cheap and ugly)

Out of scope for this PR; happy to follow up if reviewers want either path.

Post-review changes (commit `3de99b4` + `082a6b8`)

Addressed all CodeRabbit + claude[bot] review findings. See the inline replies for details; the substantive bug-fix highlights:

Slurm cluster_host — propagate via env=child_env (launch.py reads SLURM_HOST, not a CLI arg)
shlex.quote removed from nemo-run k=v overrides (subprocess list-form doesn't shell-quote)
Docker Popen now uses stdout=DEVNULL, stderr=DEVNULL, start_new_session=True to avoid pipe-buffer blocking
NEMORUN_HOME pinned in subprocess env so submit + status sides agree
GPU verify swapped from docker run --gpus all image-pull (slow + flaky) to docker info --format runtime-registry check (daemon-fast)
Task-status word match anchors on first word against a fixed failure-word set (no more "fail" in "succeeded after retry; previous attempt failed" false-positive)
experiment_id regex generalized for non-NVIDIA cluster paths
pyproject.toml dropped the unsatisfiable modelopt-launcher bare-name dep (launcher is a file-layout sibling, not a Python import dep)
Field(ge=1) on job_logs.tail
Docstring contract clarified (Docker returns pid, Slurm returns experiment_id)
README updated to reflect the new GPU probe + Phase 2/3 task linkage

All 19/19 unit tests still pass; pre-commit clean.

Validation

uv pip install -e . succeeds (modelopt-launcher resolved transitively)
19/19 unit tests pass (uv run python -m pytest tests/)
stdio handshake works end-to-end; tools/list returns all 5 with full schemas + descriptions
Mode-resolution: submit_job correctly rejects no-executor + both-executors with structured reason
Filesystem status: correctly classifies done / failed / running from _DONE + status_*.out
Pre-commit clean: ruff, ruff-format, mypy, bandit, license-headers

Acceptance criteria (from OMNIML-5123)

list_examples returns all bundled YAMLs with path and model name
submit_job with hf_local runs via Docker executor and returns immediately (Phase 1: returns PID; experiment_id capture in Phase 2)
submit_job with cluster_host/user runs via Slurm executor (detach=True) and returns experiment_id
job_status correctly reflects running / done / failed from nemo_run filesystem
job_logs returns stdout for a completed job
uvx --from git+...#subdirectory=tools/mcp modelopt-mcp --help resolves and starts
Existing launcher tests unaffected (no changes to tools/launcher/)

Phase 2 (separate PR)

Capture experiment_id from Docker subprocess output (tail until nemo_run logs the id).
Add wait_for_experiment(experiment_id, timeout_sec) -> SlurmRunResult for blocking polls.
Extract the verify + submit helpers into a shared lib that nmm-sandbox-mcp (companion server, separate repo) can consume for internal-ergonomics tools — cluster short-name → factory lookup + GitLab CI dispatch. See nmm-sandbox#21 for the companion design.

Summary by CodeRabbit

New Features
- Added a ModelOpt MCP server: list examples, verify executors, submit jobs, check status, and fetch logs; supports Docker (local GPU) and Slurm backends and a console entrypoint.
Documentation
- Comprehensive README with install, end-to-end examples, usage, and deployment notes.
Tests
- Unit tests covering discovery, verification, submission flow, status, and logs.
Chores
- CI updated to run MCP tests; package/project config added.

Phase 1 implementation of the modelopt-mcp design tracked in OMNIML-5123. Five tools exposed over stdio MCP: * list_examples — enumerate tools/launcher/examples/ with model + description metadata * verify_setup — fail-fast probe (docker daemon + GPU, or slurm SSH passwordless) * submit_job — submit a launcher YAML; mode resolved from mutually-exclusive args (hf_local → Docker, cluster_host → Slurm). Returns the experiment id immediately; the actual job runs detached. * job_status — filesystem-based status from nemo_run experiment dir (_DONE + status_*.out) * job_logs — filesystem read of log_<task>.out files, with optional tail Design principles (from OMNIML-5123): 1. Single submit_job with mode by args (not separate submit_docker / submit_slurm tools). Keeps the LLM catalog compact. 2. Filesystem is the source of truth for status + logs. No in-memory registry; survives MCP server restarts. 3. verify_setup is auto-called by submit_job by default — the verify probe takes ~1s and saves 30+ seconds of wasted submission time on bad config. Callers can pass skip_verify=True when they just probed. Layout: tools/mcp/ pyproject.toml # name: modelopt-mcp, console_script modelopt_mcp/ __init__.py server.py # FastMCP entry; 5 tool definitions bridge.py # thin wrapper over launcher's core.py # + filesystem status/log helpers tests/test_bridge.py # 19 unit tests, fully hermetic (mocked # subprocess + tmp_path fixtures) Install: # End user (uvx-from-git) claude mcp add modelopt -- uvx --from \ "git+https://github.com/NVIDIA/Model-Optimizer.git#subdirectory=tools/mcp" \ modelopt-mcp # Local development uv pip install -e tools/launcher uv pip install -e tools/mcp modelopt-mcp # stdio server Validation: * uv pip install -e . succeeds * 19/19 unit tests pass * stdio handshake works; tools/list returns all 5 with full schemas * mode-resolution rejects no-executor + both-executors correctly * filesystem status correctly classifies done / failed / running * pre-commit clean (ruff, mypy, bandit, license) Phase 2 (separate PR): capture experiment_id from Docker subprocess output, add wait_for_experiment, extract the verify + submit helpers into a shared lib that nmm-sandbox-mcp can consume for its internal-ergonomics tools (cluster short-name → factory lookup + GitLab CI dispatch). Signed-off-by: Chenhan Yu <[email protected]>

copy-pr-bot · 2026-06-12T21:31:49Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-06-12T21:31:53Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4b1448db-0035-40af-9131-707395c401a2

📥 Commits

Reviewing files that changed from the base of the PR and between 3de99b4 and 082a6b8.

📒 Files selected for processing (1)

tools/mcp/README.md

✅ Files skipped from review due to trivial changes (1)

tools/mcp/README.md

📝 Walkthrough

Walkthrough

Adds a new modelopt-mcp MCP server and bridge exposing launcher tools (list_examples, verify_setup, submit_job, job_status, job_logs), package metadata and console script, comprehensive README, unit tests, and CI workflow updates to run the new tests.

Changes

MCP Server Implementation for Launcher Tool Exposure

Layer / File(s)	Summary
CI workflow integration `.github/workflows/unit_tests.yml`	`tools/mcp/**` paths are now treated as test-relevant; adds an `mcp` job that installs `uv`, sets up a venv, installs the sibling `launcher` package then `tools/mcp`, runs `pytest`, and updates `unit-pr-required-check` to depend on `mcp` and include its result in failure conditions.
End-user and developer documentation `tools/mcp/README.md`	Documentation for `modelopt-mcp`: purpose, Docker vs Slurm execution modes, the five MCP tools, uv/uvx install flows, example agent workflow, environment variables, design principles, repo layout, and Phase 2 roadmap.
Package configuration and entry point `tools/mcp/pyproject.toml`, `tools/mcp/modelopt_mcp/__init__.py`	Adds `modelopt-mcp` project metadata and dependencies (`mcp`, `pyyaml`, `pydantic`), console script `modelopt-mcp = modelopt_mcp.server:main`, build/package discovery config, and re-exports `main` at package level.
MCP-to-launcher bridge — examples `tools/mcp/modelopt_mcp/bridge.py`	Locate launcher examples dir and implement `ExampleEntry` + `list_examples_impl()` to enumerate YAML examples, derive `model` from path, and parse `model`/`description` fields when present.
MCP-to-launcher bridge — verification `tools/mcp/modelopt_mcp/bridge.py`	Implements `verify_docker_setup_impl()` (checks `docker info` and, unless skipped, verifies `nvidia` runtime via `docker info` JSON) and `verify_slurm_setup_impl()` (SSH probe with structured diagnostics and parsed `whoami`/hostname).
MCP-to-launcher bridge — submission `tools/mcp/modelopt_mcp/bridge.py`	Adds `_normalize_yaml_path()` and `submit_job_impl()` to select executor (Docker vs Slurm), optionally run verification, validate YAML existence, build `uv run launch.py --yaml ... --yes`, dispatch detached `Popen` for Docker (returns `experiment_id: None`) or synchronous `subprocess.run` for Slurm (parse experiment id/dir/slurm_job_id, handle timeouts and failure tails).
MCP-to-launcher bridge — inspection `tools/mcp/modelopt_mcp/bridge.py`	Adds `_resolve_experiment_dir()` to map experiment IDs to on-disk directories, `job_status_impl()` to compute done/failed/running from `_DONE` and `status_*.out`, and `job_logs_impl()` to read `log_<task>.out` files and optionally tail them.
MCP server and tool registration `tools/mcp/modelopt_mcp/server.py`	Builds a `FastMCP("modelopt")` server, registers Phase 1 tools (`list_examples`, `verify_setup`, `submit_job`, `job_status`, `job_logs`) with typed inputs delegating to bridge functions, validates Slurm `cluster_host` when required, and exposes `main()` for stdio serving with configurable logging.
Unit tests for bridge and tool logic `tools/mcp/tests/__init__.py`, `tools/mcp/tests/test_bridge.py`	Adds unit tests covering example discovery, malformed YAML tolerance, Docker and Slurm verification behaviors (including missing binaries and SSH auth failures), submit_job executor selection and YAML-not-found handling, `_DONE` semantics for job_status, and job_logs aggregation/tailing and missing-task errors.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

Ignore

❌ Failed checks (1 error, 1 warning)

Check name	Status	Explanation	Resolution
Security Anti-Patterns	❌ Error	tools/mcp/modelopt_mcp/bridge.py uses “# nosec B404” (subprocess import). SECURITY.md forbids `#nosec` bypasses without an approved security exception.	Remove the `#nosec` comment or obtain an approved security exception from `@NVIDIA/modelopt-setup-codeowners` with explicit justification in the PR description.
Docstring Coverage	⚠️ Warning	Docstring coverage is 73.17% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: introducing an MCP server for the ModelOpt launcher tool. It is concise, uses a conventional feat prefix, and directly relates to the primary changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chenhany/modelopt-mcp-phase1

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

ChenhanYu · 2026-06-12T21:32:09Z

Tracks OMNIML-5123. Companion design (nmm-sandbox-mcp) at omniml/integration/nmm-sandbox#21.

Adds a new `mcp` job to .github/workflows/unit_tests.yml that mirrors the existing `launcher` job's shape — installs the sibling modelopt-launcher editable, installs modelopt-mcp editable, runs pytest. The job is gated by check-file-changes to skip on unrelated PRs (matching launcher's pattern). Three changes: * paths: add `tools/mcp/**` to the push-trigger watch list so changes under that subtree fire unit_tests.yml. * check-file-changes.files: add `tools/mcp/**` to the changed-files detector so the gated jobs (launcher / mcp / skills) skip on PRs that touch unrelated areas. * unit-pr-required-check: add `mcp` to `needs:` and the failure- aggregator condition so a failing mcp job blocks merge the same way launcher does. No other workflows touched. regression_tests.yml is GPU-only and modelopt-mcp's tests are hermetic (subprocess mocked), so it stays out of scope. Signed-off-by: Chenhan Yu <[email protected]>

codecov · 2026-06-12T21:50:36Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.09%. Comparing base (ddc0a8e) to head (082a6b8).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1701   +/-   ##
=======================================
  Coverage   77.09%   77.09%           
=======================================
  Files         511      511           
  Lines       56176    56176           
=======================================
  Hits        43310    43310           
  Misses      12866    12866

Flag	Coverage Δ
unit	`54.34% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Adds tools/mcp/README.md so a developer landing in the directory has a single-file entry into: * What modelopt-mcp is (thin MCP wrapper over tools/launcher/core.py) * The 5-tool surface with one-line each * Both install paths (uvx-from-git for end-users, uv pip install -e for dev) + why plain `pip install` doesn't work yet + two options to enable it later * A concrete end-to-end usage example (list_examples → verify_setup → submit_job → job_status → job_logs) * Required env vars table * The three design constants (single submit_job, filesystem source- of-truth, auto-verify default) * Pointer to the NVIDIA-internal companion (nmm-sandbox-mcp) for operators who want short-name cluster ergonomics + GitLab CI dispatch * Layout + Phase 2 hooks Markdownlint-clean; matches the README style in tools/launcher/. Signed-off-by: Chenhan Yu <[email protected]>

ChenhanYu · 2026-06-12T22:05:05Z

/claude review

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 7

🧹 Nitpick comments (2)

tools/mcp/tests/test_bridge.py (1)
174-214: 📐 Maintainability & Code Quality | ⚡ Quick win

Add regression tests for subprocess hardening paths introduced in bridge fixes.

Please add focused tests for:

rejecting option-like cluster_host/cluster_user,

preserving raw override values with spaces (no literal quote injection), and

structured launcher_cli_not_found/spawn-failure returns.

This will lock in the new security/reliability contracts on the MCP boundary.

Also applies to: 221-271
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/tests/test_bridge.py` around lines 174 - 214, Add three focused
pytest cases in tools/mcp/tests/test_bridge.py that exercise
bridge.verify_slurm_setup_impl: (1) call verify_slurm_setup_impl with
option-like values for cluster_host and/or cluster_user (e.g., values starting
with "-") and assert the result["ok"] is False and result["reason"] indicates
rejection of option-like inputs; (2) call verify_slurm_setup_impl with override
values that include spaces or shell metacharacters and assert the returned
fields preserve the raw strings (no injected literal quotes) — use monkeypatched
subprocess.run to return expected stdout/stderr and check returned
whoami/remote_hostname/override fields match inputs exactly; (3) simulate
spawn/launcher failures by monkeypatching subprocess.run to raise
FileNotFoundError or return a non-zero spawn failure code and assert
verify_slurm_setup_impl returns a structured error (result["ok"] is False and
result["reason"] equals/indicates "launcher_cli_not_found" or a spawn-failure
sentinel) so the contract for structured failure payloads is validated.
tools/mcp/modelopt_mcp/__init__.py (1)
48-50: 📐 Maintainability & Code Quality | ⚡ Quick win

Move __all__ to the top-level export section.

__all__ is currently declared after imports. Put it directly below the module docstring so the public surface is explicit before implementation imports.

As per coding guidelines, "Each module declares its public surface with __all__ = [...] at the top of the file."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/modelopt_mcp/__init__.py` around lines 48 - 50, Move the __all__
declaration immediately below the module docstring so the module's public
surface is declared before imports; specifically place "__all__ = ['main']"
right after the top-of-file docstring and before the "from modelopt_mcp.server
import main" import to satisfy the guideline that modules declare public exports
at the top (referencing the __all__ symbol and the imported main symbol).
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/unit_tests.yml:
- Around line 152-160: The mcp job currently lacks a job-level permissions
block, uses actions/checkout@v6 without pinning, and leaves persist-credentials
default true; update the mcp job to set minimal GITHUB_TOKEN permissions (e.g.,
read-only repo and contents as required), add a permissions: block at the job
level, change the checkout invocation (uses: actions/checkout@v6) to a pinned
reference (replace with a specific commit SHA tag for actions/checkout or add
with: ref: ${{ github.sha }} if you need the exact commit), and set with:
persist-credentials: false so credentials aren’t injected into the workspace;
keep the condition (if: needs.check-file-changes.outputs.any_changed == 'true')
and needs: [linux, check-file-changes] as-is.

In `@tools/mcp/modelopt_mcp/__init__.py`:
- Around line 28-31: Update the module docstring for submit_job to accurately
describe its return contract: state that when mode is hf_local (Docker)
submit_job returns the Docker process PID (since it runs in a background
thread), and when mode is cluster_host (Slurm) it returns the submitted
experiment_id (Slurm submission with detach=True); reference the submit_job
function and the modes hf_local/Docker and cluster_host/Slurm so callers know
which value to expect.

In `@tools/mcp/modelopt_mcp/bridge.py`:
- Line 38: Remove all inline "# nosec" suppressions on subprocess imports/usages
(e.g., the import subprocess line and occurrences around lines referenced) and
make the code pass Bandit by fixing unsafe subprocess patterns instead of
silencing them: import subprocess without a comment, ensure every subprocess
invocation in bridge.py (search for subprocess.run/ Popen/call usages) uses a
list of arguments (no shell=True), validates or sanitizes any external inputs
(use shlex.quote or strict whitelist validation where inputs originate), and use
subprocess.run(..., check=True) or other safe APIs; if any use truly requires
shell semantics, refactor to an explicit, validated command builder or document
and encapsulate the risk with safer alternatives so Bandit warnings are resolved
rather than suppressed.
- Around line 512-517: The subprocess.Popen call that spawns the launcher (argv,
cwd=str(launcher_dir), assigned to proc) must not raise uncaught
FileNotFoundError/OSError; wrap the Popen call in a try/except that catches
FileNotFoundError and OSError and returns a structured error result (instead of
letting the exception propagate) so the MCP tool can surface a clear
"launcher/uv not found" error; do the same for the other subprocess invocation
that only catches subprocess.TimeoutExpired (also catch
FileNotFoundError/OSError there). Remove the "# nosec B603" suppressions and
replace them with the new explicit error handling and logging using the same
variables (argv, launcher_dir, proc) and the existing TimeoutExpired handling to
produce consistent structured error returns.
- Around line 477-491: The argv list currently wraps values with shlex.quote
(e.g., hf_local, cluster_host, cluster_user, identity, job_dir, job_name and
extra_overrides values) which injects literal quotes when argv is passed as a
list to subprocess (no shell); remove shlex.quote and append the raw string
values instead (use str(...) where needed and fall back to empty string for
None), e.g. build entries as f"hf_local={str(hf_local)}" and for extra_overrides
use f"{k}={str(v)}"; keep the same keys (argv, extra_overrides) and conditional
checks but do not call shlex.quote.
- Around line 282-295: Remove the "# nosec" bypasses and harden SSH/ subprocess
usage: in bridge.py update verify_slurm_setup_impl to validate cluster_user and
cluster_host (reject/escape values that start with '-' or contain unexpected
characters; allow only a safe regex like [A-Za-z0-9._-]+), then build the ssh
target as a single argv element (target = f"{cluster_user}@{cluster_host}" or
cluster_host alone) and append it without any shell interpolation so ssh options
cannot be injected; in submit_job_impl stop using shlex.quote on key=value
arguments (pass raw strings in the argv list since subprocess is used without a
shell); in the docker branch replace returning Popen(..., stdout=PIPE,
stderr=STDOUT) directly with a safe pattern that drains output (use
subprocess.run(...) to capture/stream output or call Popen.communicate() / a
dedicated reader thread before returning) to avoid deadlocks; finally remove all
"# nosec" comments and address Bandit findings through the proper
security-exception workflow instead of silencing them.

In `@tools/mcp/modelopt_mcp/server.py`:
- Around line 259-262: The tail parameter on the MCP boundary must enforce a
minimum of 1 to avoid surprising slicing in bridge.job_logs_impl (which calls
body.splitlines()[-tail:]); update the Annotated Field for tail in the job_logs
schema to include ge=1 (or equivalent lower-bound validation) so values like 0
or negatives cannot be passed into job_logs_impl and cause unexpected behavior.

---

Nitpick comments:
In `@tools/mcp/modelopt_mcp/__init__.py`:
- Around line 48-50: Move the __all__ declaration immediately below the module
docstring so the module's public surface is declared before imports;
specifically place "__all__ = ['main']" right after the top-of-file docstring
and before the "from modelopt_mcp.server import main" import to satisfy the
guideline that modules declare public exports at the top (referencing the
__all__ symbol and the imported main symbol).

In `@tools/mcp/tests/test_bridge.py`:
- Around line 174-214: Add three focused pytest cases in
tools/mcp/tests/test_bridge.py that exercise bridge.verify_slurm_setup_impl: (1)
call verify_slurm_setup_impl with option-like values for cluster_host and/or
cluster_user (e.g., values starting with "-") and assert the result["ok"] is
False and result["reason"] indicates rejection of option-like inputs; (2) call
verify_slurm_setup_impl with override values that include spaces or shell
metacharacters and assert the returned fields preserve the raw strings (no
injected literal quotes) — use monkeypatched subprocess.run to return expected
stdout/stderr and check returned whoami/remote_hostname/override fields match
inputs exactly; (3) simulate spawn/launcher failures by monkeypatching
subprocess.run to raise FileNotFoundError or return a non-zero spawn failure
code and assert verify_slurm_setup_impl returns a structured error (result["ok"]
is False and result["reason"] equals/indicates "launcher_cli_not_found" or a
spawn-failure sentinel) so the contract for structured failure payloads is
validated.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 41d93d76-4c0b-45cd-b0fc-9406b45c4455

📥 Commits

Reviewing files that changed from the base of the PR and between ddc0a8e and cb3766d.

📒 Files selected for processing (8)

.github/workflows/unit_tests.yml
tools/mcp/README.md
tools/mcp/modelopt_mcp/__init__.py
tools/mcp/modelopt_mcp/bridge.py
tools/mcp/modelopt_mcp/server.py
tools/mcp/pyproject.toml
tools/mcp/tests/__init__.py
tools/mcp/tests/test_bridge.py

coderabbitai · 2026-06-12T22:16:38Z

+    argv = [
+        "ssh",
+        "-o",
+        "BatchMode=yes",
+        "-o",
+        "StrictHostKeyChecking=accept-new",
+        "-o",
+        "ConnectTimeout=5",
+    ]
+    if identity:
+        argv += ["-i", identity]
+    target = f"{cluster_user}@{cluster_host}" if cluster_user else cluster_host
+    argv += [target, "whoami && hostname"]
+


🔒 Security & Privacy | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail file="tools/mcp/modelopt_mcp/bridge.py" # Basic context: locate key functions and line ranges mentioned in the original comment rg -n "def (verify_slurm_setup_impl|submit_job_impl)" "$file" || true rg -n "verify_slurm_setup_impl|submit_job_impl" "$file" || true # Inspect around the snippet lines 260-320 (adjusted by actual file line numbers) nl -ba "$file" | sed -n '250,330p' # Inspect other cited ranges: 450-520, 500-560 nl -ba "$file" | sed -n '440,520p' nl -ba "$file" | sed -n '500,570p' # Search for security-policy-related patterns rg -n "# nosec" "$file" || true rg -n "shlex\.quote" "$file" || true rg -n "subprocess\.(run|Popen)" "$file" || true # Also inspect any ssh target building and where argv list is finalized rg -n "ssh|StrictHostKeyChecking|BatchMode|ConnectTimeout|StrictHostKeyChecking=accept-new" "$file" || true

Repository: NVIDIA/Model-Optimizer

Length of output: 293

🏁 Script executed:

#!/bin/bash set -euo pipefail file="tools/mcp/modelopt_mcp/bridge.py" echo "== show verify_slurm_setup_impl and surrounding code (approx) ==" awk 'NR>=250 && NR<=340 {printf "%4d:%s\n", NR, $0}' "$file" | sed -n '1,120p' echo "== show submit_job_impl and surrounding code (approx) ==" awk 'NR>=360 && NR<=470 {printf "%4d:%s\n", NR, $0}' "$file" | sed -n '1,120p' echo "== show other cited ranges (approx) ==" awk 'NR>=460 && NR<=540 {printf "%4d:%s\n", NR, $0}' "$file" | sed -n '1,120p' awk 'NR>=480 && NR<=560 {printf "%4d:%s\n", NR, $0}' "$file" | sed -n '1,120p' echo "== search patterns ==" rg -n "# nosec" "$file" || true rg -n "shlex\.quote" "$file" || true rg -n "subprocess\.(run|Popen|check_output|call)" "$file" || true rg -n "StrictHostKeyChecking|BatchMode=yes|ConnectTimeout=5|whoami && hostname" "$file" || true

Repository: NVIDIA/Model-Optimizer

Length of output: 17404

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n "cluster_host|cluster_user|identity" tools/mcp/modelopt_mcp/bridge.py # See if bridge does any validation for these fields before calling verify/submit. rg -n "validate|regex|allowlist|startswith\$\"-\"\$|StrictHostKeyChecking" tools/mcp/modelopt_mcp/bridge.py || true # Find launcher argument parsing for the key=value overrides we pass to `uv run launch.py`. # This checks whether shlex.quote would end up including literal quotes in the value. file_launcher="$(fd -a "launch.py" -tf --exact 2>/dev/null tools/launcher || true)" echo "launcher candidates:" echo "$file_launcher" # If fd above didn't return, fall back to repo-wide filename search. if [ -z "$file_launcher" ]; then file_launcher="$(fd -a "launch.py" .)" fi echo "Using launcher file(s):" echo "$file_launcher" # Inspect how `hf_local=...` / `cluster_host=...` style args are parsed/handled. rg -n "hf_local=|cluster_host=|user=|identity=|extra_overrides" $file_launcher || true rg -n "shlex\\.split|argparse|key=value|\\w+=\\w" $file_launcher || true

Repository: NVIDIA/Model-Optimizer

Length of output: 4327

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect tools/launcher/launch.py for argument parsing of key=value pairs file="tools/launcher/launch.py" wc -l "$file" awk 'NR>=1 && NR<=220 {printf "%4d:%s\n", NR, $0}' "$file" | sed -n '1,220p' awk 'NR>=220 && NR<=420 {printf "%4d:%s\n", NR, $0}' "$file" | sed -n '1,220p' # Find where cluster_host/user/identity values come from in the bridge MCP layer rg -n "cluster_host" tools/mcp/modelopt_mcp || true rg -n "cluster_user" tools/mcp/modelopt_mcp || true rg -n "submit_job" tools/mcp/modelopt_mcp || true # Find any schema/validation layer for these fields rg -n "class .*Model|pydantic|BaseModel|Field\\(|constr\\(|validator\\(" tools/mcp/modelopt_mcp || true

Repository: NVIDIA/Model-Optimizer

Length of output: 11191

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect server validation for cluster_host/cluster_user file="tools/mcp/modelopt_mcp/server.py" echo "== server.py cluster_host/cluster_user definitions ==" awk 'NR>=70 && NR<=220 {printf "%4d:%s\n", NR, $0}' "$file" | sed -n '1,180p' # Inspect bridge.py for job_status and NEMORUN_HOME usage b="tools/mcp/modelopt_mcp/bridge.py" echo "== bridge.py search NEMORUN_HOME and experiment_id handling ==" rg -n "NEMORUN_HOME|experiment_id|job_status|proc\\b|Popen\\(" "$b" echo "== bridge.py around job_status and further lines (approx 540-720) ==" awk 'NR>=530 && NR<=720 {printf "%4d:%s\n", NR, $0}' "$b" | sed -n '1,240p'

Repository: NVIDIA/Model-Optimizer

Length of output: 17257

🏁 Script executed:

#!/bin/bash set -euo pipefail file="tools/mcp/modelopt_mcp/bridge.py" echo "== top of bridge.py imports ==" awk 'NR>=1 && NR<=80 {printf "%4d:%s\n", NR, $0}' "$file" rg -n "^import re\\b|from re import" "$file" || true

Repository: NVIDIA/Model-Optimizer

Length of output: 3632

Fix SSH target injection and remove disallowed # nosec Bandit bypasses

Security policy: tools/mcp/modelopt_mcp/bridge.py uses # nosec to suppress Bandit for multiple subprocess calls (and tools/launcher/launch.py does as well). Repo guidelines disallow # nosec bypasses unless an approved security exception exists—remove these suppressions or route through the required codeowner-approved exception.

SSH option injection: verify_slurm_setup_impl() builds target = f"{cluster_user}@{cluster_host}" and appends it to ssh argv without --, with no validation to prevent cluster_host / cluster_user from being interpreted as additional ssh options.

Docker reliability: in docker mode, subprocess.Popen(... stdout=PIPE, stderr=STDOUT) is returned immediately without draining the pipe, which can deadlock if output fills the buffer.

argv correctness: submit_job_impl() applies shlex.quote() to key=value arguments even though subprocess uses an argv list (no shell). This can introduce literal quote characters into values and break launcher parsing for inputs containing special characters.

🔒 Suggested hardening

+import re + +_SSH_TOKEN_RE = re.compile(r"^[A-Za-z0-9._-]+$") + def verify_slurm_setup_impl( cluster_host: str, cluster_user: str | None = None, identity: str | None = None, ) -> dict: @@ + if not cluster_host or cluster_host.startswith("-") or not _SSH_TOKEN_RE.fullmatch(cluster_host): + return { + "ok": False, + "executor": "slurm", + "reason": "invalid_cluster_host", + "diagnostic": "cluster_host contains invalid characters.", + } + if cluster_user and ( + cluster_user.startswith("-") or not _SSH_TOKEN_RE.fullmatch(cluster_user) + ): + return { + "ok": False, + "executor": "slurm", + "reason": "invalid_cluster_user", + "diagnostic": "cluster_user contains invalid characters.", + } @@ - argv += [target, "whoami && hostname"] + argv += ["--", target, "whoami && hostname"]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/mcp/modelopt_mcp/bridge.py` around lines 282 - 295, Remove the "# nosec" bypasses and harden SSH/ subprocess usage: in bridge.py update verify_slurm_setup_impl to validate cluster_user and cluster_host (reject/escape values that start with '-' or contain unexpected characters; allow only a safe regex like [A-Za-z0-9._-]+), then build the ssh target as a single argv element (target = f"{cluster_user}@{cluster_host}" or cluster_host alone) and append it without any shell interpolation so ssh options cannot be injected; in submit_job_impl stop using shlex.quote on key=value arguments (pass raw strings in the argv list since subprocess is used without a shell); in the docker branch replace returning Popen(..., stdout=PIPE, stderr=STDOUT) directly with a safe pattern that drains output (use subprocess.run(...) to capture/stream output or call Popen.communicate() / a dedicated reader thread before returning) to avoid deadlocks; finally remove all "# nosec" comments and address Bandit findings through the proper security-exception workflow instead of silencing them.

Sources: Coding guidelines, Linters/SAST tools

Acknowledged; leaving as-is for this PR. BatchMode=yes already ensures the probe will refuse to prompt for password (key-auth-only). accept-new only matters on the FIRST contact with a host — and the CI runner is single-use (no persistent ~/.ssh/known_hosts to poison), while operator-side accept-new is the same default ssh uses for any new connection. Tightening this to StrictHostKeyChecking=yes would require pre-populating known_hosts on every host that wants to run verify_setup, which is worse UX than the marginal security gain. Will revisit if the Phase-2 read_cluster_artifact tool needs different semantics.

@ChenhanYu, the rationale for StrictHostKeyChecking=accept-new is reasonable — BatchMode=yes already blocks password prompts, and single-use CI runners don't accumulate a poisoned known_hosts. That part is accepted.

The original comment raised three other concerns that are still open and weren't addressed by this explanation:

SSH option injection — cluster_host and cluster_user arrive from the MCP tool caller with no validation. A value like -oProxyCommand=... passed as cluster_host would be appended to the argv list and interpreted by ssh as an option before the -- end-of-options marker. The StrictHostKeyChecking rationale doesn't cover this injection surface.

# nosec bypasses — The repo coding guidelines explicitly state these aren't allowed without codeowner-approved exceptions. The suppressions in bridge.py haven't gone through that workflow.

shlex.quote() in an argv list — shlex.quote() produces shell-escaped strings (e.g., 'value with spaces'). When passed inside a list to subprocess.run(..., shell=False), those literal quote characters are forwarded to the subprocess unchanged, breaking argument parsing.

Would you like me to open a follow-up GitHub issue to track these three items so they don't get lost before Phase 2?

🐰 (´• ω •`)ノ

🧠 Learnings used

Learnt from: kevalmorabia97 Repo: NVIDIA/Model-Optimizer PR: 1554 File: tests/gpu/torch/puzzletron/test_puzzletron.py:236-236 Timestamp: 2026-05-28T14:04:36.104Z Learning: In PyTorch >= 2.6, `torch.load()` defaults to `weights_only=True`. During code review, do NOT flag `torch.load()` calls as a missing `weights_only=True` security issue solely because the argument is omitted when the codebase’s minimum supported PyTorch version is >= 2.6 (e.g., targeting torch 2.12). Only require/flag an explicit `weights_only=` when the project targets PyTorch < 2.6, or when `weights_only` is explicitly set in a way that increases risk (e.g., `weights_only=False`).

coderabbitai · 2026-06-12T22:16:38Z

+        proc = subprocess.Popen(  # nosec B603
+            argv,
+            cwd=str(launcher_dir),
+            stdout=subprocess.PIPE,
+            stderr=subprocess.STDOUT,
+        )


🩺 Stability & Availability | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="tools/mcp/modelopt_mcp/bridge.py" echo "== File: $FILE ==" python3 - <<'PY' import itertools,sys path="tools/mcp/modelopt_mcp/bridge.py" start=470 end=620 with open(path,'r',encoding='utf-8') as f: for i,line in enumerate(f, start=1): if start <= i <= end: print(f"{i}:{line.rstrip()}") if i > end: break PY echo echo "== Search: subprocess.Popen / subprocess.run in file ==" rg -n "subprocess\.(Popen|run)\b|B603|# nosec" "$FILE" || true echo echo "== Show additional context for the second referenced range (543-550 area) ==" python3 - <<'PY' path="tools/mcp/modelopt_mcp/bridge.py" start=520 end=575 with open(path,'r',encoding='utf-8') as f: for i,line in enumerate(f, start=1): if start <= i <= end: print(f"{i}:{line.rstrip()}") if i > end: break PY echo echo "== Check how argv is defined near the Popen call ==" python3 - <<'PY' path="tools/mcp/modelopt_mcp/bridge.py" # Find first occurrence line of 'proc = subprocess.Popen' and print some lines before it target="proc = subprocess.Popen" with open(path,'r',encoding='utf-8') as f: lines=f.readlines() for idx,line in enumerate(lines): if target in line: start=max(0,idx-20) end=min(len(lines),idx+25) for i in range(start,end): print(f"{i+1}:{lines[i].rstrip()}") break PY

Repository: NVIDIA/Model-Optimizer

Length of output: 11876

Handle uv/launcher spawn failures with structured error returns (and remove # nosec Bandit suppressions)

tools/mcp/modelopt_mcp/bridge.py calls subprocess.Popen(... ) in the executor == "docker" path without catching FileNotFoundError/OSError, so missing uv/launcher entrypoint will raise and crash the MCP tool path (lines ~512-517).

The executor != "docker" path only catches subprocess.TimeoutExpired; missing uv/entrypoint will still raise (lines ~543-550).

Remove/replace the # nosec B603 suppressions on these subprocess calls—# nosec bypasses Bandit checks and isn’t allowed without the required security-exception process.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/mcp/modelopt_mcp/bridge.py` around lines 512 - 517, The subprocess.Popen call that spawns the launcher (argv, cwd=str(launcher_dir), assigned to proc) must not raise uncaught FileNotFoundError/OSError; wrap the Popen call in a try/except that catches FileNotFoundError and OSError and returns a structured error result (instead of letting the exception propagate) so the MCP tool can surface a clear "launcher/uv not found" error; do the same for the other subprocess invocation that only catches subprocess.TimeoutExpired (also catch FileNotFoundError/OSError there). Remove the "# nosec B603" suppressions and replace them with the new explicit error handling and logging using the same variables (argv, launcher_dir, proc) and the existing TimeoutExpired handling to produce consistent structured error returns.

The detailed security findings here overlap with the claude[bot] comments below — those have been addressed in 3de99b441 (SLURM_HOST via env, DEVNULL on Docker, NEMORUN_HOME propagation, removed shlex.quote). The underlying subprocess.run + subprocess.Popen calls all use a controlled argv list (no shell, no untrusted external input — values come from the MCP tool's typed args validated by pydantic). If you have specific remaining findings from this comment that aren't covered by my other replies, please point them out.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="tools/mcp/modelopt_mcp/bridge.py" echo "=== nosec B603 occurrences ===" rg -n "nosec" "$FILE" || echo "(none)" echo "" echo "=== subprocess.Popen call and surrounding context ===" rg -n "subprocess\.Popen" -A 15 -B 5 "$FILE" || echo "(none)" echo "" echo "=== subprocess.run in submit_job_impl (slurm path) context ===" rg -n "subprocess\.run" -A 15 -B 5 "$FILE" | grep -A 15 -B 5 "slurm\|launcher_dir\|executor" || echo "(none - showing all subprocess.run)" rg -n "subprocess\.run" -A 20 -B 5 "$FILE" || echo "(none)" echo "" echo "=== FileNotFoundError handling ===" rg -n "FileNotFoundError\|OSError" "$FILE" || echo "(none)"

Length of output: 13264

@ChenhanYu, thanks for the details. After checking the current file, two specific items from the original comment remain unaddressed:

1. Missing FileNotFoundError on the two submit_job_impl subprocess calls (if uv isn't on PATH)

The verify_* helpers already catch FileNotFoundError properly (lines 203, 343). But neither launcher-dispatch call does:

Line 562 – subprocess.Popen(argv, ...) (Docker path): no try/except FileNotFoundError. If uv isn't on PATH this raises an uncaught exception instead of returning a structured {"ok": False, "reason": "uv_not_found"} dict.

Line 589 – subprocess.run(argv, ...) (Slurm path): the try block only catches subprocess.TimeoutExpired; same uv-not-found scenario raises.

Both only need a small addition:

except FileNotFoundError: return { "ok": False, "executor": executor, "reason": "uv_not_found", "diagnostic": ( "`uv` binary not found on PATH. Install via " "https://docs.astral.sh/uv/getting-started/installation/" ), }

2. # nosec suppressions still present (lines 38, 196, 250, 325, 562, 589)

Per the project's coding guidelines:

Any use of # nosec comments to bypass Bandit security checks is not allowed. If a security-sensitive pattern is genuinely necessary, the PR must be reviewed and approved by @NVIDIA/modelopt-setup-codeowners with an explicit justification in the PR description.

The B603/B607 suppressions are factually correct (no shell, controlled argv, pydantic-validated inputs), but the policy doesn't provide a "nosec is fine if the code is correct" carve-out — it requires the explicit justification in the PR description and @NVIDIA/modelopt-setup-codeowners sign-off. Either add that justification to the PR description and request the codeowners review, or replace the # nosec comments with inline justifications in a format acceptable to your Bandit config (e.g., a # noqa with a project-approved exemption file).

🧠 Learnings used

Learnt from: kevalmorabia97 Repo: NVIDIA/Model-Optimizer PR: 1554 File: tests/gpu/torch/puzzletron/test_puzzletron.py:236-236 Timestamp: 2026-05-28T14:04:36.104Z Learning: In PyTorch >= 2.6, `torch.load()` defaults to `weights_only=True`. During code review, do NOT flag `torch.load()` calls as a missing `weights_only=True` security issue solely because the argument is omitted when the codebase’s minimum supported PyTorch version is >= 2.6 (e.g., targeting torch 2.12). Only require/flag an explicit `weights_only=` when the project targets PyTorch < 2.6, or when `weights_only` is explicitly set in a way that increases risk (e.g., `weights_only=False`).

claude · 2026-06-12T22:18:41Z

+    else:
+        # Slurm mode — pass cluster config knobs as nemo-run overrides.
+        argv.append(f"cluster_host={shlex.quote(cluster_host or '')}")
+        if cluster_user:
+            argv.append(f"user={shlex.quote(cluster_user)}")
+        if identity:
+            argv.append(f"identity={shlex.quote(identity)}")
+        argv.append("detach=true")


[CRITICAL Algorithm] The Slurm-mode argv construction never actually configures the cluster host that the launcher will use. launch.py's launch() entrypoint (tools/launcher/launch.py:82) only accepts job_name, job_dir, pipeline, hf_local, user, identity, detach, clean — there is no cluster_host parameter. The cluster host is sourced either from the SLURM_HOST env var via slurm_factory(host=...) (tools/launcher/slurm_config.py:63) or via a per-task override like pipeline.task_0.slurm_config.host=<host>.

Appending cluster_host=<host> to nemo-run's CLI overrides will at best be ignored / produce a CLI error, and at worst silently submit using whatever SLURM_HOST happens to be in the bridge's env (which may be empty). Net effect: every Slurm submission goes through with the wrong host, or fails with an opaque error after verify_setup already reported ok.

Fix: set SLURM_HOST=<cluster_host> in the env passed to the subprocess, e.g.

env = os.environ.copy() env["SLURM_HOST"] = cluster_host argv = ["uv", "run", "launch.py", "--yaml", str(abs_yaml), "--yes"] if cluster_user: argv.append(f"user={cluster_user}") if identity: argv.append(f"identity={identity}") argv.append("detach=true") proc = subprocess.run(argv, env=env, ...)

(The bridge also already passes cluster_user as user=... which is the right launcher arg — that one's fine.)

Fixed in 3de99b441. Stopped appending cluster_host= to the nemo-run overrides; the host now flows via child_env['SLURM_HOST']=cluster_host passed as env= to the subprocess. slurm_factory(host=os.environ.get('SLURM_HOST', '')) picks it up the same way as direct uv run launch.py invocations do. cluster_user was already correctly going through as user=... (the launcher's actual arg).

claude · 2026-06-12T22:18:54Z

+    argv = ["uv", "run", "launch.py", "--yaml", str(abs_yaml), "--yes"]
+    if hf_local:
+        argv.append(f"hf_local={shlex.quote(hf_local)}")
+    else:
+        # Slurm mode — pass cluster config knobs as nemo-run overrides.
+        argv.append(f"cluster_host={shlex.quote(cluster_host or '')}")
+        if cluster_user:
+            argv.append(f"user={shlex.quote(cluster_user)}")
+        if identity:
+            argv.append(f"identity={shlex.quote(identity)}")
+        argv.append("detach=true")
+    if job_dir:
+        argv.append(f"job_dir={shlex.quote(job_dir)}")
+    if job_name:
+        argv.append(f"job_name={shlex.quote(job_name)}")
+    for k, v in (extra_overrides or {}).items():
+        argv.append(f"{k}={shlex.quote(str(v))}")


[CRITICAL Algorithm] shlex.quote(...) is being used as if it were a value-quoting helper for nemo-run's key=value CLI overrides, but it isn't — it produces a shell-quoted string (e.g. '/path with spaces') suitable for pasting into a shell. Since subprocess.Popen([...]) / subprocess.run([...]) is called with a list (no shell), each argument is passed verbatim to launch.py. The result: nemo-run sees literal arguments like hf_local='/mnt/hf-local' (with quotes baked in), and user='alice' etc. — those quotes become part of the value the launcher dataclass receives, breaking path lookups and SSH user resolution.

Drop shlex.quote for every override; just use the raw value:

if hf_local: argv.append(f"hf_local={hf_local}") else: if cluster_user: argv.append(f"user={cluster_user}") if identity: argv.append(f"identity={identity}") argv.append("detach=true") if job_dir: argv.append(f"job_dir={job_dir}") if job_name: argv.append(f"job_name={job_name}") for k, v in (extra_overrides or {}).items(): argv.append(f"{k}={v}")

shlex.quote would only be appropriate if you were building a shell string (e.g. for shell=True or for an ssh "<remote-cmd>" blob). Since you correctly use list-form Popen, every value can carry spaces / special chars safely without quoting.

Fixed in 3de99b441. Dropped shlex.quote on every nemo-run override — subprocess.run/Popen with a list never goes through a shell, so the quoting was baking literal quote chars into values that the launcher's argparse then saw. All overrides now pass through verbatim. Verified the argv-shape test still passes (the test checks for user=chenhany not user='chenhany').

claude · 2026-06-12T22:19:08Z

+    if executor == "docker":
+        # Docker mode: spawn in background so we don't block the MCP
+        # call. The subprocess writes its experiment dir + status into
+        # NEMORUN_HOME; we'll read it back via job_status.
+        # B603 false positive — `argv` is a list built by this module
+        # from typed parameters, no shell interpretation.
+        proc = subprocess.Popen(  # nosec B603
+            argv,
+            cwd=str(launcher_dir),
+            stdout=subprocess.PIPE,
+            stderr=subprocess.STDOUT,
+        )
+        # Detached: don't wait. The caller polls job_status by
+        # experiment_id (derived from job_name or auto-named).
+        # Generate a best-effort experiment_id from job_name + timestamp.
+        # nemo_run names experiments deterministically as
+        # `<title>_<job_name>_<timestamp>`; if the caller didn't provide
+        # job_name we can't predict the id ahead of time.
+        return {
+            "ok": True,
+            "executor": "docker",
+            "pid": proc.pid,
+            "argv": argv,
+            "experiment_id": None,  # Phase 2: tail the subprocess output
+            # until nemo_run logs the id, then return it
+            "spike_note": (
+                "Docker mode launched in background. Phase 1: the "
+                "experiment_id is None — Phase 2 tails the subprocess "
+                "output to capture the id. For now, list experiments "
+                "via nemo_run's CLI or check NEMORUN_HOME."
+            ),
+        }


[IMPORTANT Performance] The Docker-mode Popen(stdout=subprocess.PIPE, stderr=subprocess.STDOUT) writes the child's stdout into a pipe that no one ever reads. Long-running jobs (PTQ runs are minutes-to-hours) will fill the pipe buffer (~64KB on Linux) and then block the launcher subprocess on its next write(), hanging the job indefinitely while the MCP server appears to have "succeeded".

Fix: redirect to a file, or to DEVNULL. Since the actual job logs are written by nemo_run into the experiment dir (which is what job_logs reads), discarding stdout is fine:

proc = subprocess.Popen( argv, cwd=str(launcher_dir), stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, start_new_session=True, # detach from MCP server's process group )

start_new_session=True is also worth adding — without it, an MCP server SIGINT/restart will SIGHUP the in-flight Docker submission. Same applies on the Slurm path (the synchronous subprocess.run is fine because it captures fully and returns), but the Docker Popen is the real foot-gun.

Fixed in 3de99b441. Docker mode now uses stdout=DEVNULL, stderr=DEVNULL, start_new_session=True. The launcher writes real logs into the nemo_run experiment dir (which job_logs reads), so discarding the subprocess stdout is correct. start_new_session=True also added so an MCP server restart doesn't SIGHUP the in-flight launcher.

claude · 2026-06-12T22:19:40Z

+def _resolve_experiment_dir(experiment_id: str) -> Path | None:
+    """Map an experiment_id to its on-disk directory.
+
+    nemo_run lays experiments out under ``$NEMORUN_HOME/experiments/<id>/``
+    by default; ``NEMORUN_HOME`` falls back to cwd. We also check
+    ``./experiments/<id>`` directly and ``./local_experiments/<id>``
+    (the Docker-mode fallback path).
+    """
+    candidates = []
+    nemorun_home = os.environ.get("NEMORUN_HOME")
+    if nemorun_home:
+        candidates.append(Path(nemorun_home) / "experiments" / experiment_id)
+    candidates.append(Path.cwd() / "experiments" / experiment_id)
+    candidates.append(Path.cwd() / "local_experiments" / experiment_id)
+    for c in candidates:
+        if c.exists():
+            return c
+    return None


[IMPORTANT Algorithm] The submit-time vs. status-time NEMORUN_HOME resolution is asymmetric and will produce false experiment_dir_not_found results.

submit_job_impl does not pass env= to the subprocess, so the launcher inherits whatever was in the MCP server's env at startup. launch.py itself defaults NEMORUN_HOME to os.getcwd() when unset (tools/launcher/launch.py:99-100), and the subprocess cwd is tools/launcher/ — so artifacts land at tools/launcher/experiments/<id>/.

_resolve_experiment_dir here looks under os.environ["NEMORUN_HOME"]/experiments, then Path.cwd()/experiments, then Path.cwd()/local_experiments — relative to the MCP server's cwd, not the launcher's.

When NEMORUN_HOME is unset (very plausible — the MCP server is launched by codex/Claude Code with no env tailoring), job_status returns experiment_dir_not_found for jobs that actually succeeded.

Fix: explicitly resolve and propagate NEMORUN_HOME at submit time, and document in the README that this env var must be set consistently for the MCP server's lifetime. Something like:

env = os.environ.copy() env.setdefault("NEMORUN_HOME", os.getcwd()) # pin so status side sees the same dir proc = subprocess.run(argv, cwd=str(launcher_dir), env=env, ...)

Also worth adding launcher_dir / "experiments" / experiment_id to the candidate list here as a belt-and-braces fallback.

Fixed in 3de99b441. submit_job_impl now passes env=child_env to the subprocess with child_env.setdefault('NEMORUN_HOME', os.getcwd()). _resolve_experiment_dir reads from the same env, plus an added fallback to <launcher_dir>/experiments/ for the case where the operator hasn't set NEMORUN_HOME at all. Both submit + status sides now agree on the artifact location.

claude · 2026-06-12T22:19:57Z

+    try:
+        gpu = subprocess.run(  # nosec B603 B607
+            [
+                "docker",
+                "run",
+                "--rm",
+                "--gpus",
+                "all",
+                "nvidia/cuda:12.0-base-ubuntu22.04",
+                "nvidia-smi",
+            ],
+            capture_output=True,
+            text=True,
+            timeout=60,
+            check=False,
+        )
+    except subprocess.TimeoutExpired:
+        return {
+            "ok": False,
+            "executor": "docker",
+            "daemon_ok": True,
+            "reason": "gpu_check_timeout",
+            "diagnostic": "GPU smoketest container did not return in 60s.",
+        }


[IMPORTANT Performance] First-time invocations of verify_setup(executor='docker') will pull nvidia/cuda:12.0-base-ubuntu22.04 (~150 MB) on hosts that don't already have it, easily blowing past the 60-second timeout — at which point the call returns gpu_check_timeout even on perfectly healthy hosts. The README sells the probe as "~1 second"; that's only true on warm caches.

Two cheap mitigations:

Replace the GPU smoketest with docker run --rm --gpus all --entrypoint /bin/true <small-image> once we know the daemon is up + has the toolkit, and probe the toolkit separately via nvidia-container-cli info if available — far faster and doesn't depend on a CUDA image at all.

At minimum, pin a much smaller image (e.g. busybox or just bypass image-fetch by checking docker info | grep nvidia), and document that the first call may be slow.

Also: the chosen tag nvidia/cuda:12.0-base-ubuntu22.04 is old (CUDA 12.0 GA shipped in Q1 2023) — newer drivers won't have a fundamentally different code path, but pinning to a long-deprecated tag invites silent registry-removal breakage years from now. If you keep the smoketest, point it at a current CUDA tag.

Fixed in 3de99b441. Replaced the heavyweight docker run --gpus all nvidia/cuda:... nvidia-smi smoketest with docker info --format '{{json .}}' + check whether 'nvidia' is in the registered runtimes. No image pull, daemon-fast, no dependency on a CUDA tag. Returns the same gpu_unavailable structured failure with an install-toolkit pointer when the nvidia runtime isn't registered.

claude · 2026-06-12T22:20:14Z

+    for status_file in sorted(exp_dir.glob("status_*.out")):
+        # Convention: filename is `status_<task_name>.out`, contents
+        # are a single word ("succeeded" or "failed").
+        task_name = status_file.stem.removeprefix("status_")
+        body = status_file.read_text(encoding="utf-8", errors="replace").strip()
+        task_statuses[task_name] = body
+        if "fail" in body.lower():
+            any_failed = True


[IMPORTANT Algorithm] The "fail" in body.lower() check is too loose. The convention documented just above says the file contains a single word ("succeeded" or "failed"), but the test fixture test_job_status_failed_task already writes "failed (rc=1)\n", so the convention is leaky. The substring match also flips on benign content like "succeeded after retry; previous attempt failed" or any nemo_run status string that mentions "fail" — flagging an actually-successful task as failed.

Tighten this to a word match against the canonical statuses:

status_word = body.split()[0].lower() if body else "" task_statuses[task_name] = body if status_word in ("failed", "error", "cancelled"): any_failed = True

Or anchor with body.lower().startswith("fail"). The current approach is silent enough that a misclassification only surfaces when an operator manually inspects the experiment dir.

Fixed in 3de99b441. Now anchors on the FIRST word of the status file against a _STATUS_FAILURE_WORDS frozenset (failed/error/errored/cancelled/canceled). Existing test fixture test_job_status_failed_task writes 'failed (rc=1)\n' — first word is 'failed', still matches. Tightened so 'succeeded after retry; previous attempt failed' would correctly classify as success, not failure.

claude · 2026-06-12T22:20:33Z

+    argv = [
+        "ssh",
+        "-o",
+        "BatchMode=yes",
+        "-o",
+        "StrictHostKeyChecking=accept-new",
+        "-o",
+        "ConnectTimeout=5",
+    ]
+    if identity:
+        argv += ["-i", identity]
+    target = f"{cluster_user}@{cluster_host}" if cluster_user else cluster_host
+    argv += [target, "whoami && hostname"]


[SUGGESTION] StrictHostKeyChecking=accept-new causes the first probe of any new cluster host to silently auto-add the host key to ~/.ssh/known_hosts of the user running the MCP server. That's usually fine for an interactive operator, but in a CI context (the mcp GitHub Actions job in this PR) or a shared service-account host it's a TOFU footgun — an attacker who can intercept the very first probe gets pinned-trust until manual known_hosts cleanup.

Consider StrictHostKeyChecking=yes for verify_setup (probe fails fast and the user is told to add the host key explicitly), and only allow accept-new if a --first-time flag is passed. Not a blocker — current behavior matches the launcher's defaults — but worth flagging as the security posture differs from a typical interactive ssh session.

claude · 2026-06-12T22:20:53Z

+    import re as _re
+
+    experiment_id = None
+    experiment_dir = None
+    slurm_job_id = None
+    for m in _re.finditer(
+        r"experiment[_\s-]+([a-zA-Z0-9_]+_\d{10,})",
+        stdout_tail,
+        _re.IGNORECASE,
+    ):
+        experiment_id = m.group(1)
+        break
+    for m in _re.finditer(
+        r"(/lustre/[^\s]+|/home/[^\s]+)/experiments/[^\s]+",
+        stdout_tail,
+    ):
+        experiment_dir = m.group(0)
+        break
+    for m in _re.finditer(r"Submitted batch job (\d+)", stdout_tail):
+        slurm_job_id = m.group(1)
+        break


[IMPORTANT Algorithm] Three independent fragility issues in this best-effort id parse:

import re as _re should be at module top (per project coding standards, "Keep imports at the top of the file"). It's also already a stdlib import — no reason to lazy-import or alias.

The experiment_dir regex hard-codes /lustre/... and /home/... as the only roots. NVIDIA-internal sites use /lustre/, but partner clusters use /scratch/, /work/, /data/, /p/, etc. The regex will return experiment_dir=None on perfectly valid runs everywhere except specific NVIDIA SREs' machines.

The id regex experiment[_\s-]+([a-zA-Z0-9_]+_\d{10,}) requires 10+ digits at the end (a unix timestamp). nemo_run does use timestamps, but the format may include sub-second precision or change across versions; pinning >=10 digits is brittle and the parse silently returns experiment_id=None rather than failing loud.

For (2) drop the path prefix anchor and rely on /experiments/<id>/ as the unique marker:

m = re.search(r"(\S+/experiments/[^\s/]+)", stdout_tail) if m: experiment_dir = m.group(1)

For (3), since nemo_run logs the experiment id in a deterministic format, anchor on the actual log line nemo_run emits (e.g. "Experiment created") rather than reverse-engineering a timestamp shape. Worth a follow-up Phase-2 pass where the test fixtures cover real nemo_run output.

Fixed in 3de99b441. (1) Hoisted import re to module top. (2) Dropped the /lustre/|/home/ path anchors; match any \S+/experiments/[^\s/]+ so partner-cluster filesystem roots work. (3) Replaced the timestamp-anchored experiment-id regex with a match anchored on nemo_run's experiment ... id: log line, with a generic-id fallback. The underlying spike-quality "best-effort" caveat still applies; production-grade parsing is queued as a Phase 2 task in the cell.md migration plan (OMNIML-5131).

claude · 2026-06-12T22:21:21Z

+        try:
+            with open(path) as f:
+                doc = yaml.safe_load(f) or {}
+            if isinstance(doc, dict):
+                # job_name is the most common identifier; "model" may
+                # also be a top-level field in some examples.
+                entry.model = doc.get("model") or doc.get("base_model") or doc.get("job_name")
+                entry.description = doc.get("description")


[SUGGESTION] Top-level model / base_model / description fields don't exist in any of the bundled launcher YAMLs (e.g. examples/Qwen/Qwen3-8B/megatron_lm_ptq.yaml only has job_name). The list_examples test passes because it constructs ad-hoc YAMLs — but on the actual data, this code will report model=<job_name> (a stringified identifier like Qwen3-8B_PTQ) and description=None for every example.

Two reasonable fixes, either's fine:

Document the convention in the launcher's CLAUDE.md and add model: / description: to the canonical examples (small edit, real metadata for agents).

Derive the model name from the path (e.g. Qwen/Qwen3-8B/... → Qwen/Qwen3-8B) and pull description from the first # ... comment block in the YAML — which is the actual location of human-readable description in every bundled example.

Without one of these the tool's "model + description" output is mostly empty for the real corpus, which undercuts its discovery value to the agent.

claude · 2026-06-12T22:21:46Z

+dependencies = [
+    "mcp>=1.0",
+    # The launcher provides the core orchestration primitive (core.run_jobs,
+    # SandboxPipeline, build_slurm_executor, build_docker_executor). Pulled
+    # in via git+subdirectory so `uvx --from <this-repo>#subdirectory=tools/mcp
+    # modelopt-mcp` resolves both packages from the same clone — no
+    # duplicate fetch.
+    "modelopt-launcher",
+    "pyyaml",
+    "pydantic>=2.0",
+]


[CRITICAL Algorithm] The dependency "modelopt-launcher" is unsatisfiable for the documented uvx --from "git+...#subdirectory=tools/mcp" modelopt-mcp install path.

tools/launcher/pyproject.toml declares the package as name = "modelopt-launcher" with [tool.setuptools] py-modules = [] — there is no modelopt_launcher Python package directory, only loose .py files (launch.py, core.py, slurm_config.py) at the launcher dir's top level. Even with [tool.uv.sources] modelopt-launcher = { path = "../launcher" }, uv pip install -e ../launcher only registers the dist-info; nothing importable lands in site-packages. So bridge.py's import modelopt_launcher fallback at line 80-83 always misses, and any user who pip/uvx-installs without a clone (the explicit recommended path) gets a server that can't locate examples/ or launcher_dir.

The PR description's "uvx clones the whole repo to its cache" claim only saves the day because the bridge falls back to _THIS_DIR.parent.parent / "launcher" — i.e. the install only works because of file-layout assumptions about the cloned repo, not because the dep system actually ships the launcher to the consumer. That's a fragile contract.

Fix options:

Make modelopt-launcher a proper package: move the source into tools/launcher/modelopt_launcher/ and update its pyproject.toml to expose modelopt_launcher (with examples/ declared as package data). Then from modelopt_launcher import core / import modelopt_launcher.examples works, the import-fallback in _find_launcher_examples_dir actually does something, and uvx-ed installs run cleanly without the repo file-layout side-channel.

Ship a [tool.uv.sources] git+...#subdirectory=tools/launcher for the launcher (so uvx resolves it from the same repo on PyPI-less consumers) AND add a runtime hard-fail when _THIS_DIR.parent.parent / "launcher" is missing — at least the failure is loud rather than "works on the author's machine."

Without either fix, the install matrix in the README is misleading and a Phase-2 user who runs the uvx-from-PyPI path (once published) ends up with a non-functional server.

Fixed in 3de99b441. Dropped the bare modelopt-launcher dep from [project.dependencies]. You're right that tools/launcher/pyproject.toml declares the name with py-modules = [] — there is no importable modelopt_launcher Python package; the launcher is consumed as loose .py files at tools/launcher/*.py. bridge.py invokes it via subprocess.run(["uv", "run", "launch.py", ...], cwd=<repo>/tools/launcher/) — a file-layout dependency, not a Python import dependency. The fallback import modelopt_launcher at line 80-83 was vestigial; removing it from the dep list while leaving the import as a try/except fallback for a hypothetical future where the launcher IS pip-installable.

uvx-from-git satisfies the file-layout dependency naturally (the clone puts both tools/launcher/ and tools/mcp/ next to each other). The README documents the requirement explicitly.

claude

Claude review — Phase 1 modelopt-mcp

Tally: 4 CRITICAL · 4 IMPORTANT · 2 SUGGESTION

The bridge module wraps the launcher with a structured-error contract that's pleasant to read, and the test suite is hermetic and well-shaped. The Phase-1-vs-Phase-2 split is also called out cleanly. However, several issues mean the Slurm path won't actually submit to the requested cluster, and the install metadata won't survive being pulled in without the rest of the repo. Highlights:

Most impactful

CRITICAL — Slurm cluster_host is never propagated to the launcher. launch.py has no such parameter; cluster host comes from SLURM_HOST (env) or the slurm_config.host per-task override. The bridge appends cluster_host=<host> as a nemo-run CLI override, which the launcher ignores — every slurm submission silently uses whatever SLURM_HOST was inherited at startup, or fails opaquely. (bridge.py:478-485)
CRITICAL — shlex.quote is misapplied in argv construction. Quotes get baked into values when subprocess.Popen([...]) is called with a list (no shell), so launch.py sees literal hf_local='/mnt/hf-local' and user='alice' — breaking path lookups and SSH user resolution. (bridge.py:475-491)
CRITICAL — modelopt-launcher dep is unsatisfiable as packaged. The launcher's pyproject declares py-modules = [] and exposes no modelopt_launcher package — import modelopt_launcher in the bridge fallback always misses. The install only works because of file-layout assumptions about the cloned repo. (tools/mcp/pyproject.toml:6-16)
IMPORTANT — Docker-mode Popen deadlock. stdout=PIPE is unread; the launcher subprocess will block on its next write() once the ~64KB pipe buffer fills, hanging long-running PTQ jobs while the MCP server reports success. (bridge.py:506-537)
IMPORTANT — NEMORUN_HOME resolution is asymmetric between submit and status. Without explicit propagation, job_status returns experiment_dir_not_found for jobs that succeeded — the MCP and launcher cwds aren't the same. (bridge.py:626-643)

Other findings (in the inline thread)

IMPORTANT — verify_setup(executor='docker') GPU smoketest pulls a 150 MB CUDA image inside a 60 s timeout; first call on a cold host returns a misleading gpu_check_timeout.
IMPORTANT — job_status's 'fail' in body.lower() substring check fires on "succeeded after retry; previous attempt failed" etc.
IMPORTANT — Slurm submit's experiment-id parse hard-codes /lustre|/home paths and a 10+ digit timestamp shape; lazy import re as _re should be at module top per project standards.
SUGGESTION — StrictHostKeyChecking=accept-new is a TOFU footgun in CI/service-account contexts.
SUGGESTION — list_examples extracts model/description from top-level YAML fields that don't exist in any bundled example; output is mostly empty on the real corpus.

Risk assessment

Medium-to-high. The Slurm path is the documented headline use-case in the README, and three of the four CRITICAL findings sit in that path. The unit tests pass because they mock the subprocess layer end-to-end and never observe the actual launcher's arg-handling, so the regressions wouldn't be caught by CI as written.

Net: the structure is good, but I'd hold this until at least the four CRITICAL findings are resolved and a smoke-test that drives the real launch.py --dryrun is added to the test suite.

claude · 2026-06-12T22:23:08Z

+    else:
+        # Slurm mode — pass cluster config knobs as nemo-run overrides.
+        argv.append(f"cluster_host={shlex.quote(cluster_host or '')}")
+        if cluster_user:
+            argv.append(f"user={shlex.quote(cluster_user)}")
+        if identity:
+            argv.append(f"identity={shlex.quote(identity)}")
+        argv.append("detach=true")


[CRITICAL Algorithm] The Slurm-mode argv construction never actually configures the cluster host that the launcher will use. launch.py's launch() entrypoint (tools/launcher/launch.py:82) only accepts job_name, job_dir, pipeline, hf_local, user, identity, detach, clean — there is no cluster_host parameter. The cluster host is sourced either from the SLURM_HOST env var via slurm_factory(host=...) (tools/launcher/slurm_config.py:63) or via a per-task override like pipeline.task_0.slurm_config.host=<host>.

Appending cluster_host=<host> to nemo-run's CLI overrides will at best be ignored / produce a CLI error, and at worst silently submit using whatever SLURM_HOST happens to be in the bridge's env (which may be empty). Net effect: every Slurm submission goes through with the wrong host, or fails with an opaque error after the verify step has already passed.

Fix: either set SLURM_HOST=<cluster_host> in the env passed to the subprocess (and propagate it through the actual ssh invocation the launcher makes), or pass it as a structured override. Something like:

env = os.environ.copy() env["SLURM_HOST"] = cluster_host # ... argv = ["uv", "run", "launch.py", "--yaml", str(abs_yaml), "--yes"] if cluster_user: argv.append(f"user={cluster_user}") if identity: argv.append(f"identity={identity}") argv.append("detach=true") proc = subprocess.run(argv, env=env, ...)

Same problem applies to cluster_user → the launcher's user parameter is correctly named but the bridge passes it as user=... already, so that one's fine.

Same as the earlier claude[bot] finding on line 485 — fixed by propagating SLURM_HOST=<cluster_host> via env=child_env in 3de99b441.

… shell-quoting, Docker DEVNULL + start_new_session, NEMORUN_HOME propagation, status word-anchor, lighter GPU probe Addresses CodeRabbit + claude[bot] findings on PR #1701: bridge.py * **Slurm cluster_host**: launch.py's entrypoint does not accept a `cluster_host` arg — it reads SLURM_HOST from env via slurm_factory. Stop appending `cluster_host=<host>` to nemo-run overrides (it was at best ignored, at worst silently submitted with whatever SLURM_HOST happened to be in the bridge's env). Propagate via env=child_env on the subprocess. * **shlex.quote misuse**: subprocess.run/Popen with a list never goes through a shell, so shell-quoting baked literal quote chars into values like `hf_local='/mnt/hf-local'`. Drop quoting on all nemo-run k=v overrides — verbatim values are passed safely. * **Docker Popen PIPE → DEVNULL**: stdout/stderr=PIPE without a reader would fill the ~64 KB kernel pipe buffer on long PTQ runs and hang the launcher subprocess. Use DEVNULL (nemo_run writes real logs into the experiment dir; job_logs reads from there). Also add `start_new_session=True` so an MCP server SIGINT/restart doesn't SIGHUP the in-flight launcher. * **NEMORUN_HOME asymmetry**: submit-side didn't pin NEMORUN_HOME in the subprocess env, but status-side (`_resolve_experiment_dir`) looked for it. Launcher defaulted to its own cwd; status-side to the MCP server's cwd. Result: false `experiment_dir_not_found` for jobs that actually succeeded. Pin via env_setdefault + expand the candidate-dir list to include `launcher_dir/experiments/` as a belt-and-braces fallback. * **GPU verify_setup**: replaced the heavyweight `docker run --gpus all nvidia/cuda:12.0-base nvidia-smi` (which pulls ~150 MB and blows past the 60s timeout on first call) with `docker info --format '{{json .}}'` + check whether `"nvidia"` is in the registered runtimes. No image pull, daemon-fast. * **Status word match**: `"fail" in body.lower()` was too loose (false-positives on "succeeded after retry; previous attempt failed"). Anchor on the FIRST word of the status file vs a `_STATUS_FAILURE_WORDS` frozenset. * **experiment_id regex**: hoisted `import re` to module top, and relaxed both the id pattern and the experiment_dir path regex (was hard-coded to `/lustre/|/home/` — fails on partner clusters' `/scratch/`, `/work/`, etc.). * **list_examples metadata**: path-derive `model` as `<family>/<model>` from `examples/<family>/<model>/<task>.yaml` when the YAML body doesn't carry top-level `model` / `description` / `job_name` (most launcher examples don't). server.py * Add `Field(ge=1)` to the `tail` param on `job_logs` so the schema rejects zero/negative values at the MCP wire level. pyproject.toml * Drop the bare `modelopt-launcher` dep. tools/launcher/pyproject.toml declares the name with `py-modules = []` — there is no importable `modelopt_launcher` Python package on disk. bridge.py uses the launcher via `subprocess.run(["uv", "run", "launch.py", ...])` with `cwd=<repo>/tools/launcher/` — a file-layout dependency, not a Python import. uvx-from-git satisfies this naturally (the clone puts both directories on disk side by side). Remove the unsatisfiable PyPI declaration; document the file-layout requirement in the README. __init__.py * Correct the `submit_job` return-contract docstring: Slurm returns `experiment_id`; Docker returns the background `pid` (Phase 2 will capture the nemo_run experiment_id for the Docker path). All changes preserve the existing 19/19 unit tests passing. Signed-off-by: Chenhan Yu <[email protected]>

ChenhanYu · 2026-06-12T22:44:40Z

Review feedback addressed in `3de99b441`

Thanks @coderabbitai and @claude for the thorough pass. Summary of what changed and what was deferred / declined:

Fixed (7 substantive bugs)

Issue	Fix
Slurm `cluster_host` never reached the launcher (launch.py reads `SLURM_HOST` env, not a CLI arg)	Propagate via `env=child_env` on the subprocess
`shlex.quote` on nemo-run k=v overrides baked literal quotes into values	Removed quoting; list-form subprocess never goes through a shell
Docker `Popen(stdout=PIPE)` would block on full pipe buffer for long jobs	Switched to `stdout=DEVNULL, stderr=DEVNULL, start_new_session=True`
Submit-side vs status-side `NEMORUN_HOME` asymmetry → false `experiment_dir_not_found`	Pin via `child_env.setdefault('NEMORUN_HOME', os.getcwd())` + expanded `_resolve_experiment_dir` candidate list
Docker GPU probe pulled ~150 MB CUDA image, blew past 60s timeout on first call	Replaced with `docker info --format` runtime-registry check — no image pull
`"fail" in body.lower()` substring match for task status was too loose	Anchor on FIRST word against a `_STATUS_FAILURE_WORDS` frozenset
`experiment_id` regex hard-coded `/lustre/	/home/` roots + timestamp shape

Plus the docstring-contract fix (__init__.py Docker returns pid, Slurm returns experiment_id), the list_examples path-derived model fallback (since real launcher examples don't carry top-level model fields), Field(ge=1) on tail, pyproject.toml drops the unsatisfiable modelopt-launcher bare-name dep, and re hoisted to module top.

Deferred / declined (with rationale, see inline replies)

# nosec on subprocess — kept; mirrors the in-repo precedent at tools/launcher/launch.py and tools/launcher/core.py. If repo policy is tightening, that's a project-wide MR covering the launcher too.
StrictHostKeyChecking=accept-new — kept; BatchMode=yes already enforces key-auth-only, and the alternative requires pre-populated known_hosts which is worse UX than the marginal security gain.
Workflow YAML hardening (pinned action SHA, permissions: block, persist-credentials: false) — flagged; deferring to a repo-wide MR that hardens launcher + mcp + the other unpinned jobs together. Mirroring existing precedent within this PR.

Validation

19/19 unit tests still pass (uv run python -m pytest tests/)
Pre-commit clean (ruff, ruff-format, mypy, bandit, license-headers)
All 16 inline review comments have a reply with what was done / why deferred

Ready for re-review. cc @kevalmorabia97

claude · 2026-06-12T22:45:11Z

Claude finished @ChenhanYu's task in 2m 52s —— View job

Re-review of `3de99b44`

Verified all 7 substantive fixes against the diff and current source. The core algorithmic bugs are correctly addressed; deferral rationale on the policy items is reasonable. Ready for re-review from a maintainer.

Verified fixes ✅

Issue	Verified
Slurm `SLURM_HOST`	`child_env["SLURM_HOST"] = cluster_host or ""` at `bridge.py:550`, passed via `env=child_env` on both Docker `Popen` (`:565`) and Slurm `subprocess.run` (`:592`). Matches `slurm_factory(host=os.environ.get("SLURM_HOST", ""))` in `tools/launcher/slurm_config.py:63`.
`shlex.quote` removal	`import shlex` dropped; all 6 override appends at `bridge.py:506-524` now use raw values.
Docker `Popen` DEVNULL	`bridge.py:562-569` — `stdout/stderr=DEVNULL` + `start_new_session=True`. Pipe-buffer hang fixed; SIGHUP isolation added.
`NEMORUN_HOME` symmetry	Submit pins via `child_env.setdefault("NEMORUN_HOME", os.getcwd())` (`:545`). `_resolve_experiment_dir` (`:704-712`) now also probes `launcher_dir/experiments/` and `launcher_dir/local_experiments/` as belt-and-braces.
GPU probe	Replaced with `docker info --format '{{json .}}'` + `"nvidia" in Runtimes` check at `bridge.py:249-287`. No image pull. Timeout dropped to 10s.
Status word-anchor	`bridge.py:760-762` — `body.split()[0].lower() in _STATUS_FAILURE_WORDS` (`{failed, error, errored, cancelled, canceled}`). Existing `test_job_status_failed_task` fixture (`failed (rc=1)`) still passes since `split()[0] == "failed"`.
`experiment_id` parse	`import re` hoisted to top; experiment_dir regex now `\S+/experiments/[^\s/]+` (works on any FS root); id regex anchored on `experiment id:` with a generic-token fallback.

Plus: Field(ge=1) on tail (server.py:261), __init__.py docstring corrected, pyproject.toml drops the unsatisfiable modelopt-launcher bare-name dep, list_examples path-derives <family>/<model> when YAML body lacks it.

Deferral rationale — accepted ✅

# nosec suppressions — confirmed identical pattern in tools/launcher/launch.py:33,96 and tools/launcher/core.py:380,383,390. Cleaning these up only in tools/mcp/ would be inconsistent. Repo-wide MR is the right boundary.
StrictHostKeyChecking=accept-new — BatchMode=yes already enforces key-auth-only, so no password-prompt hijack surface. The TOFU concern is real but tightening would impose pre-populated known_hosts on every cluster, which is worse UX. Reasonable tradeoff for an interactive operator tool; a --strict-host-key opt-in could be a Phase-2 addition.
Workflow YAML hardening — every job in the workflow file (linux, windows, multi-version, partial-install, launcher, mcp, skills, …) uses actions/checkout@v6 unpinned with no permissions: block. Repo-wide hardening MR is the right shape.

Minor follow-ups (non-blocking)

_STATUS_FAILURE_WORDS lacks timeout/cancelled_by_user — nemo_run may emit other terminal states. If you encounter false-done classifications in Phase 2, expand the frozenset. (bridge.py:50-52)
list_examples path-derivation depth — f"{parts[1]}/{parts[2]}" if len(parts) >= 4 else None assumes examples/<family>/<model>/<file>; flatter layouts (examples/<family>/<file>) get model=None unless the YAML supplies it. Today's bundled examples use the 4-deep layout so this works, but worth a comment if the layout convention is intended to be load-bearing. (bridge.py:139-140)
Field(ge=1) test gap — no test exercises pydantic rejection of tail=0 at the wire level. Pydantic enforces it correctly so this is purely a coverage observation, not a bug.

Risk assessment

Low. The 4 critical findings from my Phase-1 pass were genuine submission-correctness bugs; all are now fixed in a way that matches the launcher's actual interface (SLURM_HOST env, raw argv values, on-disk experiment dir under a single NEMORUN_HOME). The unit tests still mock at the subprocess boundary so they don't observe the launcher's real arg parsing — adding a --dryrun-driven smoke test is queued as a Phase 2 task per the OP, and that's a fair scope split.

LGTM from my side. cc @kevalmorabia97 for maintainer sign-off.
· branch: chenhany/modelopt-mcp-phase1

coderabbitai · 2026-06-12T22:45:51Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Read the updated bridge.py to verify the fixes
cat tools/mcp/modelopt_mcp/bridge.py

Length of output: 30709

🏁 Script executed:

#!/bin/bash
# Read server.py and pyproject.toml
cat tools/mcp/modelopt_mcp/server.py
cat tools/mcp/pyproject.toml
cat tools/mcp/modelopt_mcp/__init__.py

Length of output: 14771

@ChenhanYu Thanks for the thorough write-up — all 7 substantive fixes verified in the current code. The logic is clean. Two small remaining items before this can be merged:

1. Stale tool description in `server.py` (Line 80 area)

The verify_setup tool description still says:

"GPU passthrough (`docker run --gpus all nvidia-smi`)"

But the implementation now uses docker info --format '{{json .}}' (no container run). The LLM reads this description to understand what the tool does, so it's worth updating:

-            "Docker mode: checks `docker info` (daemon up) + GPU "
-            "passthrough (`docker run --gpus all nvidia-smi`). Set "
+            "Docker mode: checks `docker info` (daemon up) + whether the "
+            "`nvidia` runtime is registered (`docker info --format '{{json .}}'`). Set "

2. `AttributeError` on `TimeoutExpired.stdout.decode()` in `bridge.py` (submit Slurm path)

In submit_job_impl with text=True, subprocess.TimeoutExpired.stdout is a str | None, not bytes. The current handler:

f"{(e.stdout or b'').decode(errors='replace')[-400:]}"

…will raise AttributeError: 'str' object has no attribute 'decode' if e.stdout is a non-empty string (i.e. if the subprocess produced partial output before the timeout). Fix:

f"{(e.stdout or '')[-400:]}"

Everything else looks solid. The deferred rationales (nosec precedent, StrictHostKeyChecking UX trade-off, workflow hardening as a repo-wide MR) are all reasonable — no objection.

coderabbitai

🧹 Nitpick comments (2)

tools/mcp/modelopt_mcp/__init__.py (1)
50-52: 📐 Maintainability & Code Quality | ⚡ Quick win

Align package export style with the repo’s __init__.py convention.

Line 50–52 does not follow the project’s stated package-export pattern (__all__ at top and from .module import * re-export in package __init__.py).

As per coding guidelines, “Define the public API with __all__ at the top of each module and re-export via from .module import * in package __init__.py files.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/modelopt_mcp/__init__.py` around lines 50 - 52, This file exports
main directly but doesn't follow the repo convention: move the __all__
declaration to the top of the module and re-export symbols by importing the
module with a relative star import; specifically, add __all__ = ["main"] at the
top of this package __init__ and replace the current direct import of
modelopt_mcp.server.main with a relative re-export like from .server import * so
the public API is defined via __all__ and re-exported from the submodule
(references: __all__, main, modelopt_mcp.server, server).
Source: Coding guidelines
tools/mcp/modelopt_mcp/bridge.py (1)
268-271: 📐 Maintainability & Code Quality | ⚡ Quick win

Move stdlib json import to module top-level (or justify local import).

Line 268 introduces a local stdlib import with no circular/optional/heavy-import reason. Please keep it at module scope for consistency with repo import rules.
♻️ Proposed change
+import json
 import os
 import re
 import subprocess  # nosec B404
@@
-            import json as _json
-
-            info = _json.loads(gpu.stdout)
+            info = json.loads(gpu.stdout)
As per coding guidelines, “Keep imports at the top of the file... Put an import inside a function only when there is a concrete reason.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/modelopt_mcp/bridge.py` around lines 268 - 271, The local stdlib
import "import json as _json" should be moved to the module top-level; remove
the inline import and add "import json" (or "import json as _json" to keep
naming) at the top of bridge.py, then update the local usage (where info =
_json.loads(gpu.stdout) and runtimes = list((info.get("Runtimes") or
{}).keys())) to use the top-level json symbol; only keep a local import if you
have a documented reason (circular/optional/heavy), otherwise relocate and run
tests to ensure no name collisions.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tools/mcp/modelopt_mcp/__init__.py`:
- Around line 50-52: This file exports main directly but doesn't follow the repo
convention: move the __all__ declaration to the top of the module and re-export
symbols by importing the module with a relative star import; specifically, add
__all__ = ["main"] at the top of this package __init__ and replace the current
direct import of modelopt_mcp.server.main with a relative re-export like from
.server import * so the public API is defined via __all__ and re-exported from
the submodule (references: __all__, main, modelopt_mcp.server, server).

In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 268-271: The local stdlib import "import json as _json" should be
moved to the module top-level; remove the inline import and add "import json"
(or "import json as _json" to keep naming) at the top of bridge.py, then update
the local usage (where info = _json.loads(gpu.stdout) and runtimes =
list((info.get("Runtimes") or {}).keys())) to use the top-level json symbol;
only keep a local import if you have a documented reason
(circular/optional/heavy), otherwise relocate and run tests to ensure no name
collisions.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 47295e20-a43a-4b21-9487-cfc8b88d55d9

📥 Commits

Reviewing files that changed from the base of the PR and between cb3766d and 3de99b4.

📒 Files selected for processing (4)

tools/mcp/modelopt_mcp/__init__.py
tools/mcp/modelopt_mcp/bridge.py
tools/mcp/modelopt_mcp/server.py
tools/mcp/pyproject.toml

🚧 Files skipped from review as they are similar to previous changes (1)

tools/mcp/modelopt_mcp/server.py

Post-review-fix doc sync: * `verify_setup(executor='docker')` description: replace the stale `docker run --gpus all nvidia-smi` mention (commit 3de99b4 swapped this for the lighter `docker info --format` runtime- registry probe — no image pull, daemon-fast). * `MODELOPT_MCP_SKIP_GPU_CHECK` env var description: same fix. * Phase 2/3 section: link out to the Epic + child Tasks (OMNIML-5128/5132 Phase 2, OMNIML-5133/5134 Phase 3) so a reader can find the planned work without leaving the README. Reflects the trimmed NEL scope (6 tools, dropped nel_list/nel_validate). Signed-off-by: Chenhan Yu <[email protected]>

shengliangxu

LGTM

ChenhanYu requested a review from kevalmorabia97 June 12, 2026 21:57

ChenhanYu marked this pull request as ready for review June 12, 2026 22:04

ChenhanYu requested a review from a team as a code owner June 12, 2026 22:04

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

claude Bot reviewed Jun 12, 2026

View reviewed changes

ChenhanYu requested a review from shengliangxu June 12, 2026 22:37

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

shengliangxu approved these changes Jun 12, 2026

View reviewed changes

Conversation

ChenhanYu commented Jun 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Five tools

Design constants

Layout

Install

End-user install (recommended)

Dev install (local checkout)

Why no plain pip install today

Post-review changes (commit 3de99b4 + 082a6b8)

Validation

Acceptance criteria (from OMNIML-5123)

Phase 2 (separate PR)

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks failed

❌ Failed checks (1 error, 1 warning)

Uh oh!

ChenhanYu commented Jun 12, 2026

Uh oh!

codecov Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ChenhanYu commented Jun 12, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

claude Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ChenhanYu commented Jun 12, 2026 •

edited by coderabbitai Bot

Loading

Why no plain `pip install` today

Post-review changes (commit `3de99b4` + `082a6b8`)

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

codecov Bot commented Jun 12, 2026 •

edited

Loading

coderabbitai Bot Jun 12, 2026 •

edited

Loading

coderabbitai Bot Jun 12, 2026 •

edited

Loading

Review feedback addressed in `3de99b441`

claude Bot commented Jun 12, 2026 •

edited

Loading

Re-review of `3de99b44`