feat(osmo): VS Code/Cursor dev workflow on NVIDIA OSMO#352
Open
smash0190 wants to merge 13 commits into
Open
Conversation
Adds a privileged Docker-in-Docker workspace task that lets a developer
run the full AirStack docker-compose stack on OSMO and attach an IDE
over SSH, with Isaac Sim WebRTC livestream + Foxglove websocket exposed
via osmo port-forward.
Components:
- osmo/workspace/{Dockerfile,entrypoint.sh,sshd_config}: airstack-osmo-workspace
image. Ubuntu 24.04 + sshd (pubkey-only) + Docker CE + Docker Compose +
nvidia-container-toolkit + fuse-overlayfs (DinD-on-overlayfs needs it,
otherwise dockerd falls back to vfs which bloats AirStack images ~10x).
- osmo/workflows/airstack-dev.yaml: single privileged GPU task. Materializes
Nucleus + airlab-docker secrets from OSMO credentials, clones AirStack,
starts inner dockerd, runs `airstack up` with desktop + isaac-sim-livestream
Compose profiles.
- simulation/isaac-sim: isaac-sim-livestream Compose service that runs
Pegasus standalone with --/app/livestream/enabled=true and exposes
WebRTC port ranges 47995-48012 / 49000-49007 / 49100; launch script
gates headless+livestream extension on ISAAC_SIM_LIVESTREAM env var.
- .airstack/modules/osmo.sh: airstack osmo:{up,ide,foxglove,webrtc,logs,down}
CLI wrappers around `osmo workflow submit` / `port-forward` / `cancel`.
Persists the active workflow id and validates it's still running before
each command (prevents the stale-state 410 error).
- airstack.sh: bash 4+ re-exec bootstrap (macOS ships 3.2; the CLI uses
`declare -A`).
- osmo/README.md + docs/tutorials/airstack_on_osmo.md: admin pool setup
(privileged_allowed) + per-user credentials (airlab-docker-login,
airlab-nucleus) + student-facing IDE attach + WebRTC/Foxglove flow.
Pool requirements: privileged_allowed: true, GPU pool with
nvidia-container-toolkit on the host, ample node ephemeral storage
(AirStack images extracted are ~50-100Gi via fuse-overlayfs; vfs needs
~500Gi+).
Co-authored-by: Cursor <[email protected]>
…ward race, and cursor-server install hangs Four bugs that bit the first end-to-end runs (airstack-dev-10 → -13): - _osmo_wf_id: validate saved workflow id against `osmo workflow query` before returning. Without this, the state file at ~/.airstack/osmo-state outlives the workflow it points at and every subsequent osmo:webrtc / osmo:foxglove / osmo:ide call surfaces the same confusing "Workflow airstack-dev-N is not running! (status 410)" instead of the obvious "run airstack osmo:up to launch a fresh workflow". - cmd_osmo_up: `osmo workflow submit --set-env` is variadic. Passing two separate `--set-env A=1 --set-env B=2` silently drops the first one — this is what made airstack-dev-11 fail with "ERROR: SSH_PUB_KEY not set" when --branch was passed alongside the pubkey. Collapse the K=V pairs into a single --set-env. - cmd_osmo_ide: previously launched the IDE before starting the port-forward, so Cursor/VS Code would try to SSH localhost:2200 a few hundred ms before the tunnel listener existed and fail with "connect to host localhost port 2200: Connection refused". Now: detect an existing forward and reuse it (also avoids the "Address already in use" if osmo:foxglove was started in parallel), otherwise spawn the forward in the background, wait up to 30s for it to bind, then launch the IDE. Ctrl+C tears down the spawned forward cleanly via a trap. - workspace image / entrypoint: Cursor Remote-SSH hung indefinitely on airstack-dev-13 because (a) cursor-server's installer fell back to wget when curl timed out and wget was not in the image, and (b) a /tmp/cursor-remote-lock.* file left behind by the first crashed install blocked every silent retry. Add wget to the apt install list and rm -f the stale Cursor / VS Code remote lock files at the very top of entrypoint.sh so each fresh pod starts from a clean slate. Co-authored-by: Cursor <[email protected]>
…ons locally on osmo:foxglove
osmo:logs was invoking `osmo workflow logs <id> workspace --follow`, but
the real CLI takes the task via `-t TASK` (not positionally) and has no
`--follow` flag at all — so the command failed immediately with
"unrecognized arguments: workspace --follow". Replace with a polling loop
that uses `-t workspace -n <N>` on a short interval, prints only the
suffix that appeared since the previous fetch (find-the-last-seen-line
trick; degrades to "reprint tail" with a warning if the cursor outruns
-n), and exits cleanly once the workflow reaches a terminal state.
Tunables: OSMO_LOGS_TASK / OSMO_LOGS_TAIL / OSMO_LOGS_INTERVAL.
osmo:foxglove now installs the AirStack Foxglove extensions
(robot-commands / waypoint-editor / polygon-editor) into the laptop's
local Foxglove user-extensions directory before opening the
port-forward. Without this, custom panels show up as "Unknown panel
type: robot-commands.Robot Tasks" in the laptop's Foxglove Desktop
because it has no way to discover the extension folders that live
inside the GCS container. To avoid duplicating the install logic, the
existing gcs/foxglove_extensions/install.py is refactored to read
FOXGLOVE_EXT_SRC / FOXGLOVE_EXT_DST env vars (the in-container call
already in gcs/docker/gcs-base-docker-compose.yaml keeps working
unchanged via defaults). The wrapper sets those vars to
${PROJECT_ROOT}/gcs/foxglove_extensions and
~/.foxglove-studio/extensions respectively, overridable with
OSMO_FOXGLOVE_EXT_DIR / skippable with OSMO_FOXGLOVE_SKIP_EXTENSIONS=1.
Co-authored-by: Cursor <[email protected]>
…actually shows pixels
Kit 107's WebRTC livestream picks a UDP media port dynamically. The
documented `omni.services.livestream.nvcf` defaults (minHostPort=47998
maxHostPort=48020 fixedHostPort=0) are ignored by the stock standalone
Kit binary — on airstack-dev-13 it bound to UDP 49042, outside both the
Compose-published range AND the default `osmo:webrtc --udp` forward of
`47995-48012,49000-49007`. Result: TCP signaling on 49100 worked, the
WebRTC Streaming Client window opened, but every SRTP media packet was
dropped → black viewport plus the recurring
`NVST_CCE_DISCONNECTED when m_connectionCount 0 != 1` underflow in Kit's log.
Pin the media port via three `app.livestream.*` settings set on
`SimulationApp` before `omni.kit.livestream.webrtc` is enabled, so
whichever code path the carb.livestream-rtc.plugin consults lands on the
same port:
app.livestream.fixedHostPort = 49099
app.livestream.minHostPort = 49099
app.livestream.maxHostPort = 49099
49099 is a deliberate one-off from the 49100 TCP signaling port — same
neighborhood, easy to remember. Verified live on airstack-dev-13 after
`docker compose up -d --force-recreate isaac-sim-livestream`: Kit binds
UDP 49099 (`/proc/net/udp` hex BFCB on 0.0.0.0) and docker-proxy
publishes it from the pod host network.
Knock-on cleanups:
- `simulation/isaac-sim/docker/docker-compose.yaml` shrinks the
isaac-sim-livestream `ports:` from 27 forwarded ports
(`47995-48012, 49000-49007 TCP+UDP, 49100 TCP`) to just two:
`49100/tcp` + `49099/udp`.
- `.airstack/modules/osmo.sh` shrinks `OSMO_WEBRTC_TCP` to `49100` and
`OSMO_WEBRTC_UDP` to `49099`, so `airstack osmo:webrtc` spawns two
port-forwards instead of thirty.
- `.gitignore` ignores `.DS_Store` so working from a Mac doesn't leak
Finder metadata.
After pulling this commit into a running pod: `docker compose up -d
--force-recreate isaac-sim-livestream` to apply the new port mapping;
then re-run `airstack osmo:webrtc` on the laptop to pick up the new
forward ranges. The standalone WebRTC Streaming Client connects to
`localhost` (same address as before) and now actually receives frames.
Co-authored-by: Cursor <[email protected]>
…d for in-pod git push Two paper-cuts that bit airstack-dev-13 after the WebRTC media port pin landed (commit 2d9b161): (1) The WebRTC stream showed only the bare 3D viewport — no menu bar, no toolbar, no panels, no console. Cause: SimulationApp's default when `headless=True` is to also hide the UI (`hide_ui=True`). The NVIDIA reference at `simulation/isaac-sim/standalone_examples/api/isaacsim.simulation_app/livestream.py` explicitly opts back into UI rendering plus picks explicit window sizing and `display_options=3286` to keep the default grid/axes visible. Mirror that config in `example_one_px4_pegasus_launch_script.py` when `ISAAC_SIM_LIVESTREAM=true` (local desktop dev keeps the minimal `headless=False` path unchanged). (2) The pod has no SSH private key, only an `authorized_keys` for inbound connections from the user's laptop. As a result, `git push` from inside the Cursor / VS Code Remote-SSH session inside the pod fails with "Permission denied (publickey)". sshd inside the workspace image already has `AllowAgentForwarding yes` baked in via `osmo/workspace/sshd_config`; the missing piece is purely on the Mac side. Update the `~/.ssh/config` block in the tutorial to include `ForwardAgent yes` (so the local agent's keys are exposed in the pod), `AddKeysToAgent yes` (auto-load on first push), and `UseKeychain yes` (macOS-only Keychain unlock without passphrase prompts; ignored on Linux). Adds an `ssh-add -l` smoke-test note. Co-authored-by: Cursor <[email protected]>
…auth-debug path osmo:setup hit two failure modes that wasted a debug session each: - `osmo credential set` is not an upsert for GENERIC creds — re-running setup (e.g. to rotate a Nucleus API token) failed with `400 duplicate key value violates unique constraint "credential_pkey"` and bailed before reaching the airlab-nucleus credential. Delete-then-set each credential so re-running is idempotent. - Bracket-paste mode and cross-OS clipboards routinely smuggle invisible bytes around long pastes. Nucleus's auth endpoint silently DENIES a token with one extra trailing byte, with no actionable error from the client side. _osmo_prompt now strips leading/trailing whitespace and CR/NUL bytes via a new _osmo_trim helper, and warns when bytes were stripped. cmd_osmo_setup additionally JWT-shape-checks the Nucleus token (must be eyJ.<dot>.<dot>.) before submitting it, so a wrong paste fails at setup time instead of silently DENIED at pod boot. Also documents how to debug the "Login Required: Unable to connect server omniverse://airlab-nucleus..." popup: SSH the Nucleus host and tail base_stack-nucleus-auth-1 for InternalCredentials.auth status: DENIED. Adds a "Nucleus connectivity from OSMO" section to the admin README clarifying that Nucleus over HTTPS uses a single 443 (no need to open the native 3009-3180 range from the OSMO cluster), per NVIDIA's TLS docs. Co-authored-by: Cursor <[email protected]>
…compose parser
The OSMO entrypoint was writing OMNI_USER=<andrew_id> alongside an API
token JWT in OMNI_PASS, which routes the JWT through the password-
verification path. Nucleus silently DENIES — visible only in
base_stack-nucleus-auth-1 as `InternalCredentials.auth … 'username':
'<andrew>' … status: DENIED` (no Tokens.auth_with_api_token call). Kit
then pops "Login Required: Unable to connect server omniverse://...".
omniclient expects the literal sentinel username `$omni-api-token` paired
with the JWT as the password. The entrypoint now detects a JWT-shaped
OMNI_PASS (header starts with `eyJ`) and emits OMNI_USER=$$omni-api-token
into omni_pass.env. The `$$` is intentional: docker-compose v2
interpolates env_file values, and a single `$` would be eaten by the
parser (`OMNI_USER=$omni-api-token` becomes `OMNI_USER=-api-token` after
${omni}- expansion to empty). The container ultimately sees
OMNI_USER=$omni-api-token, which is the correct sentinel.
Also note for the next debugger: `docker compose restart` does NOT
re-read env_file. Use `docker compose up -d <svc>` to recreate the
container after editing omni_pass.env.
Updates omni_pass_TEMPLATE.env header to document the API-token pattern
explicitly (with the $$ caveat), and adds a troubleshooting row that
distinguishes "wrong auth path" (DENIED with no Tokens.auth_with_api_token
call) from "bad/expired token" (Tokens.auth_with_api_token: DENIED).
Co-authored-by: Cursor <[email protected]>
… flow
Reposition the OSMO tutorial as AirStack's recommended day-to-day
development path (not just a fallback for laptops without GPUs) and
collapse it onto a single recipe: clone the repo, then drive everything
through the airstack osmo:* wrappers in .airstack/modules/osmo.sh.
- docs/tutorials/airstack_on_osmo.md
- Retitle + rewrite the intro to lead with five concrete advantages
(pooled GPUs, no local CUDA/Docker/driver maintenance, same image as
CI + field robots, one-command onboarding, hardware bigger than your
laptop). Demote the Linux+GPU-desktop path to an escape hatch.
- Drop the Mac/Windows/no-GPU framing in 'Who is this for?' and the
mermaid laptop subgraph label.
- Add 'a local clone of AirStack' to Prerequisites; remove it from the
'do not need' list.
- Replace Option A/B credential split with a single
./airstack.sh osmo:setup recipe; move the three raw osmo credential
set calls into a collapsible 'Under the hood' footnote.
- Replace each step's raw osmo workflow ... command with the
corresponding airstack osmo:up/logs/ide/webrtc/foxglove/down wrapper;
preserve the raw form in 'Under the hood' footnotes that cross-link
cmd_osmo_* in .airstack/modules/osmo.sh.
- Drop the export WF=... paragraph — the wrappers read the id from
~/.airstack/osmo-state automatically; AIRSTACK_OSMO_WF overrides
per-invocation. \$WF now only appears inside the raw-form footnotes.
- Sweep Troubleshooting + What-survives tables: redirect raw
port-forward fixes to the airstack osmo:* equivalents and rename the
section to 'What survives airstack osmo:down?'.
- Fix WebRTC edge label (49100/tcp + 49099/udp) to match the pinned
ports the workflow actually uses today.
Companion cleanups now that the privileged_allowed flip is automatic on
the OSMO autosync side (synchronize_osmo_team_pools.py forces
privileged_allowed: true on every platform of every pool, so students
never see the 'platform does not have privileged flag enabled' error):
- osmo/README.md: drop the 'Most common blocker' privileged warning, the
privileged_allowed row from the pool-requirements table, and the
'privileged GPU pod' / '(privileged, GPU)' descriptors in the
architecture summary. Simplify the validation-stage SSH-failure hint.
- osmo/workflows/airstack-dev.yaml: trim the long DinD-requires-privileged
comment to a one-liner (the privileged: true directive itself stays).
- .airstack/modules/osmo.sh: remove the special-case 'privileged flag
enabled' error branch in cmd_osmo_up — it should never fire now.
Co-authored-by: Cursor <[email protected]>
osmo:logs was silent because cmd_osmo_logs wrapped osmo workflow logs in $( ... ) on the assumption that -n LAST_N_LINES exits after dumping the tail. Empirically the CLI keeps the stream open as new lines arrive (it already behaves like tail -f, despite --help advertising only -n), so command substitution waited forever and printed nothing. Drop the polling loop and just exec the command directly. Each fresh OSMO pod also ships a new sshd host key, so every osmo:up trips StrictHostKeyChecking against the previous workflow's fingerprint and SSH/Cursor abort with "Host key for [localhost]:2200 has changed". Switch the recommended ~/.ssh/config block (and osmo/README.md) to the ephemeral-host pattern (StrictHostKeyChecking no + UserKnownHostsFile /dev/null + LogLevel ERROR), and have cmd_osmo_ide ssh-keygen -R the stale loopback entry on every run so users on the old config get unblocked automatically. Co-authored-by: Cursor <[email protected]>
…workflow dies
The pod's entrypoint clones AirStack fresh from GitHub on every workflow
start (the pod fs is ephemeral). It defaulted to `main`, so any developer
testing branch-only OSMO changes silently ran their pod against stale
`main` code — most visibly: COMPOSE_PROFILES=desktop,isaac-sim-livestream
resolved to "desktop" alone on `main` because the isaac-sim-livestream
service only exists on the feature branch, so isaac-sim never came up
and `airstack osmo:webrtc` showed a blank stream.
- cmd_osmo_up now defaults --branch to the local repo's current
branch (git rev-parse --abbrev-ref HEAD). Detached HEAD or
non-git checkouts fall back to `main` cleanly. Pass --branch
explicitly to override.
- New _osmo_check_branch_pushed warns up-front when the about-to-
submit branch has no upstream, is ahead of origin, or has an
uncommitted working tree. The pod doesn't see your laptop's edits.
Separately, when an OSMO workflow gets canceled mid-flight (osmo:down
in another shell, or OSMO timing it out), the in-flight port-forward
and logs streams raise OSMOUserError("Workflow X is not running!")
from inside an asyncio Task. The CLI prints "Task exception was never
retrieved" + a multi-line Traceback that buries the actual one-line
cause. New _osmo_pf_filter awk script collapses that into a single
[ERROR] line pointing at `airstack osmo:up`. Wired into webrtc,
foxglove, and logs. webrtc also gains a cleanup trap that kills the
backgrounded UDP port-forward on EXIT/INT/TERM so we don't leak it
against a dead workflow.
Tutorial Step 2 documents the new --branch default and the
"pod-clones-from-GitHub-not-your-laptop" gotcha.
Co-authored-by: Cursor <[email protected]>
dockerd's defaults of --max-concurrent-downloads=3 / --max-concurrent -uploads=5 cap a fresh airstack-dev pod's image-pull at ~300 MiB/s against the airlab-backup-10g registry — single-stream TLS tops out around 300-500 MiB/s per core, and three parallel streams of unevenly sized blobs serialize down to that ceiling. Ceph (1014 TiB, 92 OSDs, SSD pools) and 10 GbE both have far more headroom than that. Bump to 10/10 to overlap enough blob downloads to saturate the pipe. Threaded through the DOCKERD_MAX_DOWNLOADS / DOCKERD_MAX_UPLOADS env vars so a pool can be tuned at submit time without rebuilding the workspace image. Workspace image needs a rebuild + push for this to take effect: cd osmo/workspace docker build -t airlab-docker.andrew.cmu.edu/airstack/airstack-osmo-workspace:latest . docker push airlab-docker.andrew.cmu.edu/airstack/airstack-osmo-workspace:latest Co-authored-by: Cursor <[email protected]>
A plain `docker build && docker push` on an Apple Silicon Mac silently produces a linux/arm64-only `latest` manifest. OSMO workers are amd64, so every subsequent workflow fails at the outer pod-image pull with "no match for platform in manifest" before the entrypoint even runs — a confusing failure mode whose root cause lives entirely in the push, not in the workflow yaml or the entrypoint. Switch the README and the Dockerfile docstring to the buildx form, explain the why, and document the post-push manifest check. Co-authored-by: Cursor <[email protected]>
The OSMO pod's `/` is itself a containerd overlay snapshot, and Linux refuses to stack a second overlayfs on top of an overlay rootfs — which is why the inner dockerd was falling through to fuse-overlayfs. That costs a kernel↔userspace FUSE round-trip on every `creat()` during layer extraction, which murders throughput on apt/pip/ROS layers (measured: 32-50 MB/s for small-file-heavy layers vs 480 MB/s for big-file layers in the same pull). Pointing dockerd at /osmo/run/docker (the kubelet emptyDir backed by ext4 on /dev/vda3) lets the existing overlay2-first fallback chain actually succeed on its first try, restoring kernel-overlay extraction performance. emptyDir lifetime matches the workflow lifetime, so the docker layer cache gets the right scope automatically. Falls back to /var/lib/docker if /osmo/run isn't present so the image still works in non-OSMO test contexts. Co-authored-by: Cursor <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a cloud-based development workflow for AirStack on NVIDIA OSMO, positioned as the recommended path for new contributors (no local GPU required, no local install of Isaac Sim / ROS / Docker).
What this adds
airstack osmo:*CLI (.airstack/modules/osmo.sh, dispatched byairstack.sh) —setup,up,ide,webrtc,foxglove,logs,exec, etc. Auto-pins--branchto local checkout, scrubs stale SSH host keys on each connect, gracefully handles port-forward / workflow-died errors.osmo/workspace/) — runs inside the OSMO pod with an inner dockerd, sshd for Remote-SSH, Foxglove/WebRTC port plumbing, and the AirStack repo cloned on demand. Inner dockerd writes to/osmo/run/docker(ext4emptyDir) so the nativeoverlay2driver is used instead offuse-overlayfs— measured ~560 MiB/s peak Harbor pull / 1.4 GB/s sequential write inside the pod.osmo/workflows/airstack-dev.yaml) with privileged-capable platform defaults.docs/tutorials/airstack_on_osmo.md,osmo/README.md) — single `git clone` + `airstack setup` + `airstack osmo:up` flow; documents Nucleus API-token auth, SSH agent forward for in-pod `git push`, and `docker buildx build --platform linux/amd64 --push` requirement for the workspace image.docs/getting_started/index.md.49099so the WebRTC stream actually renders pixels; example Pegasus launch script tweaks for headless/livestream consistency.Why
Lets anyone with an OSMO account spin up a full AirStack dev environment in one command, attach VS Code/Cursor via Remote-SSH, and run Isaac Sim + the autonomy stack on shared cluster GPUs. Same image is usable on a developer's laptop and on the cluster.
Commits (oldest first)
Pre-merge checklist for the author
Test plan
Made with Cursor