Skip to content

Cloud infra 1: inverse handshake and instance IDs#111

Merged
ccuetom merged 6 commits into
masterfrom
feat/inverse-hanshake
May 27, 2026
Merged

Cloud infra 1: inverse handshake and instance IDs#111
ccuetom merged 6 commits into
masterfrom
feat/inverse-hanshake

Conversation

@bgrey001

@bgrey001 bgrey001 commented May 21, 2026

Copy link
Copy Markdown

This PR has two main features:

  1. Inverts the node registration model so the monitor no longer has to spawns nodes. Instead, the monitor starts in dynamic mode (--dynamic) and nodes phone home (--phone-home) by reading the monitor's address from environment variables. This enables cloud deployments where pods are scheduled independently by Kubernetes/Argo.

  2. Each node generates a per-boot instance ID (8-char hex) appended to every UID (node:0:a3f7b2c1, worker:0:0:a3f7b2c1, warehouse:0:a3f7b2c1). A node, its workers, and its warehouse share the same instance ID. This eliminates UID collisions when a pod is replaced at the same index.

Changes:

  • UID scheme: BaseRPC parses instance IDs via _uid_override, backwards compatible with legacy UIDs
  • Phone-home mode: nodes read MONITOR_HOST/MONITOR_PORT/PUBSUB_PORT from env vars
  • Dynamic mode: Monitor.init_dynamic() waits for nodes instead of spawning them. register_node() RPC accepts new nodes and starts heartbeat monitoring
  • Comms: address auto-detection via UDP probe (for K8s where 0.0.0.0 isn't routable), reconnection on address change, handshake buffering to prevent lost RPCs during connection setup
  • Warehouse routing: _warehouse_uid_for_node() derives warehouse UID from node UID. WarehouseObject derives node UID from worker UID with instance ID
  • Strategy: RoundRobin.update_node() evicts stale workers when a replacement node joins

Changes based on feedback:

  • Unified --dynamic and --phone-home into a single --dynamic flag.
  • Renamed env vars
  • Added BaseRPC.instance_id kwarg + _build_uid static helper.
  • Extracted discover_routable_address() shared by InboundConnection.address and Publication.address, restored main branch hostname-first order while keeping the 0.0.0.0 K8s auto-detect.
  • Added _ensure_warehouse_proxy() helper to de-duplicate the per-node warehouse proxy refresh.
  • Dropped Node._warehouse_uid reads from inherited self._local_warehouse.uid.
  • Hoisted start_worker closure values into locals so the subprocess no longer pickles self.

Still open:

  • Head.wait_for_workers partial-acceptance / cloud timeout (potentially num_workers ideal + min_workers floor)

@bgrey001 bgrey001 requested a review from ccuetom May 21, 2026 11:57
Comment thread mosaic/cli/mrun.py Outdated
Comment thread mosaic/comms/comms.py
Comment thread mosaic/comms/comms.py Outdated
Comment thread mosaic/runtime/head.py
Comment thread mosaic/runtime/monitor.py Outdated
Comment thread mosaic/runtime/runtime.py Outdated
@@ -55,14 +62,18 @@ class BaseRPC:
"""

def __init__(self, name=None, indices=(), uid=None):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To include the additional instance ID, you are right now creating the UID string externally, passing it in. Why not add an additional keyword argument to this init that passes in the instance ID and appends it to the UID? That might be a bit cleaner?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put a build_uid helper into BaseRPC so we have a centralised way of building the uids from name, indices and instance_id (optional), think it's the cleanest approach.

Comment thread mosaic/runtime/runtime.py Outdated
Comment thread mosaic/runtime/node.py Outdated
Comment thread mosaic/runtime/warehouse.py Outdated
Comment thread mosaic/__init__.py
Comment thread mosaic/runtime/node.py
@ccuetom ccuetom dismissed their stale review May 27, 2026 10:49

Comment resolved

@ccuetom ccuetom merged commit 1e6b7ae into master May 27, 2026
3 checks passed
@ccuetom ccuetom deleted the feat/inverse-hanshake branch May 27, 2026 10:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants