Cloud infra 1: inverse handshake and instance IDs#111
Merged
Conversation
ccuetom
requested changes
May 22, 2026
| @@ -55,14 +62,18 @@ class BaseRPC: | |||
| """ | |||
|
|
|||
| def __init__(self, name=None, indices=(), uid=None): | |||
Contributor
There was a problem hiding this comment.
To include the additional instance ID, you are right now creating the UID string externally, passing it in. Why not add an additional keyword argument to this init that passes in the instance ID and appends it to the UID? That might be a bit cleaner?
Author
There was a problem hiding this comment.
I've put a build_uid helper into BaseRPC so we have a centralised way of building the uids from name, indices and instance_id (optional), think it's the cleanest approach.
ccuetom
previously requested changes
May 27, 2026
ccuetom
approved these changes
May 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR has two main features:
Inverts the node registration model so the monitor no longer has to spawns nodes. Instead, the monitor starts in dynamic mode (
--dynamic) and nodes phone home (--phone-home) by reading the monitor's address from environment variables. This enables cloud deployments where pods are scheduled independently by Kubernetes/Argo.Each node generates a per-boot instance ID (8-char hex) appended to every UID (
node:0:a3f7b2c1,worker:0:0:a3f7b2c1,warehouse:0:a3f7b2c1). A node, its workers, and its warehouse share the same instance ID. This eliminates UID collisions when a pod is replaced at the same index.Changes:
BaseRPCparses instance IDs via_uid_override, backwards compatible with legacy UIDsMonitor.init_dynamic()waits for nodes instead of spawning them.register_node()RPC accepts new nodes and starts heartbeat monitoring_warehouse_uid_for_node()derives warehouse UID from node UID.WarehouseObjectderives node UID from worker UID with instance IDRoundRobin.update_node()evicts stale workers when a replacement node joinsChanges based on feedback:
--dynamicand--phone-homeinto a single--dynamicflag.BaseRPC.instance_idkwarg +_build_uidstatic helper.discover_routable_address()shared byInboundConnection.addressandPublication.address, restored main branch hostname-first order while keeping the 0.0.0.0 K8s auto-detect._ensure_warehouse_proxy()helper to de-duplicate the per-node warehouse proxy refresh.Node._warehouse_uidreads from inheritedself._local_warehouse.uid.start_workerclosure values into locals so the subprocess no longer pickles self.Still open:
Head.wait_for_workerspartial-acceptance / cloud timeout (potentiallynum_workersideal +min_workersfloor)