Skip to content

Nemo RL integration#5

Open
taoluo wants to merge 110 commits into
mainfrom
nemo
Open

Nemo RL integration#5
taoluo wants to merge 110 commits into
mainfrom
nemo

Conversation

@taoluo
Copy link
Copy Markdown
Contributor

@taoluo taoluo commented Apr 13, 2026

address #4

The nemo-rl side change is on fork https://github.com/rlops/RL

taoluo and others added 3 commits April 9, 2026 23:17
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Rewrite Feature 5+6 (merged): two-path weight refresh model
- Active refresh: training loop syncs active non-overlap ranks in-flight
  via coordinator.sync_base_weights_to_active() → ModelUpdateService
- Expand sync: scheduler syncs woken overlap ranks via _expand_workers()
- Both paths share single-copy CPU bucket cache, same transport
- Version = _cache_ready_step (no double-bump)
- Transition window tolerated, not eliminated (bounded by in-flight
  requests at push time)
- Actor call graph: coordinator calls ModelUpdateService directly,
  no re-entrant pipeline self-call (follows sync_lora_weights pattern)
- Control-plane invariant: GPU release signals weight consistency
- Receiver-side: 6 target-worker methods (5 transport + finalize)
- Per-bucket apply, finalize once (process_weights_after_loading)

Polish other features:
- F4: dynamic NCCL group lifecycle, comm_plan dual-path mask,
  cache safety 4th invariant (_cache_lock)
- F7: soften namespace claim for anonymous child actors
- F8: registration pseudocode matches actual RLix API signatures
- F9: progress metric uses count_intended_for_step (not age-window),
  local counter gated on target_weight_version == progress_target_step
- F11: single NCCL teardown contract, no overclaim about
  destroy_model_parallel()
- F12: RLixVirtualClusterAdapter instead of "don't use RayVirtualCluster"
- Gate 3: updated to match tolerated transition window model

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Add idempotency guards (F1), routing lock + post-offload VRAM assertion
(F2), dispatch-under-lock atomicity (F3), bucket format spec + full
cache_lock critical section (F4), batch-begin/end progress lifecycle
(F9), and Gate 2.5 fallback rule for NCCL teardown (F11).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@taoluo taoluo changed the title Nemo Nemo RL integration Apr 13, 2026
zhenyulincs and others added 26 commits April 13, 2026 21:52
…results

- Add run_rlix_experiment.py with scenarios A-F (Qwen FT, Dual FT,
  Multi-LoRA, FT+LoRA, LFM DeepSpeed single/dual)
- Add lfm_finetune_pipeline1/2.yaml for LFM2.5-350M DeepSpeed training;
  includes DS_BUILD_OPS=0 env var (partial fix for sm_120a)
- Update all pipeline yamls with VLLM_USE_FLASHINFER_SAMPLER=0 and other
  RTX 5090 / Blackwell compatibility fixes
- Add RLIX_EXPERIMENT.md: architecture docs, benchmark results (v20 run),
  step-by-step timing, 8 bugs documented with root causes and fixes

Results (4x RTX 5090, 2026-04-14 v20 run):
  A Single FT:         244s  2.4% util  PASS
  B Dual FT:           312s  3.9% util  PASS
  C Single Multi-LoRA: 367s  0.7% util  PASS
  D FT+Multi-LoRA:     434s  1.8% util  PASS
  E LFM Single FT:     105s  0.1% util  FAIL (fused_adam JIT, see Bug 8)
  F LFM Dual FT:       106s  0.0% util  FAIL (same)
ZeRO-2 without offload_optimizer causes ROLL to select FusedAdam, which
fails to JIT-compile on RTX 5090 (sm_120a / Blackwell). Adding
offload_optimizer.device=cpu makes is_offload() return True so ROLL
selects DeepSpeedCPUAdam instead. No ROLL source modification needed.
…on strategies

build_latest_bucket_cache and promote_active_checkpoint are Megatron-only.
DeepSpeed training workers raise RuntimeError for these calls. Catch and
skip gracefully so LFM/DeepSpeed pipelines can initialize correctly.
ROLL's ReloadableProcessGroup monkey-patch is incompatible with DeepSpeed's
process group initialization. Skip the offload_nccl=True enforcement in the
coordinator for deepspeed_* strategy clusters and set offload_nccl=false in
LFM pipeline yamls. NCCL buffer overhead is negligible for 350M models.
…gies

DeepSpeed does not implement _build_latest_bucket_cache so the bucket-cache
weight sync crashes with CUDA illegal memory access when expanding infer
workers. Use skip_load=True to route-only expand for deepspeed backends.
…ight sync

DeepSpeed does not implement bucket-cache weight sync. Restricting actor_infer
to the same GPUs as actor_train means all infer workers are IPC-accessible and
no NCCL cross-GPU sync is needed. Pipeline 1: infer on [0,1]; Pipeline 2:
infer on [2,3]. Reverts skip_load workaround.
…atron

DeepSpeed weight sync is incompatible with rlix bucket-cache mechanism.
E/F now use the same Qwen2.5-0.5B + Megatron config as A/B, providing
consistent coverage across all 6 scenarios. NeMo will be added as a
dedicated backend later.
- All 6 scenarios now pass on RTX 5090 Blackwell (sm_120a) after fixes
- Update scenarios E/F descriptions from LFM/DeepSpeed to Qwen2.5-0.5B/Megatron
- Add final v35/v37 results table showing all 6 scenarios PASS
- Preserve v20 historical results as reference
- Add Bugs 9-11: NCCL 2.26.2 Blackwell kernels, vLLM torch.dtype
  serialization, PyTorch 2.7.1 _coalescing_manager UnboundLocalError
Implements Feature 10 from plans/nemorl-port-plan.md as a standalone
validation function intended to be called by NemoRLFullFinetunePipeline
(Task 7) during pipeline initialization.

The function fails fast on invalid GPU topologies before RLix
registration, surfacing configuration errors at startup rather than
during rollout. All 6 assertions from plan Feature 10 are preserved
verbatim (logic and error messages).

- rlix/pipeline/nemo_rl_config_bridge.py: validation function
- tests/test_nemo_rl_config_bridge.py: 7 pytest cases (1 happy path +
  6 negative cases, one per assertion, each verified to trigger the
  intended assertion and not a prior one)

Scope note: only topology validation is implemented in this commit.
Config payload construction (Feature 8) and pipeline integration
(Task 7) are intentionally left for their respective tasks.
Ports 4 modules from nemo-integration and adds BucketCacheLifecycle (new):

- bucket_cache.py: thread-safe CPUBucketCache keyed by (param_name, shard_id)
  with dirty tracking and PP-shard support
- bucket_receiver.py: BucketUpdateRequest/Result, PP-shard merge via torch.cat,
  fail-partial apply_bucket_update semantics
- model_update_service_cached.py: owns cache, populates from PP workers,
  dispatches dirty-bucket sync to inference workers
- bucket_cache_lifecycle.py: wraps ROLL's promote_active_checkpoint with
  _cache_ready_step version tracking; direct worker calls (no .remote())
  for testability

64 unit tests across 4 test files; all passing without Ray or GPU.

Fixes: worker calls use direct method (not .remote()) to be testable with
plain Python fakes — pipeline layer handles Ray .remote() externally.
…cationOp refactor

gpus_to_allocate/dp_ranks_to_add were renamed to dp_rank_to_gpus_to_add (Dict)
in the type definition; update the two affected tests to use the current API.
- external/NeMo: zhenyulincs/RL fork at rlix-task2 branch
- tests/integration/test_bucket_cache_gpu.py: 4 test classes using
  Qwen2.5-0.5B on real GPU:
  * TestGPUMemoryRelease: verifies >=90% VRAM freed after offload
  * TestWeightCorrectnessInCache: bit-for-bit match between GPU model
    and CPUBucketCache contents
  * TestBucketReceiverPush: weight correctness after apply_bucket_update
    to CPU and GPU targets
  * TestFullRoundTrip: end-to-end GPU→cache→offload→push→verify
- tests/integration/run_gpu_tests.sh: convenience deploy script
- store() takes keyword args: store(name, shard_id=0, tensor=t)
- get_dirty_buckets() returns List[Bucket] not dict
- BucketUpdateRequest.sync_id is str not int
- Remove manual Bucket construction — pass get_dirty_buckets() directly
Part 1 (test_gate2_5_nccl_destroy.py, torchrun --nproc-per-node=2):
- Megatron destroy_model_parallel() VRAM release ≥70% threshold
- 5-cycle destroy/re-init stability (no leak, allreduce works after)
- Stale process group raises after destroy (no silent corruption)
- Requires: megatron-core

Part 2 (test_gate2_5_selective_sync.py, torchrun --nproc-per-node=2):
- Dynamic NCCL group create/use/destroy per sync cycle
- rank 0 → rank 1 bucket broadcast from CPUBucketCache
- Bit-exact weight verification on receiver
- VRAM stable across 3 sync cycles

Part 3 (test_gate2_5_qwen_train_sync.py, torchrun --nproc-per-node=4):
- Real Qwen2.5-0.5B forward+backward on GPU 0,1 (TP=2 training)
- SHA256 hash snapshot of all weights before sync
- CPU bucket cache build on rank 0
- VRAM release ≥60% after model.cpu() + empty_cache
- Dynamic NCCL broadcast to GPU 2,3 (inference ranks)
- Bit-exact hash verification: training weights == received weights
- 2 full steps to verify stability
Root cause of 600s timeout: broadcast_object_list unreliable over
pure NCCL backend. New design uses same deterministic seed on both
ranks — no Python object broadcast needed. Also fixed new_group
ranks to [0, 1] (was incorrectly using [0, 2, 3] on 2-GPU setup).

Changes:
- Both ranks call make_weights(step=cycle) with same seed
- Dynamic group uses ranks=[SENDER, RECEIVER]=[0, 1]
- dist.barrier() immediately after init_process_group
- SHA256 hash verification on receiver for bit-exact check
PyTorch 2.5+ requires device_id in init_process_group() when using
NCCL backend, otherwise dist.barrier() spins at 100% CPU indefinitely.
Fix applied to all three gate2.5 test files.
zhenyulincs and others added 13 commits April 25, 2026 21:16
…187620155

Add Claude Code GitHub Workflow
…tep_target_estimate bootstrap

- grpo.py: begin_progress_batch now called before before_generation each step
- TASK5_6_HOOKS.md: add chicken-and-egg problem explanation and two-mechanism solution
- test: add test_begin_progress_batch_called_before_before_generation (31 tests total)

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…ics)

- test_f6_expand_atomic: expand reuses same CPU cache → version stays at
  _cache_ready_step, not +1; add _cache_ready_step advance to test multi-step
- test_gap_ratio: SchedGuidedAllocationOp field renamed from gpus_to_allocate
  to dp_rank_to_gpus_to_add (Dict); flatten values for GPU-set assertions
- test_nemo_rl_pipeline: add MockPolicy/MockCoordinator stubs so after_training
  can exercise _build_cpu_bucket_cache and sync_base_weights_to_active paths
  without Ray; align version assertions with no-bump spec (version = step number)

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…rs docstring

nemo_rl_pipeline.py: Phase 3 comment incorrectly attributed the
finish_generation() call to _init_inference_workers; it is called by
_sleep_all_inference_workers() which runs after init.

nemo_rl_model_update_service.py: sync_selected_workers is called on
two paths (expand and active refresh for non-overlap ranks), update
class and method docstrings to reflect both callers and note the CUDA
stream sync requirement for the active refresh path.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
taoluo added a commit that referenced this pull request May 2, 2026
…Gate 4

Three rounds of audit against rlix/scheduler/scheduler.py + RLix planner +
existing MILES code surfaced 15 distinct P0 issues in the unified port
plan that would block M11.2 Gate 4 (dual-pipeline) implementation.

Bucket I — Init control flow vs RLix scheduler reality
  #1+#2  Reorder init: scheduler executes coordinator.resize_infer(add)
         RPC in Phase 5 before signaling pending GEN waiter (planner
         emits SchedGuidedAllocationOp even on first-pipeline alloc).
         All-shell RolloutManager + register_model_update_resources +
         publish_cache_ready_step(-1) + bootstrap_active_engines(empty)
         must run as Step 6.6a-f BEFORE Step 7's _request_cluster_gpus.
         Add Step 7.1-7.3 post-request validation. Empty Miles router
         started in all-shell ctor; activate_routing is sole add_worker
         entry; per-pipeline router port (no inheritance from peer).
         Coordinator gains _active_engines_bootstrapped: bool to
         distinguish empty-set bootstrap from "not yet bootstrap'd".

Bucket II — F4/F5+6 weight-refresh internal contradictions
  #3  Drop per-bucket weight_version from sender invocation, wrapper
      payload, HTTP route metadata, and receiver API. Per-sync version
      publish is the atomic unit's step (c) — once via
      manager.set_weight_version(version, target_engines).
  #4  HTTP /update_weights_from_cpu_bucket route handler must enter
      tokenizer_manager.model_update_lock.writer_lock (mirrors existing
      update_weights_from_tensor path); active-refresh's "engine keeps
      serving" semantics need this pause boundary to prevent half-new
      weights being observed mid-load.
  #5  Add RolloutManager.get_engine_handles(engine_indices) read-only
      API; relax service-manager rule to "two read-only entries +
      one write" so service can dispatch per-engine RPCs.
  #6  Replace 5 split sender RPCs with composite
      cache_owner_actor.run_sync_session(plan: SyncSessionPlan).
      _cache_lock held in single method body, not cross-RPC.
      SyncSessionPlan is fully self-contained (target_handles +
      cpu_serialize_local_ranks + broadcast_local_ranks + comm_ranks),
      cache_owner never calls back to service or manager.
  #7  Replace dist.new_group with TCP rendezvous via existing MILES
      init_process_group helper (broadcast.py:137 pattern) on sender
      side and SGLang init_custom_process_group on receivers. Cross-
      process default-PG-less topology cannot use dist.new_group.

Bucket III — Spec drift / API contradictions
  #8  F12 main body switched from get_rollout_workers() / worker_placements
      to get_all_rollout_engine_placements() (declared full table) +
      get_active_engine_indices(allocated_gpus, tp_size) (active subset
      with half-engine raise).
  #10 Gate 3 invariant (c) restated as initial_completed=0 contract,
      replacing stale "buffer ready count non-zero" wording that
      contradicts Fix #1.

Bucket IV — TrainRayActor + milestone/Gate scope
  #9  TrainRayActor.__init__ accepts optional local_rank kwarg; RLix
      path passes local_rank=0 (1 actor 1 GPU under fractional + manual
      CVD; ray.get_gpu_ids() not in manual CVD list breaks
      get_local_gpu_id() ValueError). Standalone path keeps existing
      fallback. New file-change row for miles/ray/train_actor.py.
  #11 M11.2 unified-plan scope = Gate 4 happy path only (shell partial
      allocation + RolloutManager lazy ctor + Gate 4 (c)/(d)/(e)
      acceptance). admission_epoch / orchestrator cleanup / graceful
      drain stay M11.2-tagged follow-up — not Gate 4 pass/fail criteria.
      Manual `ray stop` is accepted recovery on crash. Implementation
      order Week 3 drops Gate 4; Week 4 picks up M11.2 happy path.
  #12 Strengthened Gate 4 acceptance: (c) router worker list contains
      exactly active engine subset; (d) v=-1 short-circuit verified via
      both run_sync_session and finalize_weight_update call counters
      staying at pre-expand values; (e) donor-shrink-before-recipient-
      add ordering with T1 < T2 < T3 strict inequality.

Bucket V — Tail-end consequences of new init ordering
  #13 v=-1 service path skips run_sync_session, skips finalize fan-out,
      no cache_owner GPU touch. Required because Step 6 releases
      actor_train allocation before Step 7's GEN request — broadcast
      H2D staging on cache_owner would race the just-released GPU.
      Service runs only manager.set_weight_version(-1, target_engines).
  #14 F10 RLix-mode startup fail-fast on resume args:
      args.load in {None, args.hf_checkpoint} AND
      args.ref_load in {None, args.hf_checkpoint}. Fix #13's no-transport
      assumes train and rollout base weights are equivalent; resume /
      non-equivalent ref_load would diverge them. CPU-only base sync
      for resume is M11.2 follow-up; for now fail-fast.
  #15 v=-1 path also skips engine.finalize_weight_update — SGLang's
      onload-time process_weights_after_loading already ran during
      manager.expand_engines lazy-spawn; re-running on already-loaded
      weights is redundant.

Cross-cutting cleanups
  - F4 _cache_lock invariant clarified as in-method (run_sync_session
    body), not cross-RPC.
  - F5+6 atomic-unit description rewritten to spell the (a)/(b)/(c)
    steps and the v=-1 short-circuit.
  - Receiver API table tightened: weight_version dropped from
    update_weights_from_cpu_bucket signature; finalize_weight_update
    row notes v=-1 skip.
  - SyncSessionPlan dataclass spec includes M11.2 narrowing note —
    Gate 4 happy-path acceptance runs cpu_serialize-only topology;
    NCCL broadcast complexity stays as M11.1 load-bearing structure
    (Gate 2.5 unchanged), M11.2 doesn't add transport work.
  - F12 placement provider rewrite + standalone path preserved.
  - Implementation follow-up table gains "M11.2 follow-up: CPU-only
    base sync at v=-1 for resume / non-equivalent ref_load" entry.

No code changes; spec-only document. 858 insertions / 235 deletions.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@myang333
Copy link
Copy Markdown
Collaborator

myang333 commented May 5, 2026

Task 4 sub-items landed on this PR:

commit sub-item plan feature
61dfbe1 partial-overlap topology validator F10 (initial)
a766ab1 per-pipeline Ray namespace reader F7
9ebcdc6 ConfigBridge builder functions F8 (partial)
e4e61e8 pipeline registration helper F8 (partial)
4365b9f RLixVirtualClusterAdapter (duck-typed stand-in) F12 (initial)
94c3683 device mapping config fallback F4/F8 review fix

Plan reference: plans/nemorl-port-plan.md Features 7, 8, 10, 12.

Deferred to follow-up tasks (saved locally on safety/feature-8-10-followups):

  • F8 full lifecycle (bootstrap_nemo_rl_pipeline + NemoRLFullFinetunePipeline)
  • F10 hardening (assert→ValueError, strengthened shard-disjoint check, wire into registration)
  • F12 factory (build_nemo_rlix_clusters + from_resource_allocation classmethod)
  • NeMo RL fork-side changes (grpo.setup() injection, RLIX_CONTROL_PLANE gate)

TianyeGGBond and others added 14 commits May 5, 2026 14:53
Conflict resolution:
- coordinator.py: kept _nemo_rl_model_update_service (NeMo RL service, correct interface)
  and tgt_dp_ranks call signature; used PR #8 more descriptive error message
- test_nemo_rl_pipeline.py: kept HEAD F5/F6 pipeline tests intact
- test_bucket_cache_lifecycle.py: extracted PR #8 BucketCacheLifecycle tests to new file
- test_gap_ratio.py: minor variable rename (g -> gpu for clarity)
nemo_rl_model_update_service.py:
- Implement sync_selected_workers() using megatron_policy_worker's
  selective_sync_active_cache (cpu_serialize transport, no topology analysis).
- Add _get_policy_workers() helper: resolves training worker actors from
  policy.src_cluster.workers, .workers, list, or single handle.

nemo_rl_pipeline.py:
- Change wake_up_partial(ranks) → wake_up_partial(ranks, skip_activate=True)
  so woken ranks stay off routing table until weight sync finishes (Step 5).
- URL: zhenyulincs/RL.git -> TianyeGGBond/RL.git, branch: nemo
- Was missing F2/F3: sleep_partial, wake_up_partial, mark_dp_ranks_inactive,
  activate_dp_ranks (partial overlap dataplane primitives)
- Now includes F2/F3/F4/F5/F6 + all fixes merged into nemo branch
…ration API

- wake_up_partial: add skip_activate keyword arg
- sleep_partial: async def -> def, add mode param, return True
  (VllmGeneration.sleep_partial is synchronous - calls ray.get internally)
- finalize_weight_update: remove dp_ranks arg (handled inside sync)
- _make_test_pipeline: p._policy = MockPolicy() directly, not wrapped
- TestExpandWorkersAtomic: remove finalize_weight_update from ordering
  assertions (no longer a separate observable call)

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
The NeMo-RL port of rlix has no runtime ROLL dependency, so eager-importing
RollFullFinetunePipeline / RollMultiLoraPipeline at package-init time
breaks any deployment where the roll.* wheel is not installed. Move them
to the docstring as opt-in dotted-path imports for consumers that still
need them.

Refs: implement_log.md Step 6 (2026-05-08).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…alidation

Two changes to PipelineCoordinator wiring for the NeMo-RL path:

1. Replace the ROLL RollResourceManagerProxy used to pin the coordinator
   actor to node 0 with a native ray.util.placement_group([{"CPU": 1}]).
   Same end behaviour, no roll.* dependency.

2. _validate_vllm_sleep_level now accepts sleep_level ∈ {1, 2}. The smoke
   topology (partial-overlap dp=2, train and infer on disjoint physical
   GPUs per pipeline at the inner-pipeline level) defaults to level=2,
   but level=1 is useful as a diagnostic when investigating CuMemAllocator
   issues — it skips _sleep_saved_buffers population. The plumbing to
   read sleep_level from the yaml lives in nemo_rl_pipeline.py.

Refs: implement_log.md Step 6 / Step ?+19 (2026-05-09).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…-release guard

Two scheduler additions surfaced by the NeMo-RL 2-pipeline runs:

1. _gather_resize_tolerate_dead — when resize_infer is dispatched to a
   coordinator that has already GC'd (pipeline finished and exited
   between scheduler decisions), the original code propagates the
   ActorDiedError and unwinds the entire scheduler tick, killing the
   surviving sibling pipeline. Catch the dead-coordinator case and
   auto-unregister the dead pipeline so the tick can continue.

2. The auto-unregister path now skips when (a) the pipeline is already
   absent from the registry (graceful unregister beat it to it) or
   (b) the pipeline has a pending_planned_release_request in flight
   (the graceful await_release_gpus path is running and must not be
   stomped). This avoids the v74 → v75 race where the launcher's
   ray.get(orchestrator.unregister_pipeline.remote(pid)) and the
   scheduler's auto-unregister collided.

Refs: debug_log.md #50 (v45), #66 (v75).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…t-loop cleanup

The omnibus NeMoRLFullFinetunePipeline + NemoRLRLixHooks update that ties
NeMo RL's grpo training loop into the RLix scheduler / coordinator /
ATC contract. Each subchange below maps to one or more debug-log entries.

NemoRLRLixHooks:
- on_trajectory_collector_created — first GENERATION demand request
  to the scheduler the moment the ATC actor exists; sets
  step_target_estimate=1 so the planner's gap-ratio doesn't skip the
  request (debug #27).
- before_weight_sync — new hook required by the NeMo-RL split of the
  weight_sync block; triggers _before_weight_sync on the pipeline so the
  CPU bucket cache is built BEFORE policy.offload_after_refit empties
  param.data storage (debug #35).
- after_training — now drives the post-train half only: sync active
  base weights through the coordinator + publish the new weight version
  to ATC.

NemoRLFullFinetunePipeline (selected highlights):
- _make_rlix_virtual_cluster / _allocate_shared_pg — colocated-friendly
  PG allocation; one bundle per GPU; multi-pipeline shares the named PG
  via ray.util.get_placement_group (plan F12).
- run() now wraps async_grpo_train in try/finally that stops the
  generation watchdog daemon, runs _await_release_actor_infer(last_step)
  (mirroring ROLL's run() epilogue, scheduler-managed shrink-to-zero),
  and best-effort ray.kill(self._trajectory_collector) so ppl1 finishing
  no longer cascades onto ppl2 (debug #65, #69).
- _start_generation_watchdog daemon polls _active_dp_ranks /
  _pre_activation_ranks every 2s and re-issues
  _request_cluster_gpus(GENERATION) when both are empty; required
  because GENERATION is a one-shot priority and ppl2's INITIALIZATION
  preemption otherwise starves ppl1.infer (debug #32).
- _push_active_dp_ranks_to_collector helper — every _expand_workers /
  _shrink_workers calls AsyncTrajectoryCollector.set_active_dp_ranks
  (debug #33) so ATC's pickled snapshot of _active_dp_ranks stays in
  sync with the pipeline's.
- _prewarm_inference_ranks — per-rank wake_up_partial+sleep_partial
  cycle at init Phase 2 so a rank's first scheduler-driven wake is
  never the first CUDA-pool wake on that physical GPU after a sibling
  Megatron train (debug #68).
- _before_weight_sync / _after_training split — bucket cache build moved
  to before_weight_sync (debug #35); after_training keeps
  publish_weight_version. _cache_ready_step semantics flipped to
  step + 1 so ATC's _calculate_target_weights advances correctly
  (debug #36). _publish_weight_version clamps to max(_, 0) to avoid
  the bootstrap-time -1 deadlock with ATC initial_weight_version=0
  (debug #31).
- _await_release_actor_infer — mirrors ROLL's planned shrink-to-zero
  (scheduler.await_release_gpus.remote) so ppl1 cleanup is
  scheduler-managed rather than relying on actor-died fallback
  (debug #65).
- wait_for_first_after_training / signal_pair_setup_complete /
  arm_pair_setup_barrier — pair-init barrier so the launcher admits
  ppl2 only after ppl1 has reached its first _after_training step
  (debug #44, step-boundary admission).
- _read_vllm_sleep_level + plumbing — sleep_level is now config-driven
  from actor_infer.strategy_args.strategy_config.sleep_level (debug
  #58 v51 / v52).

Net effect: drives the scheduler → coordinator → vLLM/Megatron contract
for the partial-overlap dp=2 smoke; without these the 2-pipeline run
never reaches step 0.

Refs: debug_log.md #21, #22, #26, #27, #31, #32, #33, #34, #35, #36,
#37, #44, #58, #65, #67, #68, #69.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…mission

Driver script that orchestrates ≥2 RLix NeMoRLFullFinetunePipeline actors
end-to-end:

- rlix.init → orchestrator handle.
- register_nemo_rl_pipeline per pipeline → (pipeline_id, ray_namespace,
  scheduler).
- CoordinatorActor.options(namespace=...).remote(...) per pipeline.
- coordinator.create_pipeline_actor.remote(pipeline_config).
- pipeline_actor.run.remote(); ray.wait per-pipeline so one finishing
  early doesn't kill the launcher.
- Step-boundary admission: rather than wall-clock time.sleep(
  admit_delay_s), wait for pipeline_actor.wait_for_first_after_training
  before admitting the next pipeline. Fall back to wall-clock sleep
  on timeout. Eliminates the cross-ppl init OOM race documented in
  v37 / v38.
- Graceful unregister at run-completion:
  ray.get(orchestrator.unregister_pipeline.remote(pid)) so the
  scheduler's auto-unregister path doesn't fire on a still-active
  sibling.
- runtime_env.env_vars wired with verified Ray thread caps (six
  RAY_*_thread_num keys binary-verified against the Ray C extension)
  plus misc thread caps (OMP / MKL / RAYON / OPENBLAS /
  TOKENIZERS_PARALLELISM / TORCH_NCCL_ENABLE_MONITORING / ...) to keep
  the dual-pipeline worker fleet under cgroup pids.max=3840.

Refs: implement_log.md Step 6 / Step ?+13 / Step ?+15 / Step ?+24;
debug_log.md #44, #53, #66.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
… smoke

Two RLix-side wrapper configs for the validated 2-pipeline topology:
- ppl1: train=[0]  infer=[0,1]   (partial-overlap dp=2)
- ppl2: train=[1]  infer=[0,1]   (partial-overlap dp=2)

Each yaml carries:
- pipeline_cls = rlix.pipeline.nemo_rl_pipeline.NemoRLFullFinetunePipeline
- nemo_config_path → /workspace/RL/examples/configs/grpo_math_1B.yaml
- nemo_config_overrides — the smoke-validated ++ override list
  (max_num_steps=6, batch_size=1, max_seq_len=64, enforce_eager=true,
  gpu_memory_utilization=0.10, num_gpu_blocks_override=64, etc.)
- train_device_mapping / infer_device_mapping used by the rlix scheduler
- actor_train / actor_infer schemas with strategy_name=vllm,
  sleep_level=2, offload_nccl=true, rlix_max_colocated_worker_groups=4

Refs: implement_log.md ✅✅✅ v52 / v74 / v78 milestones.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants