Nemo RL integration by taoluo · Pull Request #5 · rlops/rlix

taoluo · 2026-04-13T02:58:20Z

address #4

The nemo-rl side change is on fork https://github.com/rlops/RL

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Rewrite Feature 5+6 (merged): two-path weight refresh model - Active refresh: training loop syncs active non-overlap ranks in-flight via coordinator.sync_base_weights_to_active() → ModelUpdateService - Expand sync: scheduler syncs woken overlap ranks via _expand_workers() - Both paths share single-copy CPU bucket cache, same transport - Version = _cache_ready_step (no double-bump) - Transition window tolerated, not eliminated (bounded by in-flight requests at push time) - Actor call graph: coordinator calls ModelUpdateService directly, no re-entrant pipeline self-call (follows sync_lora_weights pattern) - Control-plane invariant: GPU release signals weight consistency - Receiver-side: 6 target-worker methods (5 transport + finalize) - Per-bucket apply, finalize once (process_weights_after_loading) Polish other features: - F4: dynamic NCCL group lifecycle, comm_plan dual-path mask, cache safety 4th invariant (_cache_lock) - F7: soften namespace claim for anonymous child actors - F8: registration pseudocode matches actual RLix API signatures - F9: progress metric uses count_intended_for_step (not age-window), local counter gated on target_weight_version == progress_target_step - F11: single NCCL teardown contract, no overclaim about destroy_model_parallel() - F12: RLixVirtualClusterAdapter instead of "don't use RayVirtualCluster" - Gate 3: updated to match tolerated transition window model Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Add idempotency guards (F1), routing lock + post-offload VRAM assertion (F2), dispatch-under-lock atomicity (F3), bucket format spec + full cache_lock critical section (F4), batch-begin/end progress lifecycle (F9), and Gate 2.5 fallback rule for NCCL teardown (F11). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…results - Add run_rlix_experiment.py with scenarios A-F (Qwen FT, Dual FT, Multi-LoRA, FT+LoRA, LFM DeepSpeed single/dual) - Add lfm_finetune_pipeline1/2.yaml for LFM2.5-350M DeepSpeed training; includes DS_BUILD_OPS=0 env var (partial fix for sm_120a) - Update all pipeline yamls with VLLM_USE_FLASHINFER_SAMPLER=0 and other RTX 5090 / Blackwell compatibility fixes - Add RLIX_EXPERIMENT.md: architecture docs, benchmark results (v20 run), step-by-step timing, 8 bugs documented with root causes and fixes Results (4x RTX 5090, 2026-04-14 v20 run): A Single FT: 244s 2.4% util PASS B Dual FT: 312s 3.9% util PASS C Single Multi-LoRA: 367s 0.7% util PASS D FT+Multi-LoRA: 434s 1.8% util PASS E LFM Single FT: 105s 0.1% util FAIL (fused_adam JIT, see Bug 8) F LFM Dual FT: 106s 0.0% util FAIL (same)

ZeRO-2 without offload_optimizer causes ROLL to select FusedAdam, which fails to JIT-compile on RTX 5090 (sm_120a / Blackwell). Adding offload_optimizer.device=cpu makes is_offload() return True so ROLL selects DeepSpeedCPUAdam instead. No ROLL source modification needed.

…on strategies build_latest_bucket_cache and promote_active_checkpoint are Megatron-only. DeepSpeed training workers raise RuntimeError for these calls. Catch and skip gracefully so LFM/DeepSpeed pipelines can initialize correctly.

…active clusters

ROLL's ReloadableProcessGroup monkey-patch is incompatible with DeepSpeed's process group initialization. Skip the offload_nccl=True enforcement in the coordinator for deepspeed_* strategy clusters and set offload_nccl=false in LFM pipeline yamls. NCCL buffer overhead is negligible for 350M models.

…or deepspeed

…on-Megatron

…gies DeepSpeed does not implement _build_latest_bucket_cache so the bucket-cache weight sync crashes with CUDA illegal memory access when expanding infer workers. Use skip_load=True to route-only expand for deepspeed backends.

…ight sync DeepSpeed does not implement bucket-cache weight sync. Restricting actor_infer to the same GPUs as actor_train means all infer workers are IPC-accessible and no NCCL cross-GPU sync is needed. Pipeline 1: infer on [0,1]; Pipeline 2: infer on [2,3]. Reverts skip_load workaround.

…atron DeepSpeed weight sync is incompatible with rlix bucket-cache mechanism. E/F now use the same Qwen2.5-0.5B + Megatron config as A/B, providing consistent coverage across all 6 scenarios. NeMo will be added as a dedicated backend later.

… (RTX 5090)

- All 6 scenarios now pass on RTX 5090 Blackwell (sm_120a) after fixes - Update scenarios E/F descriptions from LFM/DeepSpeed to Qwen2.5-0.5B/Megatron - Add final v35/v37 results table showing all 6 scenarios PASS - Preserve v20 historical results as reference - Add Bugs 9-11: NCCL 2.26.2 Blackwell kernels, vLLM torch.dtype serialization, PyTorch 2.7.1 _coalescing_manager UnboundLocalError

Implements Feature 10 from plans/nemorl-port-plan.md as a standalone validation function intended to be called by NemoRLFullFinetunePipeline (Task 7) during pipeline initialization. The function fails fast on invalid GPU topologies before RLix registration, surfacing configuration errors at startup rather than during rollout. All 6 assertions from plan Feature 10 are preserved verbatim (logic and error messages). - rlix/pipeline/nemo_rl_config_bridge.py: validation function - tests/test_nemo_rl_config_bridge.py: 7 pytest cases (1 happy path + 6 negative cases, one per assertion, each verified to trigger the intended assertion and not a prior one) Scope note: only topology validation is implemented in this commit. Config payload construction (Feature 8) and pipeline integration (Task 7) are intentionally left for their respective tasks.

Ports 4 modules from nemo-integration and adds BucketCacheLifecycle (new): - bucket_cache.py: thread-safe CPUBucketCache keyed by (param_name, shard_id) with dirty tracking and PP-shard support - bucket_receiver.py: BucketUpdateRequest/Result, PP-shard merge via torch.cat, fail-partial apply_bucket_update semantics - model_update_service_cached.py: owns cache, populates from PP workers, dispatches dirty-bucket sync to inference workers - bucket_cache_lifecycle.py: wraps ROLL's promote_active_checkpoint with _cache_ready_step version tracking; direct worker calls (no .remote()) for testability 64 unit tests across 4 test files; all passing without Ray or GPU. Fixes: worker calls use direct method (not .remote()) to be testable with plain Python fakes — pipeline layer handles Ray .remote() externally.

…es, and bugs

…cationOp refactor gpus_to_allocate/dp_ranks_to_add were renamed to dp_rank_to_gpus_to_add (Dict) in the type definition; update the two affected tests to use the current API.

- external/NeMo: zhenyulincs/RL fork at rlix-task2 branch - tests/integration/test_bucket_cache_gpu.py: 4 test classes using Qwen2.5-0.5B on real GPU: * TestGPUMemoryRelease: verifies >=90% VRAM freed after offload * TestWeightCorrectnessInCache: bit-for-bit match between GPU model and CPUBucketCache contents * TestBucketReceiverPush: weight correctness after apply_bucket_update to CPU and GPU targets * TestFullRoundTrip: end-to-end GPU→cache→offload→push→verify - tests/integration/run_gpu_tests.sh: convenience deploy script

- store() takes keyword args: store(name, shard_id=0, tensor=t) - get_dirty_buckets() returns List[Bucket] not dict - BucketUpdateRequest.sync_id is str not int - Remove manual Bucket construction — pass get_dirty_buckets() directly

Part 1 (test_gate2_5_nccl_destroy.py, torchrun --nproc-per-node=2): - Megatron destroy_model_parallel() VRAM release ≥70% threshold - 5-cycle destroy/re-init stability (no leak, allreduce works after) - Stale process group raises after destroy (no silent corruption) - Requires: megatron-core Part 2 (test_gate2_5_selective_sync.py, torchrun --nproc-per-node=2): - Dynamic NCCL group create/use/destroy per sync cycle - rank 0 → rank 1 bucket broadcast from CPUBucketCache - Bit-exact weight verification on receiver - VRAM stable across 3 sync cycles Part 3 (test_gate2_5_qwen_train_sync.py, torchrun --nproc-per-node=4): - Real Qwen2.5-0.5B forward+backward on GPU 0,1 (TP=2 training) - SHA256 hash snapshot of all weights before sync - CPU bucket cache build on rank 0 - VRAM release ≥60% after model.cpu() + empty_cache - Dynamic NCCL broadcast to GPU 2,3 (inference ranks) - Bit-exact hash verification: training weights == received weights - 2 full steps to verify stability

Root cause of 600s timeout: broadcast_object_list unreliable over pure NCCL backend. New design uses same deterministic seed on both ranks — no Python object broadcast needed. Also fixed new_group ranks to [0, 1] (was incorrectly using [0, 2, 3] on 2-GPU setup). Changes: - Both ranks call make_weights(step=cycle) with same seed - Dynamic group uses ranks=[SENDER, RECEIVER]=[0, 1] - dist.barrier() immediately after init_process_group - SHA256 hash verification on receiver for bit-exact check

PyTorch 2.5+ requires device_id in init_process_group() when using NCCL backend, otherwise dist.barrier() spins at 100% CPU indefinitely. Fix applied to all three gate2.5 test files.

…ash mismatches on fresh install

…187620155 Add Claude Code GitHub Workflow

…tep_target_estimate bootstrap - grpo.py: begin_progress_batch now called before before_generation each step - TASK5_6_HOOKS.md: add chicken-and-egg problem explanation and two-mechanism solution - test: add test_begin_progress_batch_called_before_before_generation (31 tests total) Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

…ics) - test_f6_expand_atomic: expand reuses same CPU cache → version stays at _cache_ready_step, not +1; add _cache_ready_step advance to test multi-step - test_gap_ratio: SchedGuidedAllocationOp field renamed from gpus_to_allocate to dp_rank_to_gpus_to_add (Dict); flatten values for GPU-set assertions - test_nemo_rl_pipeline: add MockPolicy/MockCoordinator stubs so after_training can exercise _build_cpu_bucket_cache and sync_base_weights_to_active paths without Ray; align version assertions with no-bump spec (version = step number) Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

…rs docstring nemo_rl_pipeline.py: Phase 3 comment incorrectly attributed the finish_generation() call to _init_inference_workers; it is called by _sleep_all_inference_workers() which runs after init. nemo_rl_model_update_service.py: sync_selected_workers is called on two paths (expand and active refresh for non-overlap ranks), update class and method docstrings to reflect both callers and note the CUDA stream sync requirement for the active refresh path. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

…Gate 4 Three rounds of audit against rlix/scheduler/scheduler.py + RLix planner + existing MILES code surfaced 15 distinct P0 issues in the unified port plan that would block M11.2 Gate 4 (dual-pipeline) implementation. Bucket I — Init control flow vs RLix scheduler reality #1+#2 Reorder init: scheduler executes coordinator.resize_infer(add) RPC in Phase 5 before signaling pending GEN waiter (planner emits SchedGuidedAllocationOp even on first-pipeline alloc). All-shell RolloutManager + register_model_update_resources + publish_cache_ready_step(-1) + bootstrap_active_engines(empty) must run as Step 6.6a-f BEFORE Step 7's _request_cluster_gpus. Add Step 7.1-7.3 post-request validation. Empty Miles router started in all-shell ctor; activate_routing is sole add_worker entry; per-pipeline router port (no inheritance from peer). Coordinator gains _active_engines_bootstrapped: bool to distinguish empty-set bootstrap from "not yet bootstrap'd". Bucket II — F4/F5+6 weight-refresh internal contradictions #3 Drop per-bucket weight_version from sender invocation, wrapper payload, HTTP route metadata, and receiver API. Per-sync version publish is the atomic unit's step (c) — once via manager.set_weight_version(version, target_engines). #4 HTTP /update_weights_from_cpu_bucket route handler must enter tokenizer_manager.model_update_lock.writer_lock (mirrors existing update_weights_from_tensor path); active-refresh's "engine keeps serving" semantics need this pause boundary to prevent half-new weights being observed mid-load. #5 Add RolloutManager.get_engine_handles(engine_indices) read-only API; relax service-manager rule to "two read-only entries + one write" so service can dispatch per-engine RPCs. #6 Replace 5 split sender RPCs with composite cache_owner_actor.run_sync_session(plan: SyncSessionPlan). _cache_lock held in single method body, not cross-RPC. SyncSessionPlan is fully self-contained (target_handles + cpu_serialize_local_ranks + broadcast_local_ranks + comm_ranks), cache_owner never calls back to service or manager. #7 Replace dist.new_group with TCP rendezvous via existing MILES init_process_group helper (broadcast.py:137 pattern) on sender side and SGLang init_custom_process_group on receivers. Cross- process default-PG-less topology cannot use dist.new_group. Bucket III — Spec drift / API contradictions #8 F12 main body switched from get_rollout_workers() / worker_placements to get_all_rollout_engine_placements() (declared full table) + get_active_engine_indices(allocated_gpus, tp_size) (active subset with half-engine raise). #10 Gate 3 invariant (c) restated as initial_completed=0 contract, replacing stale "buffer ready count non-zero" wording that contradicts Fix #1. Bucket IV — TrainRayActor + milestone/Gate scope #9 TrainRayActor.__init__ accepts optional local_rank kwarg; RLix path passes local_rank=0 (1 actor 1 GPU under fractional + manual CVD; ray.get_gpu_ids() not in manual CVD list breaks get_local_gpu_id() ValueError). Standalone path keeps existing fallback. New file-change row for miles/ray/train_actor.py. #11 M11.2 unified-plan scope = Gate 4 happy path only (shell partial allocation + RolloutManager lazy ctor + Gate 4 (c)/(d)/(e) acceptance). admission_epoch / orchestrator cleanup / graceful drain stay M11.2-tagged follow-up — not Gate 4 pass/fail criteria. Manual `ray stop` is accepted recovery on crash. Implementation order Week 3 drops Gate 4; Week 4 picks up M11.2 happy path. #12 Strengthened Gate 4 acceptance: (c) router worker list contains exactly active engine subset; (d) v=-1 short-circuit verified via both run_sync_session and finalize_weight_update call counters staying at pre-expand values; (e) donor-shrink-before-recipient- add ordering with T1 < T2 < T3 strict inequality. Bucket V — Tail-end consequences of new init ordering #13 v=-1 service path skips run_sync_session, skips finalize fan-out, no cache_owner GPU touch. Required because Step 6 releases actor_train allocation before Step 7's GEN request — broadcast H2D staging on cache_owner would race the just-released GPU. Service runs only manager.set_weight_version(-1, target_engines). #14 F10 RLix-mode startup fail-fast on resume args: args.load in {None, args.hf_checkpoint} AND args.ref_load in {None, args.hf_checkpoint}. Fix #13's no-transport assumes train and rollout base weights are equivalent; resume / non-equivalent ref_load would diverge them. CPU-only base sync for resume is M11.2 follow-up; for now fail-fast. #15 v=-1 path also skips engine.finalize_weight_update — SGLang's onload-time process_weights_after_loading already ran during manager.expand_engines lazy-spawn; re-running on already-loaded weights is redundant. Cross-cutting cleanups - F4 _cache_lock invariant clarified as in-method (run_sync_session body), not cross-RPC. - F5+6 atomic-unit description rewritten to spell the (a)/(b)/(c) steps and the v=-1 short-circuit. - Receiver API table tightened: weight_version dropped from update_weights_from_cpu_bucket signature; finalize_weight_update row notes v=-1 skip. - SyncSessionPlan dataclass spec includes M11.2 narrowing note — Gate 4 happy-path acceptance runs cpu_serialize-only topology; NCCL broadcast complexity stays as M11.1 load-bearing structure (Gate 2.5 unchanged), M11.2 doesn't add transport work. - F12 placement provider rewrite + standalone path preserved. - Implementation follow-up table gains "M11.2 follow-up: CPU-only base sync at v=-1 for resume / non-equivalent ref_load" entry. No code changes; spec-only document. 858 insertions / 235 deletions. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

myang333 · 2026-05-05T03:35:17Z

Task 4 sub-items landed on this PR:

commit	sub-item	plan feature
`61dfbe1`	partial-overlap topology validator	F10 (initial)
`a766ab1`	per-pipeline Ray namespace reader	F7
`9ebcdc6`	ConfigBridge builder functions	F8 (partial)
`e4e61e8`	pipeline registration helper	F8 (partial)
`4365b9f`	RLixVirtualClusterAdapter (duck-typed stand-in)	F12 (initial)
`94c3683`	device mapping config fallback	F4/F8 review fix

Plan reference: plans/nemorl-port-plan.md Features 7, 8, 10, 12.

Deferred to follow-up tasks (saved locally on safety/feature-8-10-followups):

F8 full lifecycle (bootstrap_nemo_rl_pipeline + NemoRLFullFinetunePipeline)
F10 hardening (assert→ValueError, strengthened shard-disjoint check, wire into registration)
F12 factory (build_nemo_rlix_clusters + from_resource_allocation classmethod)
NeMo RL fork-side changes (grpo.setup() injection, RLIX_CONTROL_PLANE gate)

Conflict resolution: - coordinator.py: kept _nemo_rl_model_update_service (NeMo RL service, correct interface) and tgt_dp_ranks call signature; used PR #8 more descriptive error message - test_nemo_rl_pipeline.py: kept HEAD F5/F6 pipeline tests intact - test_bucket_cache_lifecycle.py: extracted PR #8 BucketCacheLifecycle tests to new file - test_gap_ratio.py: minor variable rename (g -> gpu for clarity)

nemo_rl_model_update_service.py: - Implement sync_selected_workers() using megatron_policy_worker's selective_sync_active_cache (cpu_serialize transport, no topology analysis). - Add _get_policy_workers() helper: resolves training worker actors from policy.src_cluster.workers, .workers, list, or single handle. nemo_rl_pipeline.py: - Change wake_up_partial(ranks) → wake_up_partial(ranks, skip_activate=True) so woken ranks stay off routing table until weight sync finishes (Step 5).

- URL: zhenyulincs/RL.git -> TianyeGGBond/RL.git, branch: nemo - Was missing F2/F3: sleep_partial, wake_up_partial, mark_dp_ranks_inactive, activate_dp_ranks (partial overlap dataplane primitives) - Now includes F2/F3/F4/F5/F6 + all fixes merged into nemo branch

…ration API - wake_up_partial: add skip_activate keyword arg - sleep_partial: async def -> def, add mode param, return True (VllmGeneration.sleep_partial is synchronous - calls ray.get internally) - finalize_weight_update: remove dp_ranks arg (handled inside sync) - _make_test_pipeline: p._policy = MockPolicy() directly, not wrapped - TestExpandWorkersAtomic: remove finalize_weight_update from ordering assertions (no longer a separate observable call) Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

The NeMo-RL port of rlix has no runtime ROLL dependency, so eager-importing RollFullFinetunePipeline / RollMultiLoraPipeline at package-init time breaks any deployment where the roll.* wheel is not installed. Move them to the docstring as opt-in dotted-path imports for consumers that still need them. Refs: implement_log.md Step 6 (2026-05-08). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…alidation Two changes to PipelineCoordinator wiring for the NeMo-RL path: 1. Replace the ROLL RollResourceManagerProxy used to pin the coordinator actor to node 0 with a native ray.util.placement_group([{"CPU": 1}]). Same end behaviour, no roll.* dependency. 2. _validate_vllm_sleep_level now accepts sleep_level ∈ {1, 2}. The smoke topology (partial-overlap dp=2, train and infer on disjoint physical GPUs per pipeline at the inner-pipeline level) defaults to level=2, but level=1 is useful as a diagnostic when investigating CuMemAllocator issues — it skips _sleep_saved_buffers population. The plumbing to read sleep_level from the yaml lives in nemo_rl_pipeline.py. Refs: implement_log.md Step 6 / Step ?+19 (2026-05-09). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…-release guard Two scheduler additions surfaced by the NeMo-RL 2-pipeline runs: 1. _gather_resize_tolerate_dead — when resize_infer is dispatched to a coordinator that has already GC'd (pipeline finished and exited between scheduler decisions), the original code propagates the ActorDiedError and unwinds the entire scheduler tick, killing the surviving sibling pipeline. Catch the dead-coordinator case and auto-unregister the dead pipeline so the tick can continue. 2. The auto-unregister path now skips when (a) the pipeline is already absent from the registry (graceful unregister beat it to it) or (b) the pipeline has a pending_planned_release_request in flight (the graceful await_release_gpus path is running and must not be stomped). This avoids the v74 → v75 race where the launcher's ray.get(orchestrator.unregister_pipeline.remote(pid)) and the scheduler's auto-unregister collided. Refs: debug_log.md #50 (v45), #66 (v75). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…t-loop cleanup The omnibus NeMoRLFullFinetunePipeline + NemoRLRLixHooks update that ties NeMo RL's grpo training loop into the RLix scheduler / coordinator / ATC contract. Each subchange below maps to one or more debug-log entries. NemoRLRLixHooks: - on_trajectory_collector_created — first GENERATION demand request to the scheduler the moment the ATC actor exists; sets step_target_estimate=1 so the planner's gap-ratio doesn't skip the request (debug #27). - before_weight_sync — new hook required by the NeMo-RL split of the weight_sync block; triggers _before_weight_sync on the pipeline so the CPU bucket cache is built BEFORE policy.offload_after_refit empties param.data storage (debug #35). - after_training — now drives the post-train half only: sync active base weights through the coordinator + publish the new weight version to ATC. NemoRLFullFinetunePipeline (selected highlights): - _make_rlix_virtual_cluster / _allocate_shared_pg — colocated-friendly PG allocation; one bundle per GPU; multi-pipeline shares the named PG via ray.util.get_placement_group (plan F12). - run() now wraps async_grpo_train in try/finally that stops the generation watchdog daemon, runs _await_release_actor_infer(last_step) (mirroring ROLL's run() epilogue, scheduler-managed shrink-to-zero), and best-effort ray.kill(self._trajectory_collector) so ppl1 finishing no longer cascades onto ppl2 (debug #65, #69). - _start_generation_watchdog daemon polls _active_dp_ranks / _pre_activation_ranks every 2s and re-issues _request_cluster_gpus(GENERATION) when both are empty; required because GENERATION is a one-shot priority and ppl2's INITIALIZATION preemption otherwise starves ppl1.infer (debug #32). - _push_active_dp_ranks_to_collector helper — every _expand_workers / _shrink_workers calls AsyncTrajectoryCollector.set_active_dp_ranks (debug #33) so ATC's pickled snapshot of _active_dp_ranks stays in sync with the pipeline's. - _prewarm_inference_ranks — per-rank wake_up_partial+sleep_partial cycle at init Phase 2 so a rank's first scheduler-driven wake is never the first CUDA-pool wake on that physical GPU after a sibling Megatron train (debug #68). - _before_weight_sync / _after_training split — bucket cache build moved to before_weight_sync (debug #35); after_training keeps publish_weight_version. _cache_ready_step semantics flipped to step + 1 so ATC's _calculate_target_weights advances correctly (debug #36). _publish_weight_version clamps to max(_, 0) to avoid the bootstrap-time -1 deadlock with ATC initial_weight_version=0 (debug #31). - _await_release_actor_infer — mirrors ROLL's planned shrink-to-zero (scheduler.await_release_gpus.remote) so ppl1 cleanup is scheduler-managed rather than relying on actor-died fallback (debug #65). - wait_for_first_after_training / signal_pair_setup_complete / arm_pair_setup_barrier — pair-init barrier so the launcher admits ppl2 only after ppl1 has reached its first _after_training step (debug #44, step-boundary admission). - _read_vllm_sleep_level + plumbing — sleep_level is now config-driven from actor_infer.strategy_args.strategy_config.sleep_level (debug #58 v51 / v52). Net effect: drives the scheduler → coordinator → vLLM/Megatron contract for the partial-overlap dp=2 smoke; without these the 2-pipeline run never reaches step 0. Refs: debug_log.md #21, #22, #26, #27, #31, #32, #33, #34, #35, #36, #37, #44, #58, #65, #67, #68, #69. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…mission Driver script that orchestrates ≥2 RLix NeMoRLFullFinetunePipeline actors end-to-end: - rlix.init → orchestrator handle. - register_nemo_rl_pipeline per pipeline → (pipeline_id, ray_namespace, scheduler). - CoordinatorActor.options(namespace=...).remote(...) per pipeline. - coordinator.create_pipeline_actor.remote(pipeline_config). - pipeline_actor.run.remote(); ray.wait per-pipeline so one finishing early doesn't kill the launcher. - Step-boundary admission: rather than wall-clock time.sleep( admit_delay_s), wait for pipeline_actor.wait_for_first_after_training before admitting the next pipeline. Fall back to wall-clock sleep on timeout. Eliminates the cross-ppl init OOM race documented in v37 / v38. - Graceful unregister at run-completion: ray.get(orchestrator.unregister_pipeline.remote(pid)) so the scheduler's auto-unregister path doesn't fire on a still-active sibling. - runtime_env.env_vars wired with verified Ray thread caps (six RAY_*_thread_num keys binary-verified against the Ray C extension) plus misc thread caps (OMP / MKL / RAYON / OPENBLAS / TOKENIZERS_PARALLELISM / TORCH_NCCL_ENABLE_MONITORING / ...) to keep the dual-pipeline worker fleet under cgroup pids.max=3840. Refs: implement_log.md Step 6 / Step ?+13 / Step ?+15 / Step ?+24; debug_log.md #44, #53, #66. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

… smoke Two RLix-side wrapper configs for the validated 2-pipeline topology: - ppl1: train=[0] infer=[0,1] (partial-overlap dp=2) - ppl2: train=[1] infer=[0,1] (partial-overlap dp=2) Each yaml carries: - pipeline_cls = rlix.pipeline.nemo_rl_pipeline.NemoRLFullFinetunePipeline - nemo_config_path → /workspace/RL/examples/configs/grpo_math_1B.yaml - nemo_config_overrides — the smoke-validated ++ override list (max_num_steps=6, batch_size=1, max_seq_len=64, enforce_eager=true, gpu_memory_utilization=0.10, num_gpu_blocks_override=64, etc.) - train_device_mapping / infer_device_mapping used by the rlix scheduler - actor_train / actor_infer schemas with strategy_name=vllm, sleep_level=2, offload_nccl=true, rlix_max_colocated_worker_groups=4 Refs: implement_log.md ✅✅✅ v52 / v74 / v78 milestones. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

taoluo and others added 3 commits April 9, 2026 23:17

docs(nemo): add NeMo RL port plan

6a42473

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

taoluo changed the title ~~Nemo~~ Nemo RL integration Apr 13, 2026

zhenyulincs and others added 26 commits April 13, 2026 21:52

fix(lfm): set offload_nccl=true required by rlix coordinator for GPU-…

4c402ed

…active clusters

fix(lfm): set actor_infer offload_nccl=true, keep actor_train=false f…

ed1a3fe

…or deepspeed

fix(deepspeed): skip promote_active_checkpoint in training loop for n…

818bc27

…on-Megatron

fix(config): restore offload_nccl=true on Megatron pipeline yamls

b4d9157

fix(vllm): disable TorchDynamo compilation to prevent stall on sm_120…

e997dd4

… (RTX 5090)

docs(task2): add TASK2_IMPLEMENTATION.md covering architecture, modul…

b154188

…es, and bugs

fix(tests): update test_gap_ratio assertions to match SchedGuidedAllo…

9fc84d6

…cationOp refactor gpus_to_allocate/dp_ranks_to_add were renamed to dp_rank_to_gpus_to_add (Dict) in the type definition; update the two affected tests to use the current API.

docs(task2): document destroy_model_parallel trap in time-sharing mode

0d5e0b9

fix(tests): import pipeline modules directly to avoid heavy package deps

5a60cbd

fix(tests): use state_dict() to capture tied weights (lm_head in Qwen)

123884b

docs(task2): document 3 bugs found during GPU integration testing

fd1c521

fix(gate2.5): pass device_id to init_process_group for PyTorch 2.5+ NCCL

48859ae

PyTorch 2.5+ requires device_id in init_process_group() when using NCCL backend, otherwise dist.barrier() spins at 100% CPU indefinitely. Fix applied to all three gate2.5 test files.

zhenyulincs and others added 13 commits April 25, 2026 21:16

fix: update NeMo submodule pointer — nvidia-resiliency-ext hash sync

90995a1

fix: NeMo submodule pointer update (resiliency hash)

dcc64fd

test: add test_env_install.py — catches setup.py/pyproject.toml VCS h…

709362a

…ash mismatches on fresh install

test: add env install test

fe46b54

docs: add env check (step 0) to TASK2.md test protocol

51e6a75

docs+test: env install check in test protocol

8f24ee8

"Claude PR Assistant workflow"

ab1a833

"Claude Code Review workflow"

2577a0a

Merge pull request #1 from zhenyulincs/add-claude-github-actions-1777…

fe59782

…187620155 Add Claude Code GitHub Workflow

Implement NeMo RLix F5/F6 pipeline sync

ebb48ff

Keep NeMo hook code out of rlix

8afc8b5

TianyeGGBond and others added 14 commits May 5, 2026 14:53

Merge PR #6: feat/nemo-rl-pipeline-adapter (TianyeGGBond)

a2135c3

Merge PR #9: nemo-task5-6-hooks-and-progress-reporting (yayajjiang)

62c9426

feat: wire NeMo RL setup to RLix clusters

ef4c070

Fix NeMo RLix pipeline object lifecycle

35f067f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nemo RL integration#5

Nemo RL integration#5
taoluo wants to merge 110 commits into
mainfrom
nemo

taoluo commented Apr 13, 2026 •

edited

Loading

Uh oh!

myang333 commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

taoluo commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

myang333 commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

taoluo commented Apr 13, 2026 •

edited

Loading