Added support for Lingbot-VA by hyzhou404 · Pull Request #312 · NVIDIA/flashdreams

hyzhou404 · 2026-06-09T13:59:58Z

Summary

Add LingBot-VA Robotwin I2AV integration as a FlashDreams plugin
Achieves 2.3× speedup over the original repo (1.48× vs FSDP-removed baseline)
Self-contained package

copy-pr-bot · 2026-06-09T14:00:02Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-06-09T14:12:20Z

Greptile Summary

This PR adds a self-contained flashdreams-lingbot-va plugin that ports the LingBot-VA Robotwin I2AV model into the FlashDreams runner/pipeline interface, replacing FSDP-wrapped upstream inference with a native torch.compile-accelerated DiT and a custom DualChunkKVCache for autoregressive generation.

Core inference path: LingbotVARobotwinRunner drives an AR loop where each chunk runs video denoising (25 steps) then action denoising (50 steps), writing KV to a rolling-window cache only on the final (t=0) KV-commit step.
KV cache: DualChunkKVCache stores interleaved [video_KV | action_KV] slots; the temporal-ordering bug is fixed via _roll_left(), and the CFG sync bug is fixed by always running both cond and uncond forwards when network_cache_uncond exists.
Output shape bug: actions.npy is saved as (16, 320) instead of the documented (320, 16) \u2014 adding .T before .numpy() on line 522 fixes it.

Confidence Score: 4/5

Safe to merge after fixing the actions.npy output shape — all other paths (KV cache, CFG sync, VAE decode) look correct.

The AR pipeline, KV cache, and CFG sync fixes are all sound. The one concrete defect is in the saved output: torch.cat(pred_actions, dim=1).flatten(1) produces shape (16, 320) while every downstream consumer would expect (320, 16). A one-line .T fixes it, but until then actions.npy is silently unusable for any policy that reads it in standard (T, action_dim) order.

integrations/lingbot_va/lingbot_va/runner.py — specifically the all_actions shape at line 522.

Important Files Changed

Filename	Overview
integrations/lingbot_va/lingbot_va/runner.py	Main AR inference runner — contains a shape bug: actions.npy is saved as (action_dim, T)=(16,320) when the documented and expected format is (T, action_dim)=(320,16). Also carries one unused import (data_seq_to_patch).
integrations/lingbot_va/lingbot_va/transformer/init.py	CFG sync fix correctly applied: both cond and uncond forward_video/forward_action always run when network_cache_uncond exists; CFG linear combination only applied when relevant scale > 1.0.
integrations/lingbot_va/lingbot_va/transformer/impl/kvcache.py	Temporal-ordering fix correctly implemented via _roll_left() — evicts oldest slot by shifting left before writing the newest, keeping buffer in strict temporal order.
integrations/lingbot_va/lingbot_va/transformer/impl/network.py	Batch KV write pattern is compile-friendly and correct. RoPE frequency computation and network cache initialization look sound.
integrations/lingbot_va/lingbot_va/transformer/impl/modules.py	VABlock/VASelfAttention correctly separate read-only and persist paths. forward_readonly uses cached_kv_plus_fresh without mutating cache; forward_persist returns (out, k, v) for deferred batch write.
integrations/lingbot_va/lingbot_va/_loaders.py	Model loaders and FlowMatchScheduler look correct. WanVAEStreamingWrapper maintains per-instance feat_cache so two wrappers sharing the same VAE encoder remain independent.
integrations/lingbot_va/lingbot_va/action.py	Action normalization/denormalization logic is correct. Q01/Q99 constants have 30 elements; inverse_used_action_channel_ids correctly maps used channels for expand/select round-trip.
integrations/lingbot_va/lingbot_va/constants.py	All constant tuples verified: Q01 (30 elements), Q99 (30 elements), used_action_channel_ids (16 elements). Latent dimension computations are consistent with model config.

Sequence Diagram

sequenceDiagram
    participant R as LingbotVARobotwinRunner
    participant T as LingbotVATransformer
    participant VC as DualChunkKVCache (cond)
    participant UC as DualChunkKVCache (uncond)

    R->>T: initialize_autoregressive_cache()
    T-->>R: LingbotVATransformerCache

    loop AR chunk
        loop Video denoise
            R->>T: "predict_flow(persist=False)"
            T->>VC: cached_kv_plus_fresh
            T->>UC: cached_kv_plus_fresh
        end
        R->>T: "predict_flow(persist=True)"
        T->>VC: write_primary(k, v)
        T->>UC: write_primary(k, v)
        loop Action denoise
            R->>T: "predict_action_flow(persist=False)"
            T->>VC: cached_kv_plus_fresh
            T->>UC: cached_kv_plus_fresh
        end
        R->>T: "predict_action_flow(persist=True)"
        T->>VC: write_secondary(k, v)
        T->>UC: write_secondary(k, v)
        R->>T: commit_cache_slot()
    end

    R->>R: save actions.npy, latents.pt
    R->>R: VAE decode

_{Reviews (6): Last reviewed commit: "Fix CFG cache corruption and reduce VAE ..." | Re-trigger Greptile}

hyzhou404 · 2026-06-09T14:33:22Z

I previously mistakenly uploaded the pyproject.toml file for version 12.8; this has been fixed. Please ignore Greptile's description regarding the CUDA version change.

jmccaffrey-nv · 2026-06-10T06:08:54Z

/ok to test 35ae585

jmccaffrey-nv · 2026-06-10T06:11:46Z

Please add Apache-2.0 headers on new files which you authored and are contributing . Attribute any 3rd-party OSS files .
Please see CONTRIBUTING.md and https://github.com/NVIDIA/flashdreams/blob/main/skills/maintaining-oss-state/SKILL.md

reuse-lint / OSRB collateral sanity check (pull_request) Failing after 9s
Error: Add 'SPDX-License-Identifier: Apache-2.0' (and matching SPDX-FileCopyrightText) to the first 20 lines of this file.
Error: 15 source file(s) missing inline SPDX header

hyzhou404 · 2026-06-11T05:20:19Z

Please add Apache-2.0 headers on new files which you authored and are contributing . Attribute any 3rd-party OSS files . Please see CONTRIBUTING.md and https://github.com/NVIDIA/flashdreams/blob/main/skills/maintaining-oss-state/SKILL.md

reuse-lint / OSRB collateral sanity check (pull_request) Failing after 9s Error: Add 'SPDX-License-Identifier: Apache-2.0' (and matching SPDX-FileCopyrightText) to the first 20 lines of this file. Error: 15 source file(s) missing inline SPDX header

Done

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- Remove flash_attn stub that polluted sys.modules globally - Inline prompt_clean to avoid depending on diffusers private API - Fix KV cache to use left-shift eviction (aligned with official BlockKVCache) - Keep _n_committed capped at window_slots in steady-state - Remove duplicate _streaming_vae_half.vae.to("cpu") call Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

wilsonCernWq · 2026-06-12T21:17:38Z

+            grid_id, self.config.network.dim // self.config.network.num_heads
+        ).to(noisy_latent.device)
+
+        flow_cond = self.network.forward_video(


compile_module(net) returns an OptimizedModule that only compiles forward, but WanVADiTNetwork has no forward — the hot path calls net.forward_video / net.forward_action (transformer/init.py:200,228), which OptimizedModule delegates straight to the eager original module. So --compile-network True (default) does nothing, and the README's torch.compile-attributed 2.3× is not correct likely...

Once compile actually engages, this will likely cause torch compile to be triggered multiple times

def n_cached_tokens(self) -> int: """Number of committed tokens visible to attention.""" return min(self._n_committed, self.window_slots) * self.slot_size

Please follow the existing kvcache.py design. We have a good solution for that already.

wilsonCernWq · 2026-06-12T21:22:51Z

+            nn.SiLU(),
+            nn.Linear(self.dim, self.dim * 6),
+        )
+        self.action_text_embedding = nn.Sequential(


this action_text_embedding seems never used? is this intentional?

This is inherited from the released Lingbot-VA checkpoint. My code faithfully loads these weights to stay compatible with the pretrained checkpoint, but they don't participate in any computation. Leaving it as-is for now.

wilsonCernWq · 2026-06-12T21:30:42Z

+        raise NotImplementedError(
+            "LingBot-VA Robotwin cache initialization awaits the native DiT/VAE "
+            "port. Use --no-instantiate or CPU tests for the scaffold stage."
+        )


This does not respect the overall pipeline design

The DiffusionModel(transformer+scheduler) are built, moved to GPU, and never used. Could we still follow the intended StreamInferencePipeline.generate/finalize design (as the sibling integrations do) and drop the dead pipeline/scheduler? That restores CP/profiling/streaming-decoder and leaves a single source of truth.

- Fix uncond KV cache desync: always run uncond forward when the cache exists, regardless of guidance scale. Previously the early-return skipped write_secondary on uncond cache under default config (action_guidance_scale=1.0), leaving zeroed action KV that silently corrupted CFG from chunk 1 onward. - Share single VAE between streaming wrappers instead of loading a duplicate, halving VAE memory. Also fixes a pre-existing device mismatch bug under --enable-offload. - Remove unused sink_size config field and constant. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

wilsonCernWq

Thanks for quick response. This is just the second part of my review that I didn't finish yesterday!

wilsonCernWq · 2026-06-13T16:31:43Z

+        q01 = self.q01_tensor()
+        q99 = self.q99_tensor()
+        denorm = (action_cpu + 1.0) / 2.0 * (q99 - q01 + 1e-6) + q01
+        return denorm[list(self.config.used_action_channel_ids)]


I think this will return something like (16, 320) as I printed?

But in the readme it says

| `actions.npy` | Predicted actions array, shape `(num_chunks × action_per_frame × frame_chunk_size, action_dim)`. |

wilsonCernWq · 2026-06-13T16:37:41Z

+
+
+def load_vae(vae_path: str, torch_dtype: torch.dtype, torch_device):
+    vae = AutoencoderKLWan.from_pretrained(vae_path, torch_dtype=torch_dtype)


I think this assumes user has checkpoint pre-downloaded on disk. I think this is useful for dev, but a bit hard to use for users who just want to run the test? Can we default to download from HF url? and then if user provides a --checkpoint-root, we overwrite it with a local path?

To download from URL, it needs kwargs like subfolder="text_encoder", I believe

wilsonCernWq · 2026-06-13T16:40:04Z

+| `--checkpoint-root` | str | `robbyant/lingbot-va-posttrain-robotwin` | Local path or HuggingFace repo ID for model weights. Must contain `transformer/`, `vae/`, `text_encoder/`, `tokenizer/` subdirs. |
+| `--input-image-dir` | path | `assets/example_data/lingbot-va/robotwin` | Directory containing three observation camera PNGs (see below). |
+| `--output-dir` | path | `outputs/lingbot_va/robotwin_i2av` | Where to write `demo.mp4`, `actions.npy`, `latents.pt`, and timing JSON. |
+| `--prompt` | str | `"Grab the medium-sized white mug, rotate it, place it on the table, and hook it onto the smooth dark gray rack."` | Text prompt describing the manipulation task. Can also be a path to a `.txt` file. |


this does not seem to match the actual implementation, it seems --prompt will actually read strings? could you double check?

wilsonCernWq · 2026-06-13T16:42:35Z

+    runner_name=PIPELINE_LINGBOT_VA_ROBOTWIN_I2AV.name,
+    description=(
+        "LingBot-VA Robotwin I2AV inference scaffold "
+        "(three-camera Robotwin config; native DiT port pending)."


Is "native DiT port pending" still correct? Maybe we can do a full pass of docstrings so that they are all consistent?

wilsonCernWq · 2026-06-13T16:46:34Z

+    --output-dir outputs/lingbot_va/robotwin_i2av \
+    --checkpoint-root /path/to/lingbot-va-posttrain-robotwin \
+    --num-chunks 10 \
+    --benchmark True


Will this integration support multi-GPU?

wilsonCernWq · 2026-06-13T16:47:18Z

+    return torch.cat([grid_id, torch.full_like(grid_id[:1], t)], dim=0)
+
+
+def data_seq_to_patch(


unused code?

wilsonCernWq · 2026-06-13T16:48:29Z

+        """Return a copy with all non-Robotwin-used channels set to zero."""
+        masked = action.clone()
+        masked[:, ~self.action_mask(device=masked.device)] = 0
+        return masked


nit: this also seems unused?

wilsonCernWq · 2026-06-13T16:49:52Z

+        q01 = self.q01_tensor(device=expanded.device)
+        q99 = self.q99_tensor(device=expanded.device)
+        expanded = (expanded - q01) / (q99 - q01 + 1e-6) * 2.0 - 1.0
+        return expanded.unsqueeze(0).unsqueeze(-1)


nit: also unused seems?

liruilong940607 · 2026-06-15T05:21:25Z

Thanks for bringing lingbot-va into flashdreams.

A general question: how to verify generation result is on-par with the official implementation?

For other integrations we have the parity check code like this https://github.com/NVIDIA/flashdreams/tree/main/integrations/lingbot/tests/parity_check

Which will not only produce generation results with the original code, but also dumps logs for runtime, so that anyone can reproduce the runtime speedup reported by the developer. It seems that you have done the comparison -- is it possible to organize your comparison into reproduceable code like other integrations? And share some visuals in this PR as the evidence of quality parity? (not sure what is the best way to verify the action output though)

hyzhou404 · 2026-06-15T07:46:59Z

Thanks for bringing lingbot-va into flashdreams.

A general question: how to verify generation result is on-par with the official implementation?

For other integrations we have the parity check code like this https://github.com/NVIDIA/flashdreams/tree/main/integrations/lingbot/tests/parity_check

Which will not only produce generation results with the original code, but also dumps logs for runtime, so that anyone can reproduce the runtime speedup reported by the developer. It seems that you have done the comparison -- is it possible to organize your comparison into reproduceable code like other integrations? And share some visuals in this PR as the evidence of quality parity? (not sure what is the best way to verify the action output though)

Both the original code and my implementation exhibited some randomness that has been difficult to align. I am currently working on debugging the sources of this randomness and refactoring the code based on suggestions from Qi Wu and you. Thank you both for your careful and thorough review of my submission.

greptile-apps Bot reviewed Jun 9, 2026

View reviewed changes

hyzhou404 force-pushed the feat/lingbot-va branch 3 times, most recently from 523ba92 to 35ae585 Compare June 9, 2026 14:20

liruilong940607 requested review from liruilong940607 and wilsonCernWq June 9, 2026 17:54

hyzhou404 force-pushed the feat/lingbot-va branch from 35ae585 to d05ddeb Compare June 11, 2026 05:08

greptile-apps Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread integrations/lingbot_va/lingbot_va/transformer/impl/kvcache.py

greptile-apps Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread integrations/lingbot_va/lingbot_va/transformer/__init__.py Outdated

影青 and others added 2 commits June 11, 2026 15:13

Added support for Lingbot-VA

92a8017

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

hyzhou404 force-pushed the feat/lingbot-va branch from beb1238 to 755b105 Compare June 11, 2026 07:14

wilsonCernWq requested changes Jun 12, 2026

View reviewed changes

hyzhou404 requested a review from wilsonCernWq June 13, 2026 06:33

wilsonCernWq reviewed Jun 13, 2026

View reviewed changes



		def load_vae(vae_path: str, torch_dtype: torch.dtype, torch_device):
		vae = AutoencoderKLWan.from_pretrained(vae_path, torch_dtype=torch_dtype)

		return torch.cat([grid_id, torch.full_like(grid_id[:1], t)], dim=0)


		def data_seq_to_patch(

Conversation

hyzhou404 commented Jun 9, 2026

Summary

Uh oh!

copy-pr-bot Bot commented Jun 9, 2026

Uh oh!

greptile-apps Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hyzhou404 commented Jun 9, 2026

Uh oh!

jmccaffrey-nv commented Jun 10, 2026

Uh oh!

jmccaffrey-nv commented Jun 10, 2026

Uh oh!

Uh oh!

hyzhou404 commented Jun 11, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wilsonCernWq left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liruilong940607 commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hyzhou404 commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

greptile-apps Bot commented Jun 9, 2026 •

edited

Loading

liruilong940607 commented Jun 15, 2026 •

edited

Loading

hyzhou404 commented Jun 15, 2026 •

edited

Loading