Skip to content

Added support for Lingbot-VA#312

Open
hyzhou404 wants to merge 3 commits into
NVIDIA:mainfrom
hyzhou404:feat/lingbot-va
Open

Added support for Lingbot-VA#312
hyzhou404 wants to merge 3 commits into
NVIDIA:mainfrom
hyzhou404:feat/lingbot-va

Conversation

@hyzhou404

Copy link
Copy Markdown

Summary

  • Add LingBot-VA Robotwin I2AV integration as a FlashDreams plugin
  • Achieves 2.3× speedup over the original repo (1.48× vs FSDP-removed baseline)
  • Self-contained package

@copy-pr-bot

copy-pr-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a self-contained flashdreams-lingbot-va plugin that ports the LingBot-VA Robotwin I2AV model into the FlashDreams runner/pipeline interface, replacing FSDP-wrapped upstream inference with a native torch.compile-accelerated DiT and a custom DualChunkKVCache for autoregressive generation.

  • Core inference path: LingbotVARobotwinRunner drives an AR loop where each chunk runs video denoising (25 steps) then action denoising (50 steps), writing KV to a rolling-window cache only on the final (t=0) KV-commit step.
  • KV cache: DualChunkKVCache stores interleaved [video_KV | action_KV] slots; the temporal-ordering bug is fixed via _roll_left(), and the CFG sync bug is fixed by always running both cond and uncond forwards when network_cache_uncond exists.
  • Output shape bug: actions.npy is saved as (16, 320) instead of the documented (320, 16) \u2014 adding .T before .numpy() on line 522 fixes it.

Confidence Score: 4/5

Safe to merge after fixing the actions.npy output shape — all other paths (KV cache, CFG sync, VAE decode) look correct.

The AR pipeline, KV cache, and CFG sync fixes are all sound. The one concrete defect is in the saved output: torch.cat(pred_actions, dim=1).flatten(1) produces shape (16, 320) while every downstream consumer would expect (320, 16). A one-line .T fixes it, but until then actions.npy is silently unusable for any policy that reads it in standard (T, action_dim) order.

integrations/lingbot_va/lingbot_va/runner.py — specifically the all_actions shape at line 522.

Important Files Changed

Filename Overview
integrations/lingbot_va/lingbot_va/runner.py Main AR inference runner — contains a shape bug: actions.npy is saved as (action_dim, T)=(16,320) when the documented and expected format is (T, action_dim)=(320,16). Also carries one unused import (data_seq_to_patch).
integrations/lingbot_va/lingbot_va/transformer/init.py CFG sync fix correctly applied: both cond and uncond forward_video/forward_action always run when network_cache_uncond exists; CFG linear combination only applied when relevant scale > 1.0.
integrations/lingbot_va/lingbot_va/transformer/impl/kvcache.py Temporal-ordering fix correctly implemented via _roll_left() — evicts oldest slot by shifting left before writing the newest, keeping buffer in strict temporal order.
integrations/lingbot_va/lingbot_va/transformer/impl/network.py Batch KV write pattern is compile-friendly and correct. RoPE frequency computation and network cache initialization look sound.
integrations/lingbot_va/lingbot_va/transformer/impl/modules.py VABlock/VASelfAttention correctly separate read-only and persist paths. forward_readonly uses cached_kv_plus_fresh without mutating cache; forward_persist returns (out, k, v) for deferred batch write.
integrations/lingbot_va/lingbot_va/_loaders.py Model loaders and FlowMatchScheduler look correct. WanVAEStreamingWrapper maintains per-instance feat_cache so two wrappers sharing the same VAE encoder remain independent.
integrations/lingbot_va/lingbot_va/action.py Action normalization/denormalization logic is correct. Q01/Q99 constants have 30 elements; inverse_used_action_channel_ids correctly maps used channels for expand/select round-trip.
integrations/lingbot_va/lingbot_va/constants.py All constant tuples verified: Q01 (30 elements), Q99 (30 elements), used_action_channel_ids (16 elements). Latent dimension computations are consistent with model config.

Sequence Diagram

sequenceDiagram
    participant R as LingbotVARobotwinRunner
    participant T as LingbotVATransformer
    participant VC as DualChunkKVCache (cond)
    participant UC as DualChunkKVCache (uncond)

    R->>T: initialize_autoregressive_cache()
    T-->>R: LingbotVATransformerCache

    loop AR chunk
        loop Video denoise
            R->>T: "predict_flow(persist=False)"
            T->>VC: cached_kv_plus_fresh
            T->>UC: cached_kv_plus_fresh
        end
        R->>T: "predict_flow(persist=True)"
        T->>VC: write_primary(k, v)
        T->>UC: write_primary(k, v)
        loop Action denoise
            R->>T: "predict_action_flow(persist=False)"
            T->>VC: cached_kv_plus_fresh
            T->>UC: cached_kv_plus_fresh
        end
        R->>T: "predict_action_flow(persist=True)"
        T->>VC: write_secondary(k, v)
        T->>UC: write_secondary(k, v)
        R->>T: commit_cache_slot()
    end

    R->>R: save actions.npy, latents.pt
    R->>R: VAE decode
Loading

Reviews (6): Last reviewed commit: "Fix CFG cache corruption and reduce VAE ..." | Re-trigger Greptile

Comment thread integrations/lingbot_va/lingbot_va/runner.py Outdated
Comment thread integrations/lingbot_va/lingbot_va/runner.py Outdated
Comment thread integrations/lingbot_va/lingbot_va/runner.py Outdated
Comment thread integrations/lingbot_va/lingbot_va/runner.py
Comment thread integrations/lingbot_va/lingbot_va/runner.py
Comment thread flashdreams/pyproject.toml Outdated
@hyzhou404 hyzhou404 force-pushed the feat/lingbot-va branch 3 times, most recently from 523ba92 to 35ae585 Compare June 9, 2026 14:20
@hyzhou404

Copy link
Copy Markdown
Author

I previously mistakenly uploaded the pyproject.toml file for version 12.8; this has been fixed. Please ignore Greptile's description regarding the CUDA version change.

@jmccaffrey-nv

Copy link
Copy Markdown
Collaborator

/ok to test 35ae585

@jmccaffrey-nv

Copy link
Copy Markdown
Collaborator

Please add Apache-2.0 headers on new files which you authored and are contributing . Attribute any 3rd-party OSS files .
Please see CONTRIBUTING.md and https://github.com/NVIDIA/flashdreams/blob/main/skills/maintaining-oss-state/SKILL.md

reuse-lint / OSRB collateral sanity check (pull_request) Failing after 9s
Error: Add 'SPDX-License-Identifier: Apache-2.0' (and matching SPDX-FileCopyrightText) to the first 20 lines of this file.
Error: 15 source file(s) missing inline SPDX header

Comment thread integrations/lingbot_va/lingbot_va/transformer/impl/kvcache.py
@hyzhou404

Copy link
Copy Markdown
Author

Please add Apache-2.0 headers on new files which you authored and are contributing . Attribute any 3rd-party OSS files . Please see CONTRIBUTING.md and https://github.com/NVIDIA/flashdreams/blob/main/skills/maintaining-oss-state/SKILL.md

reuse-lint / OSRB collateral sanity check (pull_request) Failing after 9s Error: Add 'SPDX-License-Identifier: Apache-2.0' (and matching SPDX-FileCopyrightText) to the first 20 lines of this file. Error: 15 source file(s) missing inline SPDX header

Done

Comment thread integrations/lingbot_va/lingbot_va/transformer/__init__.py Outdated
影青 and others added 2 commits June 11, 2026 15:13
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Remove flash_attn stub that polluted sys.modules globally
- Inline prompt_clean to avoid depending on diffusers private API
- Fix KV cache to use left-shift eviction (aligned with official BlockKVCache)
- Keep _n_committed capped at window_slots in steady-state
- Remove duplicate _streaming_vae_half.vae.to("cpu") call

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
grid_id, self.config.network.dim // self.config.network.num_heads
).to(noisy_latent.device)

flow_cond = self.network.forward_video(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compile_module(net) returns an OptimizedModule that only compiles forward, but WanVADiTNetwork has no forward — the hot path calls net.forward_video / net.forward_action (transformer/init.py:200,228), which OptimizedModule delegates straight to the eager original module. So --compile-network True (default) does nothing, and the README's torch.compile-attributed 2.3× is not correct likely...

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once compile actually engages, this will likely cause torch compile to be triggered multiple times

    def n_cached_tokens(self) -> int:
        """Number of committed tokens visible to attention."""
        return min(self._n_committed, self.window_slots) * self.slot_size

Please follow the existing kvcache.py design. We have a good solution for that already.

nn.SiLU(),
nn.Linear(self.dim, self.dim * 6),
)
self.action_text_embedding = nn.Sequential(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this action_text_embedding seems never used? is this intentional?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inherited from the released Lingbot-VA checkpoint. My code faithfully loads these weights to stay compatible with the pretrained checkpoint, but they don't participate in any computation. Leaving it as-is for now.

raise NotImplementedError(
"LingBot-VA Robotwin cache initialization awaits the native DiT/VAE "
"port. Use --no-instantiate or CPU tests for the scaffold stage."
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not respect the overall pipeline design

The DiffusionModel(transformer+scheduler) are built, moved to GPU, and never used. Could we still follow the intended StreamInferencePipeline.generate/finalize design (as the sibling integrations do) and drop the dead pipeline/scheduler? That restores CP/profiling/streaming-decoder and leaves a single source of truth.

Comment thread integrations/lingbot_va/lingbot_va/runner.py
Comment thread integrations/lingbot_va/lingbot_va/pipeline.py Outdated
Comment thread integrations/lingbot_va/lingbot_va/runner.py Outdated
@hyzhou404 hyzhou404 requested a review from wilsonCernWq June 13, 2026 06:33
- Fix uncond KV cache desync: always run uncond forward when the cache
  exists, regardless of guidance scale. Previously the early-return
  skipped write_secondary on uncond cache under default config
  (action_guidance_scale=1.0), leaving zeroed action KV that silently
  corrupted CFG from chunk 1 onward.
- Share single VAE between streaming wrappers instead of loading a
  duplicate, halving VAE memory. Also fixes a pre-existing device
  mismatch bug under --enable-offload.
- Remove unused sink_size config field and constant.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

@wilsonCernWq wilsonCernWq left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for quick response. This is just the second part of my review that I didn't finish yesterday!

q01 = self.q01_tensor()
q99 = self.q99_tensor()
denorm = (action_cpu + 1.0) / 2.0 * (q99 - q01 + 1e-6) + q01
return denorm[list(self.config.used_action_channel_ids)]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will return something like (16, 320) as I printed?

But in the readme it says

| `actions.npy` | Predicted actions array, shape `(num_chunks × action_per_frame × frame_chunk_size, action_dim)`. |



def load_vae(vae_path: str, torch_dtype: torch.dtype, torch_device):
vae = AutoencoderKLWan.from_pretrained(vae_path, torch_dtype=torch_dtype)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this assumes user has checkpoint pre-downloaded on disk. I think this is useful for dev, but a bit hard to use for users who just want to run the test? Can we default to download from HF url? and then if user provides a --checkpoint-root, we overwrite it with a local path?

To download from URL, it needs kwargs like subfolder="text_encoder", I believe

| `--checkpoint-root` | str | `robbyant/lingbot-va-posttrain-robotwin` | Local path or HuggingFace repo ID for model weights. Must contain `transformer/`, `vae/`, `text_encoder/`, `tokenizer/` subdirs. |
| `--input-image-dir` | path | `assets/example_data/lingbot-va/robotwin` | Directory containing three observation camera PNGs (see below). |
| `--output-dir` | path | `outputs/lingbot_va/robotwin_i2av` | Where to write `demo.mp4`, `actions.npy`, `latents.pt`, and timing JSON. |
| `--prompt` | str | `"Grab the medium-sized white mug, rotate it, place it on the table, and hook it onto the smooth dark gray rack."` | Text prompt describing the manipulation task. Can also be a path to a `.txt` file. |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not seem to match the actual implementation, it seems --prompt will actually read strings? could you double check?

runner_name=PIPELINE_LINGBOT_VA_ROBOTWIN_I2AV.name,
description=(
"LingBot-VA Robotwin I2AV inference scaffold "
"(three-camera Robotwin config; native DiT port pending)."

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "native DiT port pending" still correct? Maybe we can do a full pass of docstrings so that they are all consistent?

--output-dir outputs/lingbot_va/robotwin_i2av \
--checkpoint-root /path/to/lingbot-va-posttrain-robotwin \
--num-chunks 10 \
--benchmark True

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this integration support multi-GPU?

return torch.cat([grid_id, torch.full_like(grid_id[:1], t)], dim=0)


def data_seq_to_patch(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused code?

"""Return a copy with all non-Robotwin-used channels set to zero."""
masked = action.clone()
masked[:, ~self.action_mask(device=masked.device)] = 0
return masked

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this also seems unused?

q01 = self.q01_tensor(device=expanded.device)
q99 = self.q99_tensor(device=expanded.device)
expanded = (expanded - q01) / (q99 - q01 + 1e-6) * 2.0 - 1.0
return expanded.unsqueeze(0).unsqueeze(-1)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: also unused seems?

@liruilong940607

liruilong940607 commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Thanks for bringing lingbot-va into flashdreams.

A general question: how to verify generation result is on-par with the official implementation?

For other integrations we have the parity check code like this https://github.com/NVIDIA/flashdreams/tree/main/integrations/lingbot/tests/parity_check

Which will not only produce generation results with the original code, but also dumps logs for runtime, so that anyone can reproduce the runtime speedup reported by the developer. It seems that you have done the comparison -- is it possible to organize your comparison into reproduceable code like other integrations? And share some visuals in this PR as the evidence of quality parity? (not sure what is the best way to verify the action output though)

@hyzhou404

hyzhou404 commented Jun 15, 2026

Copy link
Copy Markdown
Author

Thanks for bringing lingbot-va into flashdreams.

A general question: how to verify generation result is on-par with the official implementation?

For other integrations we have the parity check code like this https://github.com/NVIDIA/flashdreams/tree/main/integrations/lingbot/tests/parity_check

Which will not only produce generation results with the original code, but also dumps logs for runtime, so that anyone can reproduce the runtime speedup reported by the developer. It seems that you have done the comparison -- is it possible to organize your comparison into reproduceable code like other integrations? And share some visuals in this PR as the evidence of quality parity? (not sure what is the best way to verify the action output though)

Both the original code and my implementation exhibited some randomness that has been difficult to align. I am currently working on debugging the sources of this randomness and refactoring the code based on suggestions from Qi Wu and you. Thank you both for your careful and thorough review of my submission.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants