Added support for Lingbot-VA#312
Conversation
Greptile SummaryThis PR adds a self-contained
Confidence Score: 4/5Safe to merge after fixing the actions.npy output shape — all other paths (KV cache, CFG sync, VAE decode) look correct. The AR pipeline, KV cache, and CFG sync fixes are all sound. The one concrete defect is in the saved output: torch.cat(pred_actions, dim=1).flatten(1) produces shape (16, 320) while every downstream consumer would expect (320, 16). A one-line .T fixes it, but until then actions.npy is silently unusable for any policy that reads it in standard (T, action_dim) order. integrations/lingbot_va/lingbot_va/runner.py — specifically the all_actions shape at line 522. Important Files Changed
Sequence DiagramsequenceDiagram
participant R as LingbotVARobotwinRunner
participant T as LingbotVATransformer
participant VC as DualChunkKVCache (cond)
participant UC as DualChunkKVCache (uncond)
R->>T: initialize_autoregressive_cache()
T-->>R: LingbotVATransformerCache
loop AR chunk
loop Video denoise
R->>T: "predict_flow(persist=False)"
T->>VC: cached_kv_plus_fresh
T->>UC: cached_kv_plus_fresh
end
R->>T: "predict_flow(persist=True)"
T->>VC: write_primary(k, v)
T->>UC: write_primary(k, v)
loop Action denoise
R->>T: "predict_action_flow(persist=False)"
T->>VC: cached_kv_plus_fresh
T->>UC: cached_kv_plus_fresh
end
R->>T: "predict_action_flow(persist=True)"
T->>VC: write_secondary(k, v)
T->>UC: write_secondary(k, v)
R->>T: commit_cache_slot()
end
R->>R: save actions.npy, latents.pt
R->>R: VAE decode
Reviews (6): Last reviewed commit: "Fix CFG cache corruption and reduce VAE ..." | Re-trigger Greptile |
523ba92 to
35ae585
Compare
|
I previously mistakenly uploaded the pyproject.toml file for version 12.8; this has been fixed. Please ignore Greptile's description regarding the CUDA version change. |
|
/ok to test 35ae585 |
|
Please add Apache-2.0 headers on new files which you authored and are contributing . Attribute any 3rd-party OSS files . reuse-lint / OSRB collateral sanity check (pull_request) Failing after 9s |
35ae585 to
d05ddeb
Compare
Done |
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Remove flash_attn stub that polluted sys.modules globally
- Inline prompt_clean to avoid depending on diffusers private API
- Fix KV cache to use left-shift eviction (aligned with official BlockKVCache)
- Keep _n_committed capped at window_slots in steady-state
- Remove duplicate _streaming_vae_half.vae.to("cpu") call
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
beb1238 to
755b105
Compare
| grid_id, self.config.network.dim // self.config.network.num_heads | ||
| ).to(noisy_latent.device) | ||
|
|
||
| flow_cond = self.network.forward_video( |
There was a problem hiding this comment.
compile_module(net) returns an OptimizedModule that only compiles forward, but WanVADiTNetwork has no forward — the hot path calls net.forward_video / net.forward_action (transformer/init.py:200,228), which OptimizedModule delegates straight to the eager original module. So --compile-network True (default) does nothing, and the README's torch.compile-attributed 2.3× is not correct likely...
There was a problem hiding this comment.
Once compile actually engages, this will likely cause torch compile to be triggered multiple times
def n_cached_tokens(self) -> int:
"""Number of committed tokens visible to attention."""
return min(self._n_committed, self.window_slots) * self.slot_size
Please follow the existing kvcache.py design. We have a good solution for that already.
| nn.SiLU(), | ||
| nn.Linear(self.dim, self.dim * 6), | ||
| ) | ||
| self.action_text_embedding = nn.Sequential( |
There was a problem hiding this comment.
this action_text_embedding seems never used? is this intentional?
There was a problem hiding this comment.
This is inherited from the released Lingbot-VA checkpoint. My code faithfully loads these weights to stay compatible with the pretrained checkpoint, but they don't participate in any computation. Leaving it as-is for now.
| raise NotImplementedError( | ||
| "LingBot-VA Robotwin cache initialization awaits the native DiT/VAE " | ||
| "port. Use --no-instantiate or CPU tests for the scaffold stage." | ||
| ) |
There was a problem hiding this comment.
This does not respect the overall pipeline design
The DiffusionModel(transformer+scheduler) are built, moved to GPU, and never used. Could we still follow the intended StreamInferencePipeline.generate/finalize design (as the sibling integrations do) and drop the dead pipeline/scheduler? That restores CP/profiling/streaming-decoder and leaves a single source of truth.
- Fix uncond KV cache desync: always run uncond forward when the cache exists, regardless of guidance scale. Previously the early-return skipped write_secondary on uncond cache under default config (action_guidance_scale=1.0), leaving zeroed action KV that silently corrupted CFG from chunk 1 onward. - Share single VAE between streaming wrappers instead of loading a duplicate, halving VAE memory. Also fixes a pre-existing device mismatch bug under --enable-offload. - Remove unused sink_size config field and constant. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
wilsonCernWq
left a comment
There was a problem hiding this comment.
Thanks for quick response. This is just the second part of my review that I didn't finish yesterday!
| q01 = self.q01_tensor() | ||
| q99 = self.q99_tensor() | ||
| denorm = (action_cpu + 1.0) / 2.0 * (q99 - q01 + 1e-6) + q01 | ||
| return denorm[list(self.config.used_action_channel_ids)] |
There was a problem hiding this comment.
I think this will return something like (16, 320) as I printed?
But in the readme it says
| `actions.npy` | Predicted actions array, shape `(num_chunks × action_per_frame × frame_chunk_size, action_dim)`. |
|
|
||
|
|
||
| def load_vae(vae_path: str, torch_dtype: torch.dtype, torch_device): | ||
| vae = AutoencoderKLWan.from_pretrained(vae_path, torch_dtype=torch_dtype) |
There was a problem hiding this comment.
I think this assumes user has checkpoint pre-downloaded on disk. I think this is useful for dev, but a bit hard to use for users who just want to run the test? Can we default to download from HF url? and then if user provides a --checkpoint-root, we overwrite it with a local path?
To download from URL, it needs kwargs like subfolder="text_encoder", I believe
| | `--checkpoint-root` | str | `robbyant/lingbot-va-posttrain-robotwin` | Local path or HuggingFace repo ID for model weights. Must contain `transformer/`, `vae/`, `text_encoder/`, `tokenizer/` subdirs. | | ||
| | `--input-image-dir` | path | `assets/example_data/lingbot-va/robotwin` | Directory containing three observation camera PNGs (see below). | | ||
| | `--output-dir` | path | `outputs/lingbot_va/robotwin_i2av` | Where to write `demo.mp4`, `actions.npy`, `latents.pt`, and timing JSON. | | ||
| | `--prompt` | str | `"Grab the medium-sized white mug, rotate it, place it on the table, and hook it onto the smooth dark gray rack."` | Text prompt describing the manipulation task. Can also be a path to a `.txt` file. | |
There was a problem hiding this comment.
this does not seem to match the actual implementation, it seems --prompt will actually read strings? could you double check?
| runner_name=PIPELINE_LINGBOT_VA_ROBOTWIN_I2AV.name, | ||
| description=( | ||
| "LingBot-VA Robotwin I2AV inference scaffold " | ||
| "(three-camera Robotwin config; native DiT port pending)." |
There was a problem hiding this comment.
Is "native DiT port pending" still correct? Maybe we can do a full pass of docstrings so that they are all consistent?
| --output-dir outputs/lingbot_va/robotwin_i2av \ | ||
| --checkpoint-root /path/to/lingbot-va-posttrain-robotwin \ | ||
| --num-chunks 10 \ | ||
| --benchmark True |
There was a problem hiding this comment.
Will this integration support multi-GPU?
| return torch.cat([grid_id, torch.full_like(grid_id[:1], t)], dim=0) | ||
|
|
||
|
|
||
| def data_seq_to_patch( |
| """Return a copy with all non-Robotwin-used channels set to zero.""" | ||
| masked = action.clone() | ||
| masked[:, ~self.action_mask(device=masked.device)] = 0 | ||
| return masked |
There was a problem hiding this comment.
nit: this also seems unused?
| q01 = self.q01_tensor(device=expanded.device) | ||
| q99 = self.q99_tensor(device=expanded.device) | ||
| expanded = (expanded - q01) / (q99 - q01 + 1e-6) * 2.0 - 1.0 | ||
| return expanded.unsqueeze(0).unsqueeze(-1) |
There was a problem hiding this comment.
nit: also unused seems?
|
Thanks for bringing lingbot-va into flashdreams. A general question: how to verify generation result is on-par with the official implementation? For other integrations we have the parity check code like this https://github.com/NVIDIA/flashdreams/tree/main/integrations/lingbot/tests/parity_check Which will not only produce generation results with the original code, but also dumps logs for runtime, so that anyone can reproduce the runtime speedup reported by the developer. It seems that you have done the comparison -- is it possible to organize your comparison into reproduceable code like other integrations? And share some visuals in this PR as the evidence of quality parity? (not sure what is the best way to verify the action output though) |
Both the original code and my implementation exhibited some randomness that has been difficult to align. I am currently working on debugging the sources of this randomness and refactoring the code based on suggestions from Qi Wu and you. Thank you both for your careful and thorough review of my submission. |
Summary