AV latent support for LTXVLoopingSampler and LTXVExtendSampler#472
Open
Jean J. de Jong (jjdejong) wants to merge 5 commits intoLightricks:masterfrom
Open
AV latent support for LTXVLoopingSampler and LTXVExtendSampler#472Jean J. de Jong (jjdejong) wants to merge 5 commits intoLightricks:masterfrom
Jean J. de Jong (jjdejong) wants to merge 5 commits intoLightricks:masterfrom
Conversation
…ampler The looping and extend samplers currently reject AV latents with a ValueError, forcing users to generate audio in a separate low-sigma pass. This produces inferior results because audio is not refined jointly with video across temporal tiles. This change removes the AV rejection guard and carries audio latents through the temporal tiling loop alongside video: - LTXVLoopingSampler: separates AV input into video + audio, passes audio slices to each tile's sampler, accumulates audio output across tiles, and reassembles the AV NestedTensor on output. - LTXVExtendSampler: accepts audio tile data, computes audio overlap and new-frame geometry matching the video tile structure, creates proper audio noise masks, and wraps/unwraps AV latents around all SamplerCustomAdvanced calls. - LTXVBaseSampler / LTXVInContextSampler: accept optional audio tile, wrap into AV latent before sampling, split on output. For stage-2 refinement (low-sigma upscale pass), the input audio data is used to initialize each tile's audio frames instead of zeros, enabling the model to refine lipsync and audio-visual coherence at higher resolution — matching the behavior of the standard two-stage workflow using SamplerCustomAdvanced directly. Helper functions _make_av_latent_dict() and _split_av_latent_dict() handle the NestedTensor packing/unpacking with proper noise mask propagation for both modalities. Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Two-pass I2V looping workflow (single-tile and 30s 3-tile variants) with reference image conditioning at tile boundaries - 30s variant adds MultiPromptProvider for per-tile prompt variation and RepeatImageBatch for guiding images at transitions - V2V Detailer doc with Strix Halo OOM prevention and arbitrary-length video handling notes - Python generator script for the two-pass workflow Co-Authored-By: Claude Opus 4.6 <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
LTXVLoopingSamplerandLTXVExtendSampler(and theLTXVBaseSampler/LTXVInContextSamplerbuilding blocks), removing theValueErrorguard that previously rejected AV latents.NestedTensoron output.SamplerCustomAdvancedworkflow.MultiPromptProvider) and a V2V detailer doc.Implementation notes
_make_av_latent_dict()/_split_av_latent_dict()ineasy_samplers.pyhandleNestedTensorpacking/unpacking with proper noise-mask propagation for both modalities.frame_overlap(expressed in video latent frames) is converted via the audio VAE stride before audio slicing.Motivation
Previously, the looping and extend samplers raised
ValueError: LoopingSampler currently does not support Audio Visual latents., forcing users to generate audio in a separate low-sigma pass on top of a video-only result. That workaround produces inferior lipsync because audio is never refined jointly with video across temporal tiles. This change makes joint AV generation possible in long-form clips.Testing status
LTXVLoopingSamplerAV path — confirmed working end-to-end. Audio is generated jointly with video and stays synchronised across tile boundaries; verified with the included two-pass workflow.LTXVExtendSamplerAV path — implemented but not fully validated. The extend pass runs without errors when handed a source AV latent, but joint AV continuity across the extend boundary has not been rigorously compared against the workaround pipeline.optional_negative_index_strength— wired through all samplers but not extensively tested. The default (1.0) preserves prior behavior; intermediate values to soften reference-image influence have not been validated on a reference set.Test plan
LTX-2.3_Two_Pass_I2V_Looping.jsonwith an AV latent and confirm decoded audio is in sync with video.LTX-2.3_Two_Pass_I2V_Looping_30s.json(3 tiles) and confirm audio continuity across both tile boundaries.LTXVExtendSamplerand confirm audio continuation is coherent across the overlap region.optional_negative_index_strengthacross {0.0, 0.5, 1.0} on a reference-conditioned generation to confirm monotonic influence reduction.🤖 Generated with Claude Code