fix(wan): correct Wan 2.2 VAE patchify channel order (removes 2px checkerboard)#338
fix(wan): correct Wan 2.2 VAE patchify channel order (removes 2px checkerboard)#338wenqingw-nv wants to merge 1 commit into
Conversation
_patchify/_unpatchify folded the spatial patch into channels as (c ph pw), but the Wan 2.2 VAE checkpoint convs are trained for the diffusers AutoencoderKLWan convention (c pw ph) (its patchify permute is (0,1,6,4,2,3,5)). The swapped patch axes transpose every patch_size-square sub-pixel block relative to the trained weights, decoding to a fixed ~2px checkerboard on every frame. Swap ph<->pw in both functions to match. Verified: VAE encode->decode roundtrip and a full HY-WorldPlay generation are now checkerboard-free and match diffusers' decode of the identical latent. Affects every model using the Wan 2.2 VAE with patch_size=2. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Greptile SummaryFixes the
Confidence Score: 5/5Safe to merge — the change is a two-character axis swap in a pair of inverse functions, directly fixing a visible decode artifact, with the correct ordering cross-verified against the diffusers reference implementation. The fix is minimal, mathematically sound, and the patchify/unpatchify pair remains a proper inverse: ph sub-pixels consistently map to the height spatial axis and pw to width in both directions. The added comments cite the authoritative diffusers permute (0,1,6,4,2,3,5) as evidence. No unrelated code is touched and the callers (encode, decode) are unchanged. No files require special attention; the single changed file has a straightforward, well-scoped correction. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["Input: b c t H W"] --> B["_patchify(x, patch_size)"]
B --> C{"patch_size == 1?"}
C -- Yes --> D["Identity (return x)"]
C -- No --> E["rearrange:\nb c t (h ph) (w pw)\n→ b (c pw ph) t h w\n[pw outer, ph inner in channel block]"]
E --> F["Encoder3d / conv1 / latent z"]
F --> G["WanVAE.decode(z)"]
G --> H["Decoder3d output"]
H --> I["_unpatchify(out, patch_size)"]
I --> J{"patch_size == 1?"}
J -- Yes --> K["Identity (return out)"]
J -- No --> L["rearrange:\nb (c pw ph) t h w\n→ b c t (h ph) (w pw)\n[restore height=h×ph, width=w×pw]"]
L --> M["Output: b c t H W (clean frame)"]
style E fill:#d4edda,stroke:#28a745
style L fill:#d4edda,stroke:#28a745
Reviews (1): Last reviewed commit: "fix(wan): correct Wan 2.2 VAE patchify c..." | Re-trigger Greptile |
|
/ok to test 502a7ce |
Re-measured native-vs-vendor mean |Δ| with the VAE patchify channel-order fix on the native leg (native regenerated, vendor reused unchanged). The ~2px decode checkerboard removal drops median parity 36.0 -> 24.0 / 255 (per-image 36->28, 28->23, 36->26, 21->17, 47->24). Perf/speedup table is unchanged — the fix is a zero-cost einops axis swap. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Re-measured native-vs-vendor mean |Δ| with the VAE patchify channel-order fix on the native leg (native regenerated, vendor reused unchanged). The ~2px decode checkerboard removal drops median parity 36.0 -> 24.0 / 255 (per-image 36->28, 28->23, 36->26, 21->17, 47->24). Perf/speedup table is unchanged — the fix is a zero-cost einops axis swap. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Problem
The flashdreams Wan 2.2 VAE produced a fixed ~2px checkerboard/stipple on every decoded frame (visible on flat regions — sky, road, walls). diffusers
AutoencoderKLWandecodes the identical checkpoint + latent cleanly.Root cause
_patchify/_unpatchify(recipes/wan/autoencoder/vae.py) folded the spatial patch into channels as(c ph pw), but the checkpoint convs are trained for the diffusers convention(c pw ph)(itspatchifypermute is(0,1,6,4,2,3,5)). The swapped patch axes transpose everypatch_size-square sub-pixel block relative to the trained weights → checkerboard.Fix
Swap
ph↔pwin both rearrange patterns (one axis-order change per function).Verification
Scope
Affects every model using the Wan 2.2 VAE with
patch_size=2. Downstream: HY-WorldPlay sample/parity videos (PR #318, #336) should be regenerated.