feat: Add ZigZagKVPress for dynamic per-layer KV cache compression#238
Open
oneraghavan wants to merge 2 commits into
Open
feat: Add ZigZagKVPress for dynamic per-layer KV cache compression#238oneraghavan wants to merge 2 commits into
oneraghavan wants to merge 2 commits into
Conversation
Implements ZigZagKV (arXiv:2412.09036, COLING 2025) as a new BasePress wrapper that dynamically allocates per-layer KV cache budgets based on attention entropy (LMBA — Layer Minimum Budget to maintain Attention). Key design: - Two-phase approach: collect LMBA during forward pass, then apply compression with dynamic per-layer ratios in the context manager's finally block - Layers with diffuse attention get larger budgets; concentrated layers get smaller - All layers padded to uniform size for HuggingFace attention mask compatibility - Requires attn_implementation="eager" for attention weight access Includes: - ZigZagKVPress class in kvpress/presses/zigzag_press.py - 8 unit tests in tests/test_zigzag_press.py - Evaluation registry entry and evaluate.py flash_attn_3 support - SLURM scripts for testing and benchmarking Fixes NVIDIA#237
The initial ZigZagKVPress implementation forced attn_implementation="eager" and used a heavyweight two-phase design that stored full attention matrices and KV copies for all layers, then zero-padded layers to a uniform size. This caused two failures on Llama-3.1-8B benchmarks: RULER scored ~0% (zero-padding polluted the attention softmax) and LongBench hit CUDA OOM (duplicate storage + eager attention matrices). Rewrite following the PyramidKVPress pattern: - Compute LMBA from the inner press's score() output (concentration of score mass), so SnapKV-style window attention is used -> no eager required, flash attention compatible. - Collect only lightweight per-layer state (a score tensor + scalar LMBA), keep KV in the cache, and defer only the global budget normalization. Eliminates the OOM. - Physically resize each layer to its own budget via top-k (genuine per-layer cache sizes, no padding), conserving the average compression ratio. Registry now uses zigzag_snapkv (SnapKVPress inner press) instead of the eager-only observed_attention variant; evaluate.py no longer forces eager. Results (Llama-3.1-8B): [email protected] 69.51 (was 0.88, SnapKV 69.49), [email protected] 44.23 (was OOM, SnapKV 44.54). Now on par with SnapKV and far above knorm/observed. Tests rewritten for the non-eager design, including a generation-after- compression check for variable per-layer cache sizes. All 9 pass.
Collaborator
|
Hi @oneraghavan, thanks for your PR. Based on the results you provided, ZigZagKVPress does not improve the accuracy of the base press (SnapKVPress). Is there any other ScorerPress where this method would provide a significant lift in accuracy ? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements ZigZagKV (arXiv:2412.09036, COLING 2025) as a new
BasePresswrapper that dynamically allocates per-layer KV cache budgets based on attention concentration.ZigZagKVPress, which wraps anyScorerPressand uses LMBA (Layer Minimum Budget to maintain Attention) to measure each layer's uncertaintycompression_ratioDesign (flash-attention compatible)
The press is intentionally built to work without
attn_implementation="eager":score(); for SnapKV-style presses this recomputes a small observation-window attention internally, so no full attention matrices are needed.(batch, kv_heads, k_len)and a scalar LMBA are collected. The global budget normalization (uncertainty_l = LMBA_l / Σ LMBA) is computed once after the forward pass.PyramidKVPress.Allocation rule (Equation 6 from paper)
B_bound = b_bound_ratio * B_avgguarantees a minimum per-layer budget. The budgets sum toL * B_avg, preserving the target average compression.Files Changed
kvpress/presses/zigzag_press.py—ZigZagKVPresskvpress/__init__.py— exportevaluation/evaluate_registry.py— registry entryzigzag_snapkv(SnapKV inner press)evaluation/evaluate.py—flash_attn_3auto-detectiontests/test_zigzag_press.py— 9 unit testsscripts/— SLURM helpers for testing/benchmarkingBenchmark results (Llama-3.1-8B-Instruct)
RULER (4096 context, avg string_match)
LongBench (16 tasks, average)
At 50% compression ZigZagKV is on par with SnapKV and well above knorm/observed; it degrades gracefully at 80%. (Dynamic per-layer allocation is expected to help most at higher compression / longer contexts.)
Test plan
eagerrequirement; no CUDA OOM on LongBenchCloses #237