feat: Add ZigZagKVPress for dynamic per-layer KV cache compression by oneraghavan · Pull Request #238 · NVIDIA/kvpress

oneraghavan · 2026-06-15T10:18:05Z

Summary

Implements ZigZagKV (arXiv:2412.09036, COLING 2025) as a new BasePress wrapper that dynamically allocates per-layer KV cache budgets based on attention concentration.

Introduces ZigZagKVPress, which wraps any ScorerPress and uses LMBA (Layer Minimum Budget to maintain Attention) to measure each layer's uncertainty
Layers with diffuse attention (high LMBA) receive larger token budgets; layers with concentrated attention (low LMBA) receive smaller budgets
Total budget is conserved, so the average compression ratio matches compression_ratio

Design (flash-attention compatible)

The press is intentionally built to work without attn_implementation="eager":

LMBA from inner-press scores — importance scores (and LMBA) come from the inner press's score(); for SnapKV-style presses this recomputes a small observation-window attention internally, so no full attention matrices are needed.
Lightweight deferral — during prefill, only a per-layer score tensor (batch, kv_heads, k_len) and a scalar LMBA are collected. The global budget normalization (uncertainty_l = LMBA_l / Σ LMBA) is computed once after the forward pass.
Genuine per-layer resize — each layer is physically gathered to its own budget via top-k (different cache sizes per layer, no padding), exactly like PyramidKVPress.

Allocation rule (Equation 6 from paper)

uncertainty_l = LMBA_l / sum(LMBA)
budget_l = B_bound + (B_avg - B_bound) * L * uncertainty_l

B_bound = b_bound_ratio * B_avg guarantees a minimum per-layer budget. The budgets sum to L * B_avg, preserving the target average compression.

Files Changed

kvpress/presses/zigzag_press.py — ZigZagKVPress
kvpress/__init__.py — export
evaluation/evaluate_registry.py — registry entry zigzag_snapkv (SnapKV inner press)
evaluation/evaluate.py — flash_attn_3 auto-detection
tests/test_zigzag_press.py — 9 unit tests
scripts/ — SLURM helpers for testing/benchmarking

Benchmark results (Llama-3.1-8B-Instruct)

RULER (4096 context, avg string_match)

Method	Compression	Score
no_press	0%	95.74
knorm	50%	52.82
observed_attention	50%	54.66
snapkv	50%	69.49
zigzag_snapkv	50%	69.51
zigzag_snapkv	80%	37.71

LongBench (16 tasks, average)

Method	Compression	Score
no_press	0%	45.95
snapkv	50%	44.54
zigzag_snapkv	50%	44.23
zigzag_snapkv	80%	39.32

At 50% compression ZigZagKV is on par with SnapKV and well above knorm/observed; it degrades gracefully at 80%. (Dynamic per-layer allocation is expected to help most at higher compression / longer contexts.)

Test plan

9 unit tests pass on H100 (compression, average-ratio preservation, no-compression, high compression, b_bound floor, batch input, generation-after-compression with variable per-layer sizes, attribute, invalid params)
Flash-attention compatible — no eager requirement; no CUDA OOM on LongBench
RULER + LongBench benchmarks vs baselines (table above)

Closes #237

Implements ZigZagKV (arXiv:2412.09036, COLING 2025) as a new BasePress wrapper that dynamically allocates per-layer KV cache budgets based on attention entropy (LMBA — Layer Minimum Budget to maintain Attention). Key design: - Two-phase approach: collect LMBA during forward pass, then apply compression with dynamic per-layer ratios in the context manager's finally block - Layers with diffuse attention get larger budgets; concentrated layers get smaller - All layers padded to uniform size for HuggingFace attention mask compatibility - Requires attn_implementation="eager" for attention weight access Includes: - ZigZagKVPress class in kvpress/presses/zigzag_press.py - 8 unit tests in tests/test_zigzag_press.py - Evaluation registry entry and evaluate.py flash_attn_3 support - SLURM scripts for testing and benchmarking Fixes NVIDIA#237

copy-pr-bot · 2026-06-15T10:18:08Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

The initial ZigZagKVPress implementation forced attn_implementation="eager" and used a heavyweight two-phase design that stored full attention matrices and KV copies for all layers, then zero-padded layers to a uniform size. This caused two failures on Llama-3.1-8B benchmarks: RULER scored ~0% (zero-padding polluted the attention softmax) and LongBench hit CUDA OOM (duplicate storage + eager attention matrices). Rewrite following the PyramidKVPress pattern: - Compute LMBA from the inner press's score() output (concentration of score mass), so SnapKV-style window attention is used -> no eager required, flash attention compatible. - Collect only lightweight per-layer state (a score tensor + scalar LMBA), keep KV in the cache, and defer only the global budget normalization. Eliminates the OOM. - Physically resize each layer to its own budget via top-k (genuine per-layer cache sizes, no padding), conserving the average compression ratio. Registry now uses zigzag_snapkv (SnapKVPress inner press) instead of the eager-only observed_attention variant; evaluate.py no longer forces eager. Results (Llama-3.1-8B): [email protected] 69.51 (was 0.88, SnapKV 69.49), [email protected] 44.23 (was OOM, SnapKV 44.54). Now on par with SnapKV and far above knorm/observed. Tests rewritten for the non-eager design, including a generation-after- compression check for variable per-layer cache sizes. All 9 pass.

SimJeg · 2026-06-16T11:49:59Z

Hi @oneraghavan, thanks for your PR.

Based on the results you provided, ZigZagKVPress does not improve the accuracy of the base press (SnapKVPress). Is there any other ScorerPress where this method would provide a significant lift in accuracy ?

SimJeg self-assigned this Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add ZigZagKVPress for dynamic per-layer KV cache compression#238

feat: Add ZigZagKVPress for dynamic per-layer KV cache compression#238
oneraghavan wants to merge 2 commits into
NVIDIA:mainfrom
oneraghavan:feature/zigzag-kv-press

oneraghavan commented Jun 15, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 15, 2026

Uh oh!

SimJeg commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

oneraghavan commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design (flash-attention compatible)

Allocation rule (Equation 6 from paper)

Files Changed

Benchmark results (Llama-3.1-8B-Instruct)

RULER (4096 context, avg string_match)

LongBench (16 tasks, average)

Test plan

Uh oh!

copy-pr-bot Bot commented Jun 15, 2026

Uh oh!

SimJeg commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

oneraghavan commented Jun 15, 2026 •

edited

Loading