Skip to content

feat: Add ZigZagKVPress for dynamic per-layer KV cache compression#238

Open
oneraghavan wants to merge 2 commits into
NVIDIA:mainfrom
oneraghavan:feature/zigzag-kv-press
Open

feat: Add ZigZagKVPress for dynamic per-layer KV cache compression#238
oneraghavan wants to merge 2 commits into
NVIDIA:mainfrom
oneraghavan:feature/zigzag-kv-press

Conversation

@oneraghavan

@oneraghavan oneraghavan commented Jun 15, 2026

Copy link
Copy Markdown

Summary

Implements ZigZagKV (arXiv:2412.09036, COLING 2025) as a new BasePress wrapper that dynamically allocates per-layer KV cache budgets based on attention concentration.

  • Introduces ZigZagKVPress, which wraps any ScorerPress and uses LMBA (Layer Minimum Budget to maintain Attention) to measure each layer's uncertainty
  • Layers with diffuse attention (high LMBA) receive larger token budgets; layers with concentrated attention (low LMBA) receive smaller budgets
  • Total budget is conserved, so the average compression ratio matches compression_ratio

Design (flash-attention compatible)

The press is intentionally built to work without attn_implementation="eager":

  1. LMBA from inner-press scores — importance scores (and LMBA) come from the inner press's score(); for SnapKV-style presses this recomputes a small observation-window attention internally, so no full attention matrices are needed.
  2. Lightweight deferral — during prefill, only a per-layer score tensor (batch, kv_heads, k_len) and a scalar LMBA are collected. The global budget normalization (uncertainty_l = LMBA_l / Σ LMBA) is computed once after the forward pass.
  3. Genuine per-layer resize — each layer is physically gathered to its own budget via top-k (different cache sizes per layer, no padding), exactly like PyramidKVPress.

Allocation rule (Equation 6 from paper)

uncertainty_l = LMBA_l / sum(LMBA)
budget_l = B_bound + (B_avg - B_bound) * L * uncertainty_l

B_bound = b_bound_ratio * B_avg guarantees a minimum per-layer budget. The budgets sum to L * B_avg, preserving the target average compression.

Files Changed

  • kvpress/presses/zigzag_press.pyZigZagKVPress
  • kvpress/__init__.py — export
  • evaluation/evaluate_registry.py — registry entry zigzag_snapkv (SnapKV inner press)
  • evaluation/evaluate.pyflash_attn_3 auto-detection
  • tests/test_zigzag_press.py — 9 unit tests
  • scripts/ — SLURM helpers for testing/benchmarking

Benchmark results (Llama-3.1-8B-Instruct)

RULER (4096 context, avg string_match)

Method Compression Score
no_press 0% 95.74
knorm 50% 52.82
observed_attention 50% 54.66
snapkv 50% 69.49
zigzag_snapkv 50% 69.51
zigzag_snapkv 80% 37.71

LongBench (16 tasks, average)

Method Compression Score
no_press 0% 45.95
snapkv 50% 44.54
zigzag_snapkv 50% 44.23
zigzag_snapkv 80% 39.32

At 50% compression ZigZagKV is on par with SnapKV and well above knorm/observed; it degrades gracefully at 80%. (Dynamic per-layer allocation is expected to help most at higher compression / longer contexts.)

Test plan

  • 9 unit tests pass on H100 (compression, average-ratio preservation, no-compression, high compression, b_bound floor, batch input, generation-after-compression with variable per-layer sizes, attribute, invalid params)
  • Flash-attention compatible — no eager requirement; no CUDA OOM on LongBench
  • RULER + LongBench benchmarks vs baselines (table above)

Closes #237

Implements ZigZagKV (arXiv:2412.09036, COLING 2025) as a new BasePress wrapper
that dynamically allocates per-layer KV cache budgets based on attention entropy
(LMBA — Layer Minimum Budget to maintain Attention).

Key design:
- Two-phase approach: collect LMBA during forward pass, then apply compression
  with dynamic per-layer ratios in the context manager's finally block
- Layers with diffuse attention get larger budgets; concentrated layers get smaller
- All layers padded to uniform size for HuggingFace attention mask compatibility
- Requires attn_implementation="eager" for attention weight access

Includes:
- ZigZagKVPress class in kvpress/presses/zigzag_press.py
- 8 unit tests in tests/test_zigzag_press.py
- Evaluation registry entry and evaluate.py flash_attn_3 support
- SLURM scripts for testing and benchmarking

Fixes NVIDIA#237
@copy-pr-bot

copy-pr-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

The initial ZigZagKVPress implementation forced attn_implementation="eager"
and used a heavyweight two-phase design that stored full attention matrices
and KV copies for all layers, then zero-padded layers to a uniform size. This
caused two failures on Llama-3.1-8B benchmarks: RULER scored ~0% (zero-padding
polluted the attention softmax) and LongBench hit CUDA OOM (duplicate storage +
eager attention matrices).

Rewrite following the PyramidKVPress pattern:
- Compute LMBA from the inner press's score() output (concentration of score
  mass), so SnapKV-style window attention is used -> no eager required, flash
  attention compatible.
- Collect only lightweight per-layer state (a score tensor + scalar LMBA), keep
  KV in the cache, and defer only the global budget normalization. Eliminates
  the OOM.
- Physically resize each layer to its own budget via top-k (genuine per-layer
  cache sizes, no padding), conserving the average compression ratio.

Registry now uses zigzag_snapkv (SnapKVPress inner press) instead of the
eager-only observed_attention variant; evaluate.py no longer forces eager.

Results (Llama-3.1-8B): [email protected] 69.51 (was 0.88, SnapKV 69.49),
[email protected] 44.23 (was OOM, SnapKV 44.54). Now on par with SnapKV and far
above knorm/observed.

Tests rewritten for the non-eager design, including a generation-after-
compression check for variable per-layer cache sizes. All 9 pass.
@SimJeg SimJeg self-assigned this Jun 16, 2026
@SimJeg

SimJeg commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Hi @oneraghavan, thanks for your PR.

Based on the results you provided, ZigZagKVPress does not improve the accuracy of the base press (SnapKVPress). Is there any other ScorerPress where this method would provide a significant lift in accuracy ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add ZigZagKVPress — dynamic per-layer compression based on attention entropy

2 participants