Skip to content

DreamSoul-AI/OBCache

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

174 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference

This repository contains the official code for the ICML 2026 paper OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference. The implementation is built on NVIDIA KVPress and provides OBCache-enhanced variants of common KV cache eviction baselines. Our main contributions are:

  • We propose a formal OBD-based theoretical framework for estimating token saliency in KV cache eviction, yielding output-aware, easy-to-implement scores that go beyond standard attention-only heuristics.
  • OBCache is plug-and-play: it can be integrated into score-based eviction methods. In this repository, we instantiate it for prior attention-only baselines H2O, TOVA, SnapKV, and AdaKV.
  • On both Llama-3.1-8B and Qwen-2.5-7B models, OBCache scores consistently improve the long-context eviction performance of prior baselines on RULER and LongBench.

Environment Setup

conda create -n obcache python=3.12
conda activate obcache

pip install -r kvpress/requirements.txt

Inference with Cache Eviction

The easiest way to use OBCache is through the KVPress text-generation pipeline. Importing kvpress registers the kv-press-text-generation pipeline with Hugging Face Transformers.

from transformers import pipeline
from kvpress import AdaKVPress, OBCacheSnapKVPress

model = "meta-llama/Llama-3.1-8B-Instruct"

pipe = pipeline(
    "kv-press-text-generation",
    model=model,
    device_map="auto",
    dtype="auto",
    model_kwargs={"attn_implementation": "flash_attention_2"},
)

context = "A long context you want to compress once.\n"
question = "A context-specific query."

press = AdaKVPress(
    OBCacheSnapKVPress(
        compression_ratio=0.8,
        pruning_score="obcache-k",
    )
)
answer = pipe(context, question=question, press=press)["answer"]
print(answer)

Evaluation

The evaluation entry point is evaluation/evaluate.py, which is adapted from KVPress's evaluation script. To reproduce the RULER experiments in the paper, run:

bash scripts/run_ruler.sh Llama-3.1-8B-Instruct query-aware 0.8 4k
bash scripts/run_ruler.sh Qwen-2.5-7B-Instruct query-agnostic 0.7 32k

To run LongBench tasks:

bash scripts/run_longbench.sh Llama-3.1-8B-Instruct query-aware 0.8 tasks=qa,code,summ,fewshot
bash scripts/run_longbench.sh Qwen-2.5-7B-Instruct query-agnostic 0.7 tasks=qa,code,summ,synthetic

Both scripts run the baselines h2o, tova, snapkv, and adakv_snapkv, along with their OBCache-K counterparts.

Results

We provide extensive evaluations showing that OBCache scores are effective when integrated into existing eviction baselines. Across all benchmarks (RULER and LongBench), model families (Llama-3.1 and Qwen-2.5), compression settings (query-aware and query-agnostic), and compression ratios (0.6, 0.7, 0.8, 0.9), OBCache consistently improves long-context performance.

Acknowledgements

This codebase is built primarily on top of NVIDIA KVPress. We thank the KVPress authors for releasing a clean and extensible library for KV cache compression research, which makes these experiments much easier to develop and reproduce.

Reference

If you find our work useful or relevant to your research, please cite our paper:

@article{gu2025obcache,
      title={OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference}, 
      author={Yuzhe Gu and Xiyu Liang and Jiaojiao Zhao and Enmao Diao},
      journal={arXiv preprint arXiv:2510.07651},
      year={2025}
}

About

[ICML 2026] OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Generated from diaoenmao/RPipe