EEVEE

Towards Test-time Prompt Learning in the Real World for Self-Improving Agents

EEVEE is a multi-dataset test-time prompt learning framework for LLM agents. It targets realistic task streams where inputs come from multiple datasets, domains, and evaluation formats instead of one stationary benchmark.

The core idea is to learn a router-conditioned prompt set. The router assigns each input to a specialized prompt slot, and EEVEE improves the router and prompt set together through router-prompt co-evolution.

The project website includes the video overview.

Highlights

Learns prompt specialization for heterogeneous task streams.
Uses a router to reduce cross-dataset interference between task families.
Alternates router evolution and prompt evolution under downstream validation feedback.
Supports OpenAI-compatible chat completion APIs through provider:model specs.

Framework

At inference time, EEVEE routes an input to one prompt slot and queries the target model with the selected prompt. During learning, it alternates router and prompt updates so that routing decisions and prompt quality improve under the same downstream objective.

Results Snapshot

In the paper, EEVEE improves average multi-benchmark scores by 10.38 and 24.32 points over Qwen3-4B-Instruct and DeepSeek-V3.2, and outperforms GEPA and ACE by up to 37.2% and 48.2%. See the PDF or arXiv for the full evaluation, ablations, and analysis.

Main results on the four-benchmark suite. Scores are percentages averaged over three runs.

Target Model	Method	GPQA Diamond	Formula	TheoremQA	HumanEval	Avg.
Qwen3-4B-Instruct	Baseline	56.00	45.22	14.79	49.46	41.37
Qwen3-4B-Instruct	ACE	48.93	39.67	15.84	35.23	34.92
Qwen3-4B-Instruct	GEPA	50.84	49.83	19.62	30.62	37.73
Qwen3-4B-Instruct	EEVEE	54.55	54.55	25.27	72.63	51.75
DeepSeek-V3.2	Baseline	64.98	30.00	21.21	42.82	39.75
DeepSeek-V3.2	ACE	55.89	37.78	27.05	78.59	49.83
DeepSeek-V3.2	GEPA	41.75	60.56	31.72	89.29	55.83
DeepSeek-V3.2	EEVEE	63.08	60.55	39.84	92.82	64.07

How to Start

git clone https://github.com/Princeton-AI2-Lab/EEVEE.git
cd EEVEE

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Prepare the benchmarks and model specs following the usage guide, then run:

export OPENROUTER_API_KEY="..."
python main.py configs/demo.yaml

The usage guide also covers smoke tests, output artifacts, and repository layout.

Citation

If you use EEVEE, please cite:

@misc{xu2026eevee,
  title = {{EEVEE}: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents},
  author = {Weixian Xu and Shilong Liu and Mengdi Wang},
  year = {2026},
  eprint = {2606.11182},
  archivePrefix = {arXiv},
  url = {https://arxiv.org/abs/2606.11182}
}

License

This project is released under the Apache License 2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
configs		configs
docs		docs
evolve		evolve
infer		infer
prompts		prompts
tasks		tasks
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EEVEE

Highlights

Framework

Results Snapshot

How to Start

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

EEVEE

Highlights

Framework

Results Snapshot

How to Start

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages