LongDA: Long-Document Data Analysis Benchmark

LongDA is a data analysis benchmark for evaluating LLM-based agents under documentation-intensive analytical workflows. Unlike existing benchmarks that assume well-specified schemas, LongDA targets real-world settings where navigating long documentation and complex data is the primary bottleneck.

📖 Overview

We manually curate 17 U.S. national surveys with their complete documentation and extract 505 analytical queries from expert-written publications. Solving these queries requires agents to:

Retrieve and integrate key information from multiple unstructured documents (~263K tokens on average)
Navigate long documentation including codebooks, technical reports, and user guides
Perform multi-step computations with proper sampling weights and survey design considerations
Write executable code to extract variables and compute results

This benchmark captures the reality of documentation-intensive data analysis where information gathering is often the dominant bottleneck.

🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/Yiyang-Ian-Li/LongDA
cd LongDA

# Install dependencies
pip install -r requirements.txt

Download Data

Download the complete benchmark dataset including all survey data and documentation from Hugging Face:

# Install Git LFS (required for large files)
git lfs install

# Clone the complete dataset
cd /path/to/your/workspace
git clone https://huggingface.co/datasets/EvilBench/LongDA benchmark

# Your directory should now contain:
# benchmark/
# ├── benchmark.csv
# └── [17 survey folders with data/ and docs/]

Or download programmatically:

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="EvilBench/LongDA",
    repo_type="dataset",
    local_dir="./benchmark"
)

Note: The complete dataset (~1.8GB) is required. The benchmark.csv file alone is insufficient as evaluation requires access to raw survey data and documentation.

Run Evaluation

Configure your agent: Copy and modify the example config

cp configs/example_config.yaml configs/my_config.yaml
# Edit my_config.yaml with your API key

Run the benchmark:

python main.py --config_file configs/my_config.yaml

View results: Results are saved in results/TIMESTAMP_MODEL/
- run_summary.json: Overall metrics (match rate, token usage, runtime)
- block_metrics.json: Per-survey-source performance
- answers_progress.csv: All answers and correctness
- messages/: Detailed traces for each query

📊 Benchmark Statistics

505 queries across 17 U.S. national surveys
6 federal agencies: covering health, labor, economics, education, and social sciences
30 expert-written publications used for query extraction
~263K tokens average context per query (much longer than existing benchmarks)
Surveys: NHANES, CPS-ASEC, GSS, NSDUH, NHIS, NSCG, NSFG, ATUS, HERD, RHFS, SDR, SSERF, STC, NTEWS, ASFIN, ASPEP, ASPP

📝 Evaluation Metrics

LongDA evaluates agents on:

Match Rate: Proportion of queries answered within tolerance (default: 5% relative error for numbers)
Token Efficiency: Total tokens consumed across all queries
Runtime: Total time to complete the benchmark
Steps: Average number of agent-tool interactions per query

Answers are validated with flexible matching for numerical values and exact matching for list structures.

🛠️ Project Structure

LongDA/
├── benchmark/              # Benchmark data and documentation
│   ├── benchmark.csv      # 505 queries with ground truth
│   └── [SURVEY]/          # Survey-specific folders
│       ├── data/          # Raw data files
│       └── docs/          # Long documentation (codebooks, guides, etc.)
├── configs/               # Model configuration templates
├── tools/                 # Custom tools for the agent framework
├── main.py               # Main evaluation script
├── evaluate_results.py   # Post-hoc evaluation and analysis
├── metric.py             # Evaluation metrics implementation
├── my_agent.py           # LongTA agent framework
└── utils.py              # Utility functions

📚 Citation

If you use LongDA in your research, please cite:

@article{li2026longda,
  title={LongDA: Benchmarking LLM Agents for Long-Document Data Analysis},
  author={Li, Yiyang and Zhang, Zheyuan and Ma, Tianyi and Wang, Zehong and Murugesan, Keerthiram and Zhang, Chuxu and Ye, Yanfang},
  journal={arXiv preprint arXiv:2601.02598},
  year={2026}
}

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📧 Contact

For questions or issues, please:

Open a GitHub issue
Contact: [email protected]

Note: This benchmark is for research purposes only. Please comply with data usage policies when using the survey data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LongDA: Long-Document Data Analysis Benchmark

📖 Overview

🚀 Quick Start

Installation

Download Data

Run Evaluation

📊 Benchmark Statistics

📝 Evaluation Metrics

🛠️ Project Structure

📚 Citation

📄 License

📧 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
configs		configs
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluate_results.py		evaluate_results.py
main.py		main.py
metric.py		metric.py
my_agent.py		my_agent.py
requirements.txt		requirements.txt
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

LongDA: Long-Document Data Analysis Benchmark

📖 Overview

🚀 Quick Start

Installation

Download Data

Run Evaluation

📊 Benchmark Statistics

📝 Evaluation Metrics

🛠️ Project Structure

📚 Citation

📄 License

📧 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages