LongDA is a data analysis benchmark for evaluating LLM-based agents under documentation-intensive analytical workflows. Unlike existing benchmarks that assume well-specified schemas, LongDA targets real-world settings where navigating long documentation and complex data is the primary bottleneck.
We manually curate 17 U.S. national surveys with their complete documentation and extract 505 analytical queries from expert-written publications. Solving these queries requires agents to:
- Retrieve and integrate key information from multiple unstructured documents (~263K tokens on average)
- Navigate long documentation including codebooks, technical reports, and user guides
- Perform multi-step computations with proper sampling weights and survey design considerations
- Write executable code to extract variables and compute results
This benchmark captures the reality of documentation-intensive data analysis where information gathering is often the dominant bottleneck.
# Clone the repository
git clone https://github.com/Yiyang-Ian-Li/LongDA
cd LongDA
# Install dependencies
pip install -r requirements.txtDownload the complete benchmark dataset including all survey data and documentation from Hugging Face:
# Install Git LFS (required for large files)
git lfs install
# Clone the complete dataset
cd /path/to/your/workspace
git clone https://huggingface.co/datasets/EvilBench/LongDA benchmark
# Your directory should now contain:
# benchmark/
# ├── benchmark.csv
# └── [17 survey folders with data/ and docs/]Or download programmatically:
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="EvilBench/LongDA",
repo_type="dataset",
local_dir="./benchmark"
)Note: The complete dataset (~1.8GB) is required. The benchmark.csv file alone is insufficient as evaluation requires access to raw survey data and documentation.
- Configure your agent: Copy and modify the example config
cp configs/example_config.yaml configs/my_config.yaml
# Edit my_config.yaml with your API key- Run the benchmark:
python main.py --config_file configs/my_config.yaml- View results: Results are saved in
results/TIMESTAMP_MODEL/run_summary.json: Overall metrics (match rate, token usage, runtime)block_metrics.json: Per-survey-source performanceanswers_progress.csv: All answers and correctnessmessages/: Detailed traces for each query
- 505 queries across 17 U.S. national surveys
- 6 federal agencies: covering health, labor, economics, education, and social sciences
- 30 expert-written publications used for query extraction
- ~263K tokens average context per query (much longer than existing benchmarks)
- Surveys: NHANES, CPS-ASEC, GSS, NSDUH, NHIS, NSCG, NSFG, ATUS, HERD, RHFS, SDR, SSERF, STC, NTEWS, ASFIN, ASPEP, ASPP
LongDA evaluates agents on:
- Match Rate: Proportion of queries answered within tolerance (default: 5% relative error for numbers)
- Token Efficiency: Total tokens consumed across all queries
- Runtime: Total time to complete the benchmark
- Steps: Average number of agent-tool interactions per query
Answers are validated with flexible matching for numerical values and exact matching for list structures.
LongDA/
├── benchmark/ # Benchmark data and documentation
│ ├── benchmark.csv # 505 queries with ground truth
│ └── [SURVEY]/ # Survey-specific folders
│ ├── data/ # Raw data files
│ └── docs/ # Long documentation (codebooks, guides, etc.)
├── configs/ # Model configuration templates
├── tools/ # Custom tools for the agent framework
├── main.py # Main evaluation script
├── evaluate_results.py # Post-hoc evaluation and analysis
├── metric.py # Evaluation metrics implementation
├── my_agent.py # LongTA agent framework
└── utils.py # Utility functions
If you use LongDA in your research, please cite:
@article{li2026longda,
title={LongDA: Benchmarking LLM Agents for Long-Document Data Analysis},
author={Li, Yiyang and Zhang, Zheyuan and Ma, Tianyi and Wang, Zehong and Murugesan, Keerthiram and Zhang, Chuxu and Ye, Yanfang},
journal={arXiv preprint arXiv:2601.02598},
year={2026}
}This project is licensed under the MIT License - see the LICENSE file for details.
For questions or issues, please:
- Open a GitHub issue
- Contact: [email protected]
Note: This benchmark is for research purposes only. Please comply with data usage policies when using the survey data.