The complete Kodezi Chronos Multi-Random Retrieval (MRR) benchmark contains 5,000 real-world debugging scenarios, representing the most comprehensive evaluation suite for debugging-focused language models.
This repository contains:
- 500 sample test cases (10% of full benchmark)
- Evaluation framework and metrics implementation
- Sample results and baseline performance data
- Documentation and usage guides
The complete benchmark includes:
- 500 syntax error cases
- 1,200 logic bug cases
- 800 concurrency issue cases
- 600 memory-related bug cases
- 900 API misuse cases
- 400 performance bug cases
- 600 cross-category cases
- 110 real-world repositories with full git history
- Ranging from 1K to 2M+ lines of code
- Multiple programming languages (Python, JavaScript, Java, Go, Rust)
- Each with 20-200 injected bugs
- Bugs spanning 3-12 months of development history
- Refactoring scenarios with moved/renamed files
- Evolution of codebases over time
- Expert-validated fixes for all 5,000 bugs
- Multiple valid fix variations where applicable
- Detailed fix explanations and patterns
- Eligibility: University-affiliated researchers
- Process:
- Submit request to [email protected]
- Include institutional affiliation
- Describe intended research use
- Sign data use agreement
- Timeline: 2-3 weeks for approval
- Eligibility: Companies developing debugging tools
- Process:
- Contact [email protected]
- Provide company information
- Describe use case and impact
- Execute partnership agreement
- Timeline: 4-6 weeks for approval
- Release Date: Q1 2026
- Format: Public research release
- License: Apache 2.0 for code, custom license for data
Users of the full benchmark agree to:
- Use data only for research/evaluation purposes
- Not redistribute the raw dataset
- Cite the Chronos paper in publications
- Share evaluation results with the community
- Report any data quality issues found
To: [email protected]
Subject: Chronos Benchmark Access Request - [Your Institution]
Dear Chronos Team,
I am requesting access to the full Chronos MRR benchmark dataset.
Researcher Information:
- Name: [Your Name]
- Institution: [University/Company]
- Position: [Your Title]
- Email: [Institutional Email]
Research Purpose:
[Describe your intended use of the benchmark]
Expected Outcomes:
[What you plan to publish/release]
I agree to the data use terms and will cite the Chronos paper.
Best regards,
[Your Name]
While waiting for full benchmark access, you can:
- Develop your evaluation pipeline using the 500 sample cases
- Test your debugging model on representative scenarios
- Compare against baseline results provided
- Optimize retrieval strategies for scattered context
The sample dataset is designed to be representative of the full benchmark distribution.
All uses of the Chronos benchmark must cite:
@article{khan2025chronos,
title={Kodezi Chronos: A Debugging-First Language Model for Repository-Scale, Memory-Driven Code Understanding},
author={Khan, Ishraq and Chowdary, Assad and Haseeb, Sharoz and Patel, Urvish},
journal={arXiv preprint arXiv:2507.12482},
year={2025},
note={Benchmark available at Q1 2026}
}- v1.0-sample (Current): 500 test cases, available now
- v1.0-full (Q1 2026): Complete 5,000 test cases
- v2.0 (Planned 2026): Extended with additional languages and bug types
A: Yes, the sample dataset is available under Apache 2.0 license.
A: You control publication of your results. We encourage sharing for community benefit.
A: Yes! See CONTRIBUTING.md for guidelines on submitting new debugging scenarios.
A: Yes, given the same model outputs, evaluation results are reproducible.
A: The evaluation framework is extensible. See the adapter examples in benchmarks/adapters/.
- Technical Issues: Open an issue in this repository
- Access Questions: [email protected]
- Partnership Inquiries: [email protected]
Subscribe to our mailing list for updates:
- Benchmark release announcements
- New evaluation metrics
- Community evaluation results
- Workshop and challenge announcements
Sign up at: https://kodezi.com/chronos-updates
The Chronos benchmark represents a significant step forward in debugging model evaluation. We look forward to seeing how the research community uses it to advance the field.