This directory contains the evaluation benchmarks used to assess Chronos's debugging capabilities. Note that these are benchmark specifications and protocols - the actual Chronos model is only available through Kodezi OS.
State-of-the-Art Performance Achieved:
Chronos achieves the highest performance on SWE-bench Lite, the industry-standard debugging benchmark:
| Rank | System | Success Rate | Instances Resolved |
|---|---|---|---|
| 🥇 1 | Kodezi Chronos | 80.33% | 241/300 |
| 🥈 2 | ExpeRepair-v1.0 + Claude 4.5 Sonnet | 60.33% | 181/300 |
| 3 | Claude 4.5 Sonnet (Bash Only) | ~14% | ~42/300 |
| 4 | Claude 4.1 Opus (Bash Only) | 14.2% | 43/300 |
| 5 | GPT-4.1 | 13.8% | 41/300 |
| 6 | Gemini 2.0 Pro | 13.4% | 40/300 |
Key Achievement: 20 percentage point absolute lead over second place
Repository-Specific Performance:
- sympy (symbolic mathematics): 96.1%
- sphinx (documentation systems): 93.8%
- django (web frameworks): 90.4%
The Debugging Gap: General-purpose models achieving 70%+ on code generation (SWE-bench Full) drop to <15% on debugging tasks (SWE-bench Lite), revealing a 50+ percentage point gap. Chronos's specialized debugging architecture bridges this gap.
Our novel benchmark designed specifically for debugging-oriented retrieval capabilities:
- 5,000 real-world debugging scenarios
- 12,500 total bugs evaluated across all benchmarks
- Context scattered across 10-50 files
- Temporal dispersion spanning 3-12 months
- Obfuscated dependencies with refactored names
- Multi-modal artifacts (code, tests, logs, docs)
Key Metrics:
- Retrieval Precision@k (92% achieved)
- Retrieval Recall@k (85% achieved)
- Fix Accuracy (67.3% ± 2.1%)
- Context Efficiency (O(k log d) complexity)
- Human Preference (89% N=50)
We evaluate across six major bug categories:
| Category | Description | Test Cases |
|---|---|---|
| Syntax | Syntax errors and typos | 500 |
| Logic | Logical errors in algorithms | 1,200 |
| Concurrency | Race conditions, deadlocks | 800 |
| Memory | Memory leaks, buffer overflows | 600 |
| API | API misuse, version conflicts | 900 |
| Performance | Performance regressions | 400 |
Total Test Cases: 4,400 (expanded to 12,500 with variations)
Testing debugging performance across different codebase sizes:
- Small: <10K LOC
- Medium: 10K-100K LOC
- Large: 100K-1M LOC
- Enterprise: >1M LOC
| Model | Debug Success | Root Cause Acc. | Avg. Fix Iterations |
|---|---|---|---|
| GPT-4.1 | 13.8% | 12.3% | 1-2 |
| Claude 4.1 Opus | 14.2% | 11.7% | 1-2 |
| Gemini 2.0 Pro | 15.0% | 15.8% | 1-2 |
| Kodezi Chronos | 67.3% | 89% | 7.8 |
| Model | Precision@10 | Recall@10 | Fix Accuracy |
|---|---|---|---|
| GPT-4.1 + RAG | 42.3% | 31.7% | 8.9% |
| Claude 4.1 Opus + Vector DB | 48.1% | 36.2% | 11.2% |
| Gemini 2.0 Pro + Graph | 51.7% | 41.8% | 14.6% |
| Kodezi Chronos | 92% | 85% | 67.3% |
- Randomly sampled from real-world bug reports
- Verified by human developers
- Categorized by complexity and type
- Present bug report/symptoms to model
- Measure retrieval accuracy
- Evaluate proposed fix
- Run automated tests
- Check for regressions
- Measure end-to-end success
- All models tested on identical scenarios
- Same computational resources allocated
- Human verification of results
- Statistical significance testing
While the Chronos model itself is not publicly available, researchers can:
- Use our test scenarios to evaluate their own models
- Follow our protocols for consistent evaluation
- Compare results using our metrics
# This is a conceptual example - actual implementation requires model access
from benchmarks import MRRBenchmark, DebugTaskEvaluator
# Load benchmark
benchmark = MRRBenchmark.load("./multi-random-retrieval/mrr_v1.json")
# Evaluate your model
evaluator = DebugTaskEvaluator(your_model)
results = evaluator.run_benchmark(benchmark)
# Compare with Chronos results
comparison = results.compare_with_baseline("chronos_results.json")
print(comparison.summary()){
"task_id": "mrr_001",
"bug_description": "NullPointerException in user export after auth refactor",
"repository_snapshot": "path/to/repo/snapshot",
"relevant_files": ["auth/service.py", "export/handler.py", ...],
"ground_truth_fix": {
"files_modified": [...],
"patch": "...",
"test_results": "all_pass"
},
"metadata": {
"bug_category": "null_pointer",
"complexity": "medium",
"cross_file_dependencies": 3
}
}We welcome contributions to improve our benchmarks:
- New test cases - Submit real-world debugging scenarios
- Evaluation metrics - Propose new ways to measure debugging effectiveness
- Baseline comparisons - Add results from other models/tools
Please see CONTRIBUTING.md for guidelines.
If you use these benchmarks in your research:
@article{khan2025chronos,
title={Kodezi Chronos: A Debugging-First Language Model for Repository-Scale, Memory-Driven Code Understanding},
author={Khan, Ishraq and Chowdary, Assad and Haseeb, Sharoz and Patel, Urvish},
journal={arXiv preprint arXiv:2507.12482},
year={2025}
}