The complete Multi-Random Retrieval (MRR) benchmark suite has been generated with all 5,000 scenarios as specified in the Chronos paper.
| Category | Count | Description | Expected Success Rate |
|---|---|---|---|
| Syntax Errors | 500 | Compilation/syntax issues | 94.2% |
| Logic Errors | 1,200 | Algorithmic and logic bugs | 72.8% |
| API Misuse | 900 | Incorrect API usage | 79.1% |
| Memory Issues | 600 | Memory leaks, null pointers | 61.7% |
| Concurrency Issues | 800 | Race conditions, deadlocks | 58.3% |
| Performance Bugs | 400 | Inefficiencies, bottlenecks | 65.4% |
| Cross-Category | 600 | Multiple interacting bugs | 51.2% |
| TOTAL | 5,000 | 67.3% avg |
Each scenario contains:
{
"bug_id": "Unique identifier",
"category": "Bug category",
"subcategory": "Specific type",
"language": "python|javascript|java|go|cpp",
"complexity": {
"spatial_distribution": 1-50, // Files with context
"temporal_spread_months": 0-12, // Historical span
"abstraction_layers": 1-5, // Depth of reasoning
"obfuscation_level": "low|medium|high",
"cross_module_dependencies": 0-20,
"artifact_types": 2-5
},
"code_snippets": {
"buggy_code": "...",
"fixed_code": "..."
},
"scattered_context": [
// 10-50 files with relevant context
],
"temporal_info": {
// Code evolution over 3-12 months
},
"retrieval_paths": {
// Explicit, implicit, compositional paths
},
"ground_truth": {
// Expected solution and evaluation criteria
}
}- Context scattered across 10-50 files
- Temporal dispersion over 3-12 months
- Multiple abstraction layers (1-5 levels)
- Obfuscated dependencies (renamed, moved files)
- Spatial Distribution: Bug information spread across multiple files
- Temporal Spread: Related changes across months of history
- Cross-Module Dependencies: Complex inter-module relationships
- Multiple Artifact Types: Code, tests, logs, configs, documentation
- Retrieval Completeness: Must find critical files
- Temporal Understanding: Track code evolution
- Context Synthesis: Integrate scattered information
- Fix Correctness: Validate against ground truth
python benchmarks/run_benchmark.py --scenarios 10python benchmarks/run_benchmark.py --categories logic_errors --scenarios 100python benchmarks/run_benchmark.py --full --save-resultsfrom benchmarks.evaluation_metrics.comprehensive_metrics import ComprehensiveEvaluator
evaluator = ComprehensiveEvaluator()
results = evaluator.evaluate_model(
results=your_results,
model_name="your_model"
)
print(evaluator.generate_report())The benchmark evaluates:
-
Primary Metrics
- Debug Success Rate (target: 67.3%)
- Root Cause Accuracy (target: 89%)
- Avg Fix Iterations (target: 7.8)
- Retrieval Precision (target: 92%)
- Retrieval Recall (target: 85%)
-
Category Performance
- Per-category success rates
- Complexity-adjusted scoring
- Language-specific metrics
-
Efficiency Metrics
- Token efficiency
- Context utilization
- Output entropy density
-
Statistical Analysis
- 95% confidence intervals
- Cohen's d effect size
- Statistical significance (p < 0.001)
benchmarks/
├── mrr_full_benchmark/
│ ├── syntax_errors/ # 500 scenarios
│ ├── logic_errors/ # 1,200 scenarios
│ ├── api_misuse/ # 900 scenarios
│ ├── memory_issues/ # 600 scenarios
│ ├── concurrency_issues/ # 800 scenarios
│ ├── performance_bugs/ # 400 scenarios
│ ├── cross_category/ # 600 scenarios
│ └── BENCHMARK_METADATA.json
├── evaluation_metrics/
│ ├── comprehensive_metrics.py
│ ├── mrr_metrics_2025.py
│ └── statistical_analysis.py
├── run_benchmark.py # Main runner
├── run_evaluation.py # Evaluation script
└── generate_full_benchmark.py # Generator (already run)
The benchmark has been validated to ensure:
- ✅ Exactly 5,000 scenarios generated
- ✅ Proper category distribution
- ✅ Realistic complexity parameters
- ✅ Valid JSON structure
- ✅ Comprehensive ground truth
- ✅ Multi-language support
-
Test Your Model
python benchmarks/run_evaluation.py --model your_model --scenarios 100
-
Compare Performance
- Chronos: 67.3% success (target)
- GPT-4.1: 13.8% success (baseline)
- Your Model: ?
-
Analyze Results
python scripts/generate_visualizations.py jupyter notebook notebooks/performance_analysis.ipynb
When using this benchmark, please cite:
@article{khan2025chronos,
title={Kodezi Chronos: A Debugging-First Language Model for
Repository-Scale Code Understanding},
author={Khan, Ishraq and Chowdary, Assad and
Haseeb, Sharoz and Patel, Urvish},
journal={arXiv preprint arXiv:2507.12482},
year={2025}
}Benchmark Complete! The full MRR benchmark with 5,000 scenarios is ready for evaluation.