Chronos MRR Benchmark - Complete Implementation

✅ Benchmark Status: COMPLETE

📊 5,000 Scenarios Generated

The complete Multi-Random Retrieval (MRR) benchmark suite has been generated with all 5,000 scenarios as specified in the Chronos paper.

Distribution

Category	Count	Description	Expected Success Rate
Syntax Errors	500	Compilation/syntax issues	94.2%
Logic Errors	1,200	Algorithmic and logic bugs	72.8%
API Misuse	900	Incorrect API usage	79.1%
Memory Issues	600	Memory leaks, null pointers	61.7%
Concurrency Issues	800	Race conditions, deadlocks	58.3%
Performance Bugs	400	Inefficiencies, bottlenecks	65.4%
Cross-Category	600	Multiple interacting bugs	51.2%
TOTAL	5,000		67.3% avg

Scenario Structure

Each scenario contains:

{
  "bug_id": "Unique identifier",
  "category": "Bug category",
  "subcategory": "Specific type",
  "language": "python|javascript|java|go|cpp",
  "complexity": {
    "spatial_distribution": 1-50,  // Files with context
    "temporal_spread_months": 0-12,  // Historical span
    "abstraction_layers": 1-5,  // Depth of reasoning
    "obfuscation_level": "low|medium|high",
    "cross_module_dependencies": 0-20,
    "artifact_types": 2-5
  },
  "code_snippets": {
    "buggy_code": "...",
    "fixed_code": "..."
  },
  "scattered_context": [
    // 10-50 files with relevant context
  ],
  "temporal_info": {
    // Code evolution over 3-12 months
  },
  "retrieval_paths": {
    // Explicit, implicit, compositional paths
  },
  "ground_truth": {
    // Expected solution and evaluation criteria
  }
}

Key Features

1. Multi-Random Retrieval

Context scattered across 10-50 files
Temporal dispersion over 3-12 months
Multiple abstraction layers (1-5 levels)
Obfuscated dependencies (renamed, moved files)

2. Realistic Complexity

Spatial Distribution: Bug information spread across multiple files
Temporal Spread: Related changes across months of history
Cross-Module Dependencies: Complex inter-module relationships
Multiple Artifact Types: Code, tests, logs, configs, documentation

3. Comprehensive Evaluation

Retrieval Completeness: Must find critical files
Temporal Understanding: Track code evolution
Context Synthesis: Integrate scattered information
Fix Correctness: Validate against ground truth

Usage

Quick Test (10 scenarios)

python benchmarks/run_benchmark.py --scenarios 10

Category Test (100 scenarios)

python benchmarks/run_benchmark.py --categories logic_errors --scenarios 100

Full Benchmark (5,000 scenarios)

python benchmarks/run_benchmark.py --full --save-results

Custom Evaluation

from benchmarks.evaluation_metrics.comprehensive_metrics import ComprehensiveEvaluator

evaluator = ComprehensiveEvaluator()
results = evaluator.evaluate_model(
    results=your_results,
    model_name="your_model"
)
print(evaluator.generate_report())

Evaluation Metrics

The benchmark evaluates:

Primary Metrics
- Debug Success Rate (target: 67.3%)
- Root Cause Accuracy (target: 89%)
- Avg Fix Iterations (target: 7.8)
- Retrieval Precision (target: 92%)
- Retrieval Recall (target: 85%)
Category Performance
- Per-category success rates
- Complexity-adjusted scoring
- Language-specific metrics
Efficiency Metrics
- Token efficiency
- Context utilization
- Output entropy density
Statistical Analysis
- 95% confidence intervals
- Cohen's d effect size
- Statistical significance (p < 0.001)

Files Generated

benchmarks/
├── mrr_full_benchmark/
│   ├── syntax_errors/         # 500 scenarios
│   ├── logic_errors/          # 1,200 scenarios
│   ├── api_misuse/            # 900 scenarios
│   ├── memory_issues/         # 600 scenarios
│   ├── concurrency_issues/    # 800 scenarios
│   ├── performance_bugs/      # 400 scenarios
│   ├── cross_category/        # 600 scenarios
│   └── BENCHMARK_METADATA.json
├── evaluation_metrics/
│   ├── comprehensive_metrics.py
│   ├── mrr_metrics_2025.py
│   └── statistical_analysis.py
├── run_benchmark.py           # Main runner
├── run_evaluation.py          # Evaluation script
└── generate_full_benchmark.py # Generator (already run)

Validation

The benchmark has been validated to ensure:

✅ Exactly 5,000 scenarios generated
✅ Proper category distribution
✅ Realistic complexity parameters
✅ Valid JSON structure
✅ Comprehensive ground truth
✅ Multi-language support

Next Steps

Test Your Model

python benchmarks/run_evaluation.py --model your_model --scenarios 100

Compare Performance
- Chronos: 67.3% success (target)
- GPT-4.1: 13.8% success (baseline)
- Your Model: ?

Analyze Results

python scripts/generate_visualizations.py
jupyter notebook notebooks/performance_analysis.ipynb

Citation

When using this benchmark, please cite:

@article{khan2025chronos,
  title={Kodezi Chronos: A Debugging-First Language Model for 
         Repository-Scale Code Understanding},
  author={Khan, Ishraq and Chowdary, Assad and 
          Haseeb, Sharoz and Patel, Urvish},
  journal={arXiv preprint arXiv:2507.12482},
  year={2025}
}

Benchmark Complete! The full MRR benchmark with 5,000 scenarios is ready for evaluation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chronos MRR Benchmark - Complete Implementation

✅ Benchmark Status: COMPLETE

📊 5,000 Scenarios Generated

Distribution

Scenario Structure

Key Features

1. Multi-Random Retrieval

2. Realistic Complexity

3. Comprehensive Evaluation

Usage

Quick Test (10 scenarios)

Category Test (100 scenarios)

Full Benchmark (5,000 scenarios)

Custom Evaluation

Evaluation Metrics

Files Generated

Validation

Next Steps

Citation

FilesExpand file tree

BENCHMARK_COMPLETE.md

Latest commit

History

BENCHMARK_COMPLETE.md

File metadata and controls

Chronos MRR Benchmark - Complete Implementation

✅ Benchmark Status: COMPLETE

📊 5,000 Scenarios Generated

Distribution

Scenario Structure

Key Features

1. Multi-Random Retrieval

2. Realistic Complexity

3. Comprehensive Evaluation

Usage

Quick Test (10 scenarios)

Category Test (100 scenarios)

Full Benchmark (5,000 scenarios)

Custom Evaluation

Evaluation Metrics

Files Generated

Validation

Next Steps

Citation