Skip to content

Latest commit

 

History

History
201 lines (164 loc) · 5.61 KB

File metadata and controls

201 lines (164 loc) · 5.61 KB

Chronos MRR Benchmark - Complete Implementation

✅ Benchmark Status: COMPLETE

📊 5,000 Scenarios Generated

The complete Multi-Random Retrieval (MRR) benchmark suite has been generated with all 5,000 scenarios as specified in the Chronos paper.

Distribution

Category Count Description Expected Success Rate
Syntax Errors 500 Compilation/syntax issues 94.2%
Logic Errors 1,200 Algorithmic and logic bugs 72.8%
API Misuse 900 Incorrect API usage 79.1%
Memory Issues 600 Memory leaks, null pointers 61.7%
Concurrency Issues 800 Race conditions, deadlocks 58.3%
Performance Bugs 400 Inefficiencies, bottlenecks 65.4%
Cross-Category 600 Multiple interacting bugs 51.2%
TOTAL 5,000 67.3% avg

Scenario Structure

Each scenario contains:

{
  "bug_id": "Unique identifier",
  "category": "Bug category",
  "subcategory": "Specific type",
  "language": "python|javascript|java|go|cpp",
  "complexity": {
    "spatial_distribution": 1-50,  // Files with context
    "temporal_spread_months": 0-12,  // Historical span
    "abstraction_layers": 1-5,  // Depth of reasoning
    "obfuscation_level": "low|medium|high",
    "cross_module_dependencies": 0-20,
    "artifact_types": 2-5
  },
  "code_snippets": {
    "buggy_code": "...",
    "fixed_code": "..."
  },
  "scattered_context": [
    // 10-50 files with relevant context
  ],
  "temporal_info": {
    // Code evolution over 3-12 months
  },
  "retrieval_paths": {
    // Explicit, implicit, compositional paths
  },
  "ground_truth": {
    // Expected solution and evaluation criteria
  }
}

Key Features

1. Multi-Random Retrieval

  • Context scattered across 10-50 files
  • Temporal dispersion over 3-12 months
  • Multiple abstraction layers (1-5 levels)
  • Obfuscated dependencies (renamed, moved files)

2. Realistic Complexity

  • Spatial Distribution: Bug information spread across multiple files
  • Temporal Spread: Related changes across months of history
  • Cross-Module Dependencies: Complex inter-module relationships
  • Multiple Artifact Types: Code, tests, logs, configs, documentation

3. Comprehensive Evaluation

  • Retrieval Completeness: Must find critical files
  • Temporal Understanding: Track code evolution
  • Context Synthesis: Integrate scattered information
  • Fix Correctness: Validate against ground truth

Usage

Quick Test (10 scenarios)

python benchmarks/run_benchmark.py --scenarios 10

Category Test (100 scenarios)

python benchmarks/run_benchmark.py --categories logic_errors --scenarios 100

Full Benchmark (5,000 scenarios)

python benchmarks/run_benchmark.py --full --save-results

Custom Evaluation

from benchmarks.evaluation_metrics.comprehensive_metrics import ComprehensiveEvaluator

evaluator = ComprehensiveEvaluator()
results = evaluator.evaluate_model(
    results=your_results,
    model_name="your_model"
)
print(evaluator.generate_report())

Evaluation Metrics

The benchmark evaluates:

  1. Primary Metrics

    • Debug Success Rate (target: 67.3%)
    • Root Cause Accuracy (target: 89%)
    • Avg Fix Iterations (target: 7.8)
    • Retrieval Precision (target: 92%)
    • Retrieval Recall (target: 85%)
  2. Category Performance

    • Per-category success rates
    • Complexity-adjusted scoring
    • Language-specific metrics
  3. Efficiency Metrics

    • Token efficiency
    • Context utilization
    • Output entropy density
  4. Statistical Analysis

    • 95% confidence intervals
    • Cohen's d effect size
    • Statistical significance (p < 0.001)

Files Generated

benchmarks/
├── mrr_full_benchmark/
│   ├── syntax_errors/         # 500 scenarios
│   ├── logic_errors/          # 1,200 scenarios
│   ├── api_misuse/            # 900 scenarios
│   ├── memory_issues/         # 600 scenarios
│   ├── concurrency_issues/    # 800 scenarios
│   ├── performance_bugs/      # 400 scenarios
│   ├── cross_category/        # 600 scenarios
│   └── BENCHMARK_METADATA.json
├── evaluation_metrics/
│   ├── comprehensive_metrics.py
│   ├── mrr_metrics_2025.py
│   └── statistical_analysis.py
├── run_benchmark.py           # Main runner
├── run_evaluation.py          # Evaluation script
└── generate_full_benchmark.py # Generator (already run)

Validation

The benchmark has been validated to ensure:

  • ✅ Exactly 5,000 scenarios generated
  • ✅ Proper category distribution
  • ✅ Realistic complexity parameters
  • ✅ Valid JSON structure
  • ✅ Comprehensive ground truth
  • ✅ Multi-language support

Next Steps

  1. Test Your Model

    python benchmarks/run_evaluation.py --model your_model --scenarios 100
  2. Compare Performance

    • Chronos: 67.3% success (target)
    • GPT-4.1: 13.8% success (baseline)
    • Your Model: ?
  3. Analyze Results

    python scripts/generate_visualizations.py
    jupyter notebook notebooks/performance_analysis.ipynb

Citation

When using this benchmark, please cite:

@article{khan2025chronos,
  title={Kodezi Chronos: A Debugging-First Language Model for 
         Repository-Scale Code Understanding},
  author={Khan, Ishraq and Chowdary, Assad and 
          Haseeb, Sharoz and Patel, Urvish},
  journal={arXiv preprint arXiv:2507.12482},
  year={2025}
}

Benchmark Complete! The full MRR benchmark with 5,000 scenarios is ready for evaluation.