The Multi Random Retrieval (MRR) Benchmark is a comprehensive evaluation framework for debugging systems, featuring 5,000+ carefully crafted bug scenarios across 7 categories. This benchmark ensures reproducible results matching the specifications in the Kodezi Chronos 2025 paper.
- 5,000+ Bug Scenarios: Comprehensive coverage of real-world debugging challenges
- 143,712 Artifacts: Extensive codebase artifacts for realistic retrieval testing
- Deterministic Results: Reproducible outcomes with seed-based randomization
- Multi-Category Testing: 7 bug categories from syntax errors to cross-file issues
- Automated Validation: Built-in validation to ensure expected performance
| Model | Success Rate | 95% CI | Improvement vs Chronos |
|---|---|---|---|
| Kodezi Chronos | 67.3% ± 2.1% | [65.2%, 69.4%] | Baseline |
| Claude 4 Opus | 14.2% ± 1.3% | [12.9%, 15.5%] | 4.74x |
| GPT-4.1 | 13.8% ± 1.2% | [12.6%, 15.0%] | 4.88x |
| Gemini 2 Pro | 12.4% ± 1.2% | [11.2%, 13.6%] | 5.43x |
# Run benchmark for Claude 4 Opus (default)
python run_full_mrr_benchmark.py
# Run for specific model
python run_full_mrr_benchmark.py --model gpt_4_1
# Run with custom parameters
python run_full_mrr_benchmark.py \
--model chronos \
--scenarios 1000 \
--seed 12345 \
--output-dir results/custom_runFor consistent, reproducible results across environments:
# Make script executable
chmod +x run_benchmark_docker.sh
# Run full benchmark in Docker
./run_benchmark_docker.sh claude_4_opus 5000 42
# Or use docker-compose directly
export BENCHMARK_MODEL=chronos
export BENCHMARK_SCENARIOS=5000
export BENCHMARK_SEED=42
docker-compose upEnsure your benchmark setup is correct:
# Validate benchmark structure and expected results
python validate_mrr_benchmark.py
# Run mini benchmark (100 scenarios) for quick validation
python validate_mrr_benchmark.py --benchmark-dir mrr_full_benchmarkmrr_full_benchmark/
├── syntax_errors/ # 500 scenarios
├── logic_errors/ # 1,200 scenarios
├── concurrency_issues/ # 800 scenarios
├── memory_issues/ # 600 scenarios
├── api_misuse/ # 900 scenarios
├── performance_bugs/ # 400 scenarios
├── cross_category/ # 600 scenarios
└── artifacts/ # 143,712 supporting files
Each scenario is a JSON file with the following structure:
{
"bug_id": "mrr_logic_errors_0001",
"category": "logic_errors",
"description": "Null pointer dereference in user authentication",
"scattered_files": [
"src/auth/login.py",
"src/models/user.py",
"tests/auth_test.py",
// ... up to 50 files
],
"temporal_range": {
"start": "2024-01-15T10:23:00Z",
"end": "2024-03-22T14:45:00Z"
},
"ground_truth": {
"root_cause": "Missing null check in getUserProfile method",
"fix_location": "src/models/user.py:145",
"fix_type": "add_null_check"
},
"obfuscation": {
"refactored_names": true,
"architectural_changes": 2
}
}After running benchmarks, analyze results:
# Generate comparison report and visualizations
python analyze_results.py
# Skip plots (text report only)
python analyze_results.py --no-plots
# Analyze specific results directory
python analyze_results.py --results-dir results/custom_runThe benchmark generates several output files:
-
Results JSON:
{model}_mrr_results_{timestamp}.json- Complete benchmark results with all metrics
- Individual scenario outcomes
- Aggregate statistics
-
Summary Text:
{model}_summary.txt- Human-readable summary
- Category breakdown
- Key metrics
-
Validation Report:
validation_report.txt- Confirms benchmark integrity
- Validates expected performance
-
Analysis Outputs:
mrr_comparison_report.md: Comprehensive comparisonsummary_statistics.json: Machine-readable stats- Visualization PNGs: Performance charts
To ensure reproducible results:
- Use Fixed Seeds: Always specify the same seed value
- Docker Environment: Use the provided Docker setup
- Version Control: Track the exact commit/version used
- Validation: Run validation script before and after
Example reproducibility test:
# Run 1
python run_full_mrr_benchmark.py --seed 42 --scenarios 100
# Run 2 (should produce identical results)
python run_full_mrr_benchmark.py --seed 42 --scenarios 100
# Compare results
diff results/run1.json results/run2.json # Should be identicalEdit MODEL_PERFORMANCE in run_full_mrr_benchmark.py:
MODEL_PERFORMANCE = {
"your_model": {
"fix_rate": 0.25, # Expected success rate
"precision": 0.70,
"recall": 0.60,
"iterations": 5.0,
"confidence": 0.65,
"cross_file_success": 0.45,
"temporal_understanding": 0.40,
"obfuscation_resistance": 0.50
}
}Add new scenarios following the format:
scenario = {
"bug_id": "custom_001",
"category": "logic_errors",
"description": "Your bug description",
"scattered_files": ["file1.py", "file2.py"],
"ground_truth": {
"root_cause": "Description",
"fix_location": "file1.py:10"
}
}-
"File not found" errors
- Ensure you're in the
benchmarks/directory - Check that
mrr_full_benchmark/exists with all categories
- Ensure you're in the
-
Inconsistent results
- Verify you're using the same seed
- Check Python/NumPy versions match
- Use Docker for guaranteed consistency
-
Memory issues
- Reduce scenario count:
--scenarios 1000 - Increase Docker memory limit in docker-compose.yml
- Reduce scenario count:
-
Import errors
- Install requirements:
pip install -r requirements.txt - Add parent directory to Python path if needed
- Install requirements:
- Parallel Execution: The benchmark supports parallel processing
- Caching: Enable with
--cacheflag for repeated runs - Subset Testing: Use fewer scenarios for development/testing
To integrate the MRR benchmark into your CI/CD pipeline:
# GitHub Actions example
- name: Run MRR Benchmark
run: |
cd benchmarks
python run_full_mrr_benchmark.py \
--model ${{ matrix.model }} \
--scenarios 1000 \
--seed 42
- name: Validate Results
run: |
python validate_mrr_benchmark.py
- name: Upload Results
uses: actions/upload-artifact@v3
with:
name: mrr-results-${{ matrix.model }}
path: benchmarks/results/For issues or questions:
- Check the Troubleshooting section
- Review validation output for specific errors
- Consult the paper for methodology details