ChronoBench

This repo contains the data and meta-data required to establish a "judge" LLM (J-LLM) that assesses the quality of virtual experiment scripts (simulator-ready scripts that set up and run a simulation; the published paper calls these "digital twins") produced by another LLM. ChronoBench (published as SimBench, IEEE Access 2026) is a benchmark designed to evaluate, and to diagnose, the proficiency of simulator-oriented large language models (S-LLMs) in generating such scripts for virtual testing. Given a collection of S-LLMs, this benchmark ranks them based on their ability to produce high-quality scripts. More than 30 open- and closed-source S-LLMs (33 in the published study) have been assessed using a J-LLM produced with the data herein, yielding over 3,000 expert-scored multi-turn dialogues.

Using multi-turn interactions, ChronoBench employs a rule-based J-LLM (LLM-as-a-judge) that leverages both predefined rules and human-in-the-loop guidance to assign scores for the scripts generated by the S-LLM, providing a consistent and expert-inspired evaluation protocol. Unlike similarity metrics (e.g., CodeBLEU, ROUGE-L) or single-scalar execution metrics (e.g., Compile@k, Pass@k), the rubric-grounded J-LLM produces interpretable, diagnostic feedback that attributes errors to concrete aspects of the generated artifact. The J-LLM is specific to a simulator, and herein the approach is demonstrated in conjunction with the open-source Chrono multi-physics simulator, covering multibody dynamics, finite element analysis, vehicle dynamics, robotic dynamics, and sensor simulations. The benchmarking principle is broadly applicable and enables assessing an S-LLM's ability to generate virtual experiment scripts for other simulation packages (e.g., ANSYS, ABAQUS, OpenFOAM, StarCCM+, IsaacSim, PyBullet), each of which requires different but qualitatively similar simulator-specific data.

A description of the approach used to produce the J-LLM has been published in IEEE Access:

@article{simbench2026,
  author={Wang, Jingquan and Negrut, Andrew and Wang, Hongyu and Zhang, Harry and Negrut, Dan},
  journal={IEEE Access}, 
  title={SimBench: A Framework for Evaluating and Diagnosing LLM-Based Digital-Twin Generation for Multi-Physics Simulation}, 
  year={2026},
  volume={14},
  pages={61784-61808},
  doi={10.1109/ACCESS.2026.3685519}}

The earlier preprint is available on arXiv as arXiv:2408.11987. A copy of the published paper is included in this repo at .claude/docs/2026Jingquan-SimBench.pdf.

Reproducing the paper

The exact code and data behind the published IEEE Access 2026 results are preserved at the git tag paper-ieee-access-2026. To reproduce the paper, check out that tag:

git checkout paper-ieee-access-2026

The main branch moves beyond the paper (cleaner naming, leaner dependencies, and a test suite), so use the tag when you specifically want the published results.

Highlights of the paper

The J-LLM associated with ChronoBench handles multi-physics simulations, including but not limited to:

Collision, Contact, and Friction Dynamics (MBD): Scenarios involving multi-link arms, gear mechanisms, slider-crank system, and other typical mechanisms.
Vibration, deformation, stress, and strain (FEA): Scenarios involving cable, beam, shells, plates that evaluate the S-LLM's proficiency in structural analysis.
Vehicle Dynamics (VEH): City buses, off-road vehicles (e.g., HMMWV, M113), trucks (e.g., Kraz, MAN), and sedans are used to test the S-LLM's ability to simulate driving scenarios. Driver, engine, transmission, and tire models, as well as high-level control policies integrated with sensors, are included in the benchmark.
Sensor Integration (SEN): Scenarios involving GPS, IMU, LiDAR, and camera sensors are used to exercise the S-LLM's capability to support perception tasks for autonomous vehicles and robotic systems.
Robotics Dynamics (RBT): The benchmark touches on robotic systems like Turtlebot, Curiosity, and VIPER, as well as granular dynamics and deformable terrain simulations, e.g., the Soil Contact Model (SCM) that come into play in off-road operations for both robots and vehicles.

ChronoBench draws on 102 demonstration tasks associated with 34 distinct physical systems of the categories MBD through RBT listed above. These tasks involve setting up and progressively modifying virtual experiment scripts, with each task broken down into three high-quality turns. These turns have been designed by simulation experts to gradually increase in complexity, thus enabling the J-LLM to provide a robust assessment of the S-LLM's capabilities. A list of example simulation scenarios in ChronoBench is provided in the above figure.

The ChronoBench pipeline for evaluating S-LLMs is shown above. The J-LLM is calibrated using a validation set containing pairs of ground-truth and generated scripts. The prompts given to the J-LLM are interactively optimized to match the score provided by the expert. Then the J-LLM is used to evaluate the S-LLM based on the generated scripts, ground-truth scripts, and the API documentation.

PyChronoBench

Complementing the end-to-end ChronoBench evaluation, the study introduces PyChronoBench, a lightweight multiple-choice benchmark targeting fine-grained knowledge of the PyChrono API. It contains 280 questions (contact modeling, body creation, solver settings, sensor configuration, and other frequent developer tasks), each with a single correct option among four distractors for trivial automatic grading. PyChronoBench lives in its own repository: github.com/uwsbel/PyChronoBench.

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
.claude/docs		.claude/docs
api		api
chronobench		chronobench
contracts		contracts
demo_data		demo_data
examples		examples
metrics		metrics
paper		paper
runs		runs
scoring		scoring
scripts		scripts
tests		tests
visualization		visualization
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CONTRACTS.md		CONTRACTS.md
DATA.md		DATA.md
LICENSE		LICENSE
ONBOARDING.md		ONBOARDING.md
README.md		README.md
environment.lock.yml		environment.lock.yml
environment.yml		environment.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ChronoBench

Reproducing the paper

Highlights of the paper

PyChronoBench

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

ChronoBench

Reproducing the paper

Highlights of the paper

PyChronoBench

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages