This repo contains the data and meta-data required to establish a "judge" LLM (J-LLM) that assesses the quality of virtual experiment scripts (simulator-ready scripts that set up and run a simulation; the published paper calls these "digital twins") produced by another LLM. ChronoBench (published as SimBench, IEEE Access 2026) is a benchmark designed to evaluate, and to diagnose, the proficiency of simulator-oriented large language models (S-LLMs) in generating such scripts for virtual testing. Given a collection of S-LLMs, this benchmark ranks them based on their ability to produce high-quality scripts. More than 30 open- and closed-source S-LLMs (33 in the published study) have been assessed using a J-LLM produced with the data herein, yielding over 3,000 expert-scored multi-turn dialogues.
Using multi-turn interactions, ChronoBench employs a rule-based J-LLM (LLM-as-a-judge) that leverages both predefined rules and human-in-the-loop guidance to assign scores for the scripts generated by the S-LLM, providing a consistent and expert-inspired evaluation protocol. Unlike similarity metrics (e.g., CodeBLEU, ROUGE-L) or single-scalar execution metrics (e.g., Compile@k, Pass@k), the rubric-grounded J-LLM produces interpretable, diagnostic feedback that attributes errors to concrete aspects of the generated artifact. The J-LLM is specific to a simulator, and herein the approach is demonstrated in conjunction with the open-source Chrono multi-physics simulator, covering multibody dynamics, finite element analysis, vehicle dynamics, robotic dynamics, and sensor simulations. The benchmarking principle is broadly applicable and enables assessing an S-LLM's ability to generate virtual experiment scripts for other simulation packages (e.g., ANSYS, ABAQUS, OpenFOAM, StarCCM+, IsaacSim, PyBullet), each of which requires different but qualitatively similar simulator-specific data.
A description of the approach used to produce the J-LLM has been published in IEEE Access:
@article{simbench2026,
author={Wang, Jingquan and Negrut, Andrew and Wang, Hongyu and Zhang, Harry and Negrut, Dan},
journal={IEEE Access},
title={SimBench: A Framework for Evaluating and Diagnosing LLM-Based Digital-Twin Generation for Multi-Physics Simulation},
year={2026},
volume={14},
pages={61784-61808},
doi={10.1109/ACCESS.2026.3685519}}The earlier preprint is available on arXiv as arXiv:2408.11987. A copy of the published paper is included in this repo at .claude/docs/2026Jingquan-SimBench.pdf.
The exact code and data behind the published IEEE Access 2026 results are preserved at the git tag
paper-ieee-access-2026.
To reproduce the paper, check out that tag:
git checkout paper-ieee-access-2026The main branch moves beyond the paper (cleaner naming, leaner dependencies, and a test suite), so
use the tag when you specifically want the published results.
The J-LLM associated with ChronoBench handles multi-physics simulations, including but
not limited to:
- Collision, Contact, and Friction Dynamics (MBD): Scenarios involving multi-link arms, gear mechanisms, slider-crank system, and other typical mechanisms.
- Vibration, deformation, stress, and strain (FEA): Scenarios involving cable, beam, shells, plates that evaluate the S-LLM's proficiency in structural analysis.
- Vehicle Dynamics (VEH): City buses, off-road vehicles (e.g., HMMWV, M113), trucks (e.g., Kraz, MAN), and sedans are used to test the S-LLM's ability to simulate driving scenarios. Driver, engine, transmission, and tire models, as well as high-level control policies integrated with sensors, are included in the benchmark.
- Sensor Integration (SEN): Scenarios involving GPS, IMU, LiDAR, and camera sensors are used to exercise the S-LLM's capability to support perception tasks for autonomous vehicles and robotic systems.
- Robotics Dynamics (RBT): The benchmark touches on robotic systems like Turtlebot, Curiosity, and VIPER, as well as granular dynamics and deformable terrain simulations, e.g., the Soil Contact Model (SCM) that come into play in off-road operations for both robots and vehicles.
ChronoBench draws on 102 demonstration tasks associated with 34 distinct physical systems of the categories MBD through RBT listed above. These tasks involve setting up and progressively modifying virtual experiment scripts, with each task broken down into three high-quality turns. These turns have been designed by simulation experts to gradually increase in complexity, thus enabling the J-LLM to provide a robust assessment of the S-LLM's capabilities. A list of example simulation scenarios in ChronoBench is provided in the above figure.
The ChronoBench pipeline for evaluating S-LLMs is shown above. The J-LLM is calibrated using a
validation set containing pairs of ground-truth and generated scripts. The prompts given to the
J-LLM are interactively optimized to match the score provided by the expert. Then the J-LLM is
used to evaluate the S-LLM based on the generated scripts, ground-truth scripts, and the API
documentation.
Complementing the end-to-end ChronoBench evaluation, the study introduces PyChronoBench, a lightweight multiple-choice benchmark targeting fine-grained knowledge of the PyChrono API. It contains 280 questions (contact modeling, body creation, solver settings, sensor configuration, and other frequent developer tasks), each with a single correct option among four distractors for trivial automatic grading. PyChronoBench lives in its own repository: github.com/uwsbel/PyChronoBench.