Official code and data repository for the paper "DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories."
counterfactual_data: contains three files, for each dataset, containing all counterfactuals generated for the counterfactual ablation study.data: contains the human verified version of DialToM.
The user needs to install the following packages in their preferred virtual environment: google-genai, openai, sacrebleu, rouge, bert-score.
python benchmark.py --model gpt-5 --task retrospective --filename retrospective.csv
python benchmark.py --model gpt-5 --task prospective --exp [EXP_TYPE] --filename prospective.csv
- The user can choose between four experiment types for the prospective task (EXP_TYPE):
normal,easy,NOTA,CoT. The filename will change dynamically based on what experiment is chosen to{filename}_{EXP_TYPE}.csv. normalexperiment is the default prospective baseline,easyrefers to the easy set evaluation of the prospective task,NOTAandCoTare the two ablations as discussed in the rebuttal phase.
python benchmark.py --model gpt-5 --task written --filename written.csv
python counterfactual_test.py --model gpt-5 --filename counter.csv
python memorization_pilot.py --model gpt-5 --filename memorize.csv