python=3.10.15
google-generativeai==0.8.3
openai==1.58.1
Run the following command to execute the script:
python main.py --save_answer --llm_model "$llm_model" --dataset "$dataset" --answer_mode "$run_mode" --data_mode "$data_mode"--llm_model: Defines the LLM model to use. Choices include:gemini-1.5-pro-002,gemini-1.5-flash-002,gpt-4o-2024-08-06llama_8b,llama_70b,llama_sambanova_405bqwen25_coder_32b,qwq_32b,deepseek_r1
--dataset: Specifies the dataset to evaluate, such as:GSM8K,AQUA,DROP
--answer_mode": Determines the answering strategy:cot: Chain-of-Thought promptinghot: Highlight Chain-of-Thought prompting
--data_mode:random: Runs the model on 200 randomly selected samples.longest: Runs the model on 200 longest samples.shortest: Runs the model on 200 shortest samples.full: Runs the model on the whole dataset.
python main.py --save_answer --llm_model "gpt-4o-2024-08-06" --dataset "GSM8K" --answer_mode "cot" --data_mode randomRun the following command to evaluate the results:
python evaluate.py --llm_model "$llm_model" --dataset "$dataset" --answer_mode "$answer_mode" --data_mode "$data_mode"python evaluate.py --llm_model "gpt-4o-2024-08-06" --dataset "GSM8K" --answer_mode "cot" --data_mode longestRun the following command to render the result on html pages:
python visualize.py --llm_model "$llm_model" --dataset "$dataset" --answer_mode "$answer_mode" --save_htmlpython visualize.py --llm_model "gpt-4o-2024-08-06" --dataset "GSM8K" --answer_mode "cot" --data_mode --save_htmlMIT

