# ScienceAgentBench Evaluation with OpenHands This folder contains the evaluation harness for [ScienceAgentBench](https://osu-nlp-group.github.io/ScienceAgentBench/) (paper: ). ## Setup Environment and LLM Configuration Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM. ## Setup ScienceAgentBench To prevent benchmark data contamination, we only provide the annotation sheet on [Huggingface](https://huggingface.co/datasets/osunlp/ScienceAgentBench), which includes all necessary *inputs* to run an agent. ## Run Inference on ScienceAgentBench ```bash ./evaluation/benchmarks/scienceagentbench/scripts/run_infer.sh [model_config] [git-version] [use_knowledge] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split] # Example ./evaluation/benchmarks/scienceagentbench/scripts/run_infer.sh llm.eval_gpt4o 0.9.3 ``` where `model_config` is mandatory, and the rest are optional. - `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your LLM settings, as defined in your `config.toml`. - `git-version`, e.g. `HEAD`, is the git commit hash of the OpenHands version you would like to evaluate. It could also be a release tag like `0.6.2`. - `use_knowledge`, e.g. `true`, specifies whether allowing the agent to use expert-provided knowledge as additional input or not. By default, it is set to `false`. - `agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks, defaulting to `CodeActAgent`. - `eval_limit`, e.g. `10`, limits the evaluation to the first `eval_limit` instances. By default, the script evaluates the entire SWE-bench_Lite test set (300 issues). Note: in order to use `eval_limit`, you must also set `agent`. - `max_iter`, e.g. `20`, is the maximum number of iterations for the agent to run. By default, it is set to 30. - `num_workers`, e.g. `3`, is the number of parallel workers to run the evaluation. By default, it is set to 1. ## Evaluate Generated Programs ### Extract Necessary Information from OpenHands Log After the inference is completed, you may use the following command to extract necessary information from the output log for evaluation: ```bash python post_proc.py [log_fname] ``` - `log_fname`, e.g. `evaluation/.../output.jsonl`, is the automatically saved trajectory log of an OpenHands agent. Output will be write to e.g. `evaluation/.../output.converted.jsonl` ### Run evaluation Please follow the steps [here](https://github.com/OSU-NLP-Group/ScienceAgentBench/tree/main?tab=readme-ov-file#evaluation-of-generated-code) to evaluate the generated programs.