mirror of
https://github.com/All-Hands-AI/OpenHands.git
synced 2024-08-29 01:18:33 +03:00
(enh) Aider-Bench: make resumable with skip_num arg (#3626)
* added optional START_ID env flag to resume from that instance id * prepare_dataset: fix comparisons by using instance id's as int * aider bench complete_runtime: close runtime to close container * added matrix display of instance id for logging * fix typo in summarize_results.py saying summarise_results * changed start_id to skip_num to skip rows from dataset (start_id wasn't supportable) * doc changes about huggingface spaces to temporarily point back to OD
This commit is contained in:
@@ -31,7 +31,7 @@ export function HomepageHeader() {
|
||||
<a href="https://arxiv.org/abs/2407.16741">
|
||||
<img src="https://img.shields.io/badge/Paper-%20on%20Arxiv-red?logo=arxiv&style=for-the-badge" alt="Paper on Arxiv" />
|
||||
</a>
|
||||
<a href="https://huggingface.co/spaces/OpenHands/evaluation">
|
||||
<a href="https://huggingface.co/spaces/OpenDevin/evaluation">
|
||||
<img src="https://img.shields.io/badge/Evaluation-Benchmark%20on%20HF%20Space-green?logo=huggingface&style=for-the-badge" alt="Evaluation Benchmark" />
|
||||
</a>
|
||||
</div>
|
||||
|
||||
@@ -3,12 +3,14 @@
|
||||
This folder contains code and resources to run experiments and evaluations.
|
||||
|
||||
## Logistics
|
||||
|
||||
To better organize the evaluation folder, we should follow the rules below:
|
||||
- Each subfolder contains a specific benchmark or experiment. For example, `evaluation/swe_bench` should contain
|
||||
|
||||
- Each subfolder contains a specific benchmark or experiment. For example, `evaluation/swe_bench` should contain
|
||||
all the preprocessing/evaluation/analysis scripts.
|
||||
- Raw data and experimental records should not be stored within this repo.
|
||||
- For model outputs, they should be stored at [this huggingface space](https://huggingface.co/spaces/OpenHands/evaluation) for visualization.
|
||||
- Important data files of manageable size and analysis scripts (e.g., jupyter notebooks) can be directly uploaded to this repo.
|
||||
- Raw data and experimental records should not be stored within this repo.
|
||||
- For model outputs, they should be stored at [this huggingface space](https://huggingface.co/spaces/OpenDevin/evaluation) for visualization.
|
||||
- Important data files of manageable size and analysis scripts (e.g., jupyter notebooks) can be directly uploaded to this repo.
|
||||
|
||||
## Supported Benchmarks
|
||||
|
||||
@@ -23,6 +25,7 @@ To learn more about how to integrate your benchmark into OpenHands, check out [t
|
||||
- ML-Bench: [`evaluation/ml_bench`](./ml_bench)
|
||||
- APIBench: [`evaluation/gorilla`](./gorilla/)
|
||||
- ToolQA: [`evaluation/toolqa`](./toolqa/)
|
||||
- AiderBench: [`evaluation/aider_bench`](./aider_bench/)
|
||||
|
||||
### Web Browsing
|
||||
|
||||
@@ -38,7 +41,6 @@ To learn more about how to integrate your benchmark into OpenHands, check out [t
|
||||
- Entity deduction Arena (EDA): [`evaluation/EDA`](./EDA)
|
||||
- ProofWriter: [`evaluation/logic_reasoning`](./logic_reasoning)
|
||||
|
||||
|
||||
## Before everything begins: Setup Environment and LLM Configuration
|
||||
|
||||
Please follow instruction [here](https://github.com/All-Hands-AI/OpenHands/blob/main/Development.md) to setup your local development environment and LLM.
|
||||
@@ -65,12 +67,10 @@ api_key = "XXX"
|
||||
temperature = 0.0
|
||||
```
|
||||
|
||||
|
||||
### Result Visualization
|
||||
|
||||
Check [this huggingface space](https://huggingface.co/spaces/OpenHands/evaluation) for visualization of existing experimental results.
|
||||
|
||||
Check [this huggingface space](https://huggingface.co/spaces/OpenDevin/evaluation) for visualization of existing experimental results.
|
||||
|
||||
### Upload your results
|
||||
|
||||
You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenHands/evaluation) and submit a PR of your evaluation results to our hosted huggingface repo via PR following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
|
||||
You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenDevin/evaluation) and submit a PR of your evaluation results to our hosted huggingface repo via PR following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
|
||||
|
||||
@@ -33,8 +33,10 @@ development environment and LLM.
|
||||
given IDs (comma separated).
|
||||
|
||||
There are also following optional environment variables you can set:
|
||||
```
|
||||
|
||||
```bash
|
||||
export USE_UNIT_TESTS=true # if you want to allow the Agent to verify correctness using unittests. Default to false.
|
||||
export SKIP_NUM=12 # skip the first 12 instances from the dataset
|
||||
```
|
||||
|
||||
Following is the basic command to start the evaluation.
|
||||
@@ -58,6 +60,8 @@ You can update the arguments in the script
|
||||
|
||||
```bash
|
||||
poetry run python ./evaluation/aider_bench/scripts/summarize_results.py [path_to_output_jsonl_file]
|
||||
# with optional SKIP_NUM
|
||||
poetry run python SKIP_NUM=12 ./evaluation/aider_bench/scripts/summarize_results.py [path_to_output_jsonl_file]
|
||||
```
|
||||
|
||||
Full example:
|
||||
|
||||
@@ -34,6 +34,10 @@ from openhands.runtime.runtime import Runtime
|
||||
|
||||
# Configure visibility of unit tests to the Agent.
|
||||
USE_UNIT_TESTS = os.environ.get('USE_UNIT_TESTS', 'false').lower() == 'true'
|
||||
SKIP_NUM = os.environ.get('SKIP_NUM')
|
||||
SKIP_NUM = (
|
||||
int(SKIP_NUM) if SKIP_NUM and SKIP_NUM.isdigit() and int(SKIP_NUM) >= 0 else None
|
||||
)
|
||||
|
||||
|
||||
def get_config(
|
||||
@@ -66,7 +70,7 @@ async def initialize_runtime(
|
||||
|
||||
This function is called before the runtime is used to run the agent.
|
||||
"""
|
||||
logger.info(f"{'-' * 50} BEGIN Runtime Initialization Fn {'-' * 50}")
|
||||
logger.info(f"\n{'-' * 50} BEGIN Runtime Initialization Fn {'-' * 50}\n")
|
||||
obs: CmdOutputObservation
|
||||
|
||||
# Set instance id
|
||||
@@ -96,7 +100,7 @@ async def initialize_runtime(
|
||||
file_path,
|
||||
'/workspace',
|
||||
)
|
||||
logger.info(f"{'-' * 50} END Runtime Initialization Fn {'-' * 50}")
|
||||
logger.info(f"\n{'-' * 50} END Runtime Initialization Fn {'-' * 50}\n")
|
||||
|
||||
|
||||
async def complete_runtime(
|
||||
@@ -109,7 +113,7 @@ async def complete_runtime(
|
||||
If you need to do something in the sandbox to get the correctness metric after
|
||||
the agent has run, modify this function.
|
||||
"""
|
||||
logger.info(f"{'-' * 50} BEGIN Runtime Completion Fn {'-' * 50}")
|
||||
logger.info(f"\n{'-' * 50} BEGIN Runtime Completion Fn {'-' * 50}\n")
|
||||
obs: CmdOutputObservation
|
||||
|
||||
# Rewriting the test file to ignore any changes Agent may have made.
|
||||
@@ -136,7 +140,9 @@ async def complete_runtime(
|
||||
if isinstance(obs, CmdOutputObservation):
|
||||
exit_code = obs.exit_code
|
||||
|
||||
logger.info(f"{'-' * 50} END Runtime Completion Fn {'-' * 50}")
|
||||
logger.info(f"\n{'-' * 50} END Runtime Completion Fn {'-' * 50}\n")
|
||||
|
||||
await runtime.close()
|
||||
|
||||
return {
|
||||
'test_output': obs.content,
|
||||
@@ -156,7 +162,9 @@ async def process_instance(
|
||||
log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
|
||||
reset_logger_for_multiprocessing(logger, str(instance.instance_id), log_dir)
|
||||
else:
|
||||
logger.info(f'Starting evaluation for instance {str(instance.instance_id)}.')
|
||||
logger.info(
|
||||
f'\nStarting evaluation for instance {str(instance.instance_id)}.\n'
|
||||
)
|
||||
|
||||
# =============================================
|
||||
# build instruction
|
||||
@@ -268,10 +276,14 @@ if __name__ == '__main__':
|
||||
eval_ids = None
|
||||
if args.eval_ids:
|
||||
eval_ids = str(args.eval_ids).split(',')
|
||||
logger.info(f'Using specific dataset IDs: {eval_ids}')
|
||||
logger.info(f'\nUsing specific dataset IDs: {eval_ids}\n')
|
||||
|
||||
instances = prepare_dataset(
|
||||
aider_bench_tests, output_file, args.eval_n_limit, eval_ids=eval_ids
|
||||
aider_bench_tests,
|
||||
output_file,
|
||||
args.eval_n_limit,
|
||||
eval_ids=eval_ids,
|
||||
skip_num=SKIP_NUM,
|
||||
)
|
||||
|
||||
asyncio.run(
|
||||
|
||||
@@ -22,7 +22,7 @@ def extract_test_results(res_file_path: str) -> tuple[list[str], list[str]]:
|
||||
if __name__ == '__main__':
|
||||
if len(sys.argv) != 2:
|
||||
print(
|
||||
'Usage: poetry run python summarise_results.py <path_to_output_jsonl_file>'
|
||||
'Usage: poetry run python summarize_results.py <path_to_output_jsonl_file>'
|
||||
)
|
||||
sys.exit(1)
|
||||
json_file_path = sys.argv[1]
|
||||
|
||||
@@ -26,7 +26,7 @@ poetry run python evaluation/miniwob/get_success_rate.py evaluation/evaluation_o
|
||||
|
||||
## Submit your evaluation results
|
||||
|
||||
You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenHands/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
|
||||
You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenDevin/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
|
||||
|
||||
|
||||
## BrowsingAgent V1.0 result
|
||||
|
||||
@@ -95,7 +95,7 @@ With `output.jsonl` file, you can run `eval_infer.sh` to evaluate generated patc
|
||||
|
||||
> If you want to evaluate existing results, you should first run this to clone existing outputs
|
||||
>```bash
|
||||
>git clone https://huggingface.co/spaces/OpenHands/evaluation evaluation/evaluation_outputs
|
||||
>git clone https://huggingface.co/spaces/OpenDevin/evaluation evaluation/evaluation_outputs
|
||||
>```
|
||||
|
||||
NOTE, you should have already pulled the instance-level OR env-level docker images following [this section](#openhands-swe-bench-instance-level-docker-support).
|
||||
@@ -129,10 +129,10 @@ The final results will be saved to `evaluation/evaluation_outputs/outputs/swe_be
|
||||
|
||||
## Visualize Results
|
||||
|
||||
First you need to clone `https://huggingface.co/spaces/OpenHands/evaluation` and add your own running results from openhands into the `outputs` of the cloned repo.
|
||||
First you need to clone `https://huggingface.co/spaces/OpenDevin/evaluation` and add your own running results from openhands into the `outputs` of the cloned repo.
|
||||
|
||||
```bash
|
||||
git clone https://huggingface.co/spaces/OpenHands/evaluation
|
||||
git clone https://huggingface.co/spaces/OpenDevin/evaluation
|
||||
```
|
||||
|
||||
**(optional) setup streamlit environment with conda**:
|
||||
@@ -156,4 +156,4 @@ Then you can access the SWE-Bench trajectory visualizer at `localhost:8501`.
|
||||
|
||||
## Submit your evaluation results
|
||||
|
||||
You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenHands/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
|
||||
You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenDevin/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
|
||||
|
||||
@@ -181,34 +181,44 @@ def prepare_dataset(
|
||||
output_file: str,
|
||||
eval_n_limit: int,
|
||||
eval_ids: list[str] | None = None,
|
||||
skip_num: int | None = None,
|
||||
):
|
||||
assert (
|
||||
'instance_id' in dataset.columns
|
||||
), "Expected 'instance_id' column in the dataset. You should define your own unique identifier for each instance and use it as the 'instance_id' column."
|
||||
id_column = 'instance_id'
|
||||
logger.info(f'Writing evaluation output to {output_file}')
|
||||
finished_ids = set()
|
||||
finished_ids: set[str] = set()
|
||||
if os.path.exists(output_file):
|
||||
with open(output_file, 'r') as f:
|
||||
for line in f:
|
||||
data = json.loads(line)
|
||||
finished_ids.add(data[id_column])
|
||||
finished_ids.add(str(data[id_column]))
|
||||
logger.warning(
|
||||
f'Output file {output_file} already exists. Loaded {len(finished_ids)} finished instances.'
|
||||
f'\nOutput file {output_file} already exists. Loaded {len(finished_ids)} finished instances.'
|
||||
)
|
||||
|
||||
if eval_ids:
|
||||
eval_ids_converted = [dataset[id_column].dtype.type(id) for id in eval_ids]
|
||||
dataset = dataset[dataset[id_column].isin(eval_ids_converted)]
|
||||
logger.info(f'Limiting evaluation to {len(eval_ids)} specific instances.')
|
||||
elif eval_n_limit:
|
||||
elif skip_num and skip_num >= 0:
|
||||
skip_num = min(skip_num, len(dataset))
|
||||
dataset = dataset.iloc[skip_num:]
|
||||
logger.info(
|
||||
f'Starting evaluation with skipping first {skip_num} instances ({len(dataset)} instances to run).'
|
||||
)
|
||||
if eval_n_limit and eval_n_limit > 0:
|
||||
dataset = dataset.head(eval_n_limit)
|
||||
logger.info(f'Limiting evaluation to {eval_n_limit} instances.')
|
||||
elif eval_n_limit and eval_n_limit > 0:
|
||||
dataset = dataset.head(eval_n_limit)
|
||||
logger.info(f'Limiting evaluation to first {eval_n_limit} instances.')
|
||||
|
||||
new_dataset = [
|
||||
instance
|
||||
for _, instance in dataset.iterrows()
|
||||
if instance[id_column] not in finished_ids
|
||||
if str(instance[id_column]) not in finished_ids
|
||||
]
|
||||
logger.info(
|
||||
f'Finished instances: {len(finished_ids)}, Remaining instances: {len(new_dataset)}'
|
||||
@@ -228,8 +238,8 @@ async def run_evaluation(
|
||||
):
|
||||
use_multiprocessing = num_workers > 1
|
||||
logger.info(
|
||||
f'Evaluation started with Agent {metadata.agent_class}, '
|
||||
f'model {metadata.llm_config.model}, max iterations {metadata.max_iterations}.'
|
||||
f'Evaluation started with Agent {metadata.agent_class}:\n'
|
||||
f'model {metadata.llm_config.model}, max iterations {metadata.max_iterations}.\n'
|
||||
)
|
||||
pbar = tqdm(total=len(dataset))
|
||||
output_fp = open(output_file, 'a')
|
||||
@@ -241,7 +251,7 @@ async def run_evaluation(
|
||||
pbar.set_description(f'Instance {output.instance_id}')
|
||||
pbar.set_postfix_str(f'Test Result: {output.test_result}')
|
||||
logger.info(
|
||||
f'Finished evaluation for instance {output.instance_id}: {output.test_result}'
|
||||
f'Finished evaluation for instance {output.instance_id}: {output.test_result}\n'
|
||||
)
|
||||
output_fp.write(json.dumps(output.model_dump()) + '\n')
|
||||
output_fp.flush()
|
||||
@@ -270,11 +280,11 @@ async def run_evaluation(
|
||||
await update_progress(output)
|
||||
|
||||
except KeyboardInterrupt:
|
||||
print('KeyboardInterrupt received. Cleaning up...')
|
||||
print('\nKeyboardInterrupt received. Cleaning up...\n')
|
||||
cleanup()
|
||||
|
||||
output_fp.close()
|
||||
logger.info('Evaluation finished.')
|
||||
logger.info('\nEvaluation finished.\n')
|
||||
|
||||
|
||||
def reset_logger_for_multiprocessing(
|
||||
|
||||
@@ -7,6 +7,7 @@ This folder contains evaluation for [WebArena](https://github.com/web-arena-x/we
|
||||
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
|
||||
|
||||
## Setup WebArena Environment
|
||||
|
||||
WebArena requires you to set up websites containing pre-populated content that is accessible via URL to the machine running the OpenHands agents.
|
||||
Follow [this document](https://github.com/web-arena-x/webarena/blob/main/environment_docker/README.md) to set up your own WebArena environment through local servers or AWS EC2 instances.
|
||||
Take note of the base URL (`$WEBARENA_BASE_URL`) of the machine where the environment is installed.
|
||||
@@ -36,8 +37,7 @@ poetry run python evaluation/webarena/get_success_rate.py evaluation/evaluation_
|
||||
|
||||
## Submit your evaluation results
|
||||
|
||||
You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenHands/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
|
||||
|
||||
You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenDevin/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
|
||||
|
||||
## BrowsingAgent V1.0 result
|
||||
|
||||
|
||||
@@ -12,3 +12,33 @@ def find_available_tcp_port() -> int:
|
||||
return -1
|
||||
finally:
|
||||
sock.close()
|
||||
|
||||
|
||||
def display_number_matrix(number: int) -> str | None:
|
||||
if not 0 <= number <= 999:
|
||||
return None
|
||||
|
||||
# Define the matrix representation for each digit
|
||||
digits = {
|
||||
'0': ['###', '# #', '# #', '# #', '###'],
|
||||
'1': [' #', ' #', ' #', ' #', ' #'],
|
||||
'2': ['###', ' #', '###', '# ', '###'],
|
||||
'3': ['###', ' #', '###', ' #', '###'],
|
||||
'4': ['# #', '# #', '###', ' #', ' #'],
|
||||
'5': ['###', '# ', '###', ' #', '###'],
|
||||
'6': ['###', '# ', '###', '# #', '###'],
|
||||
'7': ['###', ' #', ' #', ' #', ' #'],
|
||||
'8': ['###', '# #', '###', '# #', '###'],
|
||||
'9': ['###', '# #', '###', ' #', '###'],
|
||||
}
|
||||
|
||||
# alternatively, with leading zeros: num_str = f"{number:03d}"
|
||||
num_str = str(number) # Convert to string without padding
|
||||
|
||||
result = []
|
||||
for row in range(5):
|
||||
line = ' '.join(digits[digit][row] for digit in num_str)
|
||||
result.append(line)
|
||||
|
||||
matrix_display = '\n'.join(result)
|
||||
return f'\n{matrix_display}\n'
|
||||
|
||||
Reference in New Issue
Block a user