(enh) Aider-Bench: make resumable with skip_num arg (#3626)

* added optional START_ID env flag to resume from that instance id

* prepare_dataset: fix comparisons by using instance id's as int

* aider bench complete_runtime: close runtime to close container

* added matrix display of instance id for logging

* fix typo in summarize_results.py saying summarise_results

* changed start_id to skip_num to skip rows from dataset (start_id wasn't supportable)

* doc changes about huggingface spaces to temporarily point back to OD
This commit is contained in:
tobitege
2024-08-28 17:42:01 +02:00
committed by GitHub
parent 4ed45c7c9c
commit 9c39f07430
10 changed files with 92 additions and 36 deletions

View File

@@ -31,7 +31,7 @@ export function HomepageHeader() {
<a href="https://arxiv.org/abs/2407.16741">
<img src="https://img.shields.io/badge/Paper-%20on%20Arxiv-red?logo=arxiv&style=for-the-badge" alt="Paper on Arxiv" />
</a>
<a href="https://huggingface.co/spaces/OpenHands/evaluation">
<a href="https://huggingface.co/spaces/OpenDevin/evaluation">
<img src="https://img.shields.io/badge/Evaluation-Benchmark%20on%20HF%20Space-green?logo=huggingface&style=for-the-badge" alt="Evaluation Benchmark" />
</a>
</div>

View File

@@ -3,12 +3,14 @@
This folder contains code and resources to run experiments and evaluations.
## Logistics
To better organize the evaluation folder, we should follow the rules below:
- Each subfolder contains a specific benchmark or experiment. For example, `evaluation/swe_bench` should contain
- Each subfolder contains a specific benchmark or experiment. For example, `evaluation/swe_bench` should contain
all the preprocessing/evaluation/analysis scripts.
- Raw data and experimental records should not be stored within this repo.
- For model outputs, they should be stored at [this huggingface space](https://huggingface.co/spaces/OpenHands/evaluation) for visualization.
- Important data files of manageable size and analysis scripts (e.g., jupyter notebooks) can be directly uploaded to this repo.
- Raw data and experimental records should not be stored within this repo.
- For model outputs, they should be stored at [this huggingface space](https://huggingface.co/spaces/OpenDevin/evaluation) for visualization.
- Important data files of manageable size and analysis scripts (e.g., jupyter notebooks) can be directly uploaded to this repo.
## Supported Benchmarks
@@ -23,6 +25,7 @@ To learn more about how to integrate your benchmark into OpenHands, check out [t
- ML-Bench: [`evaluation/ml_bench`](./ml_bench)
- APIBench: [`evaluation/gorilla`](./gorilla/)
- ToolQA: [`evaluation/toolqa`](./toolqa/)
- AiderBench: [`evaluation/aider_bench`](./aider_bench/)
### Web Browsing
@@ -38,7 +41,6 @@ To learn more about how to integrate your benchmark into OpenHands, check out [t
- Entity deduction Arena (EDA): [`evaluation/EDA`](./EDA)
- ProofWriter: [`evaluation/logic_reasoning`](./logic_reasoning)
## Before everything begins: Setup Environment and LLM Configuration
Please follow instruction [here](https://github.com/All-Hands-AI/OpenHands/blob/main/Development.md) to setup your local development environment and LLM.
@@ -65,12 +67,10 @@ api_key = "XXX"
temperature = 0.0
```
### Result Visualization
Check [this huggingface space](https://huggingface.co/spaces/OpenHands/evaluation) for visualization of existing experimental results.
Check [this huggingface space](https://huggingface.co/spaces/OpenDevin/evaluation) for visualization of existing experimental results.
### Upload your results
You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenHands/evaluation) and submit a PR of your evaluation results to our hosted huggingface repo via PR following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenDevin/evaluation) and submit a PR of your evaluation results to our hosted huggingface repo via PR following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).

View File

@@ -33,8 +33,10 @@ development environment and LLM.
given IDs (comma separated).
There are also following optional environment variables you can set:
```
```bash
export USE_UNIT_TESTS=true # if you want to allow the Agent to verify correctness using unittests. Default to false.
export SKIP_NUM=12 # skip the first 12 instances from the dataset
```
Following is the basic command to start the evaluation.
@@ -58,6 +60,8 @@ You can update the arguments in the script
```bash
poetry run python ./evaluation/aider_bench/scripts/summarize_results.py [path_to_output_jsonl_file]
# with optional SKIP_NUM
poetry run python SKIP_NUM=12 ./evaluation/aider_bench/scripts/summarize_results.py [path_to_output_jsonl_file]
```
Full example:

View File

@@ -34,6 +34,10 @@ from openhands.runtime.runtime import Runtime
# Configure visibility of unit tests to the Agent.
USE_UNIT_TESTS = os.environ.get('USE_UNIT_TESTS', 'false').lower() == 'true'
SKIP_NUM = os.environ.get('SKIP_NUM')
SKIP_NUM = (
int(SKIP_NUM) if SKIP_NUM and SKIP_NUM.isdigit() and int(SKIP_NUM) >= 0 else None
)
def get_config(
@@ -66,7 +70,7 @@ async def initialize_runtime(
This function is called before the runtime is used to run the agent.
"""
logger.info(f"{'-' * 50} BEGIN Runtime Initialization Fn {'-' * 50}")
logger.info(f"\n{'-' * 50} BEGIN Runtime Initialization Fn {'-' * 50}\n")
obs: CmdOutputObservation
# Set instance id
@@ -96,7 +100,7 @@ async def initialize_runtime(
file_path,
'/workspace',
)
logger.info(f"{'-' * 50} END Runtime Initialization Fn {'-' * 50}")
logger.info(f"\n{'-' * 50} END Runtime Initialization Fn {'-' * 50}\n")
async def complete_runtime(
@@ -109,7 +113,7 @@ async def complete_runtime(
If you need to do something in the sandbox to get the correctness metric after
the agent has run, modify this function.
"""
logger.info(f"{'-' * 50} BEGIN Runtime Completion Fn {'-' * 50}")
logger.info(f"\n{'-' * 50} BEGIN Runtime Completion Fn {'-' * 50}\n")
obs: CmdOutputObservation
# Rewriting the test file to ignore any changes Agent may have made.
@@ -136,7 +140,9 @@ async def complete_runtime(
if isinstance(obs, CmdOutputObservation):
exit_code = obs.exit_code
logger.info(f"{'-' * 50} END Runtime Completion Fn {'-' * 50}")
logger.info(f"\n{'-' * 50} END Runtime Completion Fn {'-' * 50}\n")
await runtime.close()
return {
'test_output': obs.content,
@@ -156,7 +162,9 @@ async def process_instance(
log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
reset_logger_for_multiprocessing(logger, str(instance.instance_id), log_dir)
else:
logger.info(f'Starting evaluation for instance {str(instance.instance_id)}.')
logger.info(
f'\nStarting evaluation for instance {str(instance.instance_id)}.\n'
)
# =============================================
# build instruction
@@ -268,10 +276,14 @@ if __name__ == '__main__':
eval_ids = None
if args.eval_ids:
eval_ids = str(args.eval_ids).split(',')
logger.info(f'Using specific dataset IDs: {eval_ids}')
logger.info(f'\nUsing specific dataset IDs: {eval_ids}\n')
instances = prepare_dataset(
aider_bench_tests, output_file, args.eval_n_limit, eval_ids=eval_ids
aider_bench_tests,
output_file,
args.eval_n_limit,
eval_ids=eval_ids,
skip_num=SKIP_NUM,
)
asyncio.run(

View File

@@ -22,7 +22,7 @@ def extract_test_results(res_file_path: str) -> tuple[list[str], list[str]]:
if __name__ == '__main__':
if len(sys.argv) != 2:
print(
'Usage: poetry run python summarise_results.py <path_to_output_jsonl_file>'
'Usage: poetry run python summarize_results.py <path_to_output_jsonl_file>'
)
sys.exit(1)
json_file_path = sys.argv[1]

View File

@@ -26,7 +26,7 @@ poetry run python evaluation/miniwob/get_success_rate.py evaluation/evaluation_o
## Submit your evaluation results
You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenHands/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenDevin/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
## BrowsingAgent V1.0 result

View File

@@ -95,7 +95,7 @@ With `output.jsonl` file, you can run `eval_infer.sh` to evaluate generated patc
> If you want to evaluate existing results, you should first run this to clone existing outputs
>```bash
>git clone https://huggingface.co/spaces/OpenHands/evaluation evaluation/evaluation_outputs
>git clone https://huggingface.co/spaces/OpenDevin/evaluation evaluation/evaluation_outputs
>```
NOTE, you should have already pulled the instance-level OR env-level docker images following [this section](#openhands-swe-bench-instance-level-docker-support).
@@ -129,10 +129,10 @@ The final results will be saved to `evaluation/evaluation_outputs/outputs/swe_be
## Visualize Results
First you need to clone `https://huggingface.co/spaces/OpenHands/evaluation` and add your own running results from openhands into the `outputs` of the cloned repo.
First you need to clone `https://huggingface.co/spaces/OpenDevin/evaluation` and add your own running results from openhands into the `outputs` of the cloned repo.
```bash
git clone https://huggingface.co/spaces/OpenHands/evaluation
git clone https://huggingface.co/spaces/OpenDevin/evaluation
```
**(optional) setup streamlit environment with conda**:
@@ -156,4 +156,4 @@ Then you can access the SWE-Bench trajectory visualizer at `localhost:8501`.
## Submit your evaluation results
You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenHands/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenDevin/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).

View File

@@ -181,34 +181,44 @@ def prepare_dataset(
output_file: str,
eval_n_limit: int,
eval_ids: list[str] | None = None,
skip_num: int | None = None,
):
assert (
'instance_id' in dataset.columns
), "Expected 'instance_id' column in the dataset. You should define your own unique identifier for each instance and use it as the 'instance_id' column."
id_column = 'instance_id'
logger.info(f'Writing evaluation output to {output_file}')
finished_ids = set()
finished_ids: set[str] = set()
if os.path.exists(output_file):
with open(output_file, 'r') as f:
for line in f:
data = json.loads(line)
finished_ids.add(data[id_column])
finished_ids.add(str(data[id_column]))
logger.warning(
f'Output file {output_file} already exists. Loaded {len(finished_ids)} finished instances.'
f'\nOutput file {output_file} already exists. Loaded {len(finished_ids)} finished instances.'
)
if eval_ids:
eval_ids_converted = [dataset[id_column].dtype.type(id) for id in eval_ids]
dataset = dataset[dataset[id_column].isin(eval_ids_converted)]
logger.info(f'Limiting evaluation to {len(eval_ids)} specific instances.')
elif eval_n_limit:
elif skip_num and skip_num >= 0:
skip_num = min(skip_num, len(dataset))
dataset = dataset.iloc[skip_num:]
logger.info(
f'Starting evaluation with skipping first {skip_num} instances ({len(dataset)} instances to run).'
)
if eval_n_limit and eval_n_limit > 0:
dataset = dataset.head(eval_n_limit)
logger.info(f'Limiting evaluation to {eval_n_limit} instances.')
elif eval_n_limit and eval_n_limit > 0:
dataset = dataset.head(eval_n_limit)
logger.info(f'Limiting evaluation to first {eval_n_limit} instances.')
new_dataset = [
instance
for _, instance in dataset.iterrows()
if instance[id_column] not in finished_ids
if str(instance[id_column]) not in finished_ids
]
logger.info(
f'Finished instances: {len(finished_ids)}, Remaining instances: {len(new_dataset)}'
@@ -228,8 +238,8 @@ async def run_evaluation(
):
use_multiprocessing = num_workers > 1
logger.info(
f'Evaluation started with Agent {metadata.agent_class}, '
f'model {metadata.llm_config.model}, max iterations {metadata.max_iterations}.'
f'Evaluation started with Agent {metadata.agent_class}:\n'
f'model {metadata.llm_config.model}, max iterations {metadata.max_iterations}.\n'
)
pbar = tqdm(total=len(dataset))
output_fp = open(output_file, 'a')
@@ -241,7 +251,7 @@ async def run_evaluation(
pbar.set_description(f'Instance {output.instance_id}')
pbar.set_postfix_str(f'Test Result: {output.test_result}')
logger.info(
f'Finished evaluation for instance {output.instance_id}: {output.test_result}'
f'Finished evaluation for instance {output.instance_id}: {output.test_result}\n'
)
output_fp.write(json.dumps(output.model_dump()) + '\n')
output_fp.flush()
@@ -270,11 +280,11 @@ async def run_evaluation(
await update_progress(output)
except KeyboardInterrupt:
print('KeyboardInterrupt received. Cleaning up...')
print('\nKeyboardInterrupt received. Cleaning up...\n')
cleanup()
output_fp.close()
logger.info('Evaluation finished.')
logger.info('\nEvaluation finished.\n')
def reset_logger_for_multiprocessing(

View File

@@ -7,6 +7,7 @@ This folder contains evaluation for [WebArena](https://github.com/web-arena-x/we
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
## Setup WebArena Environment
WebArena requires you to set up websites containing pre-populated content that is accessible via URL to the machine running the OpenHands agents.
Follow [this document](https://github.com/web-arena-x/webarena/blob/main/environment_docker/README.md) to set up your own WebArena environment through local servers or AWS EC2 instances.
Take note of the base URL (`$WEBARENA_BASE_URL`) of the machine where the environment is installed.
@@ -36,8 +37,7 @@ poetry run python evaluation/webarena/get_success_rate.py evaluation/evaluation_
## Submit your evaluation results
You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenHands/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenDevin/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
## BrowsingAgent V1.0 result

View File

@@ -12,3 +12,33 @@ def find_available_tcp_port() -> int:
return -1
finally:
sock.close()
def display_number_matrix(number: int) -> str | None:
if not 0 <= number <= 999:
return None
# Define the matrix representation for each digit
digits = {
'0': ['###', '# #', '# #', '# #', '###'],
'1': [' #', ' #', ' #', ' #', ' #'],
'2': ['###', ' #', '###', '# ', '###'],
'3': ['###', ' #', '###', ' #', '###'],
'4': ['# #', '# #', '###', ' #', ' #'],
'5': ['###', '# ', '###', ' #', '###'],
'6': ['###', '# ', '###', '# #', '###'],
'7': ['###', ' #', ' #', ' #', ' #'],
'8': ['###', '# #', '###', '# #', '###'],
'9': ['###', '# #', '###', ' #', '###'],
}
# alternatively, with leading zeros: num_str = f"{number:03d}"
num_str = str(number) # Convert to string without padding
result = []
for row in range(5):
line = ' '.join(digits[digit][row] for digit in num_str)
result.append(line)
matrix_display = '\n'.join(result)
return f'\n{matrix_display}\n'