(enh) Aider-Bench: make resumable with skip_num arg (#3626)

* added optional START_ID env flag to resume from that instance id * prepare_dataset: fix comparisons by using instance id's as int * aider bench complete_runtime: close runtime to close container * added matrix display of instance id for logging * fix typo in summarize_results.py saying summarise_results * changed start_id to skip_num to skip rows from dataset (start_id wasn't supportable) * doc changes about huggingface spaces to temporarily point back to OD
2024-08-29 01:18:33 +03:00 · 2024-08-28 17:42:01 +02:00
parent 4ed45c7c9c
commit 9c39f07430
10 changed files with 92 additions and 36 deletions
--- a/docs/src/components/HomepageHeader/HomepageHeader.tsx
+++ b/docs/src/components/HomepageHeader/HomepageHeader.tsx
@@ -31,7 +31,7 @@ export function HomepageHeader() {
          <a href="https://arxiv.org/abs/2407.16741">
            <img src="https://img.shields.io/badge/Paper-%20on%20Arxiv-red?logo=arxiv&style=for-the-badge" alt="Paper on Arxiv" />
          </a>
-          <a href="https://huggingface.co/spaces/OpenHands/evaluation">
+          <a href="https://huggingface.co/spaces/OpenDevin/evaluation">
            <img src="https://img.shields.io/badge/Evaluation-Benchmark%20on%20HF%20Space-green?logo=huggingface&style=for-the-badge" alt="Evaluation Benchmark" />
          </a>
        </div>
--- a/evaluation/README.md
+++ b/evaluation/README.md
@@ -3,12 +3,14 @@
 This folder contains code and resources to run experiments and evaluations.

 ## Logistics
+
 To better organize the evaluation folder, we should follow the rules below:
-  - Each subfolder contains a specific benchmark or experiment. For example, `evaluation/swe_bench` should contain
+
+- Each subfolder contains a specific benchmark or experiment. For example, `evaluation/swe_bench` should contain
 all the preprocessing/evaluation/analysis scripts.
-  - Raw data and experimental records should not be stored within this repo.
-    - For model outputs, they should be stored at [this huggingface space](https://huggingface.co/spaces/OpenHands/evaluation) for visualization.
-  - Important data files of manageable size and analysis scripts (e.g., jupyter notebooks) can be directly uploaded to this repo.
+- Raw data and experimental records should not be stored within this repo.
+- For model outputs, they should be stored at [this huggingface space](https://huggingface.co/spaces/OpenDevin/evaluation) for visualization.
+- Important data files of manageable size and analysis scripts (e.g., jupyter notebooks) can be directly uploaded to this repo.

 ## Supported Benchmarks

@@ -23,6 +25,7 @@ To learn more about how to integrate your benchmark into OpenHands, check out [t
 - ML-Bench: [`evaluation/ml_bench`](./ml_bench)
 - APIBench: [`evaluation/gorilla`](./gorilla/)
 - ToolQA: [`evaluation/toolqa`](./toolqa/)
+- AiderBench: [`evaluation/aider_bench`](./aider_bench/)

 ### Web Browsing

@@ -38,7 +41,6 @@ To learn more about how to integrate your benchmark into OpenHands, check out [t
 - Entity deduction Arena (EDA): [`evaluation/EDA`](./EDA)
 - ProofWriter: [`evaluation/logic_reasoning`](./logic_reasoning)

-
 ## Before everything begins: Setup Environment and LLM Configuration

 Please follow instruction [here](https://github.com/All-Hands-AI/OpenHands/blob/main/Development.md) to setup your local development environment and LLM.
@@ -65,12 +67,10 @@ api_key = "XXX"
 temperature = 0.0
 ```

-
 ### Result Visualization

-Check [this huggingface space](https://huggingface.co/spaces/OpenHands/evaluation) for visualization of existing experimental results.
-
+Check [this huggingface space](https://huggingface.co/spaces/OpenDevin/evaluation) for visualization of existing experimental results.

 ### Upload your results

-You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenHands/evaluation) and submit a PR of your evaluation results to our hosted huggingface repo via PR following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
+You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenDevin/evaluation) and submit a PR of your evaluation results to our hosted huggingface repo via PR following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
--- a/evaluation/aider_bench/README.md
+++ b/evaluation/aider_bench/README.md
@@ -33,8 +33,10 @@ development environment and LLM.
    given IDs (comma separated).

 There are also following optional environment variables you can set:
-```
+
+```bash
 export USE_UNIT_TESTS=true # if you want to allow the Agent to verify correctness using unittests. Default to false.
+export SKIP_NUM=12 # skip the first 12 instances from the dataset
 ```

 Following is the basic command to start the evaluation.
@@ -58,6 +60,8 @@ You can update the arguments in the script

 ```bash
 poetry run python ./evaluation/aider_bench/scripts/summarize_results.py [path_to_output_jsonl_file]
+# with optional SKIP_NUM
+poetry run python SKIP_NUM=12 ./evaluation/aider_bench/scripts/summarize_results.py [path_to_output_jsonl_file]
 ```

 Full example:
--- a/evaluation/aider_bench/run_infer.py
+++ b/evaluation/aider_bench/run_infer.py
@@ -34,6 +34,10 @@ from openhands.runtime.runtime import Runtime

 # Configure visibility of unit tests to the Agent.
 USE_UNIT_TESTS = os.environ.get('USE_UNIT_TESTS', 'false').lower() == 'true'
+SKIP_NUM = os.environ.get('SKIP_NUM')
+SKIP_NUM = (
+    int(SKIP_NUM) if SKIP_NUM and SKIP_NUM.isdigit() and int(SKIP_NUM) >= 0 else None
+)


 def get_config(
@@ -66,7 +70,7 @@ async def initialize_runtime(

    This function is called before the runtime is used to run the agent.
    """
-    logger.info(f"{'-' * 50} BEGIN Runtime Initialization Fn {'-' * 50}")
+    logger.info(f"\n{'-' * 50} BEGIN Runtime Initialization Fn {'-' * 50}\n")
    obs: CmdOutputObservation

    # Set instance id
@@ -96,7 +100,7 @@ async def initialize_runtime(
                file_path,
                '/workspace',
            )
-    logger.info(f"{'-' * 50} END Runtime Initialization Fn {'-' * 50}")
+    logger.info(f"\n{'-' * 50} END Runtime Initialization Fn {'-' * 50}\n")


 async def complete_runtime(
@@ -109,7 +113,7 @@ async def complete_runtime(
    If you need to do something in the sandbox to get the correctness metric after
    the agent has run, modify this function.
    """
-    logger.info(f"{'-' * 50} BEGIN Runtime Completion Fn {'-' * 50}")
+    logger.info(f"\n{'-' * 50} BEGIN Runtime Completion Fn {'-' * 50}\n")
    obs: CmdOutputObservation

    # Rewriting the test file to ignore any changes Agent may have made.
@@ -136,7 +140,9 @@ async def complete_runtime(
    if isinstance(obs, CmdOutputObservation):
        exit_code = obs.exit_code

-    logger.info(f"{'-' * 50} END Runtime Completion Fn {'-' * 50}")
+    logger.info(f"\n{'-' * 50} END Runtime Completion Fn {'-' * 50}\n")
+
+    await runtime.close()

    return {
        'test_output': obs.content,
@@ -156,7 +162,9 @@ async def process_instance(
        log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
        reset_logger_for_multiprocessing(logger, str(instance.instance_id), log_dir)
    else:
-        logger.info(f'Starting evaluation for instance {str(instance.instance_id)}.')
+        logger.info(
+            f'\nStarting evaluation for instance {str(instance.instance_id)}.\n'
+        )

    # =============================================
    # build instruction
@@ -268,10 +276,14 @@ if __name__ == '__main__':
    eval_ids = None
    if args.eval_ids:
        eval_ids = str(args.eval_ids).split(',')
-        logger.info(f'Using specific dataset IDs: {eval_ids}')
+        logger.info(f'\nUsing specific dataset IDs: {eval_ids}\n')

    instances = prepare_dataset(
-        aider_bench_tests, output_file, args.eval_n_limit, eval_ids=eval_ids
+        aider_bench_tests,
+        output_file,
+        args.eval_n_limit,
+        eval_ids=eval_ids,
+        skip_num=SKIP_NUM,
    )

    asyncio.run(
--- a/evaluation/aider_bench/scripts/summarize_results.py
+++ b/evaluation/aider_bench/scripts/summarize_results.py
@@ -22,7 +22,7 @@ def extract_test_results(res_file_path: str) -> tuple[list[str], list[str]]:
 if __name__ == '__main__':
    if len(sys.argv) != 2:
        print(
-            'Usage: poetry run python summarise_results.py <path_to_output_jsonl_file>'
+            'Usage: poetry run python summarize_results.py <path_to_output_jsonl_file>'
        )
        sys.exit(1)
    json_file_path = sys.argv[1]
--- a/evaluation/miniwob/README.md
+++ b/evaluation/miniwob/README.md
@@ -26,7 +26,7 @@ poetry run python evaluation/miniwob/get_success_rate.py evaluation/evaluation_o

 ## Submit your evaluation results

-You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenHands/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
+You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenDevin/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).


 ## BrowsingAgent V1.0 result
--- a/evaluation/swe_bench/README.md
+++ b/evaluation/swe_bench/README.md
@@ -95,7 +95,7 @@ With `output.jsonl` file, you can run `eval_infer.sh` to evaluate generated patc

 > If you want to evaluate existing results, you should first run this to clone existing outputs
 >```bash
->git clone https://huggingface.co/spaces/OpenHands/evaluation evaluation/evaluation_outputs
+>git clone https://huggingface.co/spaces/OpenDevin/evaluation evaluation/evaluation_outputs
 >```

 NOTE, you should have already pulled the instance-level OR env-level docker images following [this section](#openhands-swe-bench-instance-level-docker-support).
@@ -129,10 +129,10 @@ The final results will be saved to `evaluation/evaluation_outputs/outputs/swe_be

 ## Visualize Results

-First you need to clone `https://huggingface.co/spaces/OpenHands/evaluation` and add your own running results from openhands into the `outputs` of the cloned repo.
+First you need to clone `https://huggingface.co/spaces/OpenDevin/evaluation` and add your own running results from openhands into the `outputs` of the cloned repo.

 ```bash
-git clone https://huggingface.co/spaces/OpenHands/evaluation
+git clone https://huggingface.co/spaces/OpenDevin/evaluation
 ```

 **(optional) setup streamlit environment with conda**:
@@ -156,4 +156,4 @@ Then you can access the SWE-Bench trajectory visualizer at `localhost:8501`.

 ## Submit your evaluation results

-You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenHands/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
+You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenDevin/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
--- a/evaluation/utils/shared.py
+++ b/evaluation/utils/shared.py
@@ -181,34 +181,44 @@ def prepare_dataset(
    output_file: str,
    eval_n_limit: int,
    eval_ids: list[str] | None = None,
+    skip_num: int | None = None,
 ):
    assert (
        'instance_id' in dataset.columns
    ), "Expected 'instance_id' column in the dataset. You should define your own unique identifier for each instance and use it as the 'instance_id' column."
    id_column = 'instance_id'
    logger.info(f'Writing evaluation output to {output_file}')
-    finished_ids = set()
+    finished_ids: set[str] = set()
    if os.path.exists(output_file):
        with open(output_file, 'r') as f:
            for line in f:
                data = json.loads(line)
-                finished_ids.add(data[id_column])
+                finished_ids.add(str(data[id_column]))
        logger.warning(
-            f'Output file {output_file} already exists. Loaded {len(finished_ids)} finished instances.'
+            f'\nOutput file {output_file} already exists. Loaded {len(finished_ids)} finished instances.'
        )

    if eval_ids:
        eval_ids_converted = [dataset[id_column].dtype.type(id) for id in eval_ids]
        dataset = dataset[dataset[id_column].isin(eval_ids_converted)]
        logger.info(f'Limiting evaluation to {len(eval_ids)} specific instances.')
-    elif eval_n_limit:
+    elif skip_num and skip_num >= 0:
+        skip_num = min(skip_num, len(dataset))
+        dataset = dataset.iloc[skip_num:]
+        logger.info(
+            f'Starting evaluation with skipping first {skip_num} instances ({len(dataset)} instances to run).'
+        )
+        if eval_n_limit and eval_n_limit > 0:
+            dataset = dataset.head(eval_n_limit)
+            logger.info(f'Limiting evaluation to {eval_n_limit} instances.')
+    elif eval_n_limit and eval_n_limit > 0:
        dataset = dataset.head(eval_n_limit)
        logger.info(f'Limiting evaluation to first {eval_n_limit} instances.')

    new_dataset = [
        instance
        for _, instance in dataset.iterrows()
-        if instance[id_column] not in finished_ids
+        if str(instance[id_column]) not in finished_ids
    ]
    logger.info(
        f'Finished instances: {len(finished_ids)}, Remaining instances: {len(new_dataset)}'
@@ -228,8 +238,8 @@ async def run_evaluation(
 ):
    use_multiprocessing = num_workers > 1
    logger.info(
-        f'Evaluation started with Agent {metadata.agent_class}, '
-        f'model {metadata.llm_config.model}, max iterations {metadata.max_iterations}.'
+        f'Evaluation started with Agent {metadata.agent_class}:\n'
+        f'model {metadata.llm_config.model}, max iterations {metadata.max_iterations}.\n'
    )
    pbar = tqdm(total=len(dataset))
    output_fp = open(output_file, 'a')
@@ -241,7 +251,7 @@ async def run_evaluation(
        pbar.set_description(f'Instance {output.instance_id}')
        pbar.set_postfix_str(f'Test Result: {output.test_result}')
        logger.info(
-            f'Finished evaluation for instance {output.instance_id}: {output.test_result}'
+            f'Finished evaluation for instance {output.instance_id}: {output.test_result}\n'
        )
        output_fp.write(json.dumps(output.model_dump()) + '\n')
        output_fp.flush()
@@ -270,11 +280,11 @@ async def run_evaluation(
                await update_progress(output)

    except KeyboardInterrupt:
-        print('KeyboardInterrupt received. Cleaning up...')
+        print('\nKeyboardInterrupt received. Cleaning up...\n')
        cleanup()

    output_fp.close()
-    logger.info('Evaluation finished.')
+    logger.info('\nEvaluation finished.\n')


 def reset_logger_for_multiprocessing(
--- a/evaluation/webarena/README.md
+++ b/evaluation/webarena/README.md
@@ -7,6 +7,7 @@ This folder contains evaluation for [WebArena](https://github.com/web-arena-x/we
 Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.

 ## Setup WebArena Environment
+
 WebArena requires you to set up websites containing pre-populated content that is accessible via URL to the machine running the OpenHands agents.
 Follow [this document](https://github.com/web-arena-x/webarena/blob/main/environment_docker/README.md) to set up your own WebArena environment through local servers or AWS EC2 instances.
 Take note of the base URL (`$WEBARENA_BASE_URL`) of the machine where the environment is installed.
@@ -36,8 +37,7 @@ poetry run python evaluation/webarena/get_success_rate.py evaluation/evaluation_

 ## Submit your evaluation results

-You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenHands/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
-
+You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenDevin/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).

 ## BrowsingAgent V1.0 result

--- a/openhands/runtime/utils/system.py
+++ b/openhands/runtime/utils/system.py
@@ -12,3 +12,33 @@ def find_available_tcp_port() -> int:
        return -1
    finally:
        sock.close()
+
+
+def display_number_matrix(number: int) -> str | None:
+    if not 0 <= number <= 999:
+        return None
+
+    # Define the matrix representation for each digit
+    digits = {
+        '0': ['###', '# #', '# #', '# #', '###'],
+        '1': ['  #', '  #', '  #', '  #', '  #'],
+        '2': ['###', '  #', '###', '#  ', '###'],
+        '3': ['###', '  #', '###', '  #', '###'],
+        '4': ['# #', '# #', '###', '  #', '  #'],
+        '5': ['###', '#  ', '###', '  #', '###'],
+        '6': ['###', '#  ', '###', '# #', '###'],
+        '7': ['###', '  #', '  #', '  #', '  #'],
+        '8': ['###', '# #', '###', '# #', '###'],
+        '9': ['###', '# #', '###', '  #', '###'],
+    }
+
+    # alternatively, with leading zeros: num_str = f"{number:03d}"
+    num_str = str(number)  # Convert to string without padding
+
+    result = []
+    for row in range(5):
+        line = ' '.join(digits[digit][row] for digit in num_str)
+        result.append(line)
+
+    matrix_display = '\n'.join(result)
+    return f'\n{matrix_display}\n'