mirror of https://github.com/All-Hands-AI/OpenHands.git synced 2024-08-29 01:18:33 +03:00

Files

Graham Neubig f9088766e8 Allow setting of runtime container image (#3573 )

* Add runtime container image setting

* Fix typo in test

* Fix sandbox base container image

* Update variables

* Update to base_container_image

* Update tests/unit/test_config.py

Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>

* Fixed eval

* Fixed container_image

* Fix typo

---------

Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>

2024-08-25 23:05:41 +00:00

scripts

Rename OpenDevin to OpenHands (#3472 )

2024-08-20 00:44:54 +08:00

game.py

Use generic types (#3414 )

2024-08-16 06:41:57 -04:00

README.md

Rename OpenDevin to OpenHands (#3472 )

2024-08-20 00:44:54 +08:00

run_infer.py

Allow setting of runtime container image (#3573 )

2024-08-25 23:05:41 +00:00

README.md

EDA Evaluation

This folder contains evaluation harness for evaluating agents on the Entity-deduction-Arena Benchmark, from the paper Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games, presented in ACL 2024 main conference.

Setup Environment and LLM Configuration

Please follow instruction here to setup your local development environment and LLM.

Start the evaluation

export OPENAI_API_KEY="sk-XXX"; # This is required for evaluation (to simulate another party of conversation)
./evaluation/EDA/scripts/run_infer.sh [model_config] [git-version] [agent] [dataset] [eval_limit]

where model_config is mandatory, while git-version, agent, dataset and eval_limit are optional.

model_config, e.g. eval_gpt4_1106_preview, is the config group name for your LLM settings, as defined in your config.toml.
git-version, e.g. HEAD, is the git commit hash of the OpenHands version you would like to evaluate. It could also be a release tag like 0.6.2.
agent, e.g. CodeActAgent, is the name of the agent for benchmarks, defaulting to CodeActAgent.
dataset: There are two tasks in this evaluation. Specify dataset to test on either things or celebs task.
eval_limit, e.g. 10, limits the evaluation to the first eval_limit instances. By default it infers all instances.

For example,

./evaluation/EDA/scripts/run_infer.sh eval_gpt4o_2024_05_13 0.6.2 CodeActAgent things

Reference

@inproceedings{zhang2023entity,
  title={Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games},
  author={Zhang, Yizhe and Lu, Jiarui and Jaitly, Navdeep},
  journal={ACL},
  year={2024}
}