mirror of https://github.com/All-Hands-AI/OpenHands.git synced 2024-08-29 01:18:33 +03:00

Files

Graham Neubig f9088766e8 Allow setting of runtime container image (#3573 )

* Add runtime container image setting

* Fix typo in test

* Fix sandbox base container image

* Update variables

* Update to base_container_image

* Update tests/unit/test_config.py

Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>

* Fixed eval

* Fixed container_image

* Fix typo

---------

Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>

2024-08-25 23:05:41 +00:00

scripts

Rename OpenDevin to OpenHands (#3472 )

2024-08-20 00:44:54 +08:00

Dockerfile

test: build and run runtime tests on different custom docker images (#3324 )

2024-08-19 21:12:00 +08:00

README.md

Rename OpenDevin to OpenHands (#3472 )

2024-08-20 00:44:54 +08:00

run_infer.py

Allow setting of runtime container image (#3573 )

2024-08-25 23:05:41 +00:00

utils.py

[Refactor, Evaluation] Refactor and clean up evaluation harness to remove global config and use EventStreamRuntime (#3230 )

2024-08-06 17:21:45 +00:00

README.md

ToolQA Evaluation with OpenHands

This folder contains an evaluation harness we built on top of the original ToolQA (paper).

Setup Environment and LLM Configuration

Please follow instruction here to setup your local development environment and LLM.

Run Inference on ToolQA Instances

Make sure your Docker daemon is running, then run this bash script:

bash evaluation/toolqa/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [dataset] [hardness] [wolfram_alpha_appid]

where model_config is mandatory, while all other arguments are optional.

model_config, e.g. llm, is the config group name for your LLM settings, as defined in your config.toml.

git-version, e.g. HEAD, is the git commit hash of the OpenHands version you would like to evaluate. It could also be a release tag like 0.6.2.

agent, e.g. CodeActAgent, is the name of the agent for benchmarks, defaulting to CodeActAgent.

eval_limit, e.g. 10, limits the evaluation to the first eval_limit instances. By default, the script evaluates 1 instance.

dataset, the dataset from ToolQA to evaluate from. You could choose from agenda, airbnb, coffee, dblp, flight, gsm8k, scirex, yelp for dataset. The default is coffee.

hardness, the hardness to evaluate. You could choose from easy and hard. The default is easy.

wolfram_alpha_appid is an optional argument. When given wolfram_alpha_appid, the agent will be able to access Wolfram Alpha's APIs.

Note: in order to use eval_limit, you must also set agent; in order to use dataset, you must also set eval_limit; in order to use hardness, you must also set dataset.

Let's say you'd like to run 10 instances using llm and CodeActAgent on coffee easy test, then your command would be:

bash evaluation/toolqa/scripts/run_infer.sh llm CodeActAgent 10 coffee easy