* Add runtime container image setting * Fix typo in test * Fix sandbox base container image * Update variables * Update to base_container_image * Update tests/unit/test_config.py Co-authored-by: Xingyao Wang <xingyao6@illinois.edu> * Fixed eval * Fixed container_image * Fix typo --------- Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
EDA Evaluation
This folder contains evaluation harness for evaluating agents on the Entity-deduction-Arena Benchmark, from the paper Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games, presented in ACL 2024 main conference.
Setup Environment and LLM Configuration
Please follow instruction here to setup your local development environment and LLM.
Start the evaluation
export OPENAI_API_KEY="sk-XXX"; # This is required for evaluation (to simulate another party of conversation)
./evaluation/EDA/scripts/run_infer.sh [model_config] [git-version] [agent] [dataset] [eval_limit]
where model_config is mandatory, while git-version, agent, dataset and eval_limit are optional.
-
model_config, e.g.eval_gpt4_1106_preview, is the config group name for your LLM settings, as defined in yourconfig.toml. -
git-version, e.g.HEAD, is the git commit hash of the OpenHands version you would like to evaluate. It could also be a release tag like0.6.2. -
agent, e.g.CodeActAgent, is the name of the agent for benchmarks, defaulting toCodeActAgent. -
dataset: There are two tasks in this evaluation. Specifydatasetto test on eitherthingsorcelebstask. -
eval_limit, e.g.10, limits the evaluation to the firsteval_limitinstances. By default it infers all instances.
For example,
./evaluation/EDA/scripts/run_infer.sh eval_gpt4o_2024_05_13 0.6.2 CodeActAgent things
Reference
@inproceedings{zhang2023entity,
title={Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games},
author={Zhang, Yizhe and Lu, Jiarui and Jaitly, Navdeep},
journal={ACL},
year={2024}
}