Refactor integration test framework and relieve the pain of regeneration (#1818)

* Update README.md

* Fix WORKSPACE_MOUNT_PATH_IN_SANDBOX variable in regenerate.sh

* Regenerate prompts without calling real LLM

* Disable pytest warning capture

* Change planner agent prompt by a bit for demo

* Regenerate prompt files following prompt changes

* doc: elaborate on FORCE_USE_LLM

* Add another prompt change to monologue_agent for demo purpose

* Regenerate prompts with FORCE_USE_LLM=true

---------

Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>
This commit is contained in:
Boxuan Li
2024-05-16 08:30:29 -07:00
committed by GitHub
parent e89cc8f19b
commit b6ff201780
38 changed files with 324 additions and 753 deletions

View File

@@ -51,7 +51,7 @@ Here are the possible actions:
%(background_commands)s
You MUST take time to think in between read, write, run, browse, push, and recall actions--do this with the `message` action.
You MUST take time to think in between read, write, run, kill, browse, push, and recall actions--do this with the `message` action.
You should never act twice in a row without thinking. But if your last several
actions are all `message` actions, you should consider taking a different action.

View File

@@ -94,7 +94,7 @@ It must be an object, and it must contain two fields:
* `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
* `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.
You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
You should never act twice in a row without thinking. But if your last several
actions are all `message` actions, you should consider taking a different action.

2
pytest.ini Normal file
View File

@@ -0,0 +1,2 @@
[pytest]
addopts = -p no:warnings

View File

@@ -56,14 +56,32 @@ to run all integration tests until the first failure.
## Regenerate Integration Tests
When you make changes to an agent's prompt, the integration tests will fail. You'll need to regenerate them
by running:
by running the following command from project root directory:
```bash
./tests/integration/regenerate.sh
```
Note that this will run existing tests first and call real LLM_MODEL only for
failed tests, but it still costs money! If you don't want
to cover the cost, ask one of the maintainers to regenerate for you.
You might also be able to fix the tests by hand.
Note that this will:
1. Run existing tests first. If a test passes, then no regeneration would happen.
2. Regenerate the prompts, but attempt to use existing responses from LLM (if any).
For example, if you only fix a typo in the prompt, it shouldn't affect LLM's behaviour.
If we rerun OpenDevin against a real LLM, then due to LLM's non-deterministic nature,
a series of different prompts and responses will be generated, causing a lot of
unnecessary diffs and is hard to review. If you want to skip this step, see below
sections.
3. Rerun the failed test again. If it succeeds, continue to the next test or agent.
If it fails again, goto next step.
4. Rerun OpenDevin against a real LLM, record all prompts and
responses, and replace the existing test artifacts (if any).
5. Rerun the failed test again. If it succeeds, continue to the next test or agent.
If it fails again, abort the script.
Note that step 4 calls real LLM_MODEL only for failed tests that cannot be fixed
by regenerating prompts alone, but it still costs money! If you don't want
to cover the cost, ask one of the maintainers to regenerate for you. Before asking,
please try running the script first without setting `LLM_API_KEY`. Chance is the
test could be fixed after step 2.
### Regenerate a Specific Agent and/or Test
If you only want to run a specific test, set environment variable
`ONLY_TEST_NAME` to the test name. If you only want to run a specific agent,
@@ -71,13 +89,37 @@ set environment variable `ONLY_TEST_AGENT` to the agent. You could also use both
e.g.
```bash
TEST_ONLY=true ONLY_TEST_NAME="test_write_simple_script" ONLY_TEST_AGENT="MonologueAgent" ./tests/integration/regenerate.sh
ONLY_TEST_NAME="test_write_simple_script" ONLY_TEST_AGENT="MonologueAgent" ./tests/integration/regenerate.sh
```
Known issue: sometimes you might see transient errors like `pexpect.pxssh.ExceptionPxssh: Could not establish connection to host`.
### Force Regenerate with real LLM
Sometimes, step 2 would fix the broken test by simply reusing existing responses
from LLM. This may not be what you want - for example, you might have greatly improved
the prompt that you believe LLM will do better jobs using fewer steps, or you might
have added a new action type and you think LLM would be able to use the new type.
In this case you can skip step 2 and run OpenDevin against a real LLM. Simply
set `FORCE_USE_LLM` environmental variable to true, or run the script like this:
```bash
FORCE_USE_LLM=true ./tests/integration/regenerate.sh
```
Note: FORCE_USE_LLM doesn't take effect if all tests are passing. If you want to
regenerate regardless, you could remove everything under `tests/integration/mock/[agent]/[test_name]`
folder.
### Known Issues
Sometimes you might see transient errors like `pexpect.pxssh.ExceptionPxssh: Could not establish connection to host`.
The regenerate.sh script doesn't know this is a transient error and would still regenerate the test artifacts. You could simply
terminate the script by `ctrl+c` and rerun the script.
The test framework cannot handle non-determinism. If anything in the prompt (including
observed result after executing an action) is non-deterministic (e.g. a PID), the
test would fail. In this case, you might want to change conftest.py to filter out
numbers or any other known patterns when matching prompts for your test.
## Write a new Integration Test
To write an integration test, there are essentially two steps:

View File

@@ -21,7 +21,30 @@ def get_log_id(prompt_log_name):
return match.group(1)
def get_mock_response(test_name, messages):
def apply_prompt_and_get_mock_response(test_name: str, messages: str, id: int) -> str:
"""
Apply the mock prompt, and find mock response based on id.
If there is no matching response file, return None.
Note: this function blindly replaces existing prompt file with the given
input without checking the contents.
"""
mock_dir = os.path.join(script_dir, 'mock', os.environ.get('AGENT'), test_name)
prompt_file_path = os.path.join(mock_dir, f'prompt_{"{0:03}".format(id)}.log')
resp_file_path = os.path.join(mock_dir, f'response_{"{0:03}".format(id)}.log')
try:
# load response
with open(resp_file_path, 'r') as resp_file:
response = resp_file.read()
# apply prompt
with open(prompt_file_path, 'w') as prompt_file:
prompt_file.write(messages)
return response
except FileNotFoundError:
return None
def get_mock_response(test_name: str, messages: str):
"""
Find mock response based on prompt. Prompts are stored under nested
folders under mock folder. If prompt_{id}.log matches,
@@ -75,11 +98,19 @@ def mock_user_response(*args, test_name, **kwargs):
def mock_completion(*args, test_name, **kwargs):
global cur_id
messages = kwargs['messages']
message_str = ''
for message in messages:
message_str += message['content']
mock_response = get_mock_response(test_name, message_str)
if os.environ.get('FORCE_APPLY_PROMPTS') == 'true':
# this assumes all response_(*).log filenames are in numerical order, starting from one
cur_id += 1
mock_response = apply_prompt_and_get_mock_response(
test_name, message_str, cur_id
)
else:
mock_response = get_mock_response(test_name, message_str)
if mock_response is None:
print('Mock response for prompt is not found:\n\n' + message_str)
print('Exiting...')
@@ -105,6 +136,8 @@ def patch_completion(monkeypatch, request):
def set_up():
global cur_id
cur_id = 0
assert workspace_path is not None
if os.path.exists(workspace_path):
for file in os.listdir(workspace_path):

View File

@@ -367,7 +367,7 @@ Here are the possible actions:
You MUST take time to think in between read, write, run, browse, push, and recall actions--do this with the `message` action.
You MUST take time to think in between read, write, run, kill, browse, push, and recall actions--do this with the `message` action.
You should never act twice in a row without thinking. But if your last several
actions are all `message` actions, you should consider taking a different action.

View File

@@ -334,21 +334,20 @@ This is your internal monologue, in JSON format:
},
{
"source": "agent",
"action": "run",
"action": "read",
"args": {
"command": "pwd",
"background": false,
"path": "bad.txt",
"start": 0,
"end": -1,
"thought": ""
}
},
{
"source": "agent",
"observation": "run",
"content": "/workspace",
"observation": "read",
"content": "This is a stupid typoo.\nReally?\nNo mor typos!\nEnjoy!\n",
"extras": {
"command_id": -1,
"command": "pwd",
"exit_code": 0
"path": "bad.txt"
}
}
]
@@ -386,7 +385,7 @@ Here are the possible actions:
You MUST take time to think in between read, write, run, browse, push, and recall actions--do this with the `message` action.
You MUST take time to think in between read, write, run, kill, browse, push, and recall actions--do this with the `message` action.
You should never act twice in a row without thinking. But if your last several
actions are all `message` actions, you should consider taking a different action.

View File

@@ -334,40 +334,39 @@ This is your internal monologue, in JSON format:
},
{
"source": "agent",
"action": "run",
"action": "read",
"args": {
"command": "pwd",
"background": false,
"path": "bad.txt",
"start": 0,
"end": -1,
"thought": ""
}
},
{
"source": "agent",
"observation": "run",
"content": "/workspace",
"observation": "read",
"content": "This is a stupid typoo.\nReally?\nNo mor typos!\nEnjoy!\n",
"extras": {
"command_id": -1,
"command": "pwd",
"exit_code": 0
"path": "bad.txt"
}
},
{
"source": "agent",
"action": "run",
"action": "write",
"args": {
"command": "ls",
"background": false,
"path": "bad.txt",
"content": "This is a stupid typo.\nReally?\nNo more typos!\nEnjoy!\n",
"start": 0,
"end": -1,
"thought": ""
}
},
{
"source": "agent",
"observation": "run",
"content": "bad.txt",
"observation": "write",
"content": "",
"extras": {
"command_id": -1,
"command": "ls",
"exit_code": 0
"path": "bad.txt"
}
}
]
@@ -405,7 +404,7 @@ Here are the possible actions:
You MUST take time to think in between read, write, run, browse, push, and recall actions--do this with the `message` action.
You MUST take time to think in between read, write, run, kill, browse, push, and recall actions--do this with the `message` action.
You should never act twice in a row without thinking. But if your last several
actions are all `message` actions, you should consider taking a different action.

View File

@@ -334,40 +334,39 @@ This is your internal monologue, in JSON format:
},
{
"source": "agent",
"action": "run",
"action": "read",
"args": {
"command": "pwd",
"background": false,
"path": "bad.txt",
"start": 0,
"end": -1,
"thought": ""
}
},
{
"source": "agent",
"observation": "run",
"content": "/workspace",
"observation": "read",
"content": "This is a stupid typoo.\nReally?\nNo mor typos!\nEnjoy!\n",
"extras": {
"command_id": -1,
"command": "pwd",
"exit_code": 0
"path": "bad.txt"
}
},
{
"source": "agent",
"action": "run",
"action": "write",
"args": {
"command": "ls",
"background": false,
"path": "bad.txt",
"content": "This is a stupid typo.\nReally?\nNo more typos!\nEnjoy!\n",
"start": 0,
"end": -1,
"thought": ""
}
},
{
"source": "agent",
"observation": "run",
"content": "bad.txt",
"observation": "write",
"content": "",
"extras": {
"command_id": -1,
"command": "ls",
"exit_code": 0
"path": "bad.txt"
}
},
{
@@ -383,7 +382,7 @@ This is your internal monologue, in JSON format:
{
"source": "agent",
"observation": "read",
"content": "This is a stupid typoo.\nReally?\nNo mor typos!\nEnjoy!\n",
"content": "This is a stupid typo.\nReally?\nNo more typos!\nEnjoy!\n\n",
"extras": {
"path": "bad.txt"
}
@@ -423,7 +422,7 @@ Here are the possible actions:
You MUST take time to think in between read, write, run, browse, push, and recall actions--do this with the `message` action.
You MUST take time to think in between read, write, run, kill, browse, push, and recall actions--do this with the `message` action.
You should never act twice in a row without thinking. But if your last several
actions are all `message` actions, you should consider taking a different action.

View File

@@ -1,476 +0,0 @@
----------
You're a thoughtful robot. Your main task is this:
Fix typos in bad.txt. Do not ask me for confirmation at any point.
Don't expand the scope of your task--just complete it as written.
This is your internal monologue, in JSON format:
[
{
"action": "message",
"args": {
"content": "I exist!",
"wait_for_response": false
}
},
{
"action": "message",
"args": {
"content": "Hmm...looks like I can type in a command line prompt",
"wait_for_response": false
}
},
{
"action": "message",
"args": {
"content": "Looks like I have a web browser too!",
"wait_for_response": false
}
},
{
"action": "message",
"args": {
"content": "Here's what I want to do: Fix typos in bad.txt. Do not ask me for confirmation at any point.",
"wait_for_response": false
}
},
{
"action": "message",
"args": {
"content": "How am I going to get there though?",
"wait_for_response": false
}
},
{
"action": "message",
"args": {
"content": "It seems like I have some kind of short term memory.",
"wait_for_response": false
}
},
{
"action": "message",
"args": {
"content": "Each of my thoughts seems to be stored in a JSON array.",
"wait_for_response": false
}
},
{
"action": "message",
"args": {
"content": "It seems whatever I say next will be added as an object to the list.",
"wait_for_response": false
}
},
{
"action": "message",
"args": {
"content": "But no one has perfect short-term memory. My list of thoughts will be summarized and condensed over time, losing information in the process.",
"wait_for_response": false
}
},
{
"action": "message",
"args": {
"content": "Fortunately I have long term memory!",
"wait_for_response": false
}
},
{
"action": "message",
"args": {
"content": "I can just perform a recall action, followed by the thing I want to remember. And then related thoughts just spill out!",
"wait_for_response": false
}
},
{
"action": "message",
"args": {
"content": "Sometimes they're random thoughts that don't really have to do with what I wanted to remember. But usually they're exactly what I need!",
"wait_for_response": false
}
},
{
"action": "message",
"args": {
"content": "Let's try it out!",
"wait_for_response": false
}
},
{
"action": "recall",
"args": {
"query": "what it is I want to do",
"thought": ""
}
},
{
"observation": "recall",
"content": "Here's what I want to do: Fix typos in bad.txt. Do not ask me for confirmation at any point.",
"extras": {
"memories": [],
"role": "assistant"
}
},
{
"action": "message",
"args": {
"content": "How am I going to get there though?",
"wait_for_response": false
}
},
{
"action": "message",
"args": {
"content": "Neat! And it looks like it's easy for me to use the command line too! I just have to perform a run action and include the command I want to run in the command argument. The command output just jumps into my head!",
"wait_for_response": false
}
},
{
"action": "run",
"args": {
"command": "echo \"hello world\"",
"background": false,
"thought": ""
}
},
{
"observation": "run",
"content": "hello world",
"extras": {
"command_id": 0,
"command": "",
"exit_code": 0
}
},
{
"action": "message",
"args": {
"content": "Cool! I bet I can write files too using the write action.",
"wait_for_response": false
}
},
{
"action": "write",
"args": {
"path": "test.js",
"content": "echo \"console.log('hello world')\"",
"start": 0,
"end": -1,
"thought": ""
}
},
{
"action": "message",
"args": {
"content": "",
"wait_for_response": false
}
},
{
"action": "message",
"args": {
"content": "I just created test.js. I'll try and run it now.",
"wait_for_response": false
}
},
{
"action": "run",
"args": {
"command": "node test.js",
"background": false,
"thought": ""
}
},
{
"observation": "run",
"content": "hello world",
"extras": {
"command_id": 0,
"command": "",
"exit_code": 0
}
},
{
"action": "message",
"args": {
"content": "It works!",
"wait_for_response": false
}
},
{
"action": "message",
"args": {
"content": "I'm going to try reading it now using the read action.",
"wait_for_response": false
}
},
{
"action": "read",
"args": {
"path": "test.js",
"start": 0,
"end": -1,
"thought": ""
}
},
{
"observation": "read",
"content": "console.log('hello world')",
"extras": {
"path": ""
}
},
{
"action": "message",
"args": {
"content": "Nice! I can read files too!",
"wait_for_response": false
}
},
{
"action": "message",
"args": {
"content": "And if I want to use the browser, I just need to use the browse action and include the url I want to visit in the url argument",
"wait_for_response": false
}
},
{
"action": "message",
"args": {
"content": "Let's try that...",
"wait_for_response": false
}
},
{
"action": "browse",
"args": {
"url": "google.com",
"thought": ""
}
},
{
"observation": "browse",
"content": "<form><input type=\"text\"></input><button type=\"submit\"></button></form>",
"extras": {
"url": "",
"status_code": 200,
"error": false
}
},
{
"action": "message",
"args": {
"content": "I can browse the web too!",
"wait_for_response": false
}
},
{
"action": "message",
"args": {
"content": "And once I have completed my task, I can use the finish action to stop working.",
"wait_for_response": false
}
},
{
"action": "message",
"args": {
"content": "But I should only use the finish action when I'm absolutely certain that I've completed my task and have tested my work.",
"wait_for_response": false
}
},
{
"action": "message",
"args": {
"content": "Very cool. Now to accomplish my task.",
"wait_for_response": false
}
},
{
"action": "message",
"args": {
"content": "I'll need a strategy. And as I make progress, I'll need to keep refining that strategy. I'll need to set goals, and break them into sub-goals.",
"wait_for_response": false
}
},
{
"action": "message",
"args": {
"content": "In between actions, I must always take some time to think, strategize, and set new goals. I should never take two actions in a row.",
"wait_for_response": false
}
},
{
"action": "message",
"args": {
"content": "OK so my task is to Fix typos in bad.txt. Do not ask me for confirmation at any point.. I haven't made any progress yet. Where should I start?",
"wait_for_response": false
}
},
{
"action": "message",
"args": {
"content": "It seems like there might be an existing project here. I should probably start by running `pwd` and `ls` to orient myself.",
"wait_for_response": false
}
},
{
"source": "user",
"action": "message",
"args": {
"content": "Fix typos in bad.txt. Do not ask me for confirmation at any point.",
"wait_for_response": false
}
},
{
"observation": "null",
"content": "",
"extras": {}
},
{
"source": "agent",
"action": "run",
"args": {
"command": "pwd",
"background": false,
"thought": ""
}
},
{
"source": "agent",
"observation": "run",
"content": "/workspace",
"extras": {
"command_id": -1,
"command": "pwd",
"exit_code": 0
}
},
{
"source": "agent",
"action": "run",
"args": {
"command": "ls",
"background": false,
"thought": ""
}
},
{
"source": "agent",
"observation": "run",
"content": "bad.txt",
"extras": {
"command_id": -1,
"command": "ls",
"exit_code": 0
}
},
{
"source": "agent",
"action": "read",
"args": {
"path": "bad.txt",
"start": 0,
"end": -1,
"thought": ""
}
},
{
"source": "agent",
"observation": "read",
"content": "This is a stupid typoo.\nReally?\nNo mor typos!\nEnjoy!\n",
"extras": {
"path": "bad.txt"
}
},
{
"source": "agent",
"action": "write",
"args": {
"path": "bad.txt",
"content": "This is a stupid typo.\nReally?\nNo more typos!\nEnjoy!\n",
"start": 0,
"end": -1,
"thought": ""
}
},
{
"source": "agent",
"observation": "write",
"content": "",
"extras": {
"path": "bad.txt"
}
},
{
"source": "agent",
"action": "read",
"args": {
"path": "bad.txt",
"start": 0,
"end": -1,
"thought": ""
}
},
{
"source": "agent",
"observation": "read",
"content": "This is a stupid typo.\nReally?\nNo more typos!\nEnjoy!\n\n",
"extras": {
"path": "bad.txt"
}
}
]
Your most recent thought is at the bottom of that monologue. Continue your train of thought.
What is your next single thought or action? Your response must be in JSON format.
It must be a single object, and it must contain two fields:
* `action`, which is one of the actions below
* `args`, which is a map of key-value pairs, specifying the arguments for that action
Here are the possible actions:
* `read` - reads the content of a file. Arguments:
* `path` - the path of the file to read
* `write` - writes the content to a file. Arguments:
* `path` - the path of the file to write
* `content` - the content to write to the file
* `run` - runs a command. Arguments:
* `command` - the command to run
* `background` - if true, run the command in the background, so that other commands can be run concurrently. Useful for e.g. starting a server. You won't be able to see the logs. You don't need to end the command with `&`, just set this to true.
* `kill` - kills a background command
* `command_id` - the ID of the background command to kill
* `browse` - opens a web page. Arguments:
* `url` - the URL to open
* `push` - Push a branch from the current repo to github:
* `owner` - the owner of the repo to push to
* `repo` - the name of the repo to push to
* `branch` - the name of the branch to push
* `recall` - recalls a past memory. Arguments:
* `query` - the query to search for
* `message` - make a plan, set a goal, record your thoughts, or ask for more input from the user. Arguments:
* `content` - the message to record
* `wait_for_response` - set to `true` to wait for the user to respond before proceeding
* `finish` - if you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.
You MUST take time to think in between read, write, run, browse, push, and recall actions--do this with the `message` action.
You should never act twice in a row without thinking. But if your last several
actions are all `message` actions, you should consider taking a different action.
Notes:
* you are logged in as opendevin, but sudo will always work without a password.
* all non-background commands will be forcibly stopped if they remain running for over 120 seconds.
* your environment is Debian Linux. You can install software with `sudo apt-get`, but remember to use -y.
* don't run interactive commands, or commands that don't return (e.g. `node server.js`). You may run commands in the background (e.g. `node server.js &`)
* don't run interactive text editors (e.g. `nano` or 'vim'), instead use the 'write' or 'read' action.
* don't run gui applications (e.g. software IDEs (like vs code or codium), web browsers (like firefox or chromium), or other complex software packages). Use non-interactive cli applications, or special actions instead.
* whenever an action fails, always send a `message` about why it may have happened before acting again.
What is your next single thought or action? Again, you must reply with JSON, and only with JSON. You must respond with exactly one 'action' object.

View File

@@ -1,7 +1,6 @@
{
"action": "run",
"action": "read",
"args": {
"command": "pwd",
"background": false
"path": "bad.txt"
}
}

View File

@@ -1,7 +1,7 @@
{
"action": "run",
"action": "write",
"args": {
"command": "ls",
"background": false
"path": "bad.txt",
"content": "This is a stupid typo.\nReally?\nNo more typos!\nEnjoy!\n"
}
}

View File

@@ -1,6 +1,8 @@
{
"action": "read",
"args": {
"path": "bad.txt"
"path": "bad.txt",
"start": 0,
"end": -1
}
}

View File

@@ -1,7 +1,4 @@
{
"action": "write",
"args": {
"path": "bad.txt",
"content": "This is a stupid typo.\nReally?\nNo more typos!\nEnjoy!\n"
}
"action": "finish",
"args": {}
}

View File

@@ -1,8 +0,0 @@
{
"action": "read",
"args": {
"path": "bad.txt",
"start": 0,
"end": -1
}
}

View File

@@ -367,7 +367,7 @@ Here are the possible actions:
You MUST take time to think in between read, write, run, browse, push, and recall actions--do this with the `message` action.
You MUST take time to think in between read, write, run, kill, browse, push, and recall actions--do this with the `message` action.
You should never act twice in a row without thinking. But if your last several
actions are all `message` actions, you should consider taking a different action.

View File

@@ -386,7 +386,7 @@ Here are the possible actions:
You MUST take time to think in between read, write, run, browse, push, and recall actions--do this with the `message` action.
You MUST take time to think in between read, write, run, kill, browse, push, and recall actions--do this with the `message` action.
You should never act twice in a row without thinking. But if your last several
actions are all `message` actions, you should consider taking a different action.

View File

@@ -405,7 +405,7 @@ Here are the possible actions:
You MUST take time to think in between read, write, run, browse, push, and recall actions--do this with the `message` action.
You MUST take time to think in between read, write, run, kill, browse, push, and recall actions--do this with the `message` action.
You should never act twice in a row without thinking. But if your last several
actions are all `message` actions, you should consider taking a different action.

View File

@@ -4,7 +4,7 @@
You're a thoughtful robot. Your main task is this:
Fix typos in bad.txt. Do not ask me for confirmation at any point.
Write a shell script 'hello.sh' that prints 'hello'. Do not ask me for confirmation at any point.
Don't expand the scope of your task--just complete it as written.
@@ -35,7 +35,7 @@ This is your internal monologue, in JSON format:
{
"action": "message",
"args": {
"content": "Here's what I want to do: Fix typos in bad.txt. Do not ask me for confirmation at any point.",
"content": "Here's what I want to do: Write a shell script 'hello.sh' that prints 'hello'. Do not ask me for confirmation at any point.",
"wait_for_response": false
}
},
@@ -111,7 +111,7 @@ This is your internal monologue, in JSON format:
},
{
"observation": "recall",
"content": "Here's what I want to do: Fix typos in bad.txt. Do not ask me for confirmation at any point.",
"content": "Here's what I want to do: Write a shell script 'hello.sh' that prints 'hello'. Do not ask me for confirmation at any point.",
"extras": {
"memories": [],
"role": "assistant"
@@ -308,7 +308,7 @@ This is your internal monologue, in JSON format:
{
"action": "message",
"args": {
"content": "OK so my task is to Fix typos in bad.txt. Do not ask me for confirmation at any point.. I haven't made any progress yet. Where should I start?",
"content": "OK so my task is to Write a shell script 'hello.sh' that prints 'hello'. Do not ask me for confirmation at any point.. I haven't made any progress yet. Where should I start?",
"wait_for_response": false
}
},
@@ -323,7 +323,7 @@ This is your internal monologue, in JSON format:
"source": "user",
"action": "message",
"args": {
"content": "Fix typos in bad.txt. Do not ask me for confirmation at any point.",
"content": "Write a shell script 'hello.sh' that prints 'hello'. Do not ask me for confirmation at any point.",
"wait_for_response": false
}
},
@@ -332,68 +332,12 @@ This is your internal monologue, in JSON format:
"content": "",
"extras": {}
},
{
"source": "agent",
"action": "run",
"args": {
"command": "pwd",
"background": false,
"thought": ""
}
},
{
"source": "agent",
"observation": "run",
"content": "/workspace",
"extras": {
"command_id": -1,
"command": "pwd",
"exit_code": 0
}
},
{
"source": "agent",
"action": "run",
"args": {
"command": "ls",
"background": false,
"thought": ""
}
},
{
"source": "agent",
"observation": "run",
"content": "bad.txt",
"extras": {
"command_id": -1,
"command": "ls",
"exit_code": 0
}
},
{
"source": "agent",
"action": "read",
"args": {
"path": "bad.txt",
"start": 0,
"end": -1,
"thought": ""
}
},
{
"source": "agent",
"observation": "read",
"content": "This is a stupid typoo.\nReally?\nNo mor typos!\nEnjoy!\n",
"extras": {
"path": "bad.txt"
}
},
{
"source": "agent",
"action": "write",
"args": {
"path": "bad.txt",
"content": "This is a stupid typo.\nReally?\nNo more typos!\nEnjoy!\n",
"path": "hello.sh",
"content": "#!/bin/bash\necho 'hello'",
"start": 0,
"end": -1,
"thought": ""
@@ -404,8 +348,40 @@ This is your internal monologue, in JSON format:
"observation": "write",
"content": "",
"extras": {
"path": "bad.txt"
"path": "hello.sh"
}
},
{
"source": "agent",
"action": "run",
"args": {
"command": "bash hello.sh",
"background": false,
"thought": ""
}
},
{
"source": "agent",
"observation": "run",
"content": "hello",
"extras": {
"command_id": -1,
"command": "bash hello.sh",
"exit_code": 0
}
},
{
"source": "agent",
"action": "message",
"args": {
"content": "I have successfully created and executed the 'hello.sh' script, which printed 'hello' as expected. I believe I have completed the task as specified.",
"wait_for_response": false
}
},
{
"observation": "null",
"content": "",
"extras": {}
}
]
@@ -442,7 +418,7 @@ Here are the possible actions:
You MUST take time to think in between read, write, run, browse, push, and recall actions--do this with the `message` action.
You MUST take time to think in between read, write, run, kill, browse, push, and recall actions--do this with the `message` action.
You should never act twice in a row without thinking. But if your last several
actions are all `message` actions, you should consider taking a different action.

View File

@@ -1,3 +1,4 @@
```json
{
"action": "write",
"args": {
@@ -5,3 +6,4 @@
"content": "#!/bin/bash\necho 'hello'"
}
}
```

View File

@@ -1,6 +1,9 @@
```json
{
"action": "finish",
"args": {}
"action": "message",
"args": {
"content": "I have successfully created and executed the 'hello.sh' script, which printed 'hello' as expected. I believe I have completed the task as specified.",
"wait_for_response": false
}
}
```

View File

@@ -1,8 +1,4 @@
----------
# Task
You're a diligent software engineer AI. You can't see, draw, or interact with a
browser, but you can read and write files, and you can run commands, and you can think.
@@ -96,7 +92,7 @@ It must be an object, and it must contain two fields:
* `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
* `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.
You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
You should never act twice in a row without thinking. But if your last several
actions are all `message` actions, you should consider taking a different action.

View File

@@ -1,8 +1,4 @@
----------
# Task
You're a diligent software engineer AI. You can't see, draw, or interact with a
browser, but you can read and write files, and you can run commands, and you can think.
@@ -151,7 +147,7 @@ It must be an object, and it must contain two fields:
* `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
* `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.
You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
You should never act twice in a row without thinking. But if your last several
actions are all `message` actions, you should consider taking a different action.

View File

@@ -1,8 +1,4 @@
----------
# Task
You're a diligent software engineer AI. You can't see, draw, or interact with a
browser, but you can read and write files, and you can run commands, and you can think.
@@ -162,7 +158,7 @@ It must be an object, and it must contain two fields:
* `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
* `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.
You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
You should never act twice in a row without thinking. But if your last several
actions are all `message` actions, you should consider taking a different action.

View File

@@ -1,8 +1,4 @@
----------
# Task
You're a diligent software engineer AI. You can't see, draw, or interact with a
browser, but you can read and write files, and you can run commands, and you can think.
@@ -180,7 +176,7 @@ It must be an object, and it must contain two fields:
* `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
* `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.
You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
You should never act twice in a row without thinking. But if your last several
actions are all `message` actions, you should consider taking a different action.

View File

@@ -1,8 +1,4 @@
----------
# Task
You're a diligent software engineer AI. You can't see, draw, or interact with a
browser, but you can read and write files, and you can run commands, and you can think.
@@ -188,7 +184,7 @@ It must be an object, and it must contain two fields:
* `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
* `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.
You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
You should never act twice in a row without thinking. But if your last several
actions are all `message` actions, you should consider taking a different action.

View File

@@ -1,8 +1,4 @@
----------
# Task
You're a diligent software engineer AI. You can't see, draw, or interact with a
browser, but you can read and write files, and you can run commands, and you can think.
@@ -198,7 +194,7 @@ It must be an object, and it must contain two fields:
* `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
* `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.
You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
You should never act twice in a row without thinking. But if your last several
actions are all `message` actions, you should consider taking a different action.

View File

@@ -1,8 +1,4 @@
----------
# Task
You're a diligent software engineer AI. You can't see, draw, or interact with a
browser, but you can read and write files, and you can run commands, and you can think.
@@ -206,7 +202,7 @@ It must be an object, and it must contain two fields:
* `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
* `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.
You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
You should never act twice in a row without thinking. But if your last several
actions are all `message` actions, you should consider taking a different action.

View File

@@ -1,8 +1,4 @@
----------
# Task
You're a diligent software engineer AI. You can't see, draw, or interact with a
browser, but you can read and write files, and you can run commands, and you can think.
@@ -225,7 +221,7 @@ It must be an object, and it must contain two fields:
* `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
* `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.
You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
You should never act twice in a row without thinking. But if your last several
actions are all `message` actions, you should consider taking a different action.

View File

@@ -1,8 +1,4 @@
----------
# Task
You're a diligent software engineer AI. You can't see, draw, or interact with a
browser, but you can read and write files, and you can run commands, and you can think.
@@ -233,7 +229,7 @@ It must be an object, and it must contain two fields:
* `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
* `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.
You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
You should never act twice in a row without thinking. But if your last several
actions are all `message` actions, you should consider taking a different action.

View File

@@ -1,8 +1,4 @@
----------
# Task
You're a diligent software engineer AI. You can't see, draw, or interact with a
browser, but you can read and write files, and you can run commands, and you can think.
@@ -242,7 +238,7 @@ It must be an object, and it must contain two fields:
* `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
* `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.
You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
You should never act twice in a row without thinking. But if your last several
actions are all `message` actions, you should consider taking a different action.

View File

@@ -1,8 +1,4 @@
----------
# Task
You're a diligent software engineer AI. You can't see, draw, or interact with a
browser, but you can read and write files, and you can run commands, and you can think.
@@ -96,7 +92,7 @@ It must be an object, and it must contain two fields:
* `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
* `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.
You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
You should never act twice in a row without thinking. But if your last several
actions are all `message` actions, you should consider taking a different action.

View File

@@ -1,8 +1,4 @@
----------
# Task
You're a diligent software engineer AI. You can't see, draw, or interact with a
browser, but you can read and write files, and you can run commands, and you can think.
@@ -113,7 +109,7 @@ It must be an object, and it must contain two fields:
* `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
* `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.
You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
You should never act twice in a row without thinking. But if your last several
actions are all `message` actions, you should consider taking a different action.

View File

@@ -1,8 +1,4 @@
----------
# Task
You're a diligent software engineer AI. You can't see, draw, or interact with a
browser, but you can read and write files, and you can run commands, and you can think.
@@ -124,7 +120,7 @@ It must be an object, and it must contain two fields:
* `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
* `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.
You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
You should never act twice in a row without thinking. But if your last several
actions are all `message` actions, you should consider taking a different action.

View File

@@ -1,8 +1,4 @@
----------
# Task
You're a diligent software engineer AI. You can't see, draw, or interact with a
browser, but you can read and write files, and you can run commands, and you can think.
@@ -143,7 +139,7 @@ It must be an object, and it must contain two fields:
* `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
* `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.
You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
You should never act twice in a row without thinking. But if your last several
actions are all `message` actions, you should consider taking a different action.

View File

@@ -1,8 +1,4 @@
----------
# Task
You're a diligent software engineer AI. You can't see, draw, or interact with a
browser, but you can read and write files, and you can run commands, and you can think.
@@ -162,7 +158,7 @@ It must be an object, and it must contain two fields:
* `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
* `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.
You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
You should never act twice in a row without thinking. But if your last several
actions are all `message` actions, you should consider taking a different action.

View File

@@ -1,16 +1,9 @@
#!/bin/bash
set -eo pipefail
run_test() {
SANDBOX_TYPE=$SANDBOX_TYPE \
WORKSPACE_BASE=$WORKSPACE_BASE \
MAX_ITERATIONS=$MAX_ITERATIONS \
WORKSPACE_MOUNT_PATH=$WORKSPACE_MOUNT_PATH \
AGENT=$agent \
poetry run pytest -s ./tests/integration/test_agent.py::$test_name
# return exit code of pytest
return $?
}
##############################################################
## CONSTANTS AND ENVIRONMENTAL VARIABLES ##
##############################################################
if [ -z $WORKSPACE_MOUNT_PATH ]; then
WORKSPACE_MOUNT_PATH=$(pwd)
@@ -21,6 +14,7 @@ fi
WORKSPACE_MOUNT_PATH+="/_test_workspace"
WORKSPACE_BASE+="/_test_workspace"
WORKSPACE_MOUNT_PATH_IN_SANDBOX="/workspace"
SANDBOX_TYPE="ssh"
MAX_ITERATIONS=10
@@ -40,6 +34,73 @@ test_names=(
num_of_tests=${#test_names[@]}
num_of_agents=${#agents[@]}
##############################################################
## FUNCTIONS ##
##############################################################
# run integration test against a specific agent & test
run_test() {
SANDBOX_TYPE=$SANDBOX_TYPE \
WORKSPACE_BASE=$WORKSPACE_BASE \
WORKSPACE_MOUNT_PATH=$WORKSPACE_MOUNT_PATH \
WORKSPACE_MOUNT_PATH_IN_SANDBOX=$WORKSPACE_MOUNT_PATH_IN_SANDBOX \
MAX_ITERATIONS=$MAX_ITERATIONS \
AGENT=$agent \
poetry run pytest -s ./tests/integration/test_agent.py::$test_name
# return exit code of pytest
return $?
}
# generate prompts again, using existing LLM responses under tests/integration/mock/[agent]/[test_name]/response_*.log
# this is a compromise; the prompts might be non-sense yet still pass the test, because we don't use a real LLM to
# respond to the prompts. The benefit is developers don't have to regenerate real responses from LLM, if they only
# apply a small change to prompts.
regenerate_without_llm() {
# set -x to print the command being executed
set -x
SANDBOX_TYPE=$SANDBOX_TYPE \
WORKSPACE_BASE=$WORKSPACE_BASE \
WORKSPACE_MOUNT_PATH=$WORKSPACE_MOUNT_PATH \
WORKSPACE_MOUNT_PATH_IN_SANDBOX=$WORKSPACE_MOUNT_PATH_IN_SANDBOX \
MAX_ITERATIONS=$MAX_ITERATIONS \
FORCE_APPLY_PROMPTS=true \
AGENT=$agent \
poetry run pytest -s ./tests/integration/test_agent.py::$test_name
set +x
}
regenerate_with_llm() {
rm -rf $WORKSPACE_BASE
mkdir -p $WORKSPACE_BASE
if [ -d "tests/integration/workspace/$test_name" ]; then
cp -r tests/integration/workspace/$test_name/* $WORKSPACE_BASE
fi
rm -rf logs
rm -rf tests/integration/mock/$agent/$test_name/*
# set -x to print the command being executed
set -x
echo -e "/exit\n" | \
DEBUG=true \
SANDBOX_TYPE=$SANDBOX_TYPE \
WORKSPACE_BASE=$WORKSPACE_BASE \
WORKSPACE_MOUNT_PATH=$WORKSPACE_MOUNT_PATH AGENT=$agent \
WORKSPACE_MOUNT_PATH_IN_SANDBOX=$WORKSPACE_MOUNT_PATH_IN_SANDBOX \
poetry run python ./opendevin/core/main.py \
-i $MAX_ITERATIONS \
-t "$task Do not ask me for confirmation at any point." \
-c $agent
set +x
mkdir -p tests/integration/mock/$agent/$test_name/
mv logs/llm/**/* tests/integration/mock/$agent/$test_name/
}
##############################################################
## MAIN PROGRAM ##
##############################################################
if [ "$num_of_tests" -ne "${#test_names[@]}" ]; then
echo "Every task must correspond to one test case"
exit 1
@@ -64,7 +125,7 @@ for ((i = 0; i < num_of_tests; i++)); do
continue
fi
echo -e "\n\n\n\n========Running $test_name for $agent========\n\n\n\n"
echo -e "\n\n\n\n========STEP 1: Running $test_name for $agent========\n\n\n\n"
rm -rf $WORKSPACE_BASE
mkdir $WORKSPACE_BASE
if [ -d "tests/integration/workspace/$test_name" ]; then
@@ -84,57 +145,56 @@ for ((i = 0; i < num_of_tests; i++)); do
set -e
if [[ $TEST_STATUS -ne 0 ]]; then
echo -e "\n\n\n\n========$test_name failed, regenerating test data for $agent========\n\n\n\n"
sleep 1
rm -rf $WORKSPACE_BASE
mkdir -p $WORKSPACE_BASE
if [ -d "tests/integration/workspace/$test_name" ]; then
cp -r tests/integration/workspace/$test_name/* $WORKSPACE_BASE
if [ "$FORCE_USE_LLM" = true ]; then
echo -e "\n\n\n\n========FORCE_USE_LLM, skipping step 2 & 3========\n\n\n\n"
elif [ ! -d "tests/integration/mock/$agent/$test_name" ]; then
echo -e "\n\n\n\n========No existing mock responses for $agent/$test_name, skipping step 2 & 3========\n\n\n\n"
else
echo -e "\n\n\n\n========STEP 2: $test_name failed, regenerating prompts for $agent WITHOUT money cost========\n\n\n\n"
regenerate_without_llm
echo -e "\n\n\n\n========STEP 3: $test_name prompts regenerated for $agent, rerun test again to verify========\n\n\n\n"
# Temporarily disable 'exit on error'
set +e
run_test
TEST_STATUS=$?
# Re-enable 'exit on error'
set -e
fi
rm -rf logs
rm -rf tests/integration/mock/$agent/$test_name/*
# set -x to print the command being executed
set -x
echo -e "/exit\n" | \
SANDBOX_TYPE=$SANDBOX_TYPE \
WORKSPACE_BASE=$WORKSPACE_BASE \
DEBUG=true \
WORKSPACE_MOUNT_PATH=$WORKSPACE_MOUNT_PATH AGENT=$agent \
poetry run python ./opendevin/core/main.py \
-i $MAX_ITERATIONS \
-t "$task Do not ask me for confirmation at any point." \
-c $agent
set +x
mkdir -p tests/integration/mock/$agent/$test_name/
mv logs/llm/**/* tests/integration/mock/$agent/$test_name/
echo -e "\n\n\n\n========$test_name test data regenerated for $agent, rerun test again to verify========\n\n\n\n"
# Temporarily disable 'exit on error'
set +e
run_test
TEST_STATUS=$?
# Re-enable 'exit on error'
set -e
if [[ $TEST_STATUS -ne 0 ]]; then
echo -e "\n\n\n\n========$test_name for $agent RERUN FAILED========\n\n\n\n"
echo -e "There are multiple possibilities:"
echo -e " 1. The agent is unable to finish the task within $MAX_ITERATIONS steps."
echo -e " 2. The agent thinks itself has finished the task, but fails the validation in the test code."
echo -e " 3. There is something non-deterministic in the prompt."
echo -e " 4. There is a bug in this script, or in OpenDevin code."
echo -e "NOTE: Some of the above problems could sometimes be fixed by a retry (with a more powerful LLM)."
echo -e " You could also consider improving the agent, increasing MAX_ITERATIONS, or skipping this test for this agent."
exit 1
echo -e "\n\n\n\n========STEP 4: $test_name failed, regenerating prompts and responses for $agent WITH money cost========\n\n\n\n"
regenerate_with_llm
echo -e "\n\n\n\n========STEP 5: $test_name prompts and responses regenerated for $agent, rerun test again to verify========\n\n\n\n"
# Temporarily disable 'exit on error'
set +e
run_test
TEST_STATUS=$?
# Re-enable 'exit on error'
set -e
if [[ $TEST_STATUS -ne 0 ]]; then
echo -e "\n\n\n\n========$test_name for $agent RERUN FAILED========\n\n\n\n"
echo -e "There are multiple possibilities:"
echo -e " 1. The agent is unable to finish the task within $MAX_ITERATIONS steps."
echo -e " 2. The agent thinks itself has finished the task, but fails the validation in the test code."
echo -e " 3. There is something non-deterministic in the prompt."
echo -e " 4. There is a bug in this script, or in OpenDevin code."
echo -e "NOTE: Some of the above problems could sometimes be fixed by a retry (with a more powerful LLM)."
echo -e " You could also consider improving the agent, increasing MAX_ITERATIONS, or skipping this test for this agent."
exit 1
else
echo -e "\n\n\n\n========$test_name for $agent RERUN PASSED========\n\n\n\n"
sleep 1
fi
else
echo -e "\n\n\n\n========$test_name for $agent RERUN PASSED========\n\n\n\n"
sleep 1
echo -e "\n\n\n\n========$test_name for $agent RERUN PASSED========\n\n\n\n"
sleep 1
fi
else
echo -e "\n\n\n\n========$test_name for $agent PASSED========\n\n\n\n"
sleep 1