Refactor integration test framework and relieve the pain of regeneration (#1818)

* Update README.md * Fix WORKSPACE_MOUNT_PATH_IN_SANDBOX variable in regenerate.sh * Regenerate prompts without calling real LLM * Disable pytest warning capture * Change planner agent prompt by a bit for demo * Regenerate prompt files following prompt changes * doc: elaborate on FORCE_USE_LLM * Add another prompt change to monologue_agent for demo purpose * Regenerate prompts with FORCE_USE_LLM=true --------- Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>
2024-08-29 01:18:33 +03:00 · 2024-05-16 08:30:29 -07:00
parent e89cc8f19b
commit b6ff201780
38 changed files with 324 additions and 753 deletions
--- a/agenthub/monologue_agent/utils/prompts.py
+++ b/agenthub/monologue_agent/utils/prompts.py
@@ -51,7 +51,7 @@ Here are the possible actions:

 %(background_commands)s

-You MUST take time to think in between read, write, run, browse, push, and recall actions--do this with the `message` action.
+You MUST take time to think in between read, write, run, kill, browse, push, and recall actions--do this with the `message` action.
 You should never act twice in a row without thinking. But if your last several
 actions are all `message` actions, you should consider taking a different action.

--- a/agenthub/planner_agent/prompt.py
+++ b/agenthub/planner_agent/prompt.py
@@ -94,7 +94,7 @@ It must be an object, and it must contain two fields:
  * `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
 * `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.

-You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
+You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
 You should never act twice in a row without thinking. But if your last several
 actions are all `message` actions, you should consider taking a different action.

--- a/pytest.ini
+++ b/pytest.ini
@@ -0,0 +1,2 @@
+[pytest]
+addopts = -p no:warnings
--- a/tests/integration/README.md
+++ b/tests/integration/README.md
@@ -56,14 +56,32 @@ to run all integration tests until the first failure.

 ## Regenerate Integration Tests
 When you make changes to an agent's prompt, the integration tests will fail. You'll need to regenerate them
-by running:
+by running the following command from project root directory:
 ```bash
 ./tests/integration/regenerate.sh
 ```
-Note that this will run existing tests first and call real LLM_MODEL only for
-failed tests, but it still costs money! If you don't want
-to cover the cost, ask one of the maintainers to regenerate for you.
-You might also be able to fix the tests by hand.
+Note that this will:
+1. Run existing tests first. If a test passes, then no regeneration would happen.
+2. Regenerate the prompts, but attempt to use existing responses from LLM (if any).
+For example, if you only fix a typo in the prompt, it shouldn't affect LLM's behaviour.
+If we rerun OpenDevin against a real LLM, then due to LLM's non-deterministic nature,
+a series of different prompts and responses will be generated, causing a lot of
+unnecessary diffs and is hard to review. If you want to skip this step, see below
+sections.
+3. Rerun the failed test again. If it succeeds, continue to the next test or agent.
+If it fails again, goto next step.
+4. Rerun OpenDevin against a real LLM, record all prompts and
+responses, and replace the existing test artifacts (if any).
+5. Rerun the failed test again. If it succeeds, continue to the next test or agent.
+If it fails again, abort the script.
+
+Note that step 4 calls real LLM_MODEL only for failed tests that cannot be fixed
+by regenerating prompts alone, but it still costs money! If you don't want
+to cover the cost, ask one of the maintainers to regenerate for you. Before asking,
+please try running the script first without setting `LLM_API_KEY`. Chance is the
+test could be fixed after step 2.
+
+### Regenerate a Specific Agent and/or Test

 If you only want to run a specific test, set environment variable
 `ONLY_TEST_NAME` to the test name. If you only want to run a specific agent,
@@ -71,13 +89,37 @@ set environment variable `ONLY_TEST_AGENT` to the agent. You could also use both
 e.g.

 ```bash
-TEST_ONLY=true ONLY_TEST_NAME="test_write_simple_script" ONLY_TEST_AGENT="MonologueAgent" ./tests/integration/regenerate.sh
+ONLY_TEST_NAME="test_write_simple_script" ONLY_TEST_AGENT="MonologueAgent" ./tests/integration/regenerate.sh
 ```

-Known issue: sometimes you might see transient errors like `pexpect.pxssh.ExceptionPxssh: Could not establish connection to host`.
+### Force Regenerate with real LLM
+
+Sometimes, step 2 would fix the broken test by simply reusing existing responses
+from LLM. This may not be what you want - for example, you might have greatly improved
+the prompt that you believe LLM will do better jobs using fewer steps, or you might
+have added a new action type and you think LLM would be able to use the new type.
+In this case you can skip step 2 and run OpenDevin against a real LLM. Simply
+set `FORCE_USE_LLM` environmental variable to true, or run the script like this:
+
+```bash
+FORCE_USE_LLM=true ./tests/integration/regenerate.sh
+```
+
+Note: FORCE_USE_LLM doesn't take effect if all tests are passing. If you want to
+regenerate regardless, you could remove everything under `tests/integration/mock/[agent]/[test_name]`
+folder.
+
+### Known Issues
+
+Sometimes you might see transient errors like `pexpect.pxssh.ExceptionPxssh: Could not establish connection to host`.
 The regenerate.sh script doesn't know this is a transient error and would still regenerate the test artifacts. You could simply
 terminate the script by `ctrl+c` and rerun the script.

+The test framework cannot handle non-determinism. If anything in the prompt (including
+observed result after executing an action) is non-deterministic (e.g. a PID), the
+test would fail. In this case, you might want to change conftest.py to filter out
+numbers or any other known patterns when matching prompts for your test.
+
 ## Write a new Integration Test

 To write an integration test, there are essentially two steps:
--- a/tests/integration/conftest.py
+++ b/tests/integration/conftest.py
@@ -21,7 +21,30 @@ def get_log_id(prompt_log_name):
        return match.group(1)


-def get_mock_response(test_name, messages):
+def apply_prompt_and_get_mock_response(test_name: str, messages: str, id: int) -> str:
+    """
+    Apply the mock prompt, and find mock response based on id.
+    If there is no matching response file, return None.
+
+    Note: this function blindly replaces existing prompt file with the given
+    input without checking the contents.
+    """
+    mock_dir = os.path.join(script_dir, 'mock', os.environ.get('AGENT'), test_name)
+    prompt_file_path = os.path.join(mock_dir, f'prompt_{"{0:03}".format(id)}.log')
+    resp_file_path = os.path.join(mock_dir, f'response_{"{0:03}".format(id)}.log')
+    try:
+        # load response
+        with open(resp_file_path, 'r') as resp_file:
+            response = resp_file.read()
+        # apply prompt
+        with open(prompt_file_path, 'w') as prompt_file:
+            prompt_file.write(messages)
+        return response
+    except FileNotFoundError:
+        return None
+
+
+def get_mock_response(test_name: str, messages: str):
    """
    Find mock response based on prompt. Prompts are stored under nested
    folders under mock folder. If prompt_{id}.log matches,
@@ -75,11 +98,19 @@ def mock_user_response(*args, test_name, **kwargs):


 def mock_completion(*args, test_name, **kwargs):
+    global cur_id
    messages = kwargs['messages']
    message_str = ''
    for message in messages:
        message_str += message['content']
-    mock_response = get_mock_response(test_name, message_str)
+    if os.environ.get('FORCE_APPLY_PROMPTS') == 'true':
+        # this assumes all response_(*).log filenames are in numerical order, starting from one
+        cur_id += 1
+        mock_response = apply_prompt_and_get_mock_response(
+            test_name, message_str, cur_id
+        )
+    else:
+        mock_response = get_mock_response(test_name, message_str)
    if mock_response is None:
        print('Mock response for prompt is not found:\n\n' + message_str)
        print('Exiting...')
@@ -105,6 +136,8 @@ def patch_completion(monkeypatch, request):


 def set_up():
+    global cur_id
+    cur_id = 0
    assert workspace_path is not None
    if os.path.exists(workspace_path):
        for file in os.listdir(workspace_path):
--- a/tests/integration/mock/MonologueAgent/test_edits/prompt_001.log
+++ b/tests/integration/mock/MonologueAgent/test_edits/prompt_001.log
@@ -367,7 +367,7 @@ Here are the possible actions:



-You MUST take time to think in between read, write, run, browse, push, and recall actions--do this with the `message` action.
+You MUST take time to think in between read, write, run, kill, browse, push, and recall actions--do this with the `message` action.
 You should never act twice in a row without thinking. But if your last several
 actions are all `message` actions, you should consider taking a different action.

--- a/tests/integration/mock/MonologueAgent/test_edits/prompt_002.log
+++ b/tests/integration/mock/MonologueAgent/test_edits/prompt_002.log
@@ -334,21 +334,20 @@ This is your internal monologue, in JSON format:
  },
  {
    "source": "agent",
-    "action": "run",
+    "action": "read",
    "args": {
-      "command": "pwd",
-      "background": false,
+      "path": "bad.txt",
+      "start": 0,
+      "end": -1,
      "thought": ""
    }
  },
  {
    "source": "agent",
-    "observation": "run",
-    "content": "/workspace",
+    "observation": "read",
+    "content": "This is a stupid typoo.\nReally?\nNo mor typos!\nEnjoy!\n",
    "extras": {
-      "command_id": -1,
-      "command": "pwd",
-      "exit_code": 0
+      "path": "bad.txt"
    }
  }
 ]
@@ -386,7 +385,7 @@ Here are the possible actions:



-You MUST take time to think in between read, write, run, browse, push, and recall actions--do this with the `message` action.
+You MUST take time to think in between read, write, run, kill, browse, push, and recall actions--do this with the `message` action.
 You should never act twice in a row without thinking. But if your last several
 actions are all `message` actions, you should consider taking a different action.

--- a/tests/integration/mock/MonologueAgent/test_edits/prompt_003.log
+++ b/tests/integration/mock/MonologueAgent/test_edits/prompt_003.log
@@ -334,40 +334,39 @@ This is your internal monologue, in JSON format:
  },
  {
    "source": "agent",
-    "action": "run",
+    "action": "read",
    "args": {
-      "command": "pwd",
-      "background": false,
+      "path": "bad.txt",
+      "start": 0,
+      "end": -1,
      "thought": ""
    }
  },
  {
    "source": "agent",
-    "observation": "run",
-    "content": "/workspace",
+    "observation": "read",
+    "content": "This is a stupid typoo.\nReally?\nNo mor typos!\nEnjoy!\n",
    "extras": {
-      "command_id": -1,
-      "command": "pwd",
-      "exit_code": 0
+      "path": "bad.txt"
    }
  },
  {
    "source": "agent",
-    "action": "run",
+    "action": "write",
    "args": {
-      "command": "ls",
-      "background": false,
+      "path": "bad.txt",
+      "content": "This is a stupid typo.\nReally?\nNo more typos!\nEnjoy!\n",
+      "start": 0,
+      "end": -1,
      "thought": ""
    }
  },
  {
    "source": "agent",
-    "observation": "run",
-    "content": "bad.txt",
+    "observation": "write",
+    "content": "",
    "extras": {
-      "command_id": -1,
-      "command": "ls",
-      "exit_code": 0
+      "path": "bad.txt"
    }
  }
 ]
@@ -405,7 +404,7 @@ Here are the possible actions:



-You MUST take time to think in between read, write, run, browse, push, and recall actions--do this with the `message` action.
+You MUST take time to think in between read, write, run, kill, browse, push, and recall actions--do this with the `message` action.
 You should never act twice in a row without thinking. But if your last several
 actions are all `message` actions, you should consider taking a different action.

--- a/tests/integration/mock/MonologueAgent/test_edits/prompt_004.log
+++ b/tests/integration/mock/MonologueAgent/test_edits/prompt_004.log
@@ -334,40 +334,39 @@ This is your internal monologue, in JSON format:
  },
  {
    "source": "agent",
-    "action": "run",
+    "action": "read",
    "args": {
-      "command": "pwd",
-      "background": false,
+      "path": "bad.txt",
+      "start": 0,
+      "end": -1,
      "thought": ""
    }
  },
  {
    "source": "agent",
-    "observation": "run",
-    "content": "/workspace",
+    "observation": "read",
+    "content": "This is a stupid typoo.\nReally?\nNo mor typos!\nEnjoy!\n",
    "extras": {
-      "command_id": -1,
-      "command": "pwd",
-      "exit_code": 0
+      "path": "bad.txt"
    }
  },
  {
    "source": "agent",
-    "action": "run",
+    "action": "write",
    "args": {
-      "command": "ls",
-      "background": false,
+      "path": "bad.txt",
+      "content": "This is a stupid typo.\nReally?\nNo more typos!\nEnjoy!\n",
+      "start": 0,
+      "end": -1,
      "thought": ""
    }
  },
  {
    "source": "agent",
-    "observation": "run",
-    "content": "bad.txt",
+    "observation": "write",
+    "content": "",
    "extras": {
-      "command_id": -1,
-      "command": "ls",
-      "exit_code": 0
+      "path": "bad.txt"
    }
  },
  {
@@ -383,7 +382,7 @@ This is your internal monologue, in JSON format:
  {
    "source": "agent",
    "observation": "read",
-    "content": "This is a stupid typoo.\nReally?\nNo mor typos!\nEnjoy!\n",
+    "content": "This is a stupid typo.\nReally?\nNo more typos!\nEnjoy!\n\n",
    "extras": {
      "path": "bad.txt"
    }
@@ -423,7 +422,7 @@ Here are the possible actions:



-You MUST take time to think in between read, write, run, browse, push, and recall actions--do this with the `message` action.
+You MUST take time to think in between read, write, run, kill, browse, push, and recall actions--do this with the `message` action.
 You should never act twice in a row without thinking. But if your last several
 actions are all `message` actions, you should consider taking a different action.

--- a/tests/integration/mock/MonologueAgent/test_edits/prompt_006.log
+++ b/tests/integration/mock/MonologueAgent/test_edits/prompt_006.log
@@ -1,476 +0,0 @@
-
-
----------
-
-
-You're a thoughtful robot. Your main task is this:
-Fix typos in bad.txt. Do not ask me for confirmation at any point.
-
-Don't expand the scope of your task--just complete it as written.
-
-This is your internal monologue, in JSON format:
-
-[
-  {
-    "action": "message",
-    "args": {
-      "content": "I exist!",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "Hmm...looks like I can type in a command line prompt",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "Looks like I have a web browser too!",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "Here's what I want to do: Fix typos in bad.txt. Do not ask me for confirmation at any point.",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "How am I going to get there though?",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "It seems like I have some kind of short term memory.",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "Each of my thoughts seems to be stored in a JSON array.",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "It seems whatever I say next will be added as an object to the list.",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "But no one has perfect short-term memory. My list of thoughts will be summarized and condensed over time, losing information in the process.",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "Fortunately I have long term memory!",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "I can just perform a recall action, followed by the thing I want to remember. And then related thoughts just spill out!",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "Sometimes they're random thoughts that don't really have to do with what I wanted to remember. But usually they're exactly what I need!",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "Let's try it out!",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "recall",
-    "args": {
-      "query": "what it is I want to do",
-      "thought": ""
-    }
-  },
-  {
-    "observation": "recall",
-    "content": "Here's what I want to do: Fix typos in bad.txt. Do not ask me for confirmation at any point.",
-    "extras": {
-      "memories": [],
-      "role": "assistant"
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "How am I going to get there though?",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "Neat! And it looks like it's easy for me to use the command line too! I just have to perform a run action and include the command I want to run in the command argument. The command output just jumps into my head!",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "run",
-    "args": {
-      "command": "echo \"hello world\"",
-      "background": false,
-      "thought": ""
-    }
-  },
-  {
-    "observation": "run",
-    "content": "hello world",
-    "extras": {
-      "command_id": 0,
-      "command": "",
-      "exit_code": 0
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "Cool! I bet I can write files too using the write action.",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "write",
-    "args": {
-      "path": "test.js",
-      "content": "echo \"console.log('hello world')\"",
-      "start": 0,
-      "end": -1,
-      "thought": ""
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "I just created test.js. I'll try and run it now.",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "run",
-    "args": {
-      "command": "node test.js",
-      "background": false,
-      "thought": ""
-    }
-  },
-  {
-    "observation": "run",
-    "content": "hello world",
-    "extras": {
-      "command_id": 0,
-      "command": "",
-      "exit_code": 0
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "It works!",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "I'm going to try reading it now using the read action.",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "read",
-    "args": {
-      "path": "test.js",
-      "start": 0,
-      "end": -1,
-      "thought": ""
-    }
-  },
-  {
-    "observation": "read",
-    "content": "console.log('hello world')",
-    "extras": {
-      "path": ""
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "Nice! I can read files too!",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "And if I want to use the browser, I just need to use the browse action and include the url I want to visit in the url argument",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "Let's try that...",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "browse",
-    "args": {
-      "url": "google.com",
-      "thought": ""
-    }
-  },
-  {
-    "observation": "browse",
-    "content": "<form><input type=\"text\"></input><button type=\"submit\"></button></form>",
-    "extras": {
-      "url": "",
-      "status_code": 200,
-      "error": false
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "I can browse the web too!",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "And once I have completed my task, I can use the finish action to stop working.",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "But I should only use the finish action when I'm absolutely certain that I've completed my task and have tested my work.",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "Very cool. Now to accomplish my task.",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "I'll need a strategy. And as I make progress, I'll need to keep refining that strategy. I'll need to set goals, and break them into sub-goals.",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "In between actions, I must always take some time to think, strategize, and set new goals. I should never take two actions in a row.",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "OK so my task is to Fix typos in bad.txt. Do not ask me for confirmation at any point.. I haven't made any progress yet. Where should I start?",
-      "wait_for_response": false
-    }
-  },
-  {
-    "action": "message",
-    "args": {
-      "content": "It seems like there might be an existing project here. I should probably start by running `pwd` and `ls` to orient myself.",
-      "wait_for_response": false
-    }
-  },
-  {
-    "source": "user",
-    "action": "message",
-    "args": {
-      "content": "Fix typos in bad.txt. Do not ask me for confirmation at any point.",
-      "wait_for_response": false
-    }
-  },
-  {
-    "observation": "null",
-    "content": "",
-    "extras": {}
-  },
-  {
-    "source": "agent",
-    "action": "run",
-    "args": {
-      "command": "pwd",
-      "background": false,
-      "thought": ""
-    }
-  },
-  {
-    "source": "agent",
-    "observation": "run",
-    "content": "/workspace",
-    "extras": {
-      "command_id": -1,
-      "command": "pwd",
-      "exit_code": 0
-    }
-  },
-  {
-    "source": "agent",
-    "action": "run",
-    "args": {
-      "command": "ls",
-      "background": false,
-      "thought": ""
-    }
-  },
-  {
-    "source": "agent",
-    "observation": "run",
-    "content": "bad.txt",
-    "extras": {
-      "command_id": -1,
-      "command": "ls",
-      "exit_code": 0
-    }
-  },
-  {
-    "source": "agent",
-    "action": "read",
-    "args": {
-      "path": "bad.txt",
-      "start": 0,
-      "end": -1,
-      "thought": ""
-    }
-  },
-  {
-    "source": "agent",
-    "observation": "read",
-    "content": "This is a stupid typoo.\nReally?\nNo mor typos!\nEnjoy!\n",
-    "extras": {
-      "path": "bad.txt"
-    }
-  },
-  {
-    "source": "agent",
-    "action": "write",
-    "args": {
-      "path": "bad.txt",
-      "content": "This is a stupid typo.\nReally?\nNo more typos!\nEnjoy!\n",
-      "start": 0,
-      "end": -1,
-      "thought": ""
-    }
-  },
-  {
-    "source": "agent",
-    "observation": "write",
-    "content": "",
-    "extras": {
-      "path": "bad.txt"
-    }
-  },
-  {
-    "source": "agent",
-    "action": "read",
-    "args": {
-      "path": "bad.txt",
-      "start": 0,
-      "end": -1,
-      "thought": ""
-    }
-  },
-  {
-    "source": "agent",
-    "observation": "read",
-    "content": "This is a stupid typo.\nReally?\nNo more typos!\nEnjoy!\n\n",
-    "extras": {
-      "path": "bad.txt"
-    }
-  }
-]
-
-
-Your most recent thought is at the bottom of that monologue. Continue your train of thought.
-What is your next single thought or action? Your response must be in JSON format.
-It must be a single object, and it must contain two fields:
-* `action`, which is one of the actions below
-* `args`, which is a map of key-value pairs, specifying the arguments for that action
-
-Here are the possible actions:
-* `read` - reads the content of a file. Arguments:
-  * `path` - the path of the file to read
-* `write` - writes the content to a file. Arguments:
-  * `path` - the path of the file to write
-  * `content` - the content to write to the file
-* `run` - runs a command. Arguments:
-  * `command` - the command to run
-  * `background` - if true, run the command in the background, so that other commands can be run concurrently. Useful for e.g. starting a server. You won't be able to see the logs. You don't need to end the command with `&`, just set this to true.
-* `kill` - kills a background command
-  * `command_id` - the ID of the background command to kill
-* `browse` - opens a web page. Arguments:
-  * `url` - the URL to open
-* `push` - Push a branch from the current repo to github:
-  * `owner` - the owner of the repo to push to
-  * `repo` - the name of the repo to push to
-  * `branch` - the name of the branch to push
-* `recall` - recalls a past memory. Arguments:
-  * `query` - the query to search for
-* `message` - make a plan, set a goal, record your thoughts, or ask for more input from the user. Arguments:
-  * `content` - the message to record
-  * `wait_for_response` - set to `true` to wait for the user to respond before proceeding
-* `finish` - if you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.
-
-
-
-You MUST take time to think in between read, write, run, browse, push, and recall actions--do this with the `message` action.
-You should never act twice in a row without thinking. But if your last several
-actions are all `message` actions, you should consider taking a different action.
-
-Notes:
-* you are logged in as opendevin, but sudo will always work without a password.
-* all non-background commands will be forcibly stopped if they remain running for over 120 seconds.
-* your environment is Debian Linux. You can install software with `sudo apt-get`, but remember to use -y.
-* don't run interactive commands, or commands that don't return (e.g. `node server.js`). You may run commands in the background (e.g. `node server.js &`)
-* don't run interactive text editors (e.g. `nano` or 'vim'), instead use the 'write' or 'read' action.
-* don't run gui applications (e.g. software IDEs (like vs code or codium), web browsers (like firefox or chromium), or other complex software packages). Use non-interactive cli applications, or special actions instead.
-* whenever an action fails, always send a `message` about why it may have happened before acting again.
-
-What is your next single thought or action? Again, you must reply with JSON, and only with JSON. You must respond with exactly one 'action' object.
--- a/tests/integration/mock/MonologueAgent/test_edits/response_001.log
+++ b/tests/integration/mock/MonologueAgent/test_edits/response_001.log
@@ -1,7 +1,6 @@
 {
-  "action": "run",
+  "action": "read",
  "args": {
-    "command": "pwd",
-    "background": false
+    "path": "bad.txt"
  }
 }
--- a/tests/integration/mock/MonologueAgent/test_edits/response_002.log
+++ b/tests/integration/mock/MonologueAgent/test_edits/response_002.log
@@ -1,7 +1,7 @@
 {
-  "action": "run",
+  "action": "write",
  "args": {
-    "command": "ls",
-    "background": false
+    "path": "bad.txt",
+    "content": "This is a stupid typo.\nReally?\nNo more typos!\nEnjoy!\n"
  }
 }
--- a/tests/integration/mock/MonologueAgent/test_edits/response_003.log
+++ b/tests/integration/mock/MonologueAgent/test_edits/response_003.log
@@ -1,6 +1,8 @@
 {
  "action": "read",
  "args": {
-    "path": "bad.txt"
+    "path": "bad.txt",
+    "start": 0,
+    "end": -1
  }
 }
--- a/tests/integration/mock/MonologueAgent/test_edits/response_004.log
+++ b/tests/integration/mock/MonologueAgent/test_edits/response_004.log
@@ -1,7 +1,4 @@
 {
-  "action": "write",
-  "args": {
-    "path": "bad.txt",
-    "content": "This is a stupid typo.\nReally?\nNo more typos!\nEnjoy!\n"
-  }
+  "action": "finish",
+  "args": {}
 }
--- a/tests/integration/mock/MonologueAgent/test_edits/response_005.log
+++ b/tests/integration/mock/MonologueAgent/test_edits/response_005.log
@@ -1,8 +0,0 @@
-{
-  "action": "read",
-  "args": {
-    "path": "bad.txt",
-    "start": 0,
-    "end": -1
-  }
-}
--- a/tests/integration/mock/MonologueAgent/test_write_simple_script/prompt_001.log
+++ b/tests/integration/mock/MonologueAgent/test_write_simple_script/prompt_001.log
@@ -367,7 +367,7 @@ Here are the possible actions:



-You MUST take time to think in between read, write, run, browse, push, and recall actions--do this with the `message` action.
+You MUST take time to think in between read, write, run, kill, browse, push, and recall actions--do this with the `message` action.
 You should never act twice in a row without thinking. But if your last several
 actions are all `message` actions, you should consider taking a different action.

--- a/tests/integration/mock/MonologueAgent/test_write_simple_script/prompt_002.log
+++ b/tests/integration/mock/MonologueAgent/test_write_simple_script/prompt_002.log
@@ -386,7 +386,7 @@ Here are the possible actions:



-You MUST take time to think in between read, write, run, browse, push, and recall actions--do this with the `message` action.
+You MUST take time to think in between read, write, run, kill, browse, push, and recall actions--do this with the `message` action.
 You should never act twice in a row without thinking. But if your last several
 actions are all `message` actions, you should consider taking a different action.

--- a/tests/integration/mock/MonologueAgent/test_write_simple_script/prompt_003.log
+++ b/tests/integration/mock/MonologueAgent/test_write_simple_script/prompt_003.log
@@ -405,7 +405,7 @@ Here are the possible actions:



-You MUST take time to think in between read, write, run, browse, push, and recall actions--do this with the `message` action.
+You MUST take time to think in between read, write, run, kill, browse, push, and recall actions--do this with the `message` action.
 You should never act twice in a row without thinking. But if your last several
 actions are all `message` actions, you should consider taking a different action.

--- a/tests/integration/mock/MonologueAgent/test_write_simple_script/prompt_004.log
+++ b/tests/integration/mock/MonologueAgent/test_write_simple_script/prompt_004.log
@@ -4,7 +4,7 @@


 You're a thoughtful robot. Your main task is this:
-Fix typos in bad.txt. Do not ask me for confirmation at any point.
+Write a shell script 'hello.sh' that prints 'hello'. Do not ask me for confirmation at any point.

 Don't expand the scope of your task--just complete it as written.

@@ -35,7 +35,7 @@ This is your internal monologue, in JSON format:
  {
    "action": "message",
    "args": {
-      "content": "Here's what I want to do: Fix typos in bad.txt. Do not ask me for confirmation at any point.",
+      "content": "Here's what I want to do: Write a shell script 'hello.sh' that prints 'hello'. Do not ask me for confirmation at any point.",
      "wait_for_response": false
    }
  },
@@ -111,7 +111,7 @@ This is your internal monologue, in JSON format:
  },
  {
    "observation": "recall",
-    "content": "Here's what I want to do: Fix typos in bad.txt. Do not ask me for confirmation at any point.",
+    "content": "Here's what I want to do: Write a shell script 'hello.sh' that prints 'hello'. Do not ask me for confirmation at any point.",
    "extras": {
      "memories": [],
      "role": "assistant"
@@ -308,7 +308,7 @@ This is your internal monologue, in JSON format:
  {
    "action": "message",
    "args": {
-      "content": "OK so my task is to Fix typos in bad.txt. Do not ask me for confirmation at any point.. I haven't made any progress yet. Where should I start?",
+      "content": "OK so my task is to Write a shell script 'hello.sh' that prints 'hello'. Do not ask me for confirmation at any point.. I haven't made any progress yet. Where should I start?",
      "wait_for_response": false
    }
  },
@@ -323,7 +323,7 @@ This is your internal monologue, in JSON format:
    "source": "user",
    "action": "message",
    "args": {
-      "content": "Fix typos in bad.txt. Do not ask me for confirmation at any point.",
+      "content": "Write a shell script 'hello.sh' that prints 'hello'. Do not ask me for confirmation at any point.",
      "wait_for_response": false
    }
  },
@@ -332,68 +332,12 @@ This is your internal monologue, in JSON format:
    "content": "",
    "extras": {}
  },
-  {
-    "source": "agent",
-    "action": "run",
-    "args": {
-      "command": "pwd",
-      "background": false,
-      "thought": ""
-    }
-  },
-  {
-    "source": "agent",
-    "observation": "run",
-    "content": "/workspace",
-    "extras": {
-      "command_id": -1,
-      "command": "pwd",
-      "exit_code": 0
-    }
-  },
-  {
-    "source": "agent",
-    "action": "run",
-    "args": {
-      "command": "ls",
-      "background": false,
-      "thought": ""
-    }
-  },
-  {
-    "source": "agent",
-    "observation": "run",
-    "content": "bad.txt",
-    "extras": {
-      "command_id": -1,
-      "command": "ls",
-      "exit_code": 0
-    }
-  },
-  {
-    "source": "agent",
-    "action": "read",
-    "args": {
-      "path": "bad.txt",
-      "start": 0,
-      "end": -1,
-      "thought": ""
-    }
-  },
-  {
-    "source": "agent",
-    "observation": "read",
-    "content": "This is a stupid typoo.\nReally?\nNo mor typos!\nEnjoy!\n",
-    "extras": {
-      "path": "bad.txt"
-    }
-  },
  {
    "source": "agent",
    "action": "write",
    "args": {
-      "path": "bad.txt",
-      "content": "This is a stupid typo.\nReally?\nNo more typos!\nEnjoy!\n",
+      "path": "hello.sh",
+      "content": "#!/bin/bash\necho 'hello'",
      "start": 0,
      "end": -1,
      "thought": ""
@@ -404,8 +348,40 @@ This is your internal monologue, in JSON format:
    "observation": "write",
    "content": "",
    "extras": {
-      "path": "bad.txt"
+      "path": "hello.sh"
    }
+  },
+  {
+    "source": "agent",
+    "action": "run",
+    "args": {
+      "command": "bash hello.sh",
+      "background": false,
+      "thought": ""
+    }
+  },
+  {
+    "source": "agent",
+    "observation": "run",
+    "content": "hello",
+    "extras": {
+      "command_id": -1,
+      "command": "bash hello.sh",
+      "exit_code": 0
+    }
+  },
+  {
+    "source": "agent",
+    "action": "message",
+    "args": {
+      "content": "I have successfully created and executed the 'hello.sh' script, which printed 'hello' as expected. I believe I have completed the task as specified.",
+      "wait_for_response": false
+    }
+  },
+  {
+    "observation": "null",
+    "content": "",
+    "extras": {}
  }
 ]

@@ -442,7 +418,7 @@ Here are the possible actions:



-You MUST take time to think in between read, write, run, browse, push, and recall actions--do this with the `message` action.
+You MUST take time to think in between read, write, run, kill, browse, push, and recall actions--do this with the `message` action.
 You should never act twice in a row without thinking. But if your last several
 actions are all `message` actions, you should consider taking a different action.

--- a/tests/integration/mock/MonologueAgent/test_write_simple_script/response_001.log
+++ b/tests/integration/mock/MonologueAgent/test_write_simple_script/response_001.log
@@ -1,3 +1,4 @@
+```json
 {
  "action": "write",
  "args": {
@@ -5,3 +6,4 @@
    "content": "#!/bin/bash\necho 'hello'"
  }
 }
+```
--- a/tests/integration/mock/MonologueAgent/test_write_simple_script/response_003.log
+++ b/tests/integration/mock/MonologueAgent/test_write_simple_script/response_003.log
@@ -1,6 +1,9 @@
 ```json
 {
-  "action": "finish",
-  "args": {}
+  "action": "message",
+  "args": {
+    "content": "I have successfully created and executed the 'hello.sh' script, which printed 'hello' as expected. I believe I have completed the task as specified.",
+    "wait_for_response": false
+  }
 }
 ```
--- a/tests/integration/mock/MonologueAgent/test_write_simple_script/response_004.log
+++ b/tests/integration/mock/MonologueAgent/test_write_simple_script/response_004.log
@@ -1,4 +1,6 @@
+```json
 {
  "action": "finish",
  "args": {}
 }
+```
--- a/tests/integration/mock/PlannerAgent/test_edits/prompt_001.log
+++ b/tests/integration/mock/PlannerAgent/test_edits/prompt_001.log
@@ -1,8 +1,4 @@

-
----------
-
-
 # Task
 You're a diligent software engineer AI. You can't see, draw, or interact with a
 browser, but you can read and write files, and you can run commands, and you can think.
@@ -96,7 +92,7 @@ It must be an object, and it must contain two fields:
  * `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
 * `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.

-You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
+You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
 You should never act twice in a row without thinking. But if your last several
 actions are all `message` actions, you should consider taking a different action.

--- a/tests/integration/mock/PlannerAgent/test_edits/prompt_002.log
+++ b/tests/integration/mock/PlannerAgent/test_edits/prompt_002.log
@@ -1,8 +1,4 @@

-
----------
-
-
 # Task
 You're a diligent software engineer AI. You can't see, draw, or interact with a
 browser, but you can read and write files, and you can run commands, and you can think.
@@ -151,7 +147,7 @@ It must be an object, and it must contain two fields:
  * `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
 * `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.

-You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
+You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
 You should never act twice in a row without thinking. But if your last several
 actions are all `message` actions, you should consider taking a different action.

--- a/tests/integration/mock/PlannerAgent/test_edits/prompt_003.log
+++ b/tests/integration/mock/PlannerAgent/test_edits/prompt_003.log
@@ -1,8 +1,4 @@

-
----------
-
-
 # Task
 You're a diligent software engineer AI. You can't see, draw, or interact with a
 browser, but you can read and write files, and you can run commands, and you can think.
@@ -162,7 +158,7 @@ It must be an object, and it must contain two fields:
  * `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
 * `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.

-You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
+You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
 You should never act twice in a row without thinking. But if your last several
 actions are all `message` actions, you should consider taking a different action.

--- a/tests/integration/mock/PlannerAgent/test_edits/prompt_004.log
+++ b/tests/integration/mock/PlannerAgent/test_edits/prompt_004.log
@@ -1,8 +1,4 @@

-
----------
-
-
 # Task
 You're a diligent software engineer AI. You can't see, draw, or interact with a
 browser, but you can read and write files, and you can run commands, and you can think.
@@ -180,7 +176,7 @@ It must be an object, and it must contain two fields:
  * `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
 * `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.

-You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
+You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
 You should never act twice in a row without thinking. But if your last several
 actions are all `message` actions, you should consider taking a different action.

--- a/tests/integration/mock/PlannerAgent/test_edits/prompt_005.log
+++ b/tests/integration/mock/PlannerAgent/test_edits/prompt_005.log
@@ -1,8 +1,4 @@

-
----------
-
-
 # Task
 You're a diligent software engineer AI. You can't see, draw, or interact with a
 browser, but you can read and write files, and you can run commands, and you can think.
@@ -188,7 +184,7 @@ It must be an object, and it must contain two fields:
  * `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
 * `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.

-You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
+You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
 You should never act twice in a row without thinking. But if your last several
 actions are all `message` actions, you should consider taking a different action.

--- a/tests/integration/mock/PlannerAgent/test_edits/prompt_006.log
+++ b/tests/integration/mock/PlannerAgent/test_edits/prompt_006.log
@@ -1,8 +1,4 @@

-
----------
-
-
 # Task
 You're a diligent software engineer AI. You can't see, draw, or interact with a
 browser, but you can read and write files, and you can run commands, and you can think.
@@ -198,7 +194,7 @@ It must be an object, and it must contain two fields:
  * `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
 * `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.

-You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
+You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
 You should never act twice in a row without thinking. But if your last several
 actions are all `message` actions, you should consider taking a different action.

--- a/tests/integration/mock/PlannerAgent/test_edits/prompt_007.log
+++ b/tests/integration/mock/PlannerAgent/test_edits/prompt_007.log
@@ -1,8 +1,4 @@

-
----------
-
-
 # Task
 You're a diligent software engineer AI. You can't see, draw, or interact with a
 browser, but you can read and write files, and you can run commands, and you can think.
@@ -206,7 +202,7 @@ It must be an object, and it must contain two fields:
  * `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
 * `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.

-You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
+You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
 You should never act twice in a row without thinking. But if your last several
 actions are all `message` actions, you should consider taking a different action.

--- a/tests/integration/mock/PlannerAgent/test_edits/prompt_008.log
+++ b/tests/integration/mock/PlannerAgent/test_edits/prompt_008.log
@@ -1,8 +1,4 @@

-
----------
-
-
 # Task
 You're a diligent software engineer AI. You can't see, draw, or interact with a
 browser, but you can read and write files, and you can run commands, and you can think.
@@ -225,7 +221,7 @@ It must be an object, and it must contain two fields:
  * `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
 * `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.

-You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
+You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
 You should never act twice in a row without thinking. But if your last several
 actions are all `message` actions, you should consider taking a different action.

--- a/tests/integration/mock/PlannerAgent/test_edits/prompt_009.log
+++ b/tests/integration/mock/PlannerAgent/test_edits/prompt_009.log
@@ -1,8 +1,4 @@

-
----------
-
-
 # Task
 You're a diligent software engineer AI. You can't see, draw, or interact with a
 browser, but you can read and write files, and you can run commands, and you can think.
@@ -233,7 +229,7 @@ It must be an object, and it must contain two fields:
  * `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
 * `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.

-You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
+You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
 You should never act twice in a row without thinking. But if your last several
 actions are all `message` actions, you should consider taking a different action.

--- a/tests/integration/mock/PlannerAgent/test_edits/prompt_010.log
+++ b/tests/integration/mock/PlannerAgent/test_edits/prompt_010.log
@@ -1,8 +1,4 @@

-
----------
-
-
 # Task
 You're a diligent software engineer AI. You can't see, draw, or interact with a
 browser, but you can read and write files, and you can run commands, and you can think.
@@ -242,7 +238,7 @@ It must be an object, and it must contain two fields:
  * `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
 * `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.

-You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
+You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
 You should never act twice in a row without thinking. But if your last several
 actions are all `message` actions, you should consider taking a different action.

--- a/tests/integration/mock/PlannerAgent/test_write_simple_script/prompt_001.log
+++ b/tests/integration/mock/PlannerAgent/test_write_simple_script/prompt_001.log
@@ -1,8 +1,4 @@

-
----------
-
-
 # Task
 You're a diligent software engineer AI. You can't see, draw, or interact with a
 browser, but you can read and write files, and you can run commands, and you can think.
@@ -96,7 +92,7 @@ It must be an object, and it must contain two fields:
  * `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
 * `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.

-You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
+You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
 You should never act twice in a row without thinking. But if your last several
 actions are all `message` actions, you should consider taking a different action.

--- a/tests/integration/mock/PlannerAgent/test_write_simple_script/prompt_002.log
+++ b/tests/integration/mock/PlannerAgent/test_write_simple_script/prompt_002.log
@@ -1,8 +1,4 @@

-
----------
-
-
 # Task
 You're a diligent software engineer AI. You can't see, draw, or interact with a
 browser, but you can read and write files, and you can run commands, and you can think.
@@ -113,7 +109,7 @@ It must be an object, and it must contain two fields:
  * `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
 * `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.

-You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
+You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
 You should never act twice in a row without thinking. But if your last several
 actions are all `message` actions, you should consider taking a different action.

--- a/tests/integration/mock/PlannerAgent/test_write_simple_script/prompt_003.log
+++ b/tests/integration/mock/PlannerAgent/test_write_simple_script/prompt_003.log
@@ -1,8 +1,4 @@

-
----------
-
-
 # Task
 You're a diligent software engineer AI. You can't see, draw, or interact with a
 browser, but you can read and write files, and you can run commands, and you can think.
@@ -124,7 +120,7 @@ It must be an object, and it must contain two fields:
  * `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
 * `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.

-You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
+You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
 You should never act twice in a row without thinking. But if your last several
 actions are all `message` actions, you should consider taking a different action.

--- a/tests/integration/mock/PlannerAgent/test_write_simple_script/prompt_004.log
+++ b/tests/integration/mock/PlannerAgent/test_write_simple_script/prompt_004.log
@@ -1,8 +1,4 @@

-
----------
-
-
 # Task
 You're a diligent software engineer AI. You can't see, draw, or interact with a
 browser, but you can read and write files, and you can run commands, and you can think.
@@ -143,7 +139,7 @@ It must be an object, and it must contain two fields:
  * `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
 * `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.

-You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
+You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
 You should never act twice in a row without thinking. But if your last several
 actions are all `message` actions, you should consider taking a different action.

--- a/tests/integration/mock/PlannerAgent/test_write_simple_script/prompt_005.log
+++ b/tests/integration/mock/PlannerAgent/test_write_simple_script/prompt_005.log
@@ -1,8 +1,4 @@

-
----------
-
-
 # Task
 You're a diligent software engineer AI. You can't see, draw, or interact with a
 browser, but you can read and write files, and you can run commands, and you can think.
@@ -162,7 +158,7 @@ It must be an object, and it must contain two fields:
  * `state` - set to 'in_progress' to start the task, 'completed' to finish it, 'verified' to assert that it was successful, 'abandoned' to give up on it permanently, or `open` to stop working on it for now.
 * `finish` - if ALL of your tasks and subtasks have been verified or abandoned, and you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working.

-You MUST take time to think in between read, write, run, browse, and recall actions--do this with the `message` action.
+You MUST take time to think in between read, write, run, kill, browse, and recall actions--do this with the `message` action.
 You should never act twice in a row without thinking. But if your last several
 actions are all `message` actions, you should consider taking a different action.

--- a/tests/integration/regenerate.sh
+++ b/tests/integration/regenerate.sh
@@ -1,16 +1,9 @@
 #!/bin/bash
 set -eo pipefail

-run_test() {
-  SANDBOX_TYPE=$SANDBOX_TYPE \
-    WORKSPACE_BASE=$WORKSPACE_BASE \
-    MAX_ITERATIONS=$MAX_ITERATIONS \
-    WORKSPACE_MOUNT_PATH=$WORKSPACE_MOUNT_PATH \
-    AGENT=$agent \
-    poetry run pytest -s ./tests/integration/test_agent.py::$test_name
-    # return exit code of pytest
-    return $?
-}
+##############################################################
+##           CONSTANTS AND ENVIRONMENTAL VARIABLES          ##
+##############################################################

 if [ -z $WORKSPACE_MOUNT_PATH ]; then
  WORKSPACE_MOUNT_PATH=$(pwd)
@@ -21,6 +14,7 @@ fi

 WORKSPACE_MOUNT_PATH+="/_test_workspace"
 WORKSPACE_BASE+="/_test_workspace"
+WORKSPACE_MOUNT_PATH_IN_SANDBOX="/workspace"

 SANDBOX_TYPE="ssh"
 MAX_ITERATIONS=10
@@ -40,6 +34,73 @@ test_names=(
 num_of_tests=${#test_names[@]}
 num_of_agents=${#agents[@]}

+##############################################################
+##                      FUNCTIONS                           ##
+##############################################################
+
+# run integration test against a specific agent & test
+run_test() {
+  SANDBOX_TYPE=$SANDBOX_TYPE \
+    WORKSPACE_BASE=$WORKSPACE_BASE \
+    WORKSPACE_MOUNT_PATH=$WORKSPACE_MOUNT_PATH \
+    WORKSPACE_MOUNT_PATH_IN_SANDBOX=$WORKSPACE_MOUNT_PATH_IN_SANDBOX \
+    MAX_ITERATIONS=$MAX_ITERATIONS \
+    AGENT=$agent \
+    poetry run pytest -s ./tests/integration/test_agent.py::$test_name
+    # return exit code of pytest
+    return $?
+}
+
+# generate prompts again, using existing LLM responses under tests/integration/mock/[agent]/[test_name]/response_*.log
+# this is a compromise; the prompts might be non-sense yet still pass the test, because we don't use a real LLM to
+# respond to the prompts. The benefit is developers don't have to regenerate real responses from LLM, if they only
+# apply a small change to prompts.
+regenerate_without_llm() {
+  # set -x to print the command being executed
+  set -x
+  SANDBOX_TYPE=$SANDBOX_TYPE \
+    WORKSPACE_BASE=$WORKSPACE_BASE \
+    WORKSPACE_MOUNT_PATH=$WORKSPACE_MOUNT_PATH \
+    WORKSPACE_MOUNT_PATH_IN_SANDBOX=$WORKSPACE_MOUNT_PATH_IN_SANDBOX \
+    MAX_ITERATIONS=$MAX_ITERATIONS \
+    FORCE_APPLY_PROMPTS=true \
+    AGENT=$agent \
+    poetry run pytest -s ./tests/integration/test_agent.py::$test_name
+  set +x
+}
+
+regenerate_with_llm() {
+  rm -rf $WORKSPACE_BASE
+  mkdir -p $WORKSPACE_BASE
+  if [ -d "tests/integration/workspace/$test_name" ]; then
+    cp -r tests/integration/workspace/$test_name/* $WORKSPACE_BASE
+  fi
+
+  rm -rf logs
+  rm -rf tests/integration/mock/$agent/$test_name/*
+  # set -x to print the command being executed
+  set -x
+  echo -e "/exit\n" | \
+    DEBUG=true \
+    SANDBOX_TYPE=$SANDBOX_TYPE \
+    WORKSPACE_BASE=$WORKSPACE_BASE \
+    WORKSPACE_MOUNT_PATH=$WORKSPACE_MOUNT_PATH AGENT=$agent \
+    WORKSPACE_MOUNT_PATH_IN_SANDBOX=$WORKSPACE_MOUNT_PATH_IN_SANDBOX \
+    poetry run python ./opendevin/core/main.py \
+    -i $MAX_ITERATIONS \
+    -t "$task Do not ask me for confirmation at any point." \
+    -c $agent
+  set +x
+
+  mkdir -p tests/integration/mock/$agent/$test_name/
+  mv logs/llm/**/* tests/integration/mock/$agent/$test_name/
+}
+
+##############################################################
+##                      MAIN PROGRAM                        ##
+##############################################################
+
+
 if [ "$num_of_tests" -ne "${#test_names[@]}" ]; then
  echo "Every task must correspond to one test case"
  exit 1
@@ -64,7 +125,7 @@ for ((i = 0; i < num_of_tests; i++)); do
      continue
    fi

-    echo -e "\n\n\n\n========Running $test_name for $agent========\n\n\n\n"
+    echo -e "\n\n\n\n========STEP 1: Running $test_name for $agent========\n\n\n\n"
    rm -rf $WORKSPACE_BASE
    mkdir $WORKSPACE_BASE
    if [ -d "tests/integration/workspace/$test_name" ]; then
@@ -84,57 +145,56 @@ for ((i = 0; i < num_of_tests; i++)); do
    set -e

    if [[ $TEST_STATUS -ne 0 ]]; then
-      echo -e "\n\n\n\n========$test_name failed, regenerating test data for $agent========\n\n\n\n"
-      sleep 1

-      rm -rf $WORKSPACE_BASE
-      mkdir -p $WORKSPACE_BASE
-      if [ -d "tests/integration/workspace/$test_name" ]; then
-        cp -r tests/integration/workspace/$test_name/* $WORKSPACE_BASE
+      if [ "$FORCE_USE_LLM" = true ]; then
+        echo -e "\n\n\n\n========FORCE_USE_LLM, skipping step 2 & 3========\n\n\n\n"
+      elif [ ! -d "tests/integration/mock/$agent/$test_name" ]; then
+        echo -e "\n\n\n\n========No existing mock responses for $agent/$test_name, skipping step 2 & 3========\n\n\n\n"
+      else
+        echo -e "\n\n\n\n========STEP 2: $test_name failed, regenerating prompts for $agent WITHOUT money cost========\n\n\n\n"
+
+        regenerate_without_llm
+
+        echo -e "\n\n\n\n========STEP 3: $test_name prompts regenerated for $agent, rerun test again to verify========\n\n\n\n"
+        # Temporarily disable 'exit on error'
+        set +e
+        run_test
+        TEST_STATUS=$?
+        # Re-enable 'exit on error'
+        set -e
      fi

-      rm -rf logs
-      rm -rf tests/integration/mock/$agent/$test_name/*
-      # set -x to print the command being executed
-      set -x
-      echo -e "/exit\n" | \
-        SANDBOX_TYPE=$SANDBOX_TYPE \
-        WORKSPACE_BASE=$WORKSPACE_BASE \
-        DEBUG=true \
-        WORKSPACE_MOUNT_PATH=$WORKSPACE_MOUNT_PATH AGENT=$agent \
-        poetry run python ./opendevin/core/main.py \
-        -i $MAX_ITERATIONS \
-        -t "$task Do not ask me for confirmation at any point." \
-        -c $agent
-      set +x
-
-      mkdir -p tests/integration/mock/$agent/$test_name/
-      mv logs/llm/**/* tests/integration/mock/$agent/$test_name/
-
-      echo -e "\n\n\n\n========$test_name test data regenerated for $agent, rerun test again to verify========\n\n\n\n"
-      # Temporarily disable 'exit on error'
-      set +e
-      run_test
-      TEST_STATUS=$?
-      # Re-enable 'exit on error'
-      set -e
-
      if [[ $TEST_STATUS -ne 0 ]]; then
-        echo -e "\n\n\n\n========$test_name for $agent RERUN FAILED========\n\n\n\n"
-        echo -e "There are multiple possibilities:"
-        echo -e "  1. The agent is unable to finish the task within $MAX_ITERATIONS steps."
-        echo -e "  2. The agent thinks itself has finished the task, but fails the validation in the test code."
-        echo -e "  3. There is something non-deterministic in the prompt."
-        echo -e "  4. There is a bug in this script, or in OpenDevin code."
-        echo -e "NOTE: Some of the above problems could sometimes be fixed by a retry (with a more powerful LLM)."
-        echo -e "      You could also consider improving the agent, increasing MAX_ITERATIONS, or skipping this test for this agent."
-        exit 1
+        echo -e "\n\n\n\n========STEP 4: $test_name failed, regenerating prompts and responses for $agent WITH money cost========\n\n\n\n"
+
+        regenerate_with_llm
+
+        echo -e "\n\n\n\n========STEP 5: $test_name prompts and responses regenerated for $agent, rerun test again to verify========\n\n\n\n"
+        # Temporarily disable 'exit on error'
+        set +e
+        run_test
+        TEST_STATUS=$?
+        # Re-enable 'exit on error'
+        set -e
+
+        if [[ $TEST_STATUS -ne 0 ]]; then
+          echo -e "\n\n\n\n========$test_name for $agent RERUN FAILED========\n\n\n\n"
+          echo -e "There are multiple possibilities:"
+          echo -e "  1. The agent is unable to finish the task within $MAX_ITERATIONS steps."
+          echo -e "  2. The agent thinks itself has finished the task, but fails the validation in the test code."
+          echo -e "  3. There is something non-deterministic in the prompt."
+          echo -e "  4. There is a bug in this script, or in OpenDevin code."
+          echo -e "NOTE: Some of the above problems could sometimes be fixed by a retry (with a more powerful LLM)."
+          echo -e "      You could also consider improving the agent, increasing MAX_ITERATIONS, or skipping this test for this agent."
+          exit 1
+        else
+          echo -e "\n\n\n\n========$test_name for $agent RERUN PASSED========\n\n\n\n"
+          sleep 1
+        fi
      else
-        echo -e "\n\n\n\n========$test_name for $agent RERUN PASSED========\n\n\n\n"
-        sleep 1
+          echo -e "\n\n\n\n========$test_name for $agent RERUN PASSED========\n\n\n\n"
+          sleep 1
      fi
-
-
    else
      echo -e "\n\n\n\n========$test_name for $agent PASSED========\n\n\n\n"
      sleep 1