resolve to merge with new version

This commit is contained in:
katiue
2025-01-09 01:26:13 +07:00
12 changed files with 476 additions and 195 deletions

21
LICENSE Normal file
View File

@@ -0,0 +1,21 @@
MIT License
Copyright (c) 2024 Browser Use Inc.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

View File

@@ -21,11 +21,19 @@ This project builds upon the foundation of the [browser-use](https://github.com/
We would like to officially thank [WarmShao](https://github.com/warmshao) for his contribution to this project.
**WebUI:** is built on Gradio and supports a most of `browser-use` functionalities. This UI is designed to be user-friendly and enables easy interaction with the browser agent.
This project builds upon the foundation of the [browser-use](https://github.com/browser-use/browser-use), which is designed to make websites accessible for AI agents.
We would like to officially thank [WarmShao](https://github.com/warmshao) for his contribution to this project.
**WebUI:** is built on Gradio and supports a most of `browser-use` functionalities. This UI is designed to be user-friendly and enables easy interaction with the browser agent.
**Expanded LLM Support:** We've integrated support for various Large Language Models (LLMs), including: Gemini, OpenAI, Azure OpenAI, Anthropic, DeepSeek, Ollama etc. And we plan to add support for even more models in the future.
**Expanded LLM Support:** We've integrated support for various Large Language Models (LLMs), including: Gemini, OpenAI, Azure OpenAI, Anthropic, DeepSeek, Ollama etc. And we plan to add support for even more models in the future.
**Custom Browser Support:** You can use your own browser with our tool, eliminating the need to re-login to sites or deal with other authentication challenges. This feature also supports high-definition screen recording.
**Custom Browser Support:** You can use your own browser with our tool, eliminating the need to re-login to sites or deal with other authentication challenges. This feature also supports high-definition screen recording.
<video src="https://github.com/user-attachments/assets/56bc7080-f2e3-4367-af22-6bf2245ff6cb" controls="controls" >Your browser does not support playing this video!</video>
<video src="https://github.com/user-attachments/assets/56bc7080-f2e3-4367-af22-6bf2245ff6cb" controls="controls" >Your browser does not support playing this video!</video>
## Installation Guide
@@ -54,6 +62,35 @@ uv pip install -r requirements.txt
Then install playwright:
```bash
playwright install
```
## Installation Guide
Read the [quickstart guide](https://docs.browser-use.com/quickstart#prepare-the-environment) or follow the steps below to get started.
> Python 3.11 or higher is required.
First, we recommend using [uv](https://docs.astral.sh/uv/) to setup the Python environment.
```bash
uv venv --python 3.11
```
and activate it with:
```bash
source .venv/bin/activate
```
Install the dependencies:
```bash
uv pip install -r requirements.txt
```
Then install playwright:
```bash
playwright install
```
@@ -101,3 +138,35 @@ CHROME_USER_DATA="~/Library/Application Support/Google/Chrome/Profile 1"
## Changelog
- [x] **2025/01/06:** Thanks to @richard-devbot, a New and Well-Designed WebUI is released. [Video tutorial demo](https://github.com/warmshao/browser-use-webui/issues/1#issuecomment-2573393113).
## (Optional) Configure Environment Variables
Copy `.env.example` to `.env` and set your environment variables, including API keys for the LLM. With
```bash
cp .env.example .env
```
**If using your own browser:** - Set `CHROME_PATH` to the executable path of your browser and `CHROME_USER_DATA` to the user data directory of your browser.
You can just copy examples down below to your `.env` file.
### Windows
```env
CHROME_PATH="C:\Program Files\Google\Chrome\Application\chrome.exe"
CHROME_USER_DATA="C:\Users\YourUsername\AppData\Local\Google\Chrome\User Data"
```
> Note: Replace `YourUsername` with your actual Windows username for Windows systems.
### Mac
```env
CHROME_PATH="/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"
CHROME_USER_DATA="~/Library/Application Support/Google/Chrome/Profile 1"
```
## Changelog
- [x] **2025/01/06:** Thanks to @richard-devbot, a New and Well-Designed WebUI is released. [Video tutorial demo](https://github.com/warmshao/browser-use-webui/issues/1#issuecomment-2573393113).

View File

@@ -1,5 +1,5 @@
browser-use==0.1.17
langchain-google-genai
browser-use>=0.1.18
langchain-google-genai>=2.0.8
pyperclip
gradio
python-dotenv

View File

@@ -4,71 +4,45 @@
# @ProjectName: browser-use-webui
# @FileName: custom_agent.py
import asyncio
import base64
import io
import json
import logging
import os
import pdb
import textwrap
import time
import uuid
from io import BytesIO
from pathlib import Path
from typing import Any, Optional, Type, TypeVar
import traceback
from typing import Optional, Type
from dotenv import load_dotenv
from langchain_core.language_models.chat_models import BaseChatModel
from langchain_core.messages import (
BaseMessage,
SystemMessage,
)
from openai import RateLimitError
from PIL import Image, ImageDraw, ImageFont
from pydantic import BaseModel, ValidationError
from browser_use.agent.message_manager.service import MessageManager
from browser_use.agent.prompts import AgentMessagePrompt, SystemPrompt
from browser_use.agent.prompts import SystemPrompt
from browser_use.agent.service import Agent
from browser_use.agent.views import (
ActionResult,
AgentError,
AgentHistory,
AgentHistoryList,
AgentOutput,
AgentStepInfo,
)
from browser_use.browser.browser import Browser
from browser_use.browser.context import BrowserContext
from browser_use.browser.views import BrowserState, BrowserStateHistory
from browser_use.controller.registry.views import ActionModel
from browser_use.controller.service import Controller
from browser_use.dom.history_tree_processor.service import (
DOMHistoryElement,
HistoryTreeProcessor,
)
from browser_use.telemetry.service import ProductTelemetry
from browser_use.telemetry.views import (
AgentEndTelemetryEvent,
AgentRunTelemetryEvent,
AgentStepErrorTelemetryEvent,
)
from browser_use.utils import time_execution_async
from langchain_core.language_models.chat_models import BaseChatModel
from langchain_core.messages import (
BaseMessage,
)
from .custom_views import CustomAgentOutput, CustomAgentStepInfo
from .custom_massage_manager import CustomMassageManager
from .custom_views import CustomAgentOutput, CustomAgentStepInfo
logger = logging.getLogger(__name__)
class CustomAgent(Agent):
def __init__(
self,
task: str,
llm: BaseChatModel,
add_infos: str = '',
add_infos: str = "",
browser: Browser | None = None,
browser_context: BrowserContext | None = None,
controller: Controller = Controller(),
@@ -80,23 +54,39 @@ class CustomAgent(Agent):
max_input_tokens: int = 128000,
validate_output: bool = False,
include_attributes: list[str] = [
'title',
'type',
'name',
'role',
'tabindex',
'aria-label',
'placeholder',
'value',
'alt',
'aria-expanded',
"title",
"type",
"name",
"role",
"tabindex",
"aria-label",
"placeholder",
"value",
"alt",
"aria-expanded",
],
max_error_length: int = 400,
max_actions_per_step: int = 10,
tool_call_in_content: bool = True,
):
super().__init__(task, llm, browser, browser_context, controller, use_vision, save_conversation_path,
max_failures, retry_delay, system_prompt_class, max_input_tokens, validate_output,
include_attributes, max_error_length, max_actions_per_step)
super().__init__(
task=task,
llm=llm,
browser=browser,
browser_context=browser_context,
controller=controller,
use_vision=use_vision,
save_conversation_path=save_conversation_path,
max_failures=max_failures,
retry_delay=retry_delay,
system_prompt_class=system_prompt_class,
max_input_tokens=max_input_tokens,
validate_output=validate_output,
include_attributes=include_attributes,
max_error_length=max_error_length,
max_actions_per_step=max_actions_per_step,
tool_call_in_content=tool_call_in_content,
)
self.add_infos = add_infos
self.message_manager = CustomMassageManager(
llm=self.llm,
@@ -107,6 +97,7 @@ class CustomAgent(Agent):
include_attributes=self.include_attributes,
max_error_length=self.max_error_length,
max_actions_per_step=self.max_actions_per_step,
tool_call_in_content=tool_call_in_content,
)
def _setup_action_models(self) -> None:
@@ -118,21 +109,21 @@ class CustomAgent(Agent):
def _log_response(self, response: CustomAgentOutput) -> None:
"""Log the model's response"""
if 'Success' in response.current_state.prev_action_evaluation:
emoji = ''
elif 'Failed' in response.current_state.prev_action_evaluation:
emoji = ''
if "Success" in response.current_state.prev_action_evaluation:
emoji = ""
elif "Failed" in response.current_state.prev_action_evaluation:
emoji = ""
else:
emoji = '🤷'
emoji = "🤷"
logger.info(f'{emoji} Eval: {response.current_state.prev_action_evaluation}')
logger.info(f'🧠 New Memory: {response.current_state.important_contents}')
logger.info(f'⏳ Task Progress: {response.current_state.completed_contents}')
logger.info(f'🤔 Thought: {response.current_state.thought}')
logger.info(f'🎯 Summary: {response.current_state.summary}')
logger.info(f"{emoji} Eval: {response.current_state.prev_action_evaluation}")
logger.info(f"🧠 New Memory: {response.current_state.important_contents}")
logger.info(f"⏳ Task Progress: {response.current_state.completed_contents}")
logger.info(f"🤔 Thought: {response.current_state.thought}")
logger.info(f"🎯 Summary: {response.current_state.summary}")
for i, action in enumerate(response.action):
logger.info(
f'🛠️ Action {i + 1}/{len(response.action)}: {action.model_dump_json(exclude_unset=True)}'
f"🛠️ Action {i + 1}/{len(response.action)}: {action.model_dump_json(exclude_unset=True)}"
)
def update_step_info(self, model_output: CustomAgentOutput, step_info: CustomAgentStepInfo | None = None):
@@ -144,32 +135,54 @@ class CustomAgent(Agent):
step_info.step_number += 1
important_contents = model_output.current_state.important_contents
if important_contents and 'None' not in important_contents and important_contents not in step_info.memory:
step_info.memory += important_contents + '\n'
if (
important_contents
and "None" not in important_contents
and important_contents not in step_info.memory
):
step_info.memory += important_contents + "\n"
completed_contents = model_output.current_state.completed_contents
if completed_contents and 'None' not in completed_contents:
if completed_contents and "None" not in completed_contents:
step_info.task_progress = completed_contents
@time_execution_async('--get_next_action')
async def get_next_action(self, input_messages: list[BaseMessage]) -> CustomAgentOutput:
@time_execution_async("--get_next_action")
async def get_next_action(self, input_messages: list[BaseMessage]) -> AgentOutput:
"""Get next action from LLM based on current state"""
try:
structured_llm = self.llm.with_structured_output(self.AgentOutput, include_raw=True)
response: dict[str, Any] = await structured_llm.ainvoke(input_messages) # type: ignore
ret = self.llm.invoke(input_messages)
content_str = ''.join([str(item) for item in ret.content])
parsed_json = json.loads(content_str.replace('```json', '').replace("```", ""))
parsed: CustomAgentOutput = self.AgentOutput(**parsed_json)
# cut the number of actions to max_actions_per_step
parsed.action = parsed.action[: self.max_actions_per_step]
self._log_response(parsed)
self.n_steps += 1
parsed: AgentOutput = response['parsed']
# cut the number of actions to max_actions_per_step
parsed.action = parsed.action[: self.max_actions_per_step]
self._log_response(parsed)
self.n_steps += 1
return parsed
return parsed
except Exception as e:
# If something goes wrong, try to invoke the LLM again without structured output,
# and Manually parse the response. Temporarily solution for DeepSeek
ret = self.llm.invoke(input_messages)
if isinstance(ret.content, list):
parsed_json = json.loads(ret.content[0].replace("```json", "").replace("```", ""))
else:
parsed_json = json.loads(ret.content.replace("```json", "").replace("```", ""))
parsed: AgentOutput = self.AgentOutput(**parsed_json)
if parsed is None:
raise ValueError(f'Could not parse response.')
@time_execution_async('--step')
# cut the number of actions to max_actions_per_step
parsed.action = parsed.action[: self.max_actions_per_step]
self._log_response(parsed)
self.n_steps += 1
return parsed
@time_execution_async("--step")
async def step(self, step_info: Optional[CustomAgentStepInfo] = None) -> None:
"""Execute one step of the task"""
logger.info(f'\n📍 Step {self.n_steps}')
logger.info(f"\n📍 Step {self.n_steps}")
state = None
model_output = None
result: list[ActionResult] = []
@@ -192,7 +205,7 @@ class CustomAgent(Agent):
self._last_result = result
if len(result) > 0 and result[-1].is_done:
logger.info(f'📄 Result: {result[-1].extracted_content}')
logger.info(f"📄 Result: {result[-1].extracted_content}")
self.consecutive_failures = 0
@@ -217,7 +230,7 @@ class CustomAgent(Agent):
async def run(self, max_steps: int = 100) -> AgentHistoryList:
"""Execute the task with maximum number of steps"""
try:
logger.info(f'🚀 Starting task: {self.task}')
logger.info(f"🚀 Starting task: {self.task}")
self.telemetry.capture(
AgentRunTelemetryEvent(
@@ -226,13 +239,14 @@ class CustomAgent(Agent):
)
)
step_info = CustomAgentStepInfo(task=self.task,
add_infos=self.add_infos,
step_number=1,
max_steps=max_steps,
memory='',
task_progress=''
)
step_info = CustomAgentStepInfo(
task=self.task,
add_infos=self.add_infos,
step_number=1,
max_steps=max_steps,
memory="",
task_progress="",
)
for step in range(max_steps):
if self._too_many_failures():
@@ -247,10 +261,10 @@ class CustomAgent(Agent):
if not await self._validate_output():
continue
logger.info('✅ Task completed successfully')
logger.info("✅ Task completed successfully")
break
else:
logger.info('❌ Failed to complete task in maximum steps')
logger.info("❌ Failed to complete task in maximum steps")
return self.history

View File

@@ -18,6 +18,7 @@ from browser_use.browser.views import BrowserState
from langchain_core.language_models import BaseChatModel
from langchain_core.messages import (
HumanMessage,
AIMessage
)
from .custom_prompts import CustomAgentMessagePrompt
@@ -27,34 +28,65 @@ logger = logging.getLogger(__name__)
class CustomMassageManager(MessageManager):
def __init__(
self,
llm: BaseChatModel,
task: str,
action_descriptions: str,
system_prompt_class: Type[SystemPrompt],
max_input_tokens: int = 128000,
estimated_tokens_per_character: int = 3,
image_tokens: int = 800,
include_attributes: list[str] = [],
max_error_length: int = 400,
max_actions_per_step: int = 10,
self,
llm: BaseChatModel,
task: str,
action_descriptions: str,
system_prompt_class: Type[SystemPrompt],
max_input_tokens: int = 128000,
estimated_tokens_per_character: int = 3,
image_tokens: int = 800,
include_attributes: list[str] = [],
max_error_length: int = 400,
max_actions_per_step: int = 10,
tool_call_in_content: bool = False,
):
super().__init__(
llm,
task,
action_descriptions,
system_prompt_class,
max_input_tokens,
estimated_tokens_per_character,
image_tokens,
include_attributes,
max_error_length,
max_actions_per_step,
llm=llm,
task=task,
action_descriptions=action_descriptions,
system_prompt_class=system_prompt_class,
max_input_tokens=max_input_tokens,
estimated_tokens_per_character=estimated_tokens_per_character,
image_tokens=image_tokens,
include_attributes=include_attributes,
max_error_length=max_error_length,
max_actions_per_step=max_actions_per_step,
tool_call_in_content=tool_call_in_content,
)
# Move Task info to state_message
# Custom: Move Task info to state_message
self.history = MessageHistory()
self._add_message_with_tokens(self.system_prompt)
tool_calls = [
{
'name': 'AgentOutput',
'args': {
'current_state': {
'evaluation_previous_goal': 'Unknown - No previous actions to evaluate.',
'memory': '',
'next_goal': 'Obtain task from user',
},
'action': [],
},
'id': '',
'type': 'tool_call',
}
]
if self.tool_call_in_content:
# openai throws error if tool_calls are not responded -> move to content
example_tool_call = AIMessage(
content=f'{tool_calls}',
tool_calls=[],
)
else:
example_tool_call = AIMessage(
content=f'',
tool_calls=tool_calls,
)
self._add_message_with_tokens(example_tool_call)
def add_state_message(
self,
state: BrowserState,

View File

@@ -24,7 +24,7 @@ class CustomSystemPrompt(SystemPrompt):
{
"current_state": {
"prev_action_evaluation": "Success|Failed|Unknown - Analyze the current elements and the image to check if the previous goals/actions are successful like intended by the task. Ignore the action result. The website is the ground truth. Also mention if something unexpected happened like new suggestions in an input field. Shortly state why/why not. Note that the result you output must be consistent with the reasoning you output afterwards. If you consider it to be 'Failed,' you should reflect on this during your thought.",
"important_contents": "Output important contents closely related to user\'s instruction or task on the current page. If there is, please output the contents. If not, please output \"None\".",
"important_contents": "Output important contents closely related to user\'s instruction or task on the current page. If there is, please output the contents. If not, please output empty string ''.",
"completed_contents": "Update the input Task Progress. Completed contents is a general summary of the current contents that have been completed. Just summarize the contents that have been actually completed based on the current page and the history operations. Please list each completed item individually, such as: 1. Input username. 2. Input Password. 3. Click confirm button",
"thought": "Think about the requirements that have been completed in previous operations and the requirements that need to be completed in the next one operation. If the output of prev_action_evaluation is 'Failed', please reflect and output your reflection here. If you think you have entered the wrong page, consider to go back to the previous page in next action.",
"summary": "Please generate a brief natural language description for the operation in next actions based on your Thought."
@@ -188,7 +188,7 @@ class CustomAgentMessagePrompt:
state_description += f"\nResult of action {i + 1}/{len(self.result)}: {result.extracted_content}"
if result.error:
# only use last 300 characters of error
error = result.error[-self.max_error_length :]
error = result.error[-self.max_error_length:]
state_description += (
f"\nError of action {i + 1}/{len(self.result)}: ...{error}"
)

View File

@@ -6,9 +6,10 @@
from dataclasses import dataclass
from typing import Type
from pydantic import BaseModel, ConfigDict, Field, ValidationError, create_model
from browser_use.controller.registry.views import ActionModel
from browser_use.agent.views import AgentOutput
from browser_use.controller.registry.views import ActionModel
from pydantic import BaseModel, ConfigDict, Field, create_model
@dataclass
@@ -43,11 +44,16 @@ class CustomAgentOutput(AgentOutput):
action: list[ActionModel]
@staticmethod
def type_with_custom_actions(custom_actions: Type[ActionModel]) -> Type['CustomAgentOutput']:
def type_with_custom_actions(
custom_actions: Type[ActionModel],
) -> Type["CustomAgentOutput"]:
"""Extend actions with custom actions"""
return create_model(
'AgentOutput',
"AgentOutput",
__base__=CustomAgentOutput,
action=(list[custom_actions], Field(...)), # Properly annotated field with no default
action=(
list[custom_actions],
Field(...),
), # Properly annotated field with no default
__module__=CustomAgentOutput.__module__,
)

View File

@@ -5,8 +5,6 @@
# @Project : browser-use-webui
# @FileName: custom_context.py
import asyncio
import base64
import json
import logging
import os
@@ -21,7 +19,6 @@ if TYPE_CHECKING:
logger = logging.getLogger(__name__)
class CustomBrowserContext(BrowserContext):
def __init__(
self,
browser: 'CustomBrowser', # Forward declaration for CustomBrowser

View File

@@ -5,10 +5,9 @@
# @FileName: custom_action.py
import pyperclip
from browser_use.controller.service import Controller
from browser_use.agent.views import ActionResult
from browser_use.browser.context import BrowserContext
from browser_use.controller.service import Controller
class CustomController(Controller):
@@ -19,12 +18,12 @@ class CustomController(Controller):
def _register_custom_actions(self):
"""Register all custom browser actions"""
@self.registry.action('Copy text to clipboard')
@self.registry.action("Copy text to clipboard")
def copy_to_clipboard(text: str):
pyperclip.copy(text)
return ActionResult(extracted_content=text)
@self.registry.action('Paste text from clipboard', requires_browser=True)
@self.registry.action("Paste text from clipboard", requires_browser=True)
async def paste_from_clipboard(browser: BrowserContext):
text = pyperclip.paste()
# send text to browser

View File

@@ -13,7 +13,7 @@ from langchain_anthropic import ChatAnthropic
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_ollama import ChatOllama
from langchain_openai import AzureChatOpenAI, ChatOpenAI
import gradio as gr
def get_llm_model(provider: str, **kwargs):
"""
@@ -22,6 +22,7 @@ def get_llm_model(provider: str, **kwargs):
:param kwargs:
:return:
"""
if provider == "anthropic":
if provider == "anthropic":
if not kwargs.get("base_url", ""):
base_url = "https://api.anthropic.com"
@@ -34,6 +35,7 @@ def get_llm_model(provider: str, **kwargs):
api_key = kwargs.get("api_key")
return ChatAnthropic(
model_name=kwargs.get("model_name", "claude-3-5-sonnet-20240620"),
model_name=kwargs.get("model_name", "claude-3-5-sonnet-20240620"),
temperature=kwargs.get("temperature", 0.0),
base_url=base_url,
@@ -41,6 +43,7 @@ def get_llm_model(provider: str, **kwargs):
timeout=kwargs.get("timeout", 60),
stop=kwargs.get("stop", None),
)
elif provider == "openai":
elif provider == "openai":
if not kwargs.get("base_url", ""):
base_url = os.getenv("OPENAI_ENDPOINT", "https://api.openai.com/v1")
@@ -53,11 +56,13 @@ def get_llm_model(provider: str, **kwargs):
api_key = kwargs.get("api_key")
return ChatOpenAI(
model=kwargs.get("model_name", "gpt-4o"),
model=kwargs.get("model_name", "gpt-4o"),
temperature=kwargs.get("temperature", 0.0),
base_url=base_url,
api_key=SecretStr(api_key or ""),
)
elif provider == "deepseek":
elif provider == "deepseek":
if not kwargs.get("base_url", ""):
base_url = os.getenv("DEEPSEEK_ENDPOINT", "")
@@ -70,25 +75,31 @@ def get_llm_model(provider: str, **kwargs):
api_key = kwargs.get("api_key")
return ChatOpenAI(
model=kwargs.get("model_name", "deepseek-chat"),
model=kwargs.get("model_name", "deepseek-chat"),
temperature=kwargs.get("temperature", 0.0),
base_url=base_url,
api_key=SecretStr(api_key or ""),
)
elif provider == "gemini":
elif provider == "gemini":
if not kwargs.get("api_key", ""):
api_key = os.getenv("GOOGLE_API_KEY", "")
else:
api_key = kwargs.get("api_key")
return ChatGoogleGenerativeAI(
model=kwargs.get("model_name", "gemini-2.0-flash-exp"),
model=kwargs.get("model_name", "gemini-2.0-flash-exp"),
temperature=kwargs.get("temperature", 0.0),
api_key=SecretStr(api_key or ""),
)
elif provider == "ollama":
elif provider == "ollama":
return ChatOllama(
model=kwargs.get("model_name", "qwen2.5:7b"),
model=kwargs.get("model_name", "qwen2.5:7b"),
temperature=kwargs.get("temperature", 0.0),
num_ctx=128000,
)
elif provider == "azure_openai":
if not kwargs.get("base_url", ""):
@@ -100,6 +111,7 @@ def get_llm_model(provider: str, **kwargs):
else:
api_key = kwargs.get("api_key")
return AzureChatOpenAI(
model=kwargs.get("model_name", "gpt-4o"),
model=kwargs.get("model_name", "gpt-4o"),
temperature=kwargs.get("temperature", 0.0),
api_version="2024-05-01-preview",
@@ -108,8 +120,34 @@ def get_llm_model(provider: str, **kwargs):
)
else:
raise ValueError(f"Unsupported provider: {provider}")
# Predefined model names for common providers
model_names = {
"anthropic": ["claude-3-5-sonnet-20240620", "claude-3-opus-20240229"],
"openai": ["gpt-4o", "gpt-4", "gpt-3.5-turbo"],
"deepseek": ["deepseek-chat"],
"gemini": ["gemini-2.0-flash-exp", "gemini-2.0-flash-thinking-exp", "gemini-1.5-flash-latest", "gemini-1.5-flash-8b-latest", "gemini-2.0-flash-thinking-exp-1219" ],
"ollama": ["qwen2.5:7b", "llama2:7b"],
"azure_openai": ["gpt-4o", "gpt-4", "gpt-3.5-turbo"]
}
# Callback to update the model name dropdown based on the selected provider
def update_model_dropdown(llm_provider, api_key=None, base_url=None):
"""
Update the model name dropdown with predefined models for the selected provider.
"""
# Use API keys from .env if not provided
if not api_key:
api_key = os.getenv(f"{llm_provider.upper()}_API_KEY", "")
if not base_url:
base_url = os.getenv(f"{llm_provider.upper()}_BASE_URL", "")
# Use predefined models for the selected provider
if llm_provider in model_names:
return gr.Dropdown(choices=model_names[llm_provider], value=model_names[llm_provider][0], interactive=True)
else:
return gr.Dropdown(choices=[], value="", interactive=True, allow_custom_value=True)
def encode_image(img_path):
if not img_path:
return None

View File

@@ -80,11 +80,12 @@ async def test_browser_use_org():
async def test_browser_use_custom():
from browser_use.browser.context import BrowserContextWindowSize
from browser_use.browser.browser import BrowserConfig
from playwright.async_api import async_playwright
from src.agent.custom_agent import CustomAgent
from src.agent.custom_prompts import CustomSystemPrompt
from src.browser.custom_browser import BrowserConfig, CustomBrowser
from src.browser.custom_browser import CustomBrowser
from src.browser.custom_context import BrowserContextConfig
from src.controller.custom_controller import CustomController
@@ -95,15 +96,15 @@ async def test_browser_use_custom():
# model_name="gpt-4o",
# temperature=0.8,
# base_url=os.getenv("AZURE_OPENAI_ENDPOINT", ""),
# api_key=os.getenv("AZURE_OPENAI_API_KEY", "")
# api_key=os.getenv("AZURE_OPENAI_API_KEY", ""),
# )
# llm = utils.get_llm_model(
# provider="gemini",
# model_name="gemini-2.0-flash-exp",
# temperature=1.0,
# api_key=os.getenv("GOOGLE_API_KEY", "")
# )
llm = utils.get_llm_model(
provider="gemini",
model_name="gemini-2.0-flash-exp",
temperature=1.0,
api_key=os.getenv("GOOGLE_API_KEY", "")
)
# llm = utils.get_llm_model(
# provider="deepseek",
@@ -111,14 +112,16 @@ async def test_browser_use_custom():
# temperature=0.8
# )
llm = utils.get_llm_model(
provider="ollama", model_name="qwen2.5:7b", temperature=0.8
)
# llm = utils.get_llm_model(
# provider="ollama", model_name="qwen2.5:7b", temperature=0.8
# )
controller = CustomController()
use_own_browser = False
disable_security = True
use_vision = False
use_vision = True # Set to False when using DeepSeek
tool_call_in_content = True # Set to True when using Ollama
max_actions_per_step = 1
playwright = None
browser_context_ = None
try:
@@ -171,6 +174,8 @@ async def test_browser_use_custom():
controller=controller,
system_prompt_class=CustomSystemPrompt,
use_vision=use_vision,
tool_call_in_content=tool_call_in_content,
max_actions_per_step=max_actions_per_step
)
history: AgentHistoryList = await agent.run(max_steps=10)

216
webui.py
View File

@@ -4,6 +4,9 @@
# @Email : wenshaoguo1026@gmail.com
# @Project : browser-use-webui
# @FileName: webui.py
import pdb
from dotenv import load_dotenv
load_dotenv()
import argparse
@@ -21,8 +24,11 @@ from src.browser.custom_browser import CustomBrowser
from src.controller.custom_controller import CustomController
from src.agent.custom_agent import CustomAgent
from src.agent.custom_prompts import CustomSystemPrompt
from src.browser.custom_browser import CustomBrowser
from src.browser.custom_context import BrowserContextConfig
from src.controller.custom_controller import CustomController
from src.utils import utils
from src.utils.utils import update_model_dropdown
from src.utils.file_utils import get_latest_files
from src.utils.stream_utils import stream_browser_view, capture_screenshot
@@ -44,18 +50,24 @@ async def run_browser_agent(
add_infos,
max_steps,
use_vision,
max_actions_per_step,
tool_call_in_content,
browser_context=None # Added optional argument
):
"""
Runs the browser agent based on user configurations.
"""
# Ensure the recording directory exists
os.makedirs(save_recording_path, exist_ok=True)
# Get the list of existing videos before the agent runs
existing_videos = set(glob.glob(os.path.join(save_recording_path, '*.[mM][pP]4')) +
glob.glob(os.path.join(save_recording_path, '*.[wW][eE][bB][mM]')))
# Run the agent
llm = utils.get_llm_model(
provider=llm_provider,
model_name=llm_model_name,
temperature=llm_temperature,
base_url=llm_base_url,
api_key=llm_api_key
api_key=llm_api_key,
)
if agent_type == "org":
return await run_org_agent(
@@ -68,7 +80,9 @@ async def run_browser_agent(
task=task,
max_steps=max_steps,
use_vision=use_vision,
browser_context=browser_context # pass context
max_actions_per_step=max_actions_per_step,
tool_call_in_content=tool_call_in_content,
browser_context=browser_context, # pass context,
)
elif agent_type == "custom":
return await run_custom_agent(
@@ -83,12 +97,15 @@ async def run_browser_agent(
add_infos=add_infos,
max_steps=max_steps,
use_vision=use_vision,
browser_context=browser_context # pass context
max_actions_per_step=max_actions_per_step,
tool_call_in_content=tool_call_in_content,
browser_context=browser_context, # pass context,
)
else:
raise ValueError(f"Invalid agent type: {agent_type}")
async def run_org_agent(
llm,
headless,
@@ -99,7 +116,9 @@ async def run_org_agent(
task,
max_steps,
use_vision,
browser_context=None # receive context
max_actions_per_step,
tool_call_in_content,
browser_context=None, # receive context
):
browser = None
if browser_context is None:
@@ -142,7 +161,9 @@ async def run_org_agent(
task=task,
llm=llm,
use_vision=use_vision,
browser_context=browser_context
max_actions_per_step=max_actions_per_step,
tool_call_in_content=tool_call_in_content,
browser_context=browser_context,
)
history = await agent.run(max_steps=max_steps)
final_result = history.final_result()
@@ -153,7 +174,6 @@ async def run_org_agent(
trace_file = get_latest_files(save_recording_path + "/../traces")
return final_result, errors, model_actions, model_thoughts, recorded_files.get('.webm'), trace_file.get('.zip')
async def run_custom_agent(
llm,
use_own_browser,
@@ -166,7 +186,9 @@ async def run_custom_agent(
add_infos,
max_steps,
use_vision,
browser_context=None # receive context
max_actions_per_step,
tool_call_in_content,
browser_context=None, # receive context
):
controller = CustomController()
playwright = None
@@ -176,20 +198,29 @@ async def run_custom_agent(
playwright = await async_playwright().start()
chrome_exe = os.getenv("CHROME_PATH", "")
chrome_use_data = os.getenv("CHROME_USER_DATA", "")
if chrome_exe == "":
chrome_exe = None
elif not os.path.exists(chrome_exe):
raise ValueError(f"Chrome executable not found at {chrome_exe}")
if chrome_use_data == "":
chrome_use_data = None
browser_context_ = await playwright.chromium.launch_persistent_context(
user_data_dir=chrome_use_data,
executable_path=chrome_exe,
no_viewport=False,
headless=headless, # 保持浏览器窗口可见
user_agent=(
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36"
),
java_script_enabled=True,
bypass_csp=disable_security,
ignore_https_errors=disable_security,
record_video_dir=save_recording_path if save_recording_path else None,
record_video_size={'width': window_w, 'height': window_h}
record_video_size={"width": window_w, "height": window_h},
)
else:
browser_context_ = None
@@ -250,6 +281,7 @@ async def run_custom_agent(
except Exception as e:
import traceback
traceback.print_exc()
final_result = ""
errors = str(e) + "\n" + traceback.format_exc()
@@ -409,6 +441,7 @@ theme_map = {
"Origin": Origin(),
"Citrus": Citrus(),
"Ocean": Ocean(),
"Base": Base()
}
# Create the Gradio UI
@@ -439,57 +472,124 @@ def create_ui(theme_name="Ocean"):
### Control your browser with AI assistance
""",
elem_classes=["header-text"],
elem_classes=["header-text"],
)
# Tabs
with gr.Tabs():
# Agent Settings
with gr.Tab("⚙️ Agent Settings"):
agent_type = gr.Radio(
["org", "custom"],
label="Agent Type",
value="custom",
)
max_steps = gr.Slider(1, 200, value=100, step=1, label="Max Run Steps")
max_actions_per_step = gr.Slider(
1, 20, value=10, step=1, label="Max Actions per Step"
)
use_vision = gr.Checkbox(value=True, label="Use Vision")
tool_call_in_content = gr.Checkbox(
value=True, label="Enable Tool Calls in Content"
)
with gr.Tabs() as tabs:
with gr.TabItem("⚙️ Agent Settings", id=1):
with gr.Group():
agent_type = gr.Radio(
["org", "custom"],
label="Agent Type",
value="custom",
info="Select the type of agent to use",
)
max_steps = gr.Slider(
minimum=1,
maximum=200,
value=100,
step=1,
label="Max Run Steps",
info="Maximum number of steps the agent will take",
)
max_actions_per_step = gr.Slider(
minimum=1,
maximum=20,
value=10,
step=1,
label="Max Actions per Step",
info="Maximum number of actions the agent will take per step",
)
use_vision = gr.Checkbox(
label="Use Vision",
value=True,
info="Enable visual processing capabilities",
)
tool_call_in_content = gr.Checkbox(
label="Use Tool Calls in Content",
value=True,
info="Enable Tool Calls in content",
)
# LLM Configuration
with gr.Tab("🔧 LLM Configuration"):
llm_provider = gr.Dropdown(
["anthropic", "openai", "gemini", "azure_openai", "deepseek"],
label="LLM Provider",
value="gemini",
)
llm_model_name = gr.Textbox(label="Model Name", value="gemini-2.0-flash-exp")
llm_temperature = gr.Slider(0.0, 2.0, value=1.0, step=0.1, label="Temperature")
llm_base_url = gr.Textbox(label="Base URL")
llm_api_key = gr.Textbox(label="API Key", type="password")
with gr.TabItem("🔧 LLM Configuration", id=2):
with gr.Group():
llm_provider = gr.Dropdown(
["anthropic", "openai", "deepseek", "gemini", "ollama", "azure_openai"],
label="LLM Provider",
value="",
info="Select your preferred language model provider"
)
llm_model_name = gr.Dropdown(
label="Model Name",
value="",
interactive=True,
allow_custom_value=True, # Allow users to input custom model names
info="Select a model from the dropdown or type a custom model name"
)
llm_temperature = gr.Slider(
minimum=0.0,
maximum=2.0,
value=1.0,
step=0.1,
label="Temperature",
info="Controls randomness in model outputs"
)
with gr.Row():
llm_base_url = gr.Textbox(
label="Base URL",
value=os.getenv(f"{llm_provider.value.upper()}_BASE_URL ", ""), # Default to .env value
info="API endpoint URL (if required)"
)
llm_api_key = gr.Textbox(
label="API Key",
type="password",
value=os.getenv(f"{llm_provider.value.upper()}_API_KEY", ""), # Default to .env value
info="Your API key (leave blank to use .env)"
)
with gr.TabItem("🌐 Browser Settings", id=3):
with gr.Group():
with gr.Row():
use_own_browser = gr.Checkbox(
label="Use Own Browser",
value=False,
info="Use your existing browser instance",
)
headless = gr.Checkbox(
label="Headless Mode",
value=False,
info="Run browser without GUI",
)
disable_security = gr.Checkbox(
label="Disable Security",
value=True,
info="Disable browser security features",
)
# Browser Settings
with gr.Tab("🌐 Browser Settings"):
use_own_browser = gr.Checkbox(value=False, label="Use Own Browser")
headless = gr.Checkbox(value=False, label="Headless Mode")
disable_security = gr.Checkbox(value=True, label="Disable Security")
window_w = gr.Number(value=1280, label="Window Width")
window_h = gr.Number(value=1100, label="Window Height")
save_recording_path = gr.Textbox(
value="./tmp/record_videos",
label="Recording Path",
placeholder="e.g. ./tmp/record_videos",
)
with gr.Row():
window_w = gr.Number(
label="Window Width",
value=1280,
info="Browser window width",
)
window_h = gr.Number(
label="Window Height",
value=1100,
info="Browser window height",
)
# Run Agent
with gr.Tab("🤖 Run Agent"):
save_recording_path = gr.Textbox(
label="Recording Path",
placeholder="e.g. ./tmp/record_videos",
value="./tmp/record_videos",
info="Path to save browser recordings",
)
with gr.TabItem("🤖 Run Agent", id=4):
task = gr.Textbox(
lines=4,
value="go to google.com and type 'OpenAI' click search and give me the first url",
label="Task Description",
info="Describe what you want the agent to do",
)
add_infos = gr.Textbox(lines=3, label="Additional Information")
@@ -536,7 +636,7 @@ def create_ui(theme_name="Ocean"):
model_actions_output,
model_thoughts_output,
recording_file,
trace_file,
trace_file, max_actions_per_step, tool_call_in_content
],
queue=True,
)