claude-cookbooks/tool_evaluation/tool_evaluation.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tool Evaluation\n",
    "\n",
    "Multiple agents independently run a single evaluation task from an evaluation file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import re\n",
    "import time\n",
    "import traceback\n",
    "import xml.etree.ElementTree as ET\n",
    "from pathlib import Path\n",
    "from typing import Any, Dict, List, Tuple\n",
    "\n",
    "from anthropic import Anthropic"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prompts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Embedded evaluator prompt\n",
    "EVALUATION_PROMPT = \"\"\"You are an AI assistant with access to tools.\n",
    "\n",
    "When given a task, you MUST:\n",
    "1. Use the available tools to complete the task\n",
    "2. Provide summary of each step in your approach, wrapped in <summary> tags\n",
    "3. Provide feedback on the tools provided, wrapped in <feedback> tags\n",
    "4. Provide your final response, wrapped in <response> tags\n",
    "\n",
    "Summary Requirements:\n",
    "- In your <summary> tags, you must explain:\n",
    "  - The steps you took to complete the task\n",
    "  - Which tools you used, in what order, and why\n",
    "  - The inputs you provided to each tool\n",
    "  - The outputs you received from each tool\n",
    "  - A summary for how you arrived at the response\n",
    "\n",
    "Feedback Requirements:\n",
    "- In your <feedback> tags, provide constructive feedback on the tools:\n",
    "  - Comment on tool names: Are they clear and descriptive?\n",
    "  - Comment on input parameters: Are they well-documented? Are required vs optional parameters clear?\n",
    "  - Comment on descriptions: Do they accurately describe what the tool does?\n",
    "  - Comment on any errors encountered during tool usage: Did the tool fail to execute? Did the tool return too many tokens?\n",
    "  - Identify specific areas for improvement and explain WHY they would help\n",
    "  - Be specific and actionable in your suggestions\n",
    "  \n",
    "Response Requirements:\n",
    "- Your response should be concise and directly address what was asked\n",
    "- Always wrap your final response in <response> tags\n",
    "- If you cannot solve the task return <response>NOT_FOUND</response>\n",
    "- For numeric responses, provide just the number\n",
    "- For IDs, provide just the ID\n",
    "- For names or text, provide the exact text requested\n",
    "- Your response should go last\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Agent Loop"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "client = Anthropic()\n",
    "model = \"claude-3-7-sonnet-20250219\"\n",
    "\n",
    "\n",
    "def agent_loop(\n",
    "    prompt: str, tools: List[Dict[str, Any]] = None\n",
    ") -> Tuple[str, Dict[str, Any]]:\n",
    "    \"\"\"Simplified agent class for tool evaluation\"\"\"\n",
    "    messages = [{\"role\": \"user\", \"content\": prompt}]\n",
    "\n",
    "    response = client.messages.create(\n",
    "        model=model,\n",
    "        max_tokens=4096,\n",
    "        system=EVALUATION_PROMPT,\n",
    "        messages=messages,\n",
    "        tools=tools,\n",
    "    )\n",
    "\n",
    "    messages.append({\"role\": \"assistant\", \"content\": response.content})\n",
    "\n",
    "    # Track tool calls with timing\n",
    "    tool_metrics = {}  # {tool_name: {\"count\": N, \"durations\": [X1, X2, ...]}}\n",
    "\n",
    "    def _prepare_tool_result(tool_use_id, tool_result):\n",
    "        return {\n",
    "            \"role\": \"user\",\n",
    "            \"content\": [\n",
    "                {\n",
    "                    \"type\": \"tool_result\",\n",
    "                    \"tool_use_id\": tool_use_id,\n",
    "                    \"content\": tool_result,\n",
    "                }\n",
    "            ],\n",
    "        }\n",
    "\n",
    "    while response.stop_reason == \"tool_use\":\n",
    "        tool_use = next(block for block in response.content if block.type == \"tool_use\")\n",
    "        tool_name = tool_use.name\n",
    "\n",
    "        tool_start_ts = time.time()\n",
    "        try:\n",
    "            tool_response = eval(\n",
    "                f\"{tool_name}(**tool_use.input)\"\n",
    "            )  # Call the tool function with its input\n",
    "        except Exception as e:\n",
    "            tool_response = f\"Error executing tool {tool_name}: {str(e)}\\n\"\n",
    "            tool_response += traceback.format_exc()\n",
    "        tool_duration = time.time() - tool_start_ts\n",
    "\n",
    "        # Update tool metrics\n",
    "        if tool_name not in tool_metrics:\n",
    "            tool_metrics[tool_name] = {\"count\": 0, \"durations\": []}\n",
    "        tool_metrics[tool_name][\"count\"] += 1\n",
    "        tool_metrics[tool_name][\"durations\"].append(tool_duration)\n",
    "\n",
    "        # Prepare tool result and append to messages\n",
    "        messages.append(_prepare_tool_result(tool_use.id, tool_response))\n",
    "        response = client.messages.create(\n",
    "            model=model,\n",
    "            max_tokens=4096,\n",
    "            system=EVALUATION_PROMPT,\n",
    "            messages=messages,\n",
    "            tools=tools,\n",
    "        )\n",
    "        messages.append({\"role\": \"assistant\", \"content\": response.content})\n",
    "\n",
    "    response = next(\n",
    "        (block.text for block in response.content if hasattr(block, \"text\")),\n",
    "        None,\n",
    "    )\n",
    "    return response, tool_metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Helper Functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def parse_evaluation_file(file_path: Path) -> List[Dict[str, Any]]:\n",
    "    \"\"\"Parse XML evaluation file and return list of evaluation tasks.\"\"\"\n",
    "    try:\n",
    "        tree = ET.parse(file_path)\n",
    "        root = tree.getroot()\n",
    "        evaluations = []\n",
    "\n",
    "        # Check for task elements\n",
    "        tasks = root.findall(\".//task\")\n",
    "        for task in tasks:\n",
    "            prompt_elem = task.find(\"prompt\")\n",
    "            response_elem = task.find(\"response\")\n",
    "\n",
    "            if prompt_elem is not None and response_elem is not None:\n",
    "                eval_dict = {\n",
    "                    \"prompt\": (prompt_elem.text or \"\").strip(),\n",
    "                    \"response\": (response_elem.text or \"\").strip(),\n",
    "                }\n",
    "                evaluations.append(eval_dict)\n",
    "\n",
    "        return evaluations\n",
    "    except Exception as e:\n",
    "        print(f\"Error parsing evaluation file {file_path}: {e}\")\n",
    "        return []"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def evaluate_single_task(\n",
    "    task: Dict[str, Any], tools: List[Dict[str, Any]], task_index: int\n",
    ") -> Dict[str, Any]:\n",
    "    \"\"\"Evaluate a single task with the given tools.\"\"\"\n",
    "    start_time = time.time()\n",
    "\n",
    "    # Run the task\n",
    "    print(f\"Task {task_index + 1}: Running task with prompt: {task['prompt']}\")\n",
    "    response, tool_metrics = agent_loop(task[\"prompt\"], tools)\n",
    "\n",
    "    # Extract all tagged content\n",
    "    def _extract_xml_content(text, tag):\n",
    "        pattern = rf\"<{tag}>(.*?)</{tag}>\"\n",
    "        matches = re.findall(pattern, text, re.DOTALL)\n",
    "        return matches[-1].strip() if matches else None\n",
    "\n",
    "    response, summary, feedback = (\n",
    "        _extract_xml_content(response, tag)\n",
    "        for tag in [\"response\", \"summary\", \"feedback\"]\n",
    "    )\n",
    "    duration_seconds = time.time() - start_time\n",
    "\n",
    "    return {\n",
    "        \"prompt\": task[\"prompt\"],\n",
    "        \"expected\": task[\"response\"],\n",
    "        \"actual\": response,\n",
    "        \"score\": int(response == task[\"response\"]),\n",
    "        \"total_duration\": duration_seconds,\n",
    "        \"tool_calls\": tool_metrics,\n",
    "        \"num_tool_calls\": sum(\n",
    "            len(metrics[\"durations\"]) for metrics in tool_metrics.values()\n",
    "        ),\n",
    "        \"summary\": summary,\n",
    "        \"feedback\": feedback,\n",
    "    }"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Main Evaluation Function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Report Templates\n",
    "REPORT_HEADER = \"\"\"\n",
    "# Evaluation Report\n",
    "\n",
    "## Summary\n",
    "\n",
    "- **Accuracy**: {correct}/{total} ({accuracy:.1f}%)\n",
    "- **Average Task Duration**: {average_duration_s:.2f}s\n",
    "- **Average Tool Calls per Task**: {average_tool_calls:.2f}\n",
    "- **Total Tool Calls**: {total_tool_calls}\n",
    "\n",
    "---\n",
    "\"\"\"\n",
    "\n",
    "TASK_TEMPLATE = \"\"\"\n",
    "### Task\n",
    "\n",
    "**Prompt**: {prompt}\n",
    "**Ground Truth Response**: `{expected_response}`\n",
    "**Actual Response**: `{actual_response}`\n",
    "**Correct**: {correct_indicator}\n",
    "**Duration**: {total_duration:.2f}s\n",
    "**Tool Calls**: {tool_calls}\n",
    "\n",
    "**Summary**\n",
    "{summary}\n",
    "\n",
    "**Feedback**\n",
    "{feedback}\n",
    "\n",
    "---\n",
    "\"\"\"\n",
    "\n",
    "\n",
    "def run_evaluation(eval_path: str, tools: List[Dict[str, Any]]) -> str:\n",
    "    \"\"\"\n",
    "    Run evaluation with provided tools using a simple loop.\n",
    "\n",
    "    Args:\n",
    "        eval_path: Path to XML evaluation file\n",
    "        tools: List of tool definitions to use for evaluation\n",
    "\n",
    "    \"\"\"\n",
    "    print(\"🚀 Starting Evaluation\")\n",
    "\n",
    "    eval_file = Path(eval_path)\n",
    "\n",
    "    # Parse evaluation tasks\n",
    "    tasks = parse_evaluation_file(eval_file)\n",
    "\n",
    "    print(f\"📋 Loaded {len(tasks)} evaluation tasks\")\n",
    "\n",
    "    # Simple loop to run all tasks\n",
    "    results = []\n",
    "    for i, task in enumerate(tasks):\n",
    "        print(f\"Processing task {i + 1}/{len(tasks)}\")\n",
    "        results.append(evaluate_single_task(task, tools, i))\n",
    "\n",
    "    # Calculate summary statistics\n",
    "    correct = sum(r[\"score\"] for r in results)\n",
    "    accuracy = (correct / len(results)) * 100\n",
    "    average_duration_s = sum(r[\"total_duration\"] for r in results) / len(results)\n",
    "    average_tool_calls = sum(r[\"num_tool_calls\"] for r in results) / len(results)\n",
    "    total_tool_calls = sum(r[\"num_tool_calls\"] for r in results)\n",
    "\n",
    "    report = REPORT_HEADER.format(\n",
    "        correct=correct,\n",
    "        total=len(results),\n",
    "        accuracy=accuracy,\n",
    "        average_duration_s=average_duration_s,\n",
    "        average_tool_calls=average_tool_calls,\n",
    "        total_tool_calls=total_tool_calls,\n",
    "    )\n",
    "\n",
    "    report += \"\".join(\n",
    "        [\n",
    "            TASK_TEMPLATE.format(\n",
    "                prompt=task[\"prompt\"],\n",
    "                expected_response=task[\"response\"],\n",
    "                actual_response=result[\"actual\"],\n",
    "                correct_indicator=\"✅\" if result[\"score\"] else \"❌\",\n",
    "                total_duration=result[\"total_duration\"],\n",
    "                tool_calls=json.dumps(result[\"tool_calls\"], indent=2),\n",
    "                summary=result[\"summary\"] or \"N/A\",\n",
    "                feedback=result[\"feedback\"] or \"N/A\",\n",
    "            )\n",
    "            for task, result in zip(tasks, results)\n",
    "        ]\n",
    "    )\n",
    "    # Join all sections into final report\n",
    "    return report"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Calculator Tool"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def calculator(expression: str) -> str:\n",
    "    \"\"\"A basic calculator that performs arithmetic operations.\"\"\"\n",
    "    try:\n",
    "        result = eval(expression, {\"__builtins__\": {}}, {})\n",
    "        return str(result)\n",
    "    except Exception as e:\n",
    "        return f\"Error: {str(e)}\"\n",
    "\n",
    "\n",
    "# Define the tool schema for the calculator\n",
    "calculator_tool = {\n",
    "    \"name\": \"calculator\",\n",
    "    \"description\": \"\",  # An unhelpful tool description. \n",
    "    \"input_schema\": {\n",
    "        \"type\": \"object\",\n",
    "        \"properties\": {\n",
    "            \"expression\": {\n",
    "                \"type\": \"string\",\n",
    "                \"description\": \"\", # An unhelpful schema description.\n",
    "            }\n",
    "        },\n",
    "        \"required\": [\"expression\"],\n",
    "    },\n",
    "}\n",
    "\n",
    "# Set the tools list\n",
    "tools = [calculator_tool]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Run Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✅ Using calculator tool\n",
      "🚀 Starting Evaluation\n",
      "📋 Loaded 8 evaluation tasks\n",
      "Processing task 1/8\n",
      "Task 1: Running task with prompt: Calculate the compound interest on $10,000 invested at 5% annual interest rate, compounded monthly for 3 years. What is the final amount in dollars (rounded to 2 decimal places)?\n",
      "Processing task 2/8\n",
      "Task 2: Running task with prompt: A projectile is launched at a 45-degree angle with an initial velocity of 50 m/s. Calculate the total distance (in meters) it has traveled from the launch point after 2 seconds, assuming g=9.8 m/s². Round to 2 decimal places.\n",
      "Processing task 3/8\n",
      "Task 3: Running task with prompt: A sphere has a volume of 500 cubic meters. Calculate its surface area in square meters. Round to 2 decimal places.\n",
      "Processing task 4/8\n",
      "Task 4: Running task with prompt: Calculate the population standard deviation of this dataset: [12, 15, 18, 22, 25, 30, 35]. Round to 2 decimal places.\n",
      "Processing task 5/8\n",
      "Task 5: Running task with prompt: Calculate the pH of a solution with a hydrogen ion concentration of 3.5 × 10^-5 M. Round to 2 decimal places.\n",
      "Processing task 6/8\n",
      "Task 6: Running task with prompt: Calculate the monthly payment for a $200,000 mortgage at 4.5% annual interest rate for 30 years (360 months). Use the standard mortgage payment formula. Round to 2 decimal places.\n",
      "Processing task 7/8\n",
      "Task 7: Running task with prompt: Calculate the energy in joules of a photon with wavelength 550 nanometers. Use h = 6.626 × 10^-34 J·s and c = 3 × 10^8 m/s. Express the answer in scientific notation with 2 significant figures after the decimal (e.g., 3.61e-19).\n",
      "Processing task 8/8\n",
      "Task 8: Running task with prompt: Find the larger real root of the quadratic equation 3x² - 7x + 2 = 0. Give the exact value.\n",
      "\n",
      "# Evaluation Report\n",
      "\n",
      "## Summary\n",
      "\n",
      "- **Accuracy**: 7/8 (87.5%)\n",
      "- **Average Task Duration**: 22.73s\n",
      "- **Average Tool Calls per Task**: 7.75\n",
      "- **Total Tool Calls**: 62\n",
      "\n",
      "---\n",
      "\n",
      "### Task\n",
      "\n",
      "**Prompt**: Calculate the compound interest on $10,000 invested at 5% annual interest rate, compounded monthly for 3 years. What is the final amount in dollars (rounded to 2 decimal places)?\n",
      "**Ground Truth Response**: `11614.72`\n",
      "**Actual Response**: `$11,614.72`\n",
      "**Correct**: ❌\n",
      "**Duration**: 18.64s\n",
      "**Tool Calls**: {\n",
      "  \"calculator\": {\n",
      "    \"count\": 6,\n",
      "    \"durations\": [\n",
      "      9.560585021972656e-05,\n",
      "      9.870529174804688e-05,\n",
      "      8.988380432128906e-05,\n",
      "      0.00011301040649414062,\n",
      "      0.00010704994201660156,\n",
      "      8.821487426757812e-05\n",
      "    ]\n",
      "  }\n",
      "}\n",
      "\n",
      "**Summary**\n",
      "I approached this compound interest calculation in the following steps:\n",
      "\n",
      "1. First, I identified the formula needed: P(1 + r/n)^(nt) where:\n",
      "   - P = principal ($10,000)\n",
      "   - r = annual interest rate (5% or 0.05)\n",
      "   - n = number of times compounded per year (12 for monthly)\n",
      "   - t = time in years (3)\n",
      "\n",
      "2. I initially tried using the calculator tool with the formula using ^ for exponentiation, but received an error.\n",
      "\n",
      "3. I corrected the syntax by using ** for exponentiation in Python, calculating 10000 * (1 + 0.05/12)**(12*3).\n",
      "\n",
      "4. The calculator returned 11614.722313334678.\n",
      "\n",
      "5. I attempted several approaches to round to 2 decimal places using functions like round() and int(), but these weren't available in the calculator environment.\n",
      "\n",
      "6. Since the calculator doesn't have built-in rounding functions, I had to manually round the result to 2 decimal places: $11,614.72.\n",
      "\n",
      "**Feedback**\n",
      "The calculator tool has both strengths and areas for improvement:\n",
      "\n",
      "1. Tool name: \"calculator\" is clear and descriptive, immediately conveying its purpose.\n",
      "\n",
      "2. Input parameters: The \"expression\" parameter is simple, but lacks description of what syntax is supported. It would be helpful to specify that it uses Python syntax (particularly ** for exponentiation rather than ^).\n",
      "\n",
      "3. Error messaging: The error messages are helpful in identifying syntax issues, but don't provide guidance on how to fix them.\n",
      "\n",
      "4. Functionality limitations: The calculator doesn't support common mathematical functions like round(), int(), or the math module. It would be more useful if it included basic rounding and mathematical functions.\n",
      "\n",
      "5. Documentation: It would be beneficial to include a brief description of supported operations and functions, along with examples of proper syntax for common calculations.\n",
      "\n",
      "Overall, adding better documentation and expanding the supported functions would significantly improve the usability of this tool.\n",
      "\n",
      "---\n",
      "\n",
      "### Task\n",
      "\n",
      "**Prompt**: A projectile is launched at a 45-degree angle with an initial velocity of 50 m/s. Calculate the total distance (in meters) it has traveled from the launch point after 2 seconds, assuming g=9.8 m/s². Round to 2 decimal places.\n",
      "**Ground Truth Response**: `87.25`\n",
      "**Actual Response**: `87.25`\n",
      "**Correct**: ✅\n",
      "**Duration**: 31.06s\n",
      "**Tool Calls**: {\n",
      "  \"calculator\": {\n",
      "    \"count\": 12,\n",
      "    \"durations\": [\n",
      "      9.5367431640625e-05,\n",
      "      9.465217590332031e-05,\n",
      "      7.987022399902344e-05,\n",
      "      8.726119995117188e-05,\n",
      "      9.036064147949219e-05,\n",
      "      8.606910705566406e-05,\n",
      "      9.298324584960938e-05,\n",
      "      9.226799011230469e-05,\n",
      "      7.963180541992188e-05,\n",
      "      8.96453857421875e-05,\n",
      "      9.012222290039062e-05,\n",
      "      7.605552673339844e-05\n",
      "    ]\n",
      "  }\n",
      "}\n",
      "\n",
      "**Summary**\n",
      "To solve this projectile motion problem, I took the following steps:\n",
      "\n",
      "1. I first calculated the horizontal distance after 2 seconds:\n",
      "   - Used the formula x = v₀ × cos(θ) × t\n",
      "   - Since the calculator didn't accept trigonometric functions directly, I used the value 0.7071 (which is approximately cos(45°))\n",
      "   - Input: 50 * 2 * 0.7071\n",
      "   - Output: 70.71 meters\n",
      "\n",
      "2. I then calculated the vertical distance after 2 seconds:\n",
      "   - Used the formula y = v₀ × sin(θ) × t - 0.5 × g × t²\n",
      "   - Since sin(45°) is also approximately 0.7071\n",
      "   - Input: 50 * 0.7071 * 2 - 0.5 * 9.8 * (2**2)\n",
      "   - Output: 51.11 meters\n",
      "\n",
      "3. Finally, I calculated the total distance using the Pythagorean theorem:\n",
      "   - Used the formula d = √(x² + y²)\n",
      "   - Since the sqrt function wasn't available, I used the power operator with exponent 1/2\n",
      "   - Input: ((70.71)**2 + (51.11)**2)**(1/2)\n",
      "   - Output: 87.2475569858549 meters\n",
      "\n",
      "4. I rounded the result to 2 decimal places, which gives 87.25 meters.\n",
      "\n",
      "**Feedback**\n",
      "The calculator tool has several limitations that made this problem more complex to solve:\n",
      "\n",
      "1. Tool name: The name \"calculator\" is clear and descriptive.\n",
      "\n",
      "2. Input parameters: The \"expression\" parameter is not well-documented. There's no information about what functions or operations are supported.\n",
      "\n",
      "3. Description: There is no description provided for the tool, which would have been helpful to understand its capabilities and limitations.\n",
      "\n",
      "4. Errors encountered:\n",
      "   - The calculator doesn't support common mathematical functions like cos(), sin(), sqrt(), round(), int(), or floor().\n",
      "   - There's no math library implementation or prefix to use these functions.\n",
      "   - There's no clear documentation on what functions are supported.\n",
      "\n",
      "5. Areas for improvement:\n",
      "   - Add documentation about supported operations and functions\n",
      "   - Implement common mathematical functions (trigonometric, rounding, square root)\n",
      "   - Include examples of valid expressions\n",
      "   - Provide error messages that suggest alternatives when functions aren't available\n",
      "   - Support a math library like Python's math module would make the calculator much more useful for scientific calculations\n",
      "\n",
      "---\n",
      "\n",
      "### Task\n",
      "\n",
      "**Prompt**: A sphere has a volume of 500 cubic meters. Calculate its surface area in square meters. Round to 2 decimal places.\n",
      "**Ground Truth Response**: `304.65`\n",
      "**Actual Response**: `304.65`\n",
      "**Correct**: ✅\n",
      "**Duration**: 20.41s\n",
      "**Tool Calls**: {\n",
      "  \"calculator\": {\n",
      "    \"count\": 7,\n",
      "    \"durations\": [\n",
      "      8.916854858398438e-05,\n",
      "      9.226799011230469e-05,\n",
      "      8.7738037109375e-05,\n",
      "      8.368492126464844e-05,\n",
      "      9.083747863769531e-05,\n",
      "      0.0001010894775390625,\n",
      "      0.00010538101196289062\n",
      "    ]\n",
      "  }\n",
      "}\n",
      "\n",
      "**Summary**\n",
      "N/A\n",
      "\n",
      "**Feedback**\n",
      "The calculator tool is useful but has some limitations:\n",
      "\n",
      "1. Function naming: The name \"calculator\" is clear and descriptive.\n",
      "\n",
      "2. Input parameters: The \"expression\" parameter is straightforward, but there's no documentation about which mathematical operations and functions are supported.\n",
      "\n",
      "3. Supported operations: I encountered several errors with common mathematical operations:\n",
      "   - The caret symbol (^) for exponentiation didn't work; I had to use ** instead\n",
      "   - Built-in functions like 'round', 'int', and 'math' modules were not available\n",
      "\n",
      "4. Improvement suggestions:\n",
      "   - Provide documentation on which operators are supported (**, /, *, +, -, etc.)\n",
      "   - Include information about available mathematical functions or implement common ones like round()\n",
      "   - Add examples in the description showing proper syntax for exponentiation and other operations\n",
      "   - Consider implementing a parameter for specifying decimal precision in the result\n",
      "\n",
      "These improvements would reduce trial and error and make the tool more efficient to use.\n",
      "\n",
      "---\n",
      "\n",
      "### Task\n",
      "\n",
      "**Prompt**: Calculate the population standard deviation of this dataset: [12, 15, 18, 22, 25, 30, 35]. Round to 2 decimal places.\n",
      "**Ground Truth Response**: `7.61`\n",
      "**Actual Response**: `7.61`\n",
      "**Correct**: ✅\n",
      "**Duration**: 28.69s\n",
      "**Tool Calls**: {\n",
      "  \"calculator\": {\n",
      "    \"count\": 10,\n",
      "    \"durations\": [\n",
      "      8.7738037109375e-05,\n",
      "      8.344650268554688e-05,\n",
      "      9.584426879882812e-05,\n",
      "      8.487701416015625e-05,\n",
      "      0.00012683868408203125,\n",
      "      8.463859558105469e-05,\n",
      "      8.20159912109375e-05,\n",
      "      7.62939453125e-05,\n",
      "      7.939338684082031e-05,\n",
      "      8.535385131835938e-05\n",
      "    ]\n",
      "  }\n",
      "}\n",
      "\n",
      "**Summary**\n",
      "To calculate the population standard deviation of the dataset [12, 15, 18, 22, 25, 30, 35] rounded to 2 decimal places, I took the following steps:\n",
      "\n",
      "1. First, I calculated the mean of the dataset:\n",
      "   - Input: (12 + 15 + 18 + 22 + 25 + 30 + 35) / 7\n",
      "   - Output: 22.428571428571427\n",
      "\n",
      "2. Then I calculated the variance by:\n",
      "   - Finding the squared deviation of each value from the mean\n",
      "   - Summing these squared deviations\n",
      "   - Dividing by the number of values (7) since this is a population standard deviation\n",
      "   - Input: ((12-22.428571428571427)**2 + (15-22.428571428571427)**2 + (18-22.428571428571427)**2 + (22-22.428571428571427)**2 + (25-22.428571428571427)**2 + (30-22.428571428571427)**2 + (35-22.428571428571427)**2) / 7\n",
      "   - Output: 57.95918367346939\n",
      "\n",
      "3. I then calculated the standard deviation by taking the square root of the variance:\n",
      "   - Input: (57.95918367346939)**0.5\n",
      "   - Output: 7.61309291112813\n",
      "\n",
      "4. Finally, I rounded to 2 decimal places: 7.61\n",
      "   (I had to determine this manually as the calculator tool didn't support rounding functions)\n",
      "\n",
      "**Feedback**\n",
      "The calculator tool provided basic functionality but had significant limitations:\n",
      "\n",
      "1. Tool name: \"calculator\" is clear and descriptive, indicating its purpose well.\n",
      "\n",
      "2. Input parameters: The \"expression\" parameter is straightforward, but there's no description of what types of expressions are supported or the syntax to use.\n",
      "\n",
      "3. Description: The tool lacks a description of its capabilities and limitations. This would have been helpful to know in advance that functions like sum(), std(), round(), int(), and math library functions are not supported.\n",
      "\n",
      "4. Errors encountered: Several errors occurred when trying to use common mathematical functions. The calculator doesn't support:\n",
      "   - Statistical functions (std, sum)\n",
      "   - Rounding functions (round)\n",
      "   - Type conversion functions (int)\n",
      "   - Math library functions\n",
      "\n",
      "5. Areas for improvement:\n",
      "   - Add support for common mathematical and statistical functions like sum(), mean(), std(), round()\n",
      "   - Include a library of mathematical functions like math.floor(), math.ceil()\n",
      "   - Provide clear documentation on supported operations and syntax\n",
      "   - Allow for variable assignment and multi-line operations to simplify complex calculations\n",
      "   - Add specific statistical calculation tools for common operations like standard deviation\n",
      "\n",
      "These improvements would make the tool much more versatile and prevent the need for breaking down complex calculations into multiple basic arithmetic operations.\n",
      "\n",
      "---\n",
      "\n",
      "### Task\n",
      "\n",
      "**Prompt**: Calculate the pH of a solution with a hydrogen ion concentration of 3.5 × 10^-5 M. Round to 2 decimal places.\n",
      "**Ground Truth Response**: `4.46`\n",
      "**Actual Response**: `4.46`\n",
      "**Correct**: ✅\n",
      "**Duration**: 38.37s\n",
      "**Tool Calls**: {\n",
      "  \"calculator\": {\n",
      "    \"count\": 16,\n",
      "    \"durations\": [\n",
      "      8.726119995117188e-05,\n",
      "      8.940696716308594e-05,\n",
      "      9.322166442871094e-05,\n",
      "      8.702278137207031e-05,\n",
      "      0.00015282630920410156,\n",
      "      0.00010943412780761719,\n",
      "      0.00011801719665527344,\n",
      "      8.463859558105469e-05,\n",
      "      8.225440979003906e-05,\n",
      "      9.059906005859375e-05,\n",
      "      8.392333984375e-05,\n",
      "      8.988380432128906e-05,\n",
      "      0.00010824203491210938,\n",
      "      9.393692016601562e-05,\n",
      "      0.00010967254638671875,\n",
      "      0.000156402587890625\n",
      "    ]\n",
      "  }\n",
      "}\n",
      "\n",
      "**Summary**\n",
      "I attempted to calculate the pH of a solution with a hydrogen ion concentration of 3.5 × 10^-5 M.\n",
      "\n",
      "Steps taken:\n",
      "1. I tried various approaches to calculate pH using the calculator tool with different logarithm function notations (log10, ln, log), but encountered errors as these functions were not defined in the calculator tool.\n",
      "2. I successfully verified the value of 3.5 × 10^-5 using the calculator tool.\n",
      "3. Since direct logarithm calculations were not working, I switched to a manual calculation approach.\n",
      "4. I used the pH formula: pH = -log10([H+])\n",
      "5. I broke down the calculation: pH = -log10(3.5 × 10^-5) = -(log10(3.5) + log10(10^-5)) = -(log10(3.5) - 5)\n",
      "6. I used the known approximation that log10(3.5) ≈ 0.544\n",
      "7. I calculated: pH ≈ -(0.544 - 5) = 4.456\n",
      "8. I rounded the result to 2 decimal places: 4.46\n",
      "\n",
      "The calculator tool was used multiple times with different expressions, but had limitations with logarithmic functions.\n",
      "\n",
      "**Feedback**\n",
      "The calculator tool has several limitations:\n",
      "\n",
      "1. Tool name: \"calculator\" is clear and descriptive, accurately representing its basic function.\n",
      "\n",
      "2. Input parameters: The \"expression\" parameter is clear but lacks documentation. There's no information about what mathematical operations or functions are supported.\n",
      "\n",
      "3. Function support: The calculator doesn't support essential mathematical functions like logarithms (log, log10, ln), which are critical for many scientific calculations including pH. This significantly limits its utility for chemistry-related calculations.\n",
      "\n",
      "4. Error messages: The error messages indicate missing functions but don't provide alternatives or guidance on what syntax is supported.\n",
      "\n",
      "5. Documentation: There's no documentation about what mathematical libraries or syntax the calculator uses.\n",
      "\n",
      "Improvement suggestions:\n",
      "- Include support for common mathematical functions (log, exp, sqrt, etc.)\n",
      "- Add clear documentation about supported operations and functions\n",
      "- Implement specialized functions for common calculations (like pH)\n",
      "- Provide more helpful error messages that suggest correct syntax\n",
      "- Include examples of supported expressions in the tool description\n",
      "\n",
      "These improvements would make the calculator much more useful for scientific calculations and reduce the need for manual calculations or workarounds.\n",
      "\n",
      "---\n",
      "\n",
      "### Task\n",
      "\n",
      "**Prompt**: Calculate the monthly payment for a $200,000 mortgage at 4.5% annual interest rate for 30 years (360 months). Use the standard mortgage payment formula. Round to 2 decimal places.\n",
      "**Ground Truth Response**: `1013.37`\n",
      "**Actual Response**: `1013.37`\n",
      "**Correct**: ✅\n",
      "**Duration**: 19.65s\n",
      "**Tool Calls**: {\n",
      "  \"calculator\": {\n",
      "    \"count\": 6,\n",
      "    \"durations\": [\n",
      "      0.00011038780212402344,\n",
      "      0.0001671314239501953,\n",
      "      8.273124694824219e-05,\n",
      "      0.00010371208190917969,\n",
      "      8.702278137207031e-05,\n",
      "      8.726119995117188e-05\n",
      "    ]\n",
      "  }\n",
      "}\n",
      "\n",
      "**Summary**\n",
      "Steps taken to complete the task:\n",
      "1. I needed to calculate the monthly mortgage payment using the standard formula: P * (r * (1+r)^n) / ((1+r)^n - 1)\n",
      "   Where P = principal ($200,000), r = monthly interest rate (4.5%/12), n = number of payments (360)\n",
      "\n",
      "2. First, I attempted to use the calculator tool with the formula using ^ for exponentiation, but received an error as Python uses ** for exponentiation.\n",
      "\n",
      "3. I corrected the formula using ** for exponentiation and successfully calculated the monthly payment as $1,013.3706196517716.\n",
      "\n",
      "4. I attempted to round to 2 decimal places using various methods (round(), int(), math.floor()), but these functions were not available in the calculator tool.\n",
      "\n",
      "5. Since the built-in rounding functions weren't available, I manually rounded the result to $1,013.37 based on the calculated value.\n",
      "\n",
      "**Feedback**\n",
      "Calculator Tool Feedback:\n",
      "- Tool name: The name \"calculator\" is clear and descriptive, indicating its purpose well.\n",
      "- Input parameters: The \"expression\" parameter is clear but lacks description about syntax requirements or limitations.\n",
      "- Descriptions: The tool description is completely absent, which makes it difficult to understand what types of expressions are supported.\n",
      "- Errors encountered: The tool doesn't support common Python functions like round(), int(), or math module functions, which limits its utility for common mathematical operations.\n",
      "\n",
      "Areas for improvement:\n",
      "1. Add a clear description for the calculator tool explaining what syntax it supports and what libraries/functions are available.\n",
      "2. Include examples of supported operations in the documentation.\n",
      "3. Support common mathematical functions like round() and basic modules like math for more complex calculations.\n",
      "4. Provide better error messages that explain why an operation failed and suggest alternatives.\n",
      "5. Add documentation about what syntax to use for exponentiation and other special operations to avoid trial and error.\n",
      "\n",
      "---\n",
      "\n",
      "### Task\n",
      "\n",
      "**Prompt**: Calculate the energy in joules of a photon with wavelength 550 nanometers. Use h = 6.626 × 10^-34 J·s and c = 3 × 10^8 m/s. Express the answer in scientific notation with 2 significant figures after the decimal (e.g., 3.61e-19).\n",
      "**Ground Truth Response**: `3.61e-19`\n",
      "**Actual Response**: `3.61e-19`\n",
      "**Correct**: ✅\n",
      "**Duration**: 8.61s\n",
      "**Tool Calls**: {\n",
      "  \"calculator\": {\n",
      "    \"count\": 1,\n",
      "    \"durations\": [\n",
      "      8.845329284667969e-05\n",
      "    ]\n",
      "  }\n",
      "}\n",
      "\n",
      "**Summary**\n",
      "To calculate the energy of a photon with wavelength 550 nanometers:\n",
      "\n",
      "1. I identified the formula needed: E = hc/λ, where:\n",
      "   - E is the energy in joules\n",
      "   - h is Planck's constant (6.626 × 10^-34 J·s)\n",
      "   - c is the speed of light (3 × 10^8 m/s)\n",
      "   - λ is the wavelength (550 nm = 550 × 10^-9 m)\n",
      "\n",
      "2. I used the calculator tool with the expression: 6.626e-34 * 3e8 / (550e-9)\n",
      "   - Input: The mathematical expression with scientific notation\n",
      "   - Output: 3.614181818181818e-19 joules\n",
      "\n",
      "3. The result needs to be formatted with 2 significant figures after the decimal, so 3.61e-19 J.\n",
      "\n",
      "**Feedback**\n",
      "The calculator tool works well for this calculation:\n",
      "\n",
      "- Tool name: \"calculator\" is clear and describes its function well.\n",
      "- Input parameters: The single \"expression\" parameter is intuitive, though a brief description of acceptable syntax would be helpful.\n",
      "- Description: There's no actual description provided for the tool in the schema, which would be useful to explain capabilities and limitations.\n",
      "- Functionality: The tool handled scientific notation correctly and performed the calculation as expected.\n",
      "\n",
      "Improvement suggestion: Adding a brief description of the calculator's capabilities and acceptable syntax formats would help users understand how to properly format complex expressions, especially when dealing with scientific notation.\n",
      "\n",
      "---\n",
      "\n",
      "### Task\n",
      "\n",
      "**Prompt**: Find the larger real root of the quadratic equation 3x² - 7x + 2 = 0. Give the exact value.\n",
      "**Ground Truth Response**: `2`\n",
      "**Actual Response**: `2`\n",
      "**Correct**: ✅\n",
      "**Duration**: 16.37s\n",
      "**Tool Calls**: {\n",
      "  \"calculator\": {\n",
      "    \"count\": 4,\n",
      "    \"durations\": [\n",
      "      0.0001506805419921875,\n",
      "      9.179115295410156e-05,\n",
      "      0.000102996826171875,\n",
      "      9.870529174804688e-05\n",
      "    ]\n",
      "  }\n",
      "}\n",
      "\n",
      "**Summary**\n",
      "I solved the quadratic equation 3x² - 7x + 2 = 0 using the quadratic formula:\n",
      "x = (-b ± √(b² - 4ac))/(2a)\n",
      "\n",
      "Where a = 3, b = -7, c = 2\n",
      "\n",
      "Steps taken:\n",
      "1. First, I attempted to use the calculator with the \"sqrt\" function, but encountered an error.\n",
      "2. Then I tried using the exponentiation with \"^\" which also caused an error.\n",
      "3. Finally, I correctly used the \"**\" operator for exponentiation in the calculator tool.\n",
      "4. I calculated both roots using the quadratic formula:\n",
      "   - For the larger root: (-(-7) + ((-7)**2 - 4*3*2)**0.5)/(2*3) = 2.0\n",
      "   - For the smaller root: (-(-7) - ((-7)**2 - 4*3*2)**0.5)/(2*3) = 0.3333333333333333\n",
      "\n",
      "The larger real root is 2.\n",
      "\n",
      "**Feedback**\n",
      "The calculator tool is useful but has some limitations and areas for improvement:\n",
      "\n",
      "1. Tool name: \"calculator\" is clear and descriptive.\n",
      "2. Input parameters: The single \"expression\" parameter is straightforward, though it would be helpful to have documentation on the supported syntax.\n",
      "3. Description: The tool lacks a description of what operations are supported and what syntax to use. This caused my initial errors with sqrt() and the ^ operator.\n",
      "4. Syntax limitations: The calculator doesn't support common mathematical functions like \"sqrt\" directly, requiring the use of exponentiation (raising to power 0.5) instead. It also uses Python-style \"**\" for exponentiation rather than the more common \"^\" symbol.\n",
      "5. Error messages: The error messages were helpful in identifying the issues with my syntax.\n",
      "\n",
      "Improvement suggestions:\n",
      "- Add documentation explaining the supported operations and syntax\n",
      "- Support common mathematical functions like sqrt(), sin(), cos(), etc.\n",
      "- Consider accepting multiple syntax styles for common operations like exponentiation\n",
      "\n",
      "---\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Run evaluation\n",
    "print(f\"✅ Using calculator tool\")\n",
    "\n",
    "report = run_evaluation(eval_path=\"evaluation.xml\", tools=tools)\n",
    "\n",
    "print(report)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}