claude-cookbooks/patterns/agents/evaluator_optimizer.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluator-Optimizer Workflow\n",
    "In this workflow, one LLM call generates a response while another provides evaluation and feedback in a loop.\n",
    "\n",
    "### When to use this workflow\n",
    "This workflow is particularly effective when we have:\n",
    "\n",
    "- Clear evaluation criteria\n",
    "- Value from iterative refinement\n",
    "\n",
    "The two signs of good fit are:\n",
    "\n",
    "- LLM responses can be demonstrably improved when feedback is provided\n",
    "- The LLM can provide meaningful feedback itself"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from util import llm_call, extract_xml\n",
    "\n",
    "def generate(prompt: str, task: str, context: str = \"\") -> tuple[str, str]:\n",
    "    \"\"\"Generate and improve a solution based on feedback.\"\"\"\n",
    "    full_prompt = f\"{prompt}\\n{context}\\nTask: {task}\" if context else f\"{prompt}\\nTask: {task}\"\n",
    "    response = llm_call(full_prompt)\n",
    "    thoughts = extract_xml(response, \"thoughts\")\n",
    "    result = extract_xml(response, \"response\")\n",
    "    \n",
    "    print(\"\\n=== GENERATION START ===\")\n",
    "    print(f\"Thoughts:\\n{thoughts}\\n\")\n",
    "    print(f\"Generated:\\n{result}\")\n",
    "    print(\"=== GENERATION END ===\\n\")\n",
    "    \n",
    "    return thoughts, result\n",
    "\n",
    "def evaluate(prompt: str, content: str, task: str) -> tuple[str, str]:\n",
    "    \"\"\"Evaluate if a solution meets requirements.\"\"\"\n",
    "    full_prompt = f\"{prompt}\\nOriginal task: {task}\\nContent to evaluate: {content}\"\n",
    "    response = llm_call(full_prompt)\n",
    "    evaluation = extract_xml(response, \"evaluation\")\n",
    "    feedback = extract_xml(response, \"feedback\")\n",
    "    \n",
    "    print(\"=== EVALUATION START ===\")\n",
    "    print(f\"Status: {evaluation}\")\n",
    "    print(f\"Feedback: {feedback}\")\n",
    "    print(\"=== EVALUATION END ===\\n\")\n",
    "    \n",
    "    return evaluation, feedback\n",
    "\n",
    "def loop(task: str, evaluator_prompt: str, generator_prompt: str) -> tuple[str, list[dict]]:\n",
    "    \"\"\"Keep generating and evaluating until requirements are met.\"\"\"\n",
    "    memory = []\n",
    "    chain_of_thought = []\n",
    "    \n",
    "    thoughts, result = generate(generator_prompt, task)\n",
    "    memory.append(result)\n",
    "    chain_of_thought.append({\"thoughts\": thoughts, \"result\": result})\n",
    "    \n",
    "    while True:\n",
    "        evaluation, feedback = evaluate(evaluator_prompt, result, task)\n",
    "        if evaluation == \"PASS\":\n",
    "            return result, chain_of_thought\n",
    "            \n",
    "        context = \"\\n\".join([\n",
    "            \"Previous attempts:\",\n",
    "            *[f\"- {m}\" for m in memory],\n",
    "            f\"\\nFeedback: {feedback}\"\n",
    "        ])\n",
    "        \n",
    "        thoughts, result = generate(generator_prompt, task, context)\n",
    "        memory.append(result)\n",
    "        chain_of_thought.append({\"thoughts\": thoughts, \"result\": result})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example Use Case: Iterative coding loop\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "=== GENERATION START ===\n",
      "Thoughts:\n",
      "\n",
      "The task requires implementing a Stack with constant time operations including finding minimum. \n",
      "To achieve O(1) for getMin(), we need to maintain a second stack that keeps track of minimums.\n",
      "Each time we push, if the value is smaller than current min, we add it to minStack.\n",
      "When we pop, if the popped value equals current min, we also pop from minStack.\n",
      "\n",
      "\n",
      "Generated:\n",
      "\n",
      "```python\n",
      "class MinStack:\n",
      "    def __init__(self):\n",
      "        self.stack = []\n",
      "        self.minStack = []\n",
      "        \n",
      "    def push(self, x: int) -> None:\n",
      "        self.stack.append(x)\n",
      "        if not self.minStack or x <= self.minStack[-1]:\n",
      "            self.minStack.append(x)\n",
      "            \n",
      "    def pop(self) -> None:\n",
      "        if not self.stack:\n",
      "            return\n",
      "        if self.stack[-1] == self.minStack[-1]:\n",
      "            self.minStack.pop()\n",
      "        self.stack.pop()\n",
      "        \n",
      "    def getMin(self) -> int:\n",
      "        if not self.minStack:\n",
      "            return None\n",
      "        return self.minStack[-1]\n",
      "```\n",
      "\n",
      "=== GENERATION END ===\n",
      "\n",
      "=== EVALUATION START ===\n",
      "Status: NEEDS_IMPROVEMENT\n",
      "Feedback: \n",
      "While the implementation is generally correct and achieves O(1) time complexity for all operations, there are several areas for improvement:\n",
      "\n",
      "1. Error Handling:\n",
      "- pop() should raise an exception when stack is empty rather than silently returning\n",
      "- getMin() should raise an exception when stack is empty rather than returning None\n",
      "- These behaviors should be consistent with standard stack implementations\n",
      "\n",
      "2. Type Hints:\n",
      "- Return type hint for pop() should be None or void\n",
      "- Missing type hints for class variables stack and minStack\n",
      "\n",
      "3. Documentation:\n",
      "- Missing docstrings for class and methods\n",
      "- Missing parameter descriptions\n",
      "\n",
      "4. Edge Cases:\n",
      "- No explicit handling of invalid input types for push()\n",
      "\n",
      "The core algorithm is correct and efficient, but the implementation could be more robust and better documented for production use.\n",
      "\n",
      "=== EVALUATION END ===\n",
      "\n",
      "\n",
      "=== GENERATION START ===\n",
      "Thoughts:\n",
      "\n",
      "Based on the feedback, I'll improve the implementation by:\n",
      "1. Adding proper error handling with exceptions\n",
      "2. Including type hints and docstrings\n",
      "3. Adding input validation\n",
      "4. Maintaining O(1) time complexity for all operations\n",
      "\n",
      "\n",
      "Generated:\n",
      "\n",
      "```python\n",
      "from typing import List\n",
      "\n",
      "class MinStack:\n",
      "    \"\"\"A stack that supports push, pop, and getting minimum element in O(1) time.\"\"\"\n",
      "    \n",
      "    def __init__(self):\n",
      "        \"\"\"Initialize empty stack with two internal lists.\"\"\"\n",
      "        self.stack: List[int] = []\n",
      "        self.minStack: List[int] = []\n",
      "        \n",
      "    def push(self, x: int) -> None:\n",
      "        \"\"\"\n",
      "        Push element onto stack and update minimum stack.\n",
      "        \n",
      "        Args:\n",
      "            x: Integer to push onto stack\n",
      "            \n",
      "        Raises:\n",
      "            TypeError: If x is not an integer\n",
      "        \"\"\"\n",
      "        if not isinstance(x, int):\n",
      "            raise TypeError(\"Input must be an integer\")\n",
      "            \n",
      "        self.stack.append(x)\n",
      "        if not self.minStack or x <= self.minStack[-1]:\n",
      "            self.minStack.append(x)\n",
      "            \n",
      "    def pop(self) -> None:\n",
      "        \"\"\"\n",
      "        Remove and return top element from stack.\n",
      "        \n",
      "        Raises:\n",
      "            IndexError: If stack is empty\n",
      "        \"\"\"\n",
      "        if not self.stack:\n",
      "            raise IndexError(\"Cannot pop from empty stack\")\n",
      "            \n",
      "        if self.stack[-1] == self.minStack[-1]:\n",
      "            self.minStack.pop()\n",
      "        self.stack.pop()\n",
      "        \n",
      "    def getMin(self) -> int:\n",
      "        \"\"\"\n",
      "        Get minimum element in stack.\n",
      "        \n",
      "        Returns:\n",
      "            int: Minimum element in stack\n",
      "            \n",
      "        Raises:\n",
      "            IndexError: If stack is empty\n",
      "        \"\"\"\n",
      "        if not self.minStack:\n",
      "            raise IndexError(\"Cannot get minimum from empty stack\")\n",
      "        return self.minStack[-1]\n",
      "```\n",
      "\n",
      "=== GENERATION END ===\n",
      "\n"
     ]
    }
   ],
   "source": [
    "evaluator_prompt = \"\"\"\n",
    "Evaluate this following code implementation for:\n",
    "1. code correctness\n",
    "2. time complexity\n",
    "3. style and best practices\n",
    "\n",
    "You should be evaluating only and not attemping to solve the task.\n",
    "Only output \"PASS\" if all criteria are met and you have no further suggestions for improvements.\n",
    "Output your evaluation concisely in the following format.\n",
    "\n",
    "<evaluation>PASS, NEEDS_IMPROVEMENT, or FAIL</evaluation>\n",
    "<feedback>\n",
    "What needs improvement and why.\n",
    "</feedback>\n",
    "\"\"\"\n",
    "\n",
    "generator_prompt = \"\"\"\n",
    "Your goal is to complete the task based on <user input>. If there are feedback \n",
    "from your previous generations, you should reflect on them to improve your solution\n",
    "\n",
    "Output your answer concisely in the following format: \n",
    "\n",
    "<thoughts>\n",
    "[Your understanding of the task and feedback and how you plan to improve]\n",
    "</thoughts>\n",
    "\n",
    "<response>\n",
    "[Your code implementation here]\n",
    "</response>\n",
    "\"\"\"\n",
    "\n",
    "task = \"\"\"\n",
    "<user input>\n",
    "Implement a Stack with:\n",
    "1. push(x)\n",
    "2. pop()\n",
    "3. getMin()\n",
    "All operations should be O(1).\n",
    "</user input>\n",
    "\"\"\"\n",
    "\n",
    "loop(task, evaluator_prompt, generator_prompt)\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "py311",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}