claude-cookbooks/tool_use/extracting_structured_json.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Extracting Structured JSON using Claude and Tool Use\n",
    "\n",
    "In this cookbook, we'll explore various examples of using Claude and the tool use feature to extract structured JSON data from different types of input. We'll define custom tools that prompt Claude to generate well-structured JSON output for tasks such as summarization, entity extraction, sentiment analysis, and more.\n",
    "\n",
    "If you want to get structured JSON data without using tools, take a look at our \"[How to enable JSON mode](https://github.com/anthropics/anthropic-cookbook/blob/main/misc/how_to_enable_json_mode.ipynb)\" cookbook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set up the environment\n",
    "\n",
    "First, let's install the required libraries and set up the Claude API client."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install anthropic requests beautifulsoup4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [],
   "source": [
    "from anthropic import Anthropic\n",
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "import json\n",
    "\n",
    "client = Anthropic()\n",
    "MODEL_NAME = \"claude-3-haiku-20240307\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example 1: Article Summarization\n",
    "\n",
    "In this example, we'll use Claude to generate a JSON summary of an article, including fields for the author, topics, summary, coherence score, persuasion score, and a counterpoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "JSON Summary:\n",
      "{\n",
      "  \"author\": \"Anthropic\",\n",
      "  \"topics\": [\n",
      "    \"AI policy\",\n",
      "    \"AI safety\",\n",
      "    \"third-party testing\"\n",
      "  ],\n",
      "  \"summary\": \"The article argues that the AI sector needs effective third-party testing for frontier AI systems to avoid societal harm, whether deliberate or accidental. It discusses what third-party testing looks like, why it's needed, and the research Anthropic has done to arrive at this policy position. The article states that such a testing regime is necessary because frontier AI systems like large-scale generative models don't fit neatly into use-case and sector-specific frameworks, and can pose risks of serious misuse or AI-caused accidents. Though Anthropic and other organizations have implemented self-governance systems, the article argues that industry-wide third-party testing is ultimately needed to be broadly trusted. The article outlines key components of an effective third-party testing regime, including identifying national security risks, and discusses how it could be accomplished by a diverse ecosystem of organizations. Anthropic plans to advocate for greater funding and public sector infrastructure for AI testing and evaluation, as well as developing tests for specific capabilities.\",\n",
      "  \"coherence\": 90,\n",
      "  \"persuasion\": 0.8\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "tools = [\n",
    "    {\n",
    "        \"name\": \"print_summary\",\n",
    "        \"description\": \"Prints a summary of the article.\",\n",
    "        \"input_schema\": {\n",
    "            \"type\": \"object\",\n",
    "            \"properties\": {\n",
    "                \"author\": {\"type\": \"string\", \"description\": \"Name of the article author\"},\n",
    "                \"topics\": {\n",
    "                    \"type\": \"array\",\n",
    "                    \"items\": {\"type\": \"string\"},\n",
    "                    \"description\": 'Array of topics, e.g. [\"tech\", \"politics\"]. Should be as specific as possible, and can overlap.'\n",
    "                },\n",
    "                \"summary\": {\"type\": \"string\", \"description\": \"Summary of the article. One or two paragraphs max.\"},\n",
    "                \"coherence\": {\"type\": \"integer\", \"description\": \"Coherence of the article's key points, 0-100 (inclusive)\"},\n",
    "                \"persuasion\": {\"type\": \"number\", \"description\": \"Article's persuasion score, 0.0-1.0 (inclusive)\"}\n",
    "            },\n",
    "            \"required\": ['author', 'topics', 'summary', 'coherence', 'persuasion', 'counterpoint']\n",
    "        }\n",
    "    }\n",
    "]\n",
    "\n",
    "url = \"https://www.anthropic.com/news/third-party-testing\"\n",
    "response = requests.get(url)\n",
    "soup = BeautifulSoup(response.text, \"html.parser\")\n",
    "article = \" \".join([p.text for p in soup.find_all(\"p\")])\n",
    "\n",
    "query = f\"\"\"\n",
    "<article>\n",
    "{article}\n",
    "</article>\n",
    "\n",
    "Use the `print_summary` tool.\n",
    "\"\"\"\n",
    "\n",
    "response = client.messages.create(\n",
    "    model=MODEL_NAME,\n",
    "    max_tokens=4096,\n",
    "    tools=tools,\n",
    "    messages=[{\"role\": \"user\", \"content\": query}]\n",
    ")\n",
    "json_summary = None\n",
    "for content in response.content:\n",
    "    if content.type == \"tool_use\" and content.name == \"print_summary\":\n",
    "        json_summary = content.input\n",
    "        break\n",
    "\n",
    "if json_summary:\n",
    "    print(\"JSON Summary:\")\n",
    "    print(json.dumps(json_summary, indent=2))\n",
    "else:\n",
    "    print(\"No JSON summary found in the response.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example 2: Named Entity Recognition\n",
    "In this example, we'll use Claude to perform named entity recognition on a given text and return the entities in a structured JSON format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Extracted Entities (JSON):\n",
      "{'entities': [{'name': 'John', 'type': 'PERSON', 'context': 'John works at Google in New York.'}, {'name': 'Google', 'type': 'ORGANIZATION', 'context': 'John works at Google in New York.'}, {'name': 'New York', 'type': 'LOCATION', 'context': 'John works at Google in New York.'}, {'name': 'Sarah', 'type': 'PERSON', 'context': 'He met with Sarah, the CEO of Acme Inc., last week in San Francisco.'}, {'name': 'Acme Inc.', 'type': 'ORGANIZATION', 'context': 'He met with Sarah, the CEO of Acme Inc., last week in San Francisco.'}, {'name': 'San Francisco', 'type': 'LOCATION', 'context': 'He met with Sarah, the CEO of Acme Inc., last week in San Francisco.'}]}\n"
     ]
    }
   ],
   "source": [
    "tools = [\n",
    "    {\n",
    "        \"name\": \"print_entities\",\n",
    "        \"description\": \"Prints extract named entities.\",\n",
    "        \"input_schema\": {\n",
    "            \"type\": \"object\",\n",
    "            \"properties\": {\n",
    "                \"entities\": {\n",
    "                    \"type\": \"array\",\n",
    "                    \"items\": {\n",
    "                        \"type\": \"object\",\n",
    "                        \"properties\": {\n",
    "                            \"name\": {\"type\": \"string\", \"description\": \"The extracted entity name.\"},\n",
    "                            \"type\": {\"type\": \"string\", \"description\": \"The entity type (e.g., PERSON, ORGANIZATION, LOCATION).\"},\n",
    "                            \"context\": {\"type\": \"string\", \"description\": \"The context in which the entity appears in the text.\"}\n",
    "                        },\n",
    "                        \"required\": [\"name\", \"type\", \"context\"]\n",
    "                    }\n",
    "                }\n",
    "            },\n",
    "            \"required\": [\"entities\"]\n",
    "        }\n",
    "    }\n",
    "]\n",
    "\n",
    "text = \"John works at Google in New York. He met with Sarah, the CEO of Acme Inc., last week in San Francisco.\"\n",
    "\n",
    "query = f\"\"\"\n",
    "<document>\n",
    "{text}\n",
    "</document>\n",
    "\n",
    "Use the print_entities tool.\n",
    "\"\"\"\n",
    "\n",
    "response = client.messages.create(\n",
    "    model=MODEL_NAME,\n",
    "    max_tokens=4096,\n",
    "    tools=tools,\n",
    "    messages=[{\"role\": \"user\", \"content\": query}]\n",
    ")\n",
    "\n",
    "json_entities = None\n",
    "for content in response.content:\n",
    "    if content.type == \"tool_use\" and content.name == \"print_entities\":\n",
    "        json_entities = content.input\n",
    "        break\n",
    "\n",
    "if json_entities:\n",
    "    print(\"Extracted Entities (JSON):\")\n",
    "    print(json_entities)\n",
    "else:\n",
    "    print(\"No entities found in the response.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example 3: Sentiment Analysis\n",
    "In this example, we'll use Claude to perform sentiment analysis on a given text and return the sentiment scores in a structured JSON format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sentiment Analysis (JSON):\n",
      "{\n",
      "  \"negative_score\": 0.6,\n",
      "  \"neutral_score\": 0.3,\n",
      "  \"positive_score\": 0.1\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "tools = [\n",
    "    {\n",
    "        \"name\": \"print_sentiment_scores\",\n",
    "        \"description\": \"Prints the sentiment scores of a given text.\",\n",
    "        \"input_schema\": {\n",
    "            \"type\": \"object\",\n",
    "            \"properties\": {\n",
    "                \"positive_score\": {\"type\": \"number\", \"description\": \"The positive sentiment score, ranging from 0.0 to 1.0.\"},\n",
    "                \"negative_score\": {\"type\": \"number\", \"description\": \"The negative sentiment score, ranging from 0.0 to 1.0.\"},\n",
    "                \"neutral_score\": {\"type\": \"number\", \"description\": \"The neutral sentiment score, ranging from 0.0 to 1.0.\"}\n",
    "            },\n",
    "            \"required\": [\"positive_score\", \"negative_score\", \"neutral_score\"]\n",
    "        }\n",
    "    }\n",
    "]\n",
    "\n",
    "text = \"The product was okay, but the customer service was terrible. I probably won't buy from them again.\"\n",
    "\n",
    "query = f\"\"\"\n",
    "<text>\n",
    "{text}\n",
    "</text>\n",
    "\n",
    "Use the print_sentiment_scores tool.\n",
    "\"\"\"\n",
    "\n",
    "response = client.messages.create(\n",
    "    model=MODEL_NAME,\n",
    "    max_tokens=4096,\n",
    "    tools=tools,\n",
    "    messages=[{\"role\": \"user\", \"content\": query}]\n",
    ")\n",
    "\n",
    "json_sentiment = None\n",
    "for content in response.content:\n",
    "    if content.type == \"tool_use\" and content.name == \"print_sentiment_scores\":\n",
    "        json_sentiment = content.input\n",
    "        break\n",
    "\n",
    "if json_sentiment:\n",
    "    print(\"Sentiment Analysis (JSON):\")\n",
    "    print(json.dumps(json_sentiment, indent=2))\n",
    "else:\n",
    "    print(\"No sentiment analysis found in the response.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example 4: Text Classification\n",
    "In this example, we'll use Claude to classify a given text into predefined categories and return the classification results in a structured JSON format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Text Classification (JSON):\n",
      "{\n",
      "  \"categories\": [\n",
      "    {\n",
      "      \"name\": \"Politics\",\n",
      "      \"score\": 0.1\n",
      "    },\n",
      "    {\n",
      "      \"name\": \"Sports\",\n",
      "      \"score\": 0.1\n",
      "    },\n",
      "    {\n",
      "      \"name\": \"Technology\",\n",
      "      \"score\": 0.7\n",
      "    },\n",
      "    {\n",
      "      \"name\": \"Entertainment\",\n",
      "      \"score\": 0.1\n",
      "    },\n",
      "    {\n",
      "      \"name\": \"Business\",\n",
      "      \"score\": 0.5\n",
      "    }\n",
      "  ]\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "tools = [\n",
    "    {\n",
    "        \"name\": \"print_classification\",\n",
    "        \"description\": \"Prints the classification results.\",\n",
    "        \"input_schema\": {\n",
    "            \"type\": \"object\",\n",
    "            \"properties\": {\n",
    "                \"categories\": {\n",
    "                    \"type\": \"array\",\n",
    "                    \"items\": {\n",
    "                        \"type\": \"object\",\n",
    "                        \"properties\": {\n",
    "                            \"name\": {\"type\": \"string\", \"description\": \"The category name.\"},\n",
    "                            \"score\": {\"type\": \"number\", \"description\": \"The classification score for the category, ranging from 0.0 to 1.0.\"}\n",
    "                        },\n",
    "                        \"required\": [\"name\", \"score\"]\n",
    "                    }\n",
    "                }\n",
    "            },\n",
    "            \"required\": [\"categories\"]\n",
    "        }\n",
    "    }\n",
    "]\n",
    "\n",
    "text = \"The new quantum computing breakthrough could revolutionize the tech industry.\"\n",
    "\n",
    "query = f\"\"\"\n",
    "<document>\n",
    "{text}\n",
    "</document>\n",
    "\n",
    "Use the print_classification tool. The categories can be Politics, Sports, Technology, Entertainment, Business.\n",
    "\"\"\"\n",
    "\n",
    "response = client.messages.create(\n",
    "    model=MODEL_NAME,\n",
    "    max_tokens=4096,\n",
    "    tools=tools,\n",
    "    messages=[{\"role\": \"user\", \"content\": query}]\n",
    ")\n",
    "\n",
    "json_classification = None\n",
    "for content in response.content:\n",
    "    if content.type == \"tool_use\" and content.name == \"print_classification\":\n",
    "        json_classification = content.input\n",
    "        break\n",
    "\n",
    "if json_classification:\n",
    "    print(\"Text Classification (JSON):\")\n",
    "    print(json.dumps(json_classification, indent=2))\n",
    "else:\n",
    "    print(\"No text classification found in the response.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example 5: Working with unknown keys\n",
    "\n",
    "In some cases you may not know the exact JSON object shape up front. In this example we provide an open ended `input_schema` and instruct Claude via prompting how to interact with the tool."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Characteristics (JSON):\n",
      "{\n",
      "  \"height\": \"tall\",\n",
      "  \"facial_hair\": \"beard\",\n",
      "  \"facial_features\": \"scar on left cheek\",\n",
      "  \"voice\": \"deep voice\",\n",
      "  \"clothing\": \"black leather jacket\"\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "tools = [\n",
    "    {\n",
    "        \"name\": \"print_all_characteristics\",\n",
    "        \"description\": \"Prints all characteristics which are provided.\",\n",
    "        \"input_schema\": {\n",
    "            \"type\": \"object\",\n",
    "            \"additionalProperties\": True\n",
    "        }\n",
    "    }\n",
    "]\n",
    "\n",
    "query = f\"\"\"Given a description of a character, your task is to extract all the characteristics of the character and print them using the print_all_characteristics tool.\n",
    "\n",
    "The print_all_characteristics tool takes an arbitrary number of inputs where the key is the characteristic name and the value is the characteristic value (age: 28 or eye_color: green).\n",
    "\n",
    "<description>\n",
    "The man is tall, with a beard and a scar on his left cheek. He has a deep voice and wears a black leather jacket.\n",
    "</description>\n",
    "\n",
    "Now use the print_all_characteristics tool.\"\"\"\n",
    "\n",
    "response = client.messages.create(\n",
    "    model=MODEL_NAME,\n",
    "    max_tokens=4096,\n",
    "    tools=tools,\n",
    "    tool_choice={\"type\": \"tool\", \"name\": \"print_all_characteristics\"},\n",
    "    messages=[{\"role\": \"user\", \"content\": query}]\n",
    ")\n",
    "\n",
    "tool_output = None\n",
    "for content in response.content:\n",
    "    if content.type == \"tool_use\" and content.name == \"print_all_characteristics\":\n",
    "        tool_output = content.input\n",
    "        break\n",
    "\n",
    "if tool_output:\n",
    "    print(\"Characteristics (JSON):\")\n",
    "    print(json.dumps(json_classification, indent=2))\n",
    "else:\n",
    "    print(\"Something went wrong.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These examples demonstrate how you can use Claude and the tool use feature to extract structured JSON data for various natural language processing tasks. By defining custom tools with specific input schemas, you can guide Claude to generate well-structured JSON output that can be easily parsed and utilized in your applications."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Anthropic Tools SDK",
   "language": "python",
   "name": "ant-tools-sdk"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}