claude-cookbooks/misc/using_citations.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Citations \n",
    "\n",
    "The Anthropic API features citation support that enables Claude to provide detailed citations when answering questions about documents. Citations are a valuable affordance in many LLM powered applications to help users track and verify the sources of information in responses.\n",
    "\n",
    "Citations are supported on:\n",
    "* `claude-3-5-sonnet-20241022`\n",
    "* `claude-3-5-haiku-20241022`\n",
    "\n",
    "The citations feature is an alternative to prompt-based citation techniques. Using this featue has the following advantages:\n",
    "- Prompt-based techniques often require Claude to output full quotes from the source document it intends to cite. This increases output tokens and therefore cost.\n",
    "- The citation feature will not return citations pointing to documents or locations that were not provided as valid sources.\n",
    "- While testing we found the citation feature to generate citations with higher recall and percision than prompt based techniques.\n",
    "\n",
    "The documentation for citations can be found [here](https://docs.anthropic.com/en/docs/build-with-claude/citations)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "First, let's install the required libraries and initalize our Anthropic client. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install anthropic  --quiet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "import anthropic\n",
    "import os\n",
    "import json\n",
    "\n",
    "ANTHROPIC_API_KEY = os.environ.get(\"ANTHROPIC_API_KEY\")\n",
    "# ANTHROPIC_API_KEY = \"\" # Put your API key here!\n",
    "\n",
    "client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Document Types\n",
    "\n",
    "Citations support three different document types. The type of citation outputted depends on the type of document being cited from:\n",
    "\n",
    "* Plain text document citation → char location format\n",
    "* PDF document citation → page location format\n",
    "* Custom content document citation → content block location format\n",
    "\n",
    "We will explore working with each of these in the examples below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Plain Text Documents\n",
    "\n",
    "With plain text document citations you provide your document as raw text to the model. You can provide one or multiple documents. This text will get automatically chunked into sentences. The model will cite these sentences as appropriate. The model is able to cite multiple sentences together at once in a single citation but will not cite text smaller than a sentence.\n",
    "\n",
    "Along with the outputted text the API response will include structured data for all citations. \n",
    "\n",
    "Let's see a complete example using a help center customer chatbot for a made up company PetWorld."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "================================================================================\n",
      "Raw response:\n",
      "================================================================================\n",
      "{\n",
      "  \"blocks\": [\n",
      "    {\n",
      "      \"text\": \"Based on the documentation, I can explain why you don't see tracking yet: \"\n",
      "    },\n",
      "    {\n",
      "      \"text\": \"You'll receive an email with your tracking number once your order ships. If you don't receive a tracking number within 48 hours of your order confirmation, please contact our customer support team for assistance.\",\n",
      "      \"citations\": [\n",
      "        {\n",
      "          \"cited_text\": \"Once your order ships, you'll receive an email with a tracking number. \",\n",
      "          \"start_char_index\": 0,\n",
      "          \"end_char_index\": 71,\n",
      "          \"document_title\": \"Order Tracking Information\"\n",
      "        },\n",
      "        {\n",
      "          \"cited_text\": \"If you haven't received a tracking number within 48 hours of your order confirmation, please contact our customer support team.\",\n",
      "          \"start_char_index\": 398,\n",
      "          \"end_char_index\": 525,\n",
      "          \"document_title\": \"Order Tracking Information\"\n",
      "        }\n",
      "      ]\n",
      "    },\n",
      "    {\n",
      "      \"text\": \"\\n\\nSince you just checked out, your order likely hasn't shipped yet. Once it ships, you'll receive the tracking information via email.\"\n",
      "    }\n",
      "  ]\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "# Read all help center articles and create a list of documents\n",
    "articles_dir = './data/help_center_articles'\n",
    "documents = []\n",
    "\n",
    "for filename in sorted(os.listdir(articles_dir)):\n",
    "    if filename.endswith('.txt'):\n",
    "        with open(os.path.join(articles_dir, filename), 'r') as f:\n",
    "            content = f.read()\n",
    "            # Split into title and body\n",
    "            title_line, body = content.split('\\n', 1)\n",
    "            title = title_line.replace('title: ', '')\n",
    "            documents.append({\n",
    "                \"type\": \"document\",\n",
    "                \"source\": {\n",
    "                    \"type\": \"text\",\n",
    "                    \"media_type\": \"text/plain\",\n",
    "                    \"data\": body\n",
    "                },\n",
    "                \"title\": title,\n",
    "                \"citations\": {\"enabled\": True}\n",
    "            })\n",
    "\n",
    "QUESTION = \"I just checked out, where is my order tracking number? Track package is not available on the website yet for my order.\"\n",
    "\n",
    "# Add the question to the content\n",
    "content = documents \n",
    "\n",
    "response = client.messages.create(\n",
    "    model=\"claude-3-5-sonnet-latest\",\n",
    "    temperature=0.0,\n",
    "    max_tokens=1024,\n",
    "    system='You are a customer support bot working for PetWorld. Your task is to provide short, helpful answers to user questions. Since you are in a chat interface avoid providing extra details. You will be given access to PetWorld\\'s help center articles to help you answer questions.',\n",
    "    messages=[\n",
    "        {\n",
    "            \"role\": \"user\",\n",
    "            \"content\": documents\n",
    "        },\n",
    "        {\n",
    "            \"role\": \"user\",\n",
    "            \"content\": [{\"type\": \"text\", \"text\": f'Here is the user\\'s question: {QUESTION}'}]\n",
    "        },\n",
    "\n",
    "    ]\n",
    ")\n",
    "\n",
    "print(\"\\n\" + \"=\"*80 + \"\\nRaw response:\\n\" + \"=\"*80)\n",
    "raw_response = {\n",
    "    \"blocks\": []\n",
    "}\n",
    "\n",
    "for content in response.content:\n",
    "    if content.type == \"text\":\n",
    "        block = {\n",
    "            \"text\": content.text,\n",
    "        }\n",
    "        if hasattr(content, 'citations') and content.citations:\n",
    "            block[\"citations\"] = [\n",
    "                {\n",
    "                    \"type\": c.type,\n",
    "                    \"cited_text\": c.cited_text,\n",
    "                    \"document_index\": c.document_index,\n",
    "                    \"document_title\": c.document_title,\n",
    "                    \"start_char_index\": c.start_char_index,\n",
    "                    \"end_char_index\": c.end_char_index\n",
    "                } for c in content.citations\n",
    "            ]\n",
    "        raw_response[\"blocks\"].append(block)\n",
    "\n",
    "print(json.dumps(raw_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Visualizing Citations\n",
    "By leveraging the citation data, we can create UIs that:\n",
    "\n",
    "1. Show users exactly where information comes from\n",
    "2. Link directly to source documents\n",
    "3. Highlight cited text in context\n",
    "4. Build trust through transparent sourcing\n",
    "\n",
    "Below is a simple visualization function that transforms Claude's structured citations into a readable format with numbered references, similar to academic papers.\n",
    "\n",
    "The function takes Claude's response object and outputs:\n",
    "- Text with numbered citation markers (e.g., \"The answer [1] includes this fact [2]\")\n",
    "- A numbered reference list showing each cited text and its source document"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Based on the documentation, I can explain why you don't see tracking yet: You'll receive an email with your tracking number once your order ships. If you don't receive a tracking number within 48 hours of your order confirmation, please contact our customer support team for assistance. [1] [2]\n",
      "\n",
      "Since you just checked out, your order likely hasn't shipped yet. Once it ships, you'll receive the tracking information via email.\n",
      "\n",
      "[1] \"Once your order ships, you'll receive an email with a tracking number.\" found in \"Order Tracking Information\"\n",
      "[2] \"If you haven't received a tracking number within 48 hours of your order confirmation, please contact our customer support team.\" found in \"Order Tracking Information\"\n"
     ]
    }
   ],
   "source": [
    "def visualize_citations(response):\n",
    "    \"\"\"\n",
    "    Takes a response object and returns a string with numbered citations.\n",
    "    Example output: \"here is the plain text answer [1][2] here is some more text [3]\"\n",
    "    with a list of citations below.\n",
    "    \"\"\"\n",
    "    # Dictionary to store unique citations\n",
    "    citations_dict = {}\n",
    "    citation_counter = 1\n",
    "    \n",
    "    # Final formatted text\n",
    "    formatted_text = \"\"\n",
    "    citations_list = []\n",
    "    \n",
    "    for content in response.content:\n",
    "        if content.type == \"text\":\n",
    "            text = content.text\n",
    "            if hasattr(content, 'citations') and content.citations:\n",
    "                # Sort citations by their appearance in the text\n",
    "                sorted_citations = sorted(content.citations, \n",
    "                                         key=lambda x: x.start_char_index)\n",
    "                \n",
    "                # Process each citation\n",
    "                for citation in sorted_citations:\n",
    "                    doc_title = citation.document_title\n",
    "                    cited_text = citation.cited_text.replace('\\n', ' ').replace('\\r', ' ')\n",
    "                    # Remove any multiple spaces that might have been created\n",
    "                    cited_text = ' '.join(cited_text.split())\n",
    "                    \n",
    "                    # Create a unique key for this citation\n",
    "                    citation_key = f\"{doc_title}:{cited_text}\"\n",
    "                    \n",
    "                    # If this is a new citation, add it to our dictionary\n",
    "                    if citation_key not in citations_dict:\n",
    "                        citations_dict[citation_key] = citation_counter\n",
    "                        citations_list.append(f\"[{citation_counter}] \\\"{cited_text}\\\" found in \\\"{doc_title}\\\"\")\n",
    "                        citation_counter += 1\n",
    "                    \n",
    "                    # Add the citation number to the text\n",
    "                    citation_num = citations_dict[citation_key]\n",
    "                    text += f\" [{citation_num}]\"\n",
    "            \n",
    "            formatted_text += text\n",
    "    \n",
    "    # Combine the formatted text with the citations list\n",
    "    final_output = formatted_text + \"\\n\\n\" + \"\\n\".join(citations_list)\n",
    "    return final_output\n",
    "\n",
    "formatted_response = visualize_citations(response)\n",
    "print(formatted_response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### PDF Documents\n",
    "\n",
    "When working with PDFs, Claude can provide citations that reference specific page numbers, making it easy to track information sources. Here's how PDF citations work:\n",
    "\n",
    "- PDF document content is provided as base64-encoded data\n",
    "- Text is automatically chunked into sentences\n",
    "- Citations include page numbers (1-indexed) where the information was found\n",
    "- The model can cite multiple sentences together in a single citation but won't cite text smaller than a sentence\n",
    "- While images are processed, only text content can be cited at this time\n",
    "\n",
    "Below is an example using the Constitutional AI paper to demonstrate PDF citations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "================================================================================\n",
      "Raw response:\n",
      "================================================================================\n",
      "{\n",
      "  \"content\": [\n",
      "    {\n",
      "      \"type\": \"text\",\n",
      "      \"text\": \"Based on the paper, here are the key aspects of Constitutional AI (CAI):\\n\\n\"\n",
      "    },\n",
      "    {\n",
      "      \"type\": \"text\",\n",
      "      \"text\": \"Constitutional AI is a method for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, which is why it's called 'Constitutional' AI.\",\n",
      "      \"citations\": [\n",
      "        {\n",
      "          \"type\": \"page_location\",\n",
      "          \"cited_text\": \"We experiment with methods for training a harmless AI assistant through self\\u0002improvement, without any human labels identifying harmful outputs. The only human\\r\\noversight is provided through a list of rules or principles, and so we refer to the method as\\r\\n\\u2018Constitutional AI\\u2019. \",\n",
      "          \"document_index\": 0,\n",
      "          \"document_title\": \"Constitutional AI Paper\",\n",
      "          \"start_page_number\": 1,\n",
      "          \"end_page_number\": 2\n",
      "        }\n",
      "      ]\n",
      "    },\n",
      "    {\n",
      "      \"type\": \"text\",\n",
      "      \"text\": \"\\n\\nThe process involves two main stages:\\n\\n1. Supervised Learning Phase:\\n\"\n",
      "    },\n",
      "    {\n",
      "      \"type\": \"text\",\n",
      "      \"text\": \"In this phase, they sample from an initial model, generate self-critiques and revisions, and then finetune the original model on revised responses.\",\n",
      "      \"citations\": [\n",
      "        {\n",
      "          \"type\": \"page_location\",\n",
      "          \"cited_text\": \"In the supervised phase we sample from an initial model, then generate\\r\\nself-critiques and revisions, and then finetune the original model on revised responses. \",\n",
      "          \"document_index\": 0,\n",
      "          \"document_title\": \"Constitutional AI Paper\",\n",
      "          \"start_page_number\": 1,\n",
      "          \"end_page_number\": 2\n",
      "        }\n",
      "      ]\n",
      "    },\n",
      "    {\n",
      "      \"type\": \"text\",\n",
      "      \"text\": \"\\n\\n2. Reinforcement Learning Phase:\\n\"\n",
      "    },\n",
      "    {\n",
      "      \"type\": \"text\",\n",
      "      \"text\": \"In this phase, they:\\n- Sample from the finetuned model\\n- Use a model to evaluate which of two samples is better\\n- Train a preference model from this dataset of AI preferences\\n- Use \\\"RL from AI Feedback\\\" (RLAIF)\",\n",
      "      \"citations\": [\n",
      "        {\n",
      "          \"type\": \"page_location\",\n",
      "          \"cited_text\": \"In\\r\\nthe RL phase, we sample from the finetuned model, use a model to evaluate which of the\\r\\ntwo samples is better, and then train a preference model from this dataset of AI prefer\\u0002ences. We then train with RL using the preference model as the reward signal, i.e. we\\r\\nuse \\u2018RL from AI Feedback\\u2019 (RLAIF). \",\n",
      "          \"document_index\": 0,\n",
      "          \"document_title\": \"Constitutional AI Paper\",\n",
      "          \"start_page_number\": 1,\n",
      "          \"end_page_number\": 2\n",
      "        }\n",
      "      ]\n",
      "    },\n",
      "    {\n",
      "      \"type\": \"text\",\n",
      "      \"text\": \"\\n\\nThe key outcomes are:\\n\\n\"\n",
      "    },\n",
      "    {\n",
      "      \"type\": \"text\",\n",
      "      \"text\": \"- They are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them\\n- Both the SL and RL methods can leverage chain-of-thought style reasoning to improve human-judged performance and transparency of AI decision making\\n- These methods make it possible to control AI behavior more precisely and with far fewer human labels\",\n",
      "      \"citations\": [\n",
      "        {\n",
      "          \"type\": \"page_location\",\n",
      "          \"cited_text\": \"As a result we are able to train a harmless but non\\u0002evasive AI assistant that engages with harmful queries by explaining its objections to them.\\r\\nBoth the SL and RL methods can leverage chain-of-thought style reasoning to improve the\\r\\nhuman-judged performance and transparency of AI decision making. These methods make\\r\\nit possible to control AI behavior more precisely and with far fewer human labels.\\r\\n\",\n",
      "          \"document_index\": 0,\n",
      "          \"document_title\": \"Constitutional AI Paper\",\n",
      "          \"start_page_number\": 1,\n",
      "          \"end_page_number\": 2\n",
      "        }\n",
      "      ]\n",
      "    },\n",
      "    {\n",
      "      \"type\": \"text\",\n",
      "      \"text\": \"\\n\\n\"\n",
      "    },\n",
      "    {\n",
      "      \"type\": \"text\",\n",
      "      \"text\": \"The ultimate goal is not to completely remove human supervision, but rather to make it more efficient, transparent and targeted. While this work reduces reliance on human supervision for harmlessness, they still relied on human supervision in the form of helpfulness labels. They expect it is possible to achieve helpfulness and instruction-following without human feedback, starting from only a pretrained LM and extensive prompting, but leave this for future work.\",\n",
      "      \"citations\": [\n",
      "        {\n",
      "          \"type\": \"page_location\",\n",
      "          \"cited_text\": \"By removing human feedback labels for harmlessness, we have moved further away from reliance on human\\r\\nsupervision, and closer to the possibility of a self-supervised approach to alignment. However, in this work\\r\\nwe still relied on human supervision in the form of helpfulness labels. We expect it is possible to achieve help\\u0002fulness and instruction-following without human feedback, starting from only a pretrained LM and extensive\\r\\nprompting, but we leave this for future work.\\r\\nOur ultimate goal is not to remove human supervision entirely, but to make it more efficient, transparent, and\\r\\ntargeted. \",\n",
      "          \"document_index\": 0,\n",
      "          \"document_title\": \"Constitutional AI Paper\",\n",
      "          \"start_page_number\": 15,\n",
      "          \"end_page_number\": 16\n",
      "        }\n",
      "      ]\n",
      "    }\n",
      "  ]\n",
      "}\n",
      "\n",
      "================================================================================\n",
      "Formatted response:\n",
      "================================================================================\n",
      "Based on the paper, here are the key aspects of Constitutional AI (CAI):\n",
      "\n",
      "Constitutional AI is a method for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, which is why it's called 'Constitutional' AI. [1]\n",
      "\n",
      "The process involves two main stages:\n",
      "\n",
      "1. Supervised Learning Phase:\n",
      "In this phase, they sample from an initial model, generate self-critiques and revisions, and then finetune the original model on revised responses. [2]\n",
      "\n",
      "2. Reinforcement Learning Phase:\n",
      "In this phase, they:\n",
      "- Sample from the finetuned model\n",
      "- Use a model to evaluate which of two samples is better\n",
      "- Train a preference model from this dataset of AI preferences\n",
      "- Use \"RL from AI Feedback\" (RLAIF) [3]\n",
      "\n",
      "The key outcomes are:\n",
      "\n",
      "- They are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them\n",
      "- Both the SL and RL methods can leverage chain-of-thought style reasoning to improve human-judged performance and transparency of AI decision making\n",
      "- These methods make it possible to control AI behavior more precisely and with far fewer human labels [4]\n",
      "\n",
      "The ultimate goal is not to completely remove human supervision, but rather to make it more efficient, transparent and targeted. While this work reduces reliance on human supervision for harmlessness, they still relied on human supervision in the form of helpfulness labels. They expect it is possible to achieve helpfulness and instruction-following without human feedback, starting from only a pretrained LM and extensive prompting, but leave this for future work. [5]\n",
      "\n",
      "[1] \"We experiment with methods for training a harmless AI assistant through self\u0002improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as ‘Constitutional AI’.\" found in \"Constitutional AI Paper\"\n",
      "[2] \"In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses.\" found in \"Constitutional AI Paper\"\n",
      "[3] \"In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI prefer\u0002ences. We then train with RL using the preference model as the reward signal, i.e. we use ‘RL from AI Feedback’ (RLAIF).\" found in \"Constitutional AI Paper\"\n",
      "[4] \"As a result we are able to train a harmless but non\u0002evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.\" found in \"Constitutional AI Paper\"\n",
      "[5] \"By removing human feedback labels for harmlessness, we have moved further away from reliance on human supervision, and closer to the possibility of a self-supervised approach to alignment. However, in this work we still relied on human supervision in the form of helpfulness labels. We expect it is possible to achieve help\u0002fulness and instruction-following without human feedback, starting from only a pretrained LM and extensive prompting, but we leave this for future work. Our ultimate goal is not to remove human supervision entirely, but to make it more efficient, transparent, and targeted.\" found in \"Constitutional AI Paper\"\n"
     ]
    }
   ],
   "source": [
    "import base64\n",
    "import json\n",
    "\n",
    "# Read and encode the PDF\n",
    "pdf_path = 'data/Constitutional AI.pdf'\n",
    "with open(pdf_path, \"rb\") as f:\n",
    "    pdf_data = base64.b64encode(f.read()).decode()\n",
    "\n",
    "response = client.messages.create(\n",
    "    model=\"claude-3-5-sonnet-latest\",\n",
    "    temperature=0.0,\n",
    "    max_tokens=1024,\n",
    "    messages=[\n",
    "        {\n",
    "            \"role\": \"user\",\n",
    "            \"content\": [\n",
    "                {\n",
    "                    \"type\": \"document\",\n",
    "                    \"source\": {\n",
    "                        \"type\": \"base64\",\n",
    "                        \"media_type\": \"application/pdf\",\n",
    "                        \"data\": pdf_data\n",
    "                    },\n",
    "                    \"title\": \"Constitutional AI Paper\",\n",
    "                    \"citations\": {\"enabled\": True}\n",
    "                },\n",
    "                {\n",
    "                    \"type\": \"text\",\n",
    "                    \"text\": \"What is the main idea of Constitutional AI?\"\n",
    "                }\n",
    "            ]\n",
    "        }\n",
    "    ]\n",
    ")\n",
    "\n",
    "print(\"\\n\" + \"=\"*80 + \"\\nRaw response:\\n\" + \"=\"*80)\n",
    "\n",
    "# Convert response to a dictionary format\n",
    "raw_response = {\"content\": []}\n",
    "\n",
    "for content in response.content:\n",
    "    if content.type == \"text\":\n",
    "        block = {\n",
    "            \"type\": \"text\",\n",
    "            \"text\": content.text\n",
    "        }\n",
    "        if hasattr(content, 'citations') and content.citations:\n",
    "            block[\"citations\"] = [\n",
    "                {\n",
    "                    \"type\": c.type,\n",
    "                    \"cited_text\": c.cited_text,\n",
    "                    \"document_index\": c.document_index,\n",
    "                    \"document_title\": c.document_title,\n",
    "                    \"start_char_index\": c.start_char_index,\n",
    "                    \"end_char_index\": c.end_char_index\n",
    "                } for c in content.citations\n",
    "            ]\n",
    "        raw_response[\"content\"].append(block)\n",
    "\n",
    "print(json.dumps(raw_response, indent=2))\n",
    "formatted_response = visualize_citations(response)\n",
    "print(\"\\n\" + \"=\"*80 + \"\\nFormatted response:\\n\" + \"=\"*80)\n",
    "print(formatted_response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Custom Content Documents\n",
    "\n",
    "While plain text documents are automatically chunked into sentences, custom content documents give you complete control over citation granularity. This API shape allows you to:\n",
    "\n",
    "* Define your own chunks of any size\n",
    "* Control the minimum citation unit\n",
    "* Optimize for documents that don't work well with sentence chunking\n",
    "\n",
    "In the example below, we use the same help center articles as the plain text example above, but instead of allowing sentence-level citations, we'll treat each article as a single chunk. This demonstrates how the choice of document type affects citation behavior and granularity. You will notice that the `cited_text` is the entire article in contrast to a sentence from the source article."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "================================================================================\n",
      "Raw response:\n",
      "================================================================================\n",
      "{\n",
      "  \"content\": [\n",
      "    {\n",
      "      \"type\": \"text\",\n",
      "      \"text\": \"You should receive an email with your tracking number once your order ships. If it's been less than 48 hours since your order confirmation, please wait as the tracking number may not be available yet. If you haven't received a tracking number after 48 hours, please contact our customer support team for assistance.\",\n",
      "      \"citations\": [\n",
      "        {\n",
      "          \"type\": \"content_block_location\",\n",
      "          \"cited_text\": \"Once your order ships, you'll receive an email with a tracking number. To track your package, log in to your PetWorld account and go to \\\"Order History.\\\" Click on the order you want to track and select \\\"Track Package.\\\" This will show you the current status and estimated delivery date. You can also enter the tracking number directly on our shipping partner's website for more detailed information. If you haven't received a tracking number within 48 hours of your order confirmation, please contact our customer support team.\",\n",
      "          \"document_index\": 3,\n",
      "          \"document_title\": \"Order Tracking Information\",\n",
      "          \"start_block_index\": 0,\n",
      "          \"end_block_index\": 1\n",
      "        }\n",
      "      ]\n",
      "    }\n",
      "  ]\n",
      "}\n",
      "\n",
      "================================================================================\n",
      "Formatted response:\n",
      "================================================================================\n",
      "You should receive an email with your tracking number once your order ships. If it's been less than 48 hours since your order confirmation, please wait as the tracking number may not be available yet. If you haven't received a tracking number after 48 hours, please contact our customer support team for assistance. [1]\n",
      "\n",
      "[1] \"Once your order ships, you'll receive an email with a tracking number. To track your package, log in to your PetWorld account and go to \"Order History.\" Click on the order you want to track and select \"Track Package.\" This will show you the current status and estimated delivery date. You can also enter the tracking number directly on our shipping partner's website for more detailed information. If you haven't received a tracking number within 48 hours of your order confirmation, please contact our customer support team.\" found in \"Order Tracking Information\"\n"
     ]
    }
   ],
   "source": [
    "# Read all help center articles and create a list of custom content documents\n",
    "articles_dir = './data/help_center_articles'\n",
    "documents = []\n",
    "\n",
    "for filename in sorted(os.listdir(articles_dir)):\n",
    "    if filename.endswith('.txt'):\n",
    "        with open(os.path.join(articles_dir, filename), 'r') as f:\n",
    "            content = f.read()\n",
    "            # Split into title and body\n",
    "            title_line, body = content.split('\\n', 1)\n",
    "            title = title_line.replace('title: ', '')\n",
    "            \n",
    "            documents.append({\n",
    "                \"type\": \"document\",\n",
    "                \"source\": {\n",
    "                    \"type\": \"content\",\n",
    "                    \"content\": [\n",
    "                        {\"type\": \"text\", \"text\": body}\n",
    "                    ]\n",
    "                },\n",
    "                \"title\": title,\n",
    "                \"citations\": {\"enabled\": True}\n",
    "            })\n",
    "\n",
    "QUESTION = \"I just checked out, where is my order tracking number? Track package is not available on the website yet for my order.\"\n",
    "\n",
    "response = client.messages.create(\n",
    "    model=\"claude-3-5-sonnet-latest\",\n",
    "    temperature=0.0,\n",
    "    max_tokens=1024,\n",
    "    system='You are a customer support bot working for PetWorld. Your task is to provide short, helpful answers to user questions. Since you are in a chat interface avoid providing extra details. You will be given access to PetWorld\\'s help center articles to help you answer questions.',\n",
    "    messages=[\n",
    "        {\n",
    "            \"role\": \"user\",\n",
    "            \"content\": documents\n",
    "        },\n",
    "        {\n",
    "            \"role\": \"user\",\n",
    "            \"content\": [{\"type\": \"text\", \"text\": f'Here is the user\\'s question: {QUESTION}'}]\n",
    "        }\n",
    "    ]\n",
    ")\n",
    "\n",
    "print(\"\\n\" + \"=\"*80 + \"\\nRaw response:\\n\" + \"=\"*80)\n",
    "raw_response = {\n",
    "    \"content\": []\n",
    "}\n",
    "\n",
    "for content in response.content:\n",
    "    if content.type == \"text\":\n",
    "        block = {\n",
    "            \"type\": \"text\",\n",
    "            \"text\": content.text\n",
    "        }\n",
    "        if hasattr(content, 'citations') and content.citations:\n",
    "            block[\"citations\"] = [\n",
    "                {\n",
    "                    \"type\": c.type,\n",
    "                    \"cited_text\": c.cited_text,\n",
    "                    \"document_index\": c.document_index,\n",
    "                    \"document_title\": c.document_title,\n",
    "                    \"start_char_index\": c.start_char_index,\n",
    "                    \"end_char_index\": c.end_char_index\n",
    "                } for c in content.citations\n",
    "            ]\n",
    "        raw_response[\"content\"].append(block)\n",
    "\n",
    "print(json.dumps(raw_response, indent=2))\n",
    "formatted_response = visualize_citations(response)\n",
    "print(\"\\n\" + \"=\"*80 + \"\\nFormatted response:\\n\" + \"=\"*80)\n",
    "print(formatted_response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using the Context Field\n",
    "\n",
    "The `context` field allows you to provide additional information about a document that Claude can use when generating responses, but that won't be cited. This is useful for:\n",
    "\n",
    "* Providing metadata about the document (e.g., publication date, author)\n",
    "* [Contextual retrieval](https://www.anthropic.com/news/contextual-retrieval)\n",
    "* Including usage instructions or context that shouldn't be directly cited\n",
    "\n",
    "In the example below, we provide a loyalty program article with a warning in the context field. Notice how Claude can use the information in the context to inform its response but the context field content is not available for citation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "================================================================================\n",
      "Raw response:\n",
      "================================================================================\n",
      "{\n",
      "  \"content\": [\n",
      "    {\n",
      "      \"type\": \"text\",\n",
      "      \"text\": \"Let me explain PetWorld's loyalty program:\\n\\n\"\n",
      "    },\n",
      "    {\n",
      "      \"type\": \"text\",\n",
      "      \"text\": \"The program works by awarding 1 point for every dollar you spend at PetWorld. Once you collect 100 points, you'll receive a $5 reward that you can redeem on your next purchase.\",\n",
      "      \"citations\": [\n",
      "        {\n",
      "          \"type\": \"char_location\",\n",
      "          \"cited_text\": \"PetWorld offers a loyalty program where customers earn 1 point for every dollar spent. Once you accumulate 100 points, you'll receive a $5 reward that can be used on your next purchase. \",\n",
      "          \"document_index\": 0,\n",
      "          \"document_title\": \"Loyalty Program Details\",\n",
      "          \"start_char_index\": 0,\n",
      "          \"end_char_index\": 186\n",
      "        }\n",
      "      ]\n",
      "    },\n",
      "    {\n",
      "      \"type\": \"text\",\n",
      "      \"text\": \"\\n\\n\"\n",
      "    },\n",
      "    {\n",
      "      \"type\": \"text\",\n",
      "      \"text\": \"Points have an expiration period of 12 months from the date they are earned.\",\n",
      "      \"citations\": [\n",
      "        {\n",
      "          \"type\": \"char_location\",\n",
      "          \"cited_text\": \"Points expire 12 months after they are earned. \",\n",
      "          \"document_index\": 0,\n",
      "          \"document_title\": \"Loyalty Program Details\",\n",
      "          \"start_char_index\": 186,\n",
      "          \"end_char_index\": 233\n",
      "        }\n",
      "      ]\n",
      "    },\n",
      "    {\n",
      "      \"type\": \"text\",\n",
      "      \"text\": \"\\n\\n\"\n",
      "    },\n",
      "    {\n",
      "      \"type\": \"text\",\n",
      "      \"text\": \"You can easily monitor your point balance by either checking your account dashboard or contacting customer service.\",\n",
      "      \"citations\": [\n",
      "        {\n",
      "          \"type\": \"char_location\",\n",
      "          \"cited_text\": \"You can check your point balance in your account dashboard or by asking customer service.\",\n",
      "          \"document_index\": 0,\n",
      "          \"document_title\": \"Loyalty Program Details\",\n",
      "          \"start_char_index\": 233,\n",
      "          \"end_char_index\": 322\n",
      "        }\n",
      "      ]\n",
      "    },\n",
      "    {\n",
      "      \"type\": \"text\",\n",
      "      \"text\": \"\\n\\nPlease note that this information comes from an article that hasn't been updated in 12 months, so some details may have changed. I recommend verifying the current terms with PetWorld directly.\"\n",
      "    }\n",
      "  ]\n",
      "}\n",
      "\n",
      "================================================================================\n",
      "Formatted response:\n",
      "================================================================================\n",
      "Let me explain PetWorld's loyalty program:\n",
      "\n",
      "The program works by awarding 1 point for every dollar you spend at PetWorld. Once you collect 100 points, you'll receive a $5 reward that you can redeem on your next purchase. [1]\n",
      "\n",
      "Points have an expiration period of 12 months from the date they are earned. [2]\n",
      "\n",
      "You can easily monitor your point balance by either checking your account dashboard or contacting customer service. [3]\n",
      "\n",
      "Please note that this information comes from an article that hasn't been updated in 12 months, so some details may have changed. I recommend verifying the current terms with PetWorld directly.\n",
      "\n",
      "[1] \"PetWorld offers a loyalty program where customers earn 1 point for every dollar spent. Once you accumulate 100 points, you'll receive a $5 reward that can be used on your next purchase.\" found in \"Loyalty Program Details\"\n",
      "[2] \"Points expire 12 months after they are earned.\" found in \"Loyalty Program Details\"\n",
      "[3] \"You can check your point balance in your account dashboard or by asking customer service.\" found in \"Loyalty Program Details\"\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "\n",
    "# Create a document with context field\n",
    "document = {\n",
    "    \"type\": \"document\",\n",
    "    \"source\": {\n",
    "        \"type\": \"text\",\n",
    "        \"media_type\": \"text/plain\",\n",
    "        \"data\": \"PetWorld offers a loyalty program where customers earn 1 point for every dollar spent. Once you accumulate 100 points, you'll receive a $5 reward that can be used on your next purchase. Points expire 12 months after they are earned. You can check your point balance in your account dashboard or by asking customer service.\"\n",
    "    },\n",
    "    \"title\": \"Loyalty Program Details\",\n",
    "    \"context\": \"WARNING: This article has not been updated in 12 months. Content may be out of date. Be sure to inform the user this content may be incorrect after providing guidance.\",\n",
    "    \"citations\": {\"enabled\": True}\n",
    "}\n",
    "\n",
    "QUESTION = \"How does PetWorld's loyalty program work? When do points expire?\"\n",
    "\n",
    "response = client.messages.create(\n",
    "    model=\"claude-3-5-sonnet-latest\",\n",
    "    temperature=0.0,\n",
    "    max_tokens=1024,\n",
    "    messages=[\n",
    "        {\n",
    "            \"role\": \"user\",\n",
    "            \"content\": [\n",
    "                document,\n",
    "                {\n",
    "                    \"type\": \"text\",\n",
    "                    \"text\": QUESTION\n",
    "                }\n",
    "            ]\n",
    "        }\n",
    "    ]\n",
    ")\n",
    "\n",
    "print(\"\\n\" + \"=\"*80 + \"\\nRaw response:\\n\" + \"=\"*80)\n",
    "raw_response = {\n",
    "    \"content\": []\n",
    "}\n",
    "\n",
    "for content in response.content:\n",
    "    if content.type == \"text\":\n",
    "        block = {\n",
    "            \"type\": \"text\",\n",
    "            \"text\": content.text\n",
    "        }\n",
    "        if hasattr(content, 'citations') and content.citations:\n",
    "            block[\"citations\"] = [\n",
    "                {\n",
    "                    \"type\": c.type,\n",
    "                    \"cited_text\": c.cited_text,\n",
    "                    \"document_index\": c.document_index,\n",
    "                    \"document_title\": c.document_title,\n",
    "                    \"start_char_index\": c.start_char_index,\n",
    "                    \"end_char_index\": c.end_char_index\n",
    "                } for c in content.citations\n",
    "            ]\n",
    "        raw_response[\"content\"].append(block)\n",
    "\n",
    "print(json.dumps(raw_response, indent=2))\n",
    "formatted_response = visualize_citations(response)\n",
    "print(\"\\n\" + \"=\"*80 + \"\\nFormatted response:\\n\" + \"=\"*80)\n",
    "print(formatted_response)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "py311",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}