claude-cookbooks/third_party/Pinecone/rag_using_pinecone.ipynb

{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Retrieval-Augmented Generation using Pinecone\n",
    "\n",
    "This notebook demonstrates how to connect Claude with the data in your Pinecone vector database through a technique called retrieval-augmented generation (RAG). We will cover the following steps:\n",
    "\n",
    "1. Embedding a dataset using Voyage AI's embedding model\n",
    "2. Uploading the embeddings to a Pinecone index\n",
    "3. Retrieving information from the vector database\n",
    "4. Using Claude to answer questions with information from the database"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "First, let's install the necessary libraries and set the API keys we will need to use in this notebook. We will need to get a [Claude API key](https://docs.claude.com/claude/reference/getting-started-with-the-api), a free [Pinecone API key](https://docs.pinecone.io/docs/quickstart), and a free [Voyage AI API key](https://docs.voyageai.com/install/). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install anthropic datasets pinecone-client voyageai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Insert your API keys here\n",
    "CLAUDE_API_KEY=\"<YOUR_CLAUDE_API_KEY>\"\n",
    "PINECONE_API_KEY=\"<YOUR_PINECONE_API_KEY>\"\n",
    "VOYAGE_API_KEY=\"<YOUR_VOYAGE_API_KEY>\""
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download the dataset\n",
    "Now let's download the Amazon products dataset which has over 10k Amazon product descriptions and load it into a DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# Download the JSONL file\n",
    "!wget  https://www-cdn.anthropic.com/48affa556a5af1de657d426bcc1506cdf7e2f68e/amazon-products.jsonl\n",
    "\n",
    "data = []\n",
    "with open('amazon-products.jsonl', 'r') as file:\n",
    "    for line in file:\n",
    "        try:\n",
    "            data.append(eval(line))\n",
    "        except:\n",
    "            pass\n",
    "\n",
    "df = pd.DataFrame(data)\n",
    "display(df.head())\n",
    "len(df)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Vector Database\n",
    "\n",
    "To create our vector database, we first need a free API key from Pinecone. Once we have the key, we can initialize the database as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pinecone import Pinecone\n",
    "\n",
    "pc = Pinecone(api_key=PINECONE_API_KEY)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we set up our index specification, which allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all available providers and regions [here](https://www.pinecone.io/docs/data-types/metadata/).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pinecone import ServerlessSpec\n",
    "\n",
    "spec = ServerlessSpec(\n",
    "    cloud=\"aws\", region=\"us-west-2\"\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we initialize the index. We will be using Voyage's \"voyage-2\" model for creating the embeddings, so we set the dimension to 1024."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "index_name = 'amazon-products'\n",
    "existing_indexes = [\n",
    "    index_info[\"name\"] for index_info in pc.list_indexes()\n",
    "]\n",
    "\n",
    "# check if index already exists (it shouldn't if this is first time)\n",
    "if index_name not in existing_indexes:\n",
    "    # if does not exist, create index\n",
    "    pc.create_index(\n",
    "        index_name,\n",
    "        dimension=1024,  # dimensionality of voyage-2 embeddings\n",
    "        metric='dotproduct',\n",
    "        spec=spec\n",
    "    )\n",
    "    # wait for index to be initialized\n",
    "    while not pc.describe_index(index_name).status['ready']:\n",
    "        time.sleep(1)\n",
    "\n",
    "# connect to index\n",
    "index = pc.Index(index_name)\n",
    "time.sleep(1)\n",
    "# view index stats\n",
    "index.describe_index_stats()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We should see that the new Pinecone index has a total_vector_count of 0, as we haven't added any vectors yet."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Embeddings\n",
    "To get started with Voyage's embeddings, go [here](https://www.voyageai.com) to get an API key.\n",
    "\n",
    "Now let's set up our Voyage client and demonstrate how to create an embedding using the `embed` method. To learn more about using Voyage embeddings with Claude, see [this notebook](https://github.com/anthropics/anthropic-cookbook/blob/main/third_party/VoyageAI/how_to_create_embeddings.md)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import voyageai\n",
    "\n",
    "vo = voyageai.Client(api_key=VOYAGE_API_KEY)\n",
    "\n",
    "texts = [\"Sample text 1\", \"Sample text 2\"]\n",
    "\n",
    "result = vo.embed(texts, model=\"voyage-2\", input_type=\"document\")\n",
    "print(result.embeddings[0])\n",
    "print(result.embeddings[1])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Uploading data to the Pinecone index\n",
    "\n",
    "With our embedding model set up, we can now take our product descriptions, embed them, and upload the embeddings to the Pinecone index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tqdm.auto import tqdm\n",
    "from time import sleep\n",
    "\n",
    "descriptions = df[\"text\"].tolist()\n",
    "batch_size = 100  # how many embeddings we create and insert at once\n",
    "\n",
    "for i in tqdm(range(0, len(descriptions), batch_size)):\n",
    "    # find end of batch\n",
    "    i_end = min(len(descriptions), i+batch_size)\n",
    "    descriptions_batch = descriptions[i:i_end]\n",
    "    # create embeddings (try-except added to avoid RateLimitError. Voyage currently allows 300/requests per minute.)\n",
    "    done = False\n",
    "    while not done:\n",
    "        try:\n",
    "            res = vo.embed(descriptions_batch, model=\"voyage-2\", input_type=\"document\")\n",
    "            done = True\n",
    "        except:\n",
    "            sleep(5)\n",
    "            \n",
    "    embeds = [record for record in res.embeddings]\n",
    "    # create unique IDs for each text\n",
    "    ids_batch = [f\"description_{idx}\" for idx in range(i, i_end)]\n",
    "    \n",
    "    # Create metadata dictionaries for each text\n",
    "    metadata_batch = [{'description': description} for description in descriptions_batch]\n",
    "\n",
    "    to_upsert = list(zip(ids_batch, embeds, metadata_batch))\n",
    "\n",
    "    # upsert to Pinecone\n",
    "    index.upsert(vectors=to_upsert)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Making queries\n",
    "\n",
    "With our index populated, we can start making queries to get results. We can take a natural language question, embed it, and query it against the index to return semantically similar product descriptions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'matches': [{'id': 'description_1771',\n",
       "              'metadata': {'description': 'Product Name: Scientific Explorer '\n",
       "                                          'My First Science Kids Science '\n",
       "                                          'Experiment Kit\\n'\n",
       "                                          '\\n'\n",
       "                                          'About Product: Experiments to spark '\n",
       "                                          'creativity and curiosity | Grow '\n",
       "                                          'watery crystals, create a rainbow '\n",
       "                                          'in a plate, explore the science of '\n",
       "                                          'color and more | Represents STEM '\n",
       "                                          '(Science, Technology, Engineering, '\n",
       "                                          'Math) principles – open ended toys '\n",
       "                                          'to construct, engineer, explorer '\n",
       "                                          'and experiment | Includes cross '\n",
       "                                          'linked polyacrylamide, 3 color '\n",
       "                                          'tablets, 3 mixing cups, 3 test '\n",
       "                                          'tubes, caps and stand, pipette, '\n",
       "                                          'mixing tray, magnifier and '\n",
       "                                          'instructions | Recommended for '\n",
       "                                          'children 4 years of age and older '\n",
       "                                          'with adult supervision\\n'\n",
       "                                          '\\n'\n",
       "                                          'Categories: Toys & Games | Learning '\n",
       "                                          '& Education | Science Kits & Toys'},\n",
       "              'score': 0.772703767,\n",
       "              'values': []},\n",
       "             {'id': 'description_3133',\n",
       "              'metadata': {'description': 'Product Name: Super Science Magnet '\n",
       "                                          'Kit.\\n'\n",
       "                                          '\\n'\n",
       "                                          'About Product: \\n'\n",
       "                                          '\\n'\n",
       "                                          'Categories: Toys & Games | Learning '\n",
       "                                          '& Education | Science Kits & Toys'},\n",
       "              'score': 0.765997052,\n",
       "              'values': []},\n",
       "             {'id': 'description_1792',\n",
       "              'metadata': {'description': 'Product Name: BRIGHT Atom Model - '\n",
       "                                          'Student\\n'\n",
       "                                          '\\n'\n",
       "                                          'About Product: \\n'\n",
       "                                          '\\n'\n",
       "                                          'Categories: Toys & Games | Learning '\n",
       "                                          '& Education | Science Kits & Toys'},\n",
       "              'score': 0.765654,\n",
       "              'values': []},\n",
       "             {'id': 'description_1787',\n",
       "              'metadata': {'description': 'Product Name: Thames & Kosmos '\n",
       "                                          'Biology Genetics and DNA\\n'\n",
       "                                          '\\n'\n",
       "                                          'About Product: Learn the basics of '\n",
       "                                          'genetics and DNA. | Assemble a '\n",
       "                                          'model to see the elegant '\n",
       "                                          'double-stranded Helical structure '\n",
       "                                          \"of DNA. | A parents' Choice Gold \"\n",
       "                                          'award winner | 20 experiments in '\n",
       "                                          'the 48 page full color experiment '\n",
       "                                          'manual and learning guide\\n'\n",
       "                                          '\\n'\n",
       "                                          'Categories: Toys & Games | Learning '\n",
       "                                          '& Education | Science Kits & Toys'},\n",
       "              'score': 0.765174091,\n",
       "              'values': []},\n",
       "             {'id': 'description_120',\n",
       "              'metadata': {'description': 'Product Name: Educational Insights '\n",
       "                                          \"Nancy B's Science Club Binoculars \"\n",
       "                                          'and Wildlife Activity Journal\\n'\n",
       "                                          '\\n'\n",
       "                                          'About Product: From bird search and '\n",
       "                                          'ecosystem challenges to creative '\n",
       "                                          'writing and drawing exercises, this '\n",
       "                                          'set is perfect for the nature lover '\n",
       "                                          'in your life! | Includes 4x '\n",
       "                                          'magnification binoculars and '\n",
       "                                          '22-page activity journal packed '\n",
       "                                          'with scientific activities! | '\n",
       "                                          'Binoculars are lightweight, yet '\n",
       "                                          'durable. | Supports STEM learning, '\n",
       "                                          'providing hands-on experience with '\n",
       "                                          'a key scientific tool. | Great '\n",
       "                                          'introductory tool for young '\n",
       "                                          'naturalists on-the-go! | Part of '\n",
       "                                          \"the Nancy B's Science Club line, \"\n",
       "                                          'designed to encourage scientific '\n",
       "                                          'confidence. | Winner of the '\n",
       "                                          \"Parents' Choice Recommended Award. \"\n",
       "                                          '| Scientific experience designed '\n",
       "                                          'specifically for kids ages 8-11.\\n'\n",
       "                                          '\\n'\n",
       "                                          'Categories: Electronics | Camera & '\n",
       "                                          'Photo | Binoculars & Scopes | '\n",
       "                                          'Binoculars'},\n",
       "              'score': 0.765075564,\n",
       "              'values': []}],\n",
       " 'namespace': '',\n",
       " 'usage': {'read_units': 6}}"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "USER_QUESTION = \"I want to get my daughter more interested in science. What kind of gifts should I get her?\"\n",
    "\n",
    "question_embed = vo.embed([USER_QUESTION], model=\"voyage-2\", input_type=\"query\")\n",
    "results = index.query(\n",
    "            vector=question_embed.embeddings, top_k=5, include_metadata=True\n",
    "        )\n",
    "results"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Optimizing search\n",
    "\n",
    "These results are good, but we can optimize them even further. Using Claude, we can take the user's question and generate search keywords from it. This allows us to perform a wide, diverse search over the index to get more relevant product descriptions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import anthropic\n",
    "\n",
    "client = anthropic.Anthropic(api_key=CLAUDE_API_KEY)\n",
    "def get_completion(prompt):\n",
    "    completion = client.completions.create(\n",
    "        model=\"claude-2.1\",\n",
    "        prompt=prompt,\n",
    "        max_tokens_to_sample=1024,\n",
    "    )\n",
    "    return completion.completion"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_keyword_prompt(question):\n",
    "    return f\"\"\"\\n\\nHuman: Given a question, generate a list of 5 very diverse search keywords that can be used to search for products on Amazon.\n",
    "\n",
    "The question is: {question}\n",
    "\n",
    "Output your keywords as a JSON that has one property \"keywords\" that is a list of strings. Only output valid JSON.\\n\\nAssistant:{{\"\"\"\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With our Anthropic client setup and our prompt created, we can now begin to generate keywords from the question. We will output the keywords in a JSON object so we can easily parse them from Claude's output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "keyword_json = \"{\" + get_completion(create_keyword_prompt(USER_QUESTION))\n",
    "print(keyword_json)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "# Extract the keywords from the JSON\n",
    "data = json.loads(keyword_json)\n",
    "keywords_list = data['keywords']\n",
    "print(keywords_list)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now with our keywords in a list, let's embed each one, query it against the index, and return the top 3 most relevant product descriptions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "results_list = []\n",
    "for keyword in keywords_list:\n",
    "    # get the embeddings for the keywords\n",
    "    query_embed = vo.embed([keyword], model=\"voyage-2\", input_type=\"query\")\n",
    "    # search for the embeddings in the Pinecone index\n",
    "    search_results = index.query(vector=query_embed.embeddings, top_k=3, include_metadata=True)\n",
    "    # append the search results to the list\n",
    "    for search_result in search_results.matches:\n",
    "            results_list.append(search_result['metadata']['description'])\n",
    "print(len(results_list))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Answering with Claude\n",
    "\n",
    "Now that we have a list of product descriptions, let's format them into a search template Claude has been trained with and pass the formatted descriptions into another prompt."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Formatting search results\n",
    "def format_results(extracted: list[str]) -> str:\n",
    "        result = \"\\n\".join(\n",
    "            [\n",
    "                f'<item index=\"{i+1}\">\\n<page_content>\\n{r}\\n</page_content>\\n</item>'\n",
    "                for i, r in enumerate(extracted)\n",
    "            ]\n",
    "        )\n",
    "    \n",
    "        return f\"\\n<search_results>\\n{result}\\n</search_results>\"\n",
    "\n",
    "def create_answer_prompt(results_list, question):\n",
    "    return f\"\"\"\\n\\nHuman: {format_results(results_list)} Using the search results provided within the <search_results></search_results> tags, please answer the following question <question>{question}</question>. Do not reference the search results in your answer.\\n\\nAssistant:\"\"\"\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, let's ask the original user's question and get our answer from Claude."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " To get your daughter more interested in science, I would recommend getting her an age-appropriate science kit or set that allows for hands-on exploration and experimentation. For example, for a younger child you could try a beginner chemistry set, magnet set, or crystal growing kit. For an older child, look for kits that tackle more advanced scientific principles like physics, engineering, robotics, etc. The key is choosing something that sparks her natural curiosity and lets her actively investigate concepts through activities, observations, and discovery. Supplement the kits with science books, museum visits, documentaries, and conversations about science she encounters in everyday life. Making science fun and engaging is crucial for building her interest.\n"
     ]
    }
   ],
   "source": [
    "answer = get_completion(create_answer_prompt(results_list, USER_QUESTION))\n",
    "print(answer)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "myvenv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}