claude-cookbooks/third_party/MongoDB/rag_using_mongodb.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# How to Build A RAG System Using Claude 3 And MongoDB\n",
        "\n",
        "\n",
        "This tutorial implements a chatbot prompted to take on the role of a Venture capital tech Analyst. The chatbot is a naive RAG system with a collection of tech news articles acting as its knowledge source.\n",
        "This notebook covers the following:\n",
        "\n",
        "1. Follow a comprehensive tutorial on setting up your development environment, from installing necessary libraries to configuring a MongoDB database.\n",
        "2. Learn efficient data handling methods, including creating vector search indexes and preparing data for ingestion and query processing.\n",
        "3. Understand how to employ Claude 3 models within the RAG system for generating precise responses based on contextual information retrieved from the database.\n",
        "\n",
        "\n",
        "You will need the following:\n",
        "- Claude API Key\n",
        "- VoyageAI API Key\n",
        "- Hugging Face Access Token"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Step 1: Library installation, data loading and preparation\n",
        "\n",
        "\n",
        "Below are brief explanations of the tools and libraries utilised within the implementation code:\n",
        "- anthropic:  The official Python library for Anthropic that enables access to state-of-the-art language models. This library provides access to the Claude 3 family models, which can understand text and images.\n",
        "- datasets: This library is part of the Hugging Face ecosystem. By installing 'datasets', we gain access to a number of pre-processed and ready-to-use datasets, which are essential for training and fine-tuning machine learning models or benchmarking their performance.\n",
        "- pandas: This data science library provides robust data structures and methods for data manipulation, processing, and analysis.\n",
        "- voyageai: This is the official Python client library for accessing VoyageAI's suite of embedding models.\n",
        "- pymongo: PyMongo is a Python toolkit for MongoDB. It enables interactions with a MongoDB database.\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "zkePkAfZLQ_R"
      },
      "outputs": [],
      "source": [
        "!pip install pymongo datasets pandas anthropic voyageai"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The code snippet below executes the following steps:\n",
        "1. Import Necessary Libraries:\n",
        "- `os` for interacting with the operating system,\n",
        "- `requests` for making HTTP requests,\n",
        "- `BytesIO` from the io module to handle bytes objects like files in memory,\n",
        "- `pandas` (as pd) for data manipulation and analysis, and\n",
        "- `userdata` from google.colab enables access to environment variables stored in Google Colab secrets.\n",
        "2. Function Definition: The `download_and_combine_parquet_files` function is defined with two parameters:\n",
        "- `parquet_file_urls`: a list of URLs as strings, each pointing to a Parquet file that contains a sub-collection of the tech-news-embedding dataset.\n",
        "- `hf_token` is a string representing a Hugging Face authorization token. Access tokens can be created or copied from the [Hugging Face platform](https://huggingface.co/docs/hub/en/security-tokens#:~:text=To%20create%20an%20access%20token,you%27re%20ready%20to%20go!)\n",
        "3. Download and Read Parquet Files: The function iterates over each URL in parquet_file_urls. For each URL, it:\n",
        "- Makes a GET request using the requests.get method, passing the URL and the headers for authorization.\n",
        "- Checks if the response status code is 200 (OK), indicating the request was successful.\n",
        "- If successful, it reads the content of the response into a BytesIO object (to handle it as a file in memory), then uses pandas.read_parquet to read the Parquet file from this object into a Pandas DataFrame.\n",
        "- Appends the DataFrame to the list all_dataframes.\n",
        "4. Combine DataFrames: After downloading and reading all Parquet files into DataFrames, there’s a check to ensure that `all_dataframes` is not empty. If there are DataFrames to work with, then all DataFrames are concatenated into a single DataFrame using pd.concat, with ignore_index=True to reindex the new combined DataFrame. This combined DataFrame is the overall process output in the `download_and_combine_parquet_files` function."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "id": "VoY9qBGJNo2F"
      },
      "outputs": [],
      "source": [
        "import os\n",
        "import requests\n",
        "from io import BytesIO\n",
        "import pandas as pd\n",
        "from google.colab import userdata\n",
        "\n",
        "def download_and_combine_parquet_files(parquet_file_urls, hf_token):\n",
        "    \"\"\"\n",
        "    Downloads Parquet files from the provided URLs using the given Hugging Face token,\n",
        "    and returns a combined DataFrame.\n",
        "\n",
        "    Parameters:\n",
        "    - parquet_file_urls: List of strings, URLs to the Parquet files.\n",
        "    - hf_token: String, Hugging Face authorization token.\n",
        "\n",
        "    Returns:\n",
        "    - combined_df: A pandas DataFrame containing the combined data from all Parquet files.\n",
        "    \"\"\"\n",
        "    headers = {\"Authorization\": f\"Bearer {hf_token}\"}\n",
        "    all_dataframes = []\n",
        "\n",
        "    for parquet_file_url in parquet_file_urls:\n",
        "        response = requests.get(parquet_file_url, headers=headers)\n",
        "        if response.status_code == 200:\n",
        "            parquet_bytes = BytesIO(response.content)\n",
        "            df = pd.read_parquet(parquet_bytes)\n",
        "            all_dataframes.append(df)\n",
        "        else:\n",
        "            print(f\"Failed to download Parquet file from {parquet_file_url}: {response.status_code}\")\n",
        "\n",
        "    if all_dataframes:\n",
        "        combined_df = pd.concat(all_dataframes, ignore_index=True)\n",
        "        return combined_df\n",
        "    else:\n",
        "        print(\"No dataframes to concatenate.\")\n",
        "        return None"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Below is a list of the Parquet files required for this tutorial. The complete list of all files is located [here](https://huggingface.co/datasets/MongoDB/tech-news-embeddings/tree/refs%2Fconvert%2Fparquet/default/train). Each Parquet file represents approximately 45,000 data points.\n",
        "\n",
        "In the code snippet below, a subset of the tech-news-embeddings dataset is grouped into a single DataFrame, which is then assigned to the variable `combined_df`.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "id": "8sa8PrlAOawR"
      },
      "outputs": [],
      "source": [
        "# Uncomment the links below to load more data\n",
        "# For the full list of data visit: https://huggingface.co/datasets/MongoDB/tech-news-embeddings/tree/refs%2Fconvert%2Fparquet/default/train\n",
        "parquet_files = [\n",
        "    \"https://huggingface.co/api/datasets/AIatMongoDB/tech-news-embeddings/parquet/default/train/0000.parquet\",\n",
        "    # \"https://huggingface.co/api/datasets/AIatMongoDB/tech-news-embeddings/parquet/default/train/0001.parquet\",\n",
        "    # \"https://huggingface.co/api/datasets/AIatMongoDB/tech-news-embeddings/parquet/default/train/0002.parquet\",\n",
        "    # \"https://huggingface.co/api/datasets/AIatMongoDB/tech-news-embeddings/parquet/default/train/0003.parquet\",\n",
        "    # \"https://huggingface.co/api/datasets/AIatMongoDB/tech-news-embeddings/parquet/default/train/0004.parquet\",\n",
        "    # \"https://huggingface.co/api/datasets/AIatMongoDB/tech-news-embeddings/parquet/default/train/0005.parquet\",\n",
        "]\n",
        "\n",
        "hf_token = userdata.get(\"HF_TOKEN\")\n",
        "combined_df = download_and_combine_parquet_files(parquet_files, hf_token)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As a final phase in data preparation, the code snippet below shows the step to remove the `_id` column from the grouped dataset, as it is unnecessary for subsequent steps in this tutorial. Additionally, the data within the embedding column for each data point is converted from a numpy array to a Python list to prevent errors related to incompatible data types during the data ingestion. \n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "id": "GEx4i1ehRliI"
      },
      "outputs": [],
      "source": [
        "# Remove the _id coloum from the intital dataset\n",
        "combined_df = combined_df.drop(columns=['_id'])\n",
        "\n",
        "# Remove the initial embedding coloumn as we are going to create new embeddings with VoyageAI embedding model\n",
        "combined_df = combined_df.drop(columns=['embedding'])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 293
        },
        "id": "6UsKPi32OWX3",
        "outputId": "231e7041-7b4d-49f7-c7c5-1d81e7e4adea"
      },
      "outputs": [],
      "source": [
        "combined_df.head()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Limiting the amount of document used to 500 for this demo due to the rate limit on VoyageAI API\n",
        "# Read more on VoyageAI rate limits: https://docs.voyageai.com/docs/rate-limits\n",
        "max_documents = 500\n",
        "\n",
        "if len(combined_df) > max_documents:\n",
        "    combined_df = combined_df[:max_documents]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import voyageai\n",
        "import time\n",
        "\n",
        "vo = voyageai.Client(api_key=userdata.get(\"VOYAGE_API_KEY\"))\n",
        "\n",
        "def get_embedding(text: str) -> list[float]:\n",
        "    if not text.strip():\n",
        "      print(\"Attempted to get embedding for empty text.\")\n",
        "      return []\n",
        "\n",
        "    embedding = vo.embed(text, model=\"voyage-large-2\", input_type=\"document\")\n",
        "\n",
        "    return embedding.embeddings[0]\n",
        "\n",
        "combined_df[\"embedding\"] = combined_df[\"description\"].apply(get_embedding)\n",
        "\n",
        "combined_df.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GXv9MrqLSZw8"
      },
      "source": [
        "## Step 2: Database and collection creation\n",
        "\n",
        "**To create a new MongoDB database, set up a database cluster:**\n",
        "1. Register for a [free MongoDB Atlas account](https://www.mongodb.com/cloud/atlas/register?utm_campaign=devrel&utm_source=community&utm_medium=cta&utm_content=Partner%20Cookbook&utm_term=richmond.alake), or existing users, [sign into MongoDB Atlas](https://account.mongodb.com/account/login?utm_campaign=devrel&utm_source=community&utm_medium=cta&utm_content=Partner%20Cookbook&utm_term=richmond.alake)\n",
        "2. Select the “Database” option on the left-hand pane, which will navigate to the Database Deployment page with a deployment specification of any existing cluster. Create a new database cluster by clicking on the \"+Create\" button.\n",
        "3. For assistance with database cluster setup and obtaining the URI, refer to our guide for setting up a MongoDB cluster and getting your connection string.\n",
        "Note: Don’t forget to whitelist the IP for the Python host or 0.0.0.0/0 for any IP when creating proof of concepts.\n",
        "4. After successfully creating and deploying the cluster, the cluster becomes accessible on the ‘Database Deployment’ page.\n",
        "5. Click on the “Connect” button of the cluster to view the option to set up a connection to the cluster via various language drivers.\n",
        "6. This tutorial only requires the cluster’s URI(unique resource identifier). Grab the URI and copy it into the Google Colabs Secrets environment in a variable named MONGO_URI or place it in a .env file or equivalent.\n",
        "\n",
        "\n",
        "Once you have created a cluster, navigate to the cluster page and create a database and collection within the MongoDB Atlas cluster by clicking + Create Database. \n",
        "The database will be named `tech_news`, and the collection will be named `hacker_noon_tech_news`.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Step 3: Vector search index creation\n",
        "\n",
        "By this point, you have created a cluster, database and collection.\n",
        "\n",
        "The steps in this section are crucial to ensure that a vector search can be conducted using the queries entered into the chatbot and searched against the records within the hacker_noon_tech_news collection. The objective of this step is to create a vector search index. To achieve this, refer to the official [vector search index creation guide](https://www.mongodb.com/docs/atlas/atlas-vector-search/create-index/).\n",
        "\n",
        "In the creation of a vector search index using the JSON editor on MongoDB Atlas, ensure your vector search index is named vector_index and the vector search index definition is as follows:\n",
        "\n",
        "```\n",
        "{\n",
        " \"fields\": [{\n",
        "     \"numDimensions\": 1536,\n",
        "     \"path\": \"embedding\",\n",
        "     \"similarity\": \"cosine\",\n",
        "     \"type\": \"vector\"\n",
        "   }]\n",
        "}\n",
        "\n",
        "```"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Step 4: Data ingestion\n",
        "\n",
        "To ingest data into the MongoDB database created in previous steps. The following operations have to be carried out:\n",
        "- Connect to the database and collection\n",
        "- Clear out the collection of any existing records\n",
        "- Convert the Pandas DataFrame of the dataset into dictionaries before ingestion\n",
        "- Ingest dictionaries into MongoDB using a batch operation\n",
        "\n",
        "This tutorial requires the cluster's URI (unique resource identifier). Grab the URI and copy it into the Google Colab Secrets environment in a variable named MONGO_URI, or place it in a .env file or equivalent.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Yv0dMtWnQhDy",
        "outputId": "5bc5276b-46b3-4bc3-f5e4-19802ff910c3"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Connection to MongoDB successful\n"
          ]
        }
      ],
      "source": [
        "import pymongo\n",
        "from google.colab import userdata\n",
        "\n",
        "def get_mongo_client(mongo_uri):\n",
        "  \"\"\"Establish connection to the MongoDB.\"\"\"\n",
        "  try:\n",
        "    client = pymongo.MongoClient(mongo_uri)\n",
        "    print(\"Connection to MongoDB successful\")\n",
        "    return client\n",
        "  except pymongo.errors.ConnectionFailure as e:\n",
        "    print(f\"Connection failed: {e}\")\n",
        "    return None\n",
        "\n",
        "mongo_uri = userdata.get('MONGO_URI')\n",
        "if not mongo_uri:\n",
        "  print(\"MONGO_URI not set in environment variables\")\n",
        "\n",
        "mongo_client = get_mongo_client(mongo_uri)\n",
        "\n",
        "DB_NAME=\"tech_news\"\n",
        "COLLECTION_NAME=\"hacker_noon_tech_news\"\n",
        "\n",
        "db = mongo_client[DB_NAME]\n",
        "collection = db[COLLECTION_NAME]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "4NowJ2mlQrfT",
        "outputId": "31bfa7ac-dc62-421e-92c5-f644e03d7c1f"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "DeleteResult({'n': 228012, 'electionId': ObjectId('7fffffff000000000000000e'), 'opTime': {'ts': Timestamp(1709660559, 7341), 't': 14}, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1709660559, 7341), 'signature': {'hash': b'jT\\xf1\\xb4\\xa9\\xd3\\xe3suu\\x03`\\x15(}\\x8f\\x00\\x9f\\xe9\\x8a', 'keyId': 7320226449804230661}}, 'operationTime': Timestamp(1709660559, 7341)}, acknowledged=True)"
            ]
          },
          "execution_count": 7,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "# To ensure we are working with a fresh collection\n",
        "# delete any existing records in the collection\n",
        "collection.delete_many({})"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "tdFsJXXRjEZe"
      },
      "outputs": [],
      "source": [
        "# Data Ingestion\n",
        "combined_df_json = combined_df.to_dict(orient='records')\n",
        "collection.insert_many(combined_df_json)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Step 5: Vector Search\n",
        "\n",
        "This section showcases the creation of a vector search custom function that accepts a user query, which corresponds to entries to the chatbot. The function also takes a second parameter, `collection`, which points to the database collection containing records against which the vector search operation should be conducted.\n",
        "\n",
        "The `vector_search` function produces a vector search result derived from a series of operations outlined in a MongoDB aggregation pipeline. This pipeline includes the `$vectorSearch` and `$project` stages and performs queries based on the vector embeddings of user queries. It then formats the results, omitting any record attributes unnecessary for subsequent processes.\n",
        "\n",
        "The code snippet below conducts the following operations to allow semantic search for movies:\n",
        "1. Define the `vector_search` function that takes a user's query string and a MongoDB collection as inputs and returns a list of documents that match the query based on vector similarity search.\n",
        "2. Generate an embedding for the user's query by calling the previously defined function, `get_embedding`, which converts the query string into a vector representation.\n",
        "3. Construct a pipeline for MongoDB's aggregate function, incorporating two main stages: `$vectorSearch` and `$project`.\n",
        "4. The `$vectorSearch` stage performs the actual vector search. The index field specifies the vector index to utilise for the vector search, and this should correspond to the name entered in the vector search index definition in previous steps. The queryVector field takes the embedding representation of the use query. The path field corresponds to the document field containing the embeddings.  The `numCandidates` specifies the number of candidate documents to consider and the limit on the number of results to return.\n",
        "5. The $project stage formats the results to excludes the _id  the `embedding` field.\n",
        "6. The aggregate executes the defined pipeline to obtain the vector search results. The final operation converts the returned cursor from the database into a list."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "metadata": {
        "id": "QsSBrAu4nSJC"
      },
      "outputs": [],
      "source": [
        "def vector_search(user_query, collection):\n",
        "    \"\"\"\n",
        "    Perform a vector search in the MongoDB collection based on the user query.\n",
        "\n",
        "    Args:\n",
        "    user_query (str): The user's query string.\n",
        "    collection (MongoCollection): The MongoDB collection to search.\n",
        "\n",
        "    Returns:\n",
        "    list: A list of matching documents.\n",
        "    \"\"\"\n",
        "\n",
        "    # Generate embedding for the user query\n",
        "    query_embedding = get_embedding(user_query)\n",
        "\n",
        "    if query_embedding is None:\n",
        "        return \"Invalid query or embedding generation failed.\"\n",
        "\n",
        "    # Define the vector search pipeline\n",
        "    pipeline = [\n",
        "        {\n",
        "            \"$vectorSearch\": {\n",
        "                \"index\": \"vector_index\",\n",
        "                \"queryVector\": query_embedding,\n",
        "                \"path\": \"embedding\",\n",
        "                \"numCandidates\": 150,  # Number of candidate matches to consider\n",
        "                \"limit\": 5  # Return top 5 matches\n",
        "            }\n",
        "        },\n",
        "        {\n",
        "            \"$project\": {\n",
        "                \"_id\": 0,  # Exclude the _id field\n",
        "                \"embedding\": 0,  # Exclude the embedding field\n",
        "                \"score\": {\n",
        "                    \"$meta\": \"vectorSearchScore\"  # Include the search score\n",
        "                }\n",
        "            }\n",
        "        }\n",
        "    ]\n",
        "\n",
        "    # Execute the search\n",
        "    results = collection.aggregate(pipeline)\n",
        "    return list(results)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Step 6: Handling user queries with Claude 3 models\n",
        "\n",
        "The final section of the tutorial outlines the sequence of operations performed as follows:\n",
        "\n",
        "- Accept a user query in the form of a string.\n",
        "- Utilize the VoyageAI embedding model to generate embeddings for the user query.\n",
        "- Load the Anthropic Claude 3, specifically the ‘claude-3-opus-20240229’ model, to serve as the base model for the RAG system.\n",
        "- Execute a vector search using the embeddings of the user query to fetch relevant information from the knowledge base, which provides additional context for the base model.\n",
        "- Submit both the user query and the gathered additional information to the base model to generate a response.\n",
        "\n",
        "\n",
        "An important note is that the dimensions of the user query embedding match the dimensions set in the vector search index definition on MongoDB Atlas."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The next step in this section is to import the anthropic library and load the client to access the anthropic’s methods for handling messages and accessing Claude models. Ensure you obtain an Claude API key located within the settings page on the [official Anthropic website](https://platform.claude.com/settings/keys).\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 11,
      "metadata": {
        "id": "wADfSOFOnw0s"
      },
      "outputs": [],
      "source": [
        "import anthropic\n",
        "client = anthropic.Client(api_key=userdata.get(\"CLAUDE_API_KEY\"))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Below is a more detailed description of the operations in the code snippet below:\n",
        "\n",
        "1. Vector Search Execution: The function begins by calling `vector_search` with the user's query and a specified collection as arguments. This performs a search within the collection, leveraging vector embeddings to find relevant information related to the query.\n",
        "2. Compile Search Results: `search_result` is initialized as an empty string to aggregate information from the search. The search results are compiled by iterating over the results returned by the `vector_search` function, formates each item's details (title, company name, URL, publication date, article URL, and description) into a human-readable string, appending this information to search_result with a newline character \\n at the end of each entry.\n",
        "3. Generate Response Using Anthropic Client: The function then constructs a request to the Claude API (through a client object, presumably an instance of the anthropic. Client class created earlier). It specifies:\n",
        "- The model to use (\"claude-3-opus-20240229\") indicates a specific version of the Claude 3 model.\n",
        "- The maximum token limit for the generated response (max_tokens=1024).\n",
        "- A system description guides the model to behave as a \"Venture Capital Tech Analyst\" with access to tech company articles and information, using this context to advise.\n",
        "- The actual message for the model to process combines the user query with the aggregated search results as context.\n",
        "4. Return the Generated Response and Search Results: It extracts and returns the response text from the first item in the response's content, alongside the compiled search results.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 12,
      "metadata": {
        "id": "JrG8Ho50nhVR"
      },
      "outputs": [],
      "source": [
        "def handle_user_query(query, collection):\n",
        "\n",
        "  get_knowledge = vector_search(query, collection)\n",
        "\n",
        "  search_result = ''\n",
        "  for result in get_knowledge:\n",
        "    search_result += (\n",
        "        f\"Title: {result.get('title', 'N/A')}, \"\n",
        "        f\"Company Name: {result.get('companyName', 'N/A')}, \"\n",
        "        f\"Company URL: {result.get('companyUrl', 'N/A')}, \"\n",
        "        f\"Date Published: {result.get('published_at', 'N/A')}, \"\n",
        "        f\"Article URL: {result.get('url', 'N/A')}, \"\n",
        "        f\"Description: {result.get('description', 'N/A')}, \\n\"\n",
        "    )\n",
        "\n",
        "  response = client.messages.create(\n",
        "    model=\"claude-3-opus-20240229\",\n",
        "    max_tokens=1024,\n",
        "    system=\"You are Venture Captital Tech Analyst with access to some tech company articles and information. You use the information you are given to provide advice.\",\n",
        "    messages=[\n",
        "        {\"role\": \"user\", \"content\": \"Answer this user query: \" + query + \" with the following context: \" + search_result}\n",
        "    ]\n",
        "  )\n",
        "\n",
        "  return (response.content[0].text), search_result"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The final step in this tutorial is to initialize the query, pass it into the `handle_user_query` function and print the response returned."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 13,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "_3OcaBbmoY2H",
        "outputId": "06c231ef-cbaa-48cd-e226-9d1c8450b737"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Response: Based on the information provided in the article titles and descriptions, Alibaba Group Holding Limited appears to be a top technology stock pick for 2023 according to renowned investor Ray Dalio. The article \"Top 10 Technology Stocks to Buy in 2023 According to Ray Dalio\" suggests that Alibaba is one of Dalio's favored tech investments for the year.\n",
            "\n",
            "As a venture capital tech analyst, I would recommend considering an investment in Alibaba for the following reasons:\n",
            "\n",
            "1. Endorsement from a respected investor: Ray Dalio, known for his successful investment strategies, has included Alibaba in his top 10 technology stock picks for 2023. His backing lends credibility to the investment potential of the company.\n",
            "\n",
            "2. Strong market position: Alibaba is a leading e-commerce company in China with a significant market share. It has a diversified business model spanning e-commerce, cloud computing, digital media, and entertainment.\n",
            "\n",
            "3. Growth potential: With China's large and growing middle class, Alibaba is well-positioned to benefit from increasing consumer spending and the shift towards online shopping.\n",
            "\n",
            "However, it's essential to consider the following points as well:\n",
            "\n",
            "1. Regulatory risks: Chinese tech companies, including Alibaba, have faced increased regulatory scrutiny in recent times. Changes in government policies could impact the company's growth and profitability.\n",
            "\n",
            "2. Competition: While Alibaba is a dominant player, it faces competition from other tech giants like Tencent and JD.com in various business segments.\n",
            "\n",
            "3. Geopolitical tensions: Ongoing tensions between the U.S. and China could lead to market volatility and impact investor sentiment towards Chinese stocks.\n",
            "\n",
            "As with any investment, it's crucial to conduct thorough research, consider your risk tolerance, and diversify your portfolio. Keep in mind that the information provided here is based on limited data points, and stock prices can be influenced by various factors beyond the scope of this context.\n",
            "Source Information: \\nTitle: Top 10 Technology Stocks to Buy in 2023 According to Ray Dalio, Company Name: ALIBABA GROUP HOLDING LIMITED, Company URL: https://hackernoon.com/company/alibabagroupholdinglimited, Date Published: 2023-04-21 11:58:00, Article URL: https://uk.finance.yahoo.com/news/top-10-technology-stocks-buy-155830366.html, Description: In this article we discuss the top 10 technology stocks to buy in 2023 according to Ray Dalio. If you want to skip our detailed analysis of Dalio’s investment philosophy and portfolio construction, \n",
            "Title: 3 Tech Stocks I Love Right Now, Company Name: 10Clouds, Company URL: https://hackernoon.com/company/10clouds, Date Published: 2023-04-02 11:30:00, Article URL: https://www.msn.com/en-xl/money/other/3-tech-stocks-i-love-right-now/ar-AA19n9Ht, Description: These are tech giants but they''re also great investments., \n",
            "Title: 3 Millionaire-Maker Hydrogen Stocks to Buy Before the Window Closes, Company Name: Air Products & Chemicals, Company URL: https://hackernoon.com/company/airproductschemicals, Date Published: 2023-07-28 12:18:00, Article URL: https://www.msn.com/en-us/money/topstocks/3-millionaire-maker-hydrogen-stocks-to-buy-before-the-window-closes/ar-AA1etN8O, Description: These are the best hydrogen stocks to buy with multibagger returns potential.More From InvestorPlace Buy This $5 Stock BEFORE This Apple Project Goes Live Wall Street Titan: Here’s My #1 Stock for 2023 The $1 Investment You MUST Take Advantage of Right Now It doesn’t matter if you have $500 or $5 million., \n",
            "Title: Why it may be time to sell the pop in tech stocks: BlackRock, Company Name: BlackRock, Company URL: https://hackernoon.com/company/blackrock, Date Published: 2023-02-13 19:06:00, Article URL: https://news.yahoo.com/why-it-may-be-time-to-sell-the-pop-in-tech-stocks-blackrock-190606866.html, Description: Household tech names like Apple Meta and Netflix have soared so far in 2023 but one strategist says the gains aren''t likely to last. It''s time to take profits on tech stocks — the early sector winner of 2023 — as the Federal Reserve may soon dash hopes for a pivot on interest rates, \n",
            "Title: The harsh reality for investors eyeing tech stocks in 2023: Morning Brief, Company Name: 10Clouds, Company URL: https://hackernoon.com/company/10clouds, Date Published: 2023-01-02 11:18:00, Article URL: https://news.yahoo.com/the-harsh-reality-for-investors-eyeing-tech-stocks-in-2023-morning-brief-111854804.html, Description: Curious on how to buy battered tech stocks? Here''s a quick tip. More on that and what else to watch in business on Monday., \n",
            "\n"
          ]
        }
      ],
      "source": [
        "# Conduct query with retrieval of sources\n",
        "query = \"Give me the best tech stock to invest in and tell me why\"\n",
        "response, source_information = handle_user_query(query, collection)\n",
        "\n",
        "print(f\"Response: {response}\")\n",
        "print(f\"\\\\nSource Information: \\\\n{source_information}\")"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}