cp

2025-04-07 00:48:52 +03:00 · 2025-02-18 10:06:29 +00:00
parent 7249e55824
commit 06d2f16b4b
1 changed files with 297 additions and 0 deletions
--- a/all_rag_techniques/HyPE_Hypothetical_Prompt_Embeddings.ipynb
+++ b/all_rag_techniques/HyPE_Hypothetical_Prompt_Embeddings.ipynb
@@ -0,0 +1,297 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Simple RAG (Retrieval-Augmented Generation) System\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "This code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying PDF documents. The system encodes the document content into a vector store, which can then be queried to retrieve relevant information.\n",
+    "\n",
+    "## Key Components\n",
+    "\n",
+    "1. PDF processing and text extraction\n",
+    "2. Text chunking for manageable processing\n",
+    "3. Vector store creation using [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) and OpenAI embeddings\n",
+    "4. Retriever setup for querying the processed documents\n",
+    "5. Evaluation of the RAG system\n",
+    "\n",
+    "## Method Details\n",
+    "\n",
+    "### Document Preprocessing\n",
+    "\n",
+    "1. The PDF is loaded using PyPDFLoader.\n",
+    "2. The text is split into chunks using RecursiveCharacterTextSplitter with specified chunk size and overlap.\n",
+    "\n",
+    "### Text Cleaning\n",
+    "\n",
+    "A custom function `replace_t_with_space` is applied to clean the text chunks. This likely addresses specific formatting issues in the PDF.\n",
+    "\n",
+    "### Vector Store Creation\n",
+    "\n",
+    "1. OpenAI embeddings are used to create vector representations of the text chunks.\n",
+    "2. A FAISS vector store is created from these embeddings for efficient similarity search.\n",
+    "\n",
+    "### Retriever Setup\n",
+    "\n",
+    "1. A retriever is configured to fetch the top 2 most relevant chunks for a given query.\n",
+    "\n",
+    "### Encoding Function\n",
+    "\n",
+    "The `encode_pdf` function encapsulates the entire process of loading, chunking, cleaning, and encoding the PDF into a vector store.\n",
+    "\n",
+    "## Key Features\n",
+    "\n",
+    "1. Modular Design: The encoding process is encapsulated in a single function for easy reuse.\n",
+    "2. Configurable Chunking: Allows adjustment of chunk size and overlap.\n",
+    "3. Efficient Retrieval: Uses FAISS for fast similarity search.\n",
+    "4. Evaluation: Includes a function to evaluate the RAG system's performance.\n",
+    "\n",
+    "## Usage Example\n",
+    "\n",
+    "The code includes a test query: \"What is the main cause of climate change?\". This demonstrates how to use the retriever to fetch relevant context from the processed document.\n",
+    "\n",
+    "## Evaluation\n",
+    "\n",
+    "The system includes an `evaluate_rag` function to assess the performance of the retriever, though the specific metrics used are not detailed in the provided code.\n",
+    "\n",
+    "## Benefits of this Approach\n",
+    "\n",
+    "1. Scalability: Can handle large documents by processing them in chunks.\n",
+    "2. Flexibility: Easy to adjust parameters like chunk size and number of retrieved results.\n",
+    "3. Efficiency: Utilizes FAISS for fast similarity search in high-dimensional spaces.\n",
+    "4. Integration with Advanced NLP: Uses OpenAI embeddings for state-of-the-art text representation.\n",
+    "\n",
+    "## Conclusion\n",
+    "\n",
+    "This simple RAG system provides a solid foundation for building more complex information retrieval and question-answering systems. By encoding document content into a searchable vector store, it enables efficient retrieval of relevant information in response to queries. This approach is particularly useful for applications requiring quick access to specific information within large documents or document collections."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Import libraries and environment variables"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import sys\n",
+    "from dotenv import load_dotenv\n",
+    "\n",
+    "\n",
+    "# Load environment variables from a .env file\n",
+    "load_dotenv()\n",
+    "\n",
+    "# Set the OpenAI API key environment variable (comment out if not using OpenAI)\n",
+    "if not os.getenv('OPENAI_API_KEY'):\n",
+    "    os.environ[\"OPENAI_API_KEY\"] = input(\"Please enter your OpenAI API key: \")\n",
+    "else:\n",
+    "    os.environ[\"OPENAI_API_KEY\"] = os.getenv('OPENAI_API_KEY')\n",
+    "\n",
+    "sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path since we work with notebooks\n",
+    "from helper_functions import *\n",
+    "from evaluation.evalute_rag import *\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Read Docs"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "path = \"../data/Understanding_Climate_Change.pdf\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Encode document"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def encode_pdf(path, chunk_size=1000, chunk_overlap=200):\n",
+    "    \"\"\"\n",
+    "    Encodes a PDF book into a vector store using OpenAI embeddings.\n",
+    "\n",
+    "    Args:\n",
+    "        path: The path to the PDF file.\n",
+    "        chunk_size: The desired size of each text chunk.\n",
+    "        chunk_overlap: The amount of overlap between consecutive chunks.\n",
+    "\n",
+    "    Returns:\n",
+    "        A FAISS vector store containing the encoded book content.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    # Load PDF documents\n",
+    "    loader = PyPDFLoader(path)\n",
+    "    documents = loader.load()\n",
+    "\n",
+    "    # Split documents into chunks\n",
+    "    text_splitter = RecursiveCharacterTextSplitter(\n",
+    "        chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len\n",
+    "    )\n",
+    "    texts = text_splitter.split_documents(documents)\n",
+    "    cleaned_texts = replace_t_with_space(texts)\n",
+    "\n",
+    "    # Create embeddings (Tested with OpenAI and Amazon Bedrock)\n",
+    "    embeddings = get_langchain_embedding_provider(EmbeddingProvider.OPENAI)\n",
+    "    #embeddings = get_langchain_embedding_provider(EmbeddingProvider.AMAZON_BEDROCK)\n",
+    "\n",
+    "    # Create vector store\n",
+    "    vectorstore = FAISS.from_documents(cleaned_texts, embeddings)\n",
+    "\n",
+    "    return vectorstore"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "chunks_vector_store = encode_pdf(path, chunk_size=1000, chunk_overlap=200)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Create retriever"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "chunks_query_retriever = chunks_vector_store.as_retriever(search_kwargs={\"k\": 2})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Test retriever"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "c:\\Users\\N7\\PycharmProjects\\llm_tasks\\RAG_TECHNIQUES\\.venv\\Lib\\site-packages\\langchain_core\\_api\\deprecation.py:139: LangChainDeprecationWarning: The method `BaseRetriever.get_relevant_documents` was deprecated in langchain-core 0.1.46 and will be removed in 0.3.0. Use invoke instead.\n",
+      "  warn_deprecated(\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Context 1:\n",
+      "driven by human activities, particularly the emission of greenhou se gases.  \n",
+      "Chapter 2: Causes of Climate Change  \n",
+      "Greenhouse Gases  \n",
+      "The primary cause of recent climate change is the increase in greenhouse gases in the \n",
+      "atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
+      "oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is  essential \n",
+      "for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
+      "activities have intensified this natural process, leading to a warmer climate.  \n",
+      "Fossil Fuels  \n",
+      "Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
+      "natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
+      "the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
+      "today.  \n",
+      "Coal\n",
+      "\n",
+      "\n",
+      "Context 2:\n",
+      "Most of these climate changes are attributed to very small variations in Earth's orbit that \n",
+      "change the amount of solar energy our planet receives. During the Holocene epoch, which \n",
+      "began at the end of the last ice age, human societies f lourished, but the industrial era has seen \n",
+      "unprecedented changes.  \n",
+      "Modern Observations  \n",
+      "Modern scientific observations indicate a rapid increase in global temperatures, sea levels, \n",
+      "and extreme weather events. The Intergovernmental Panel on Climate Change (IPCC) has \n",
+      "documented these changes extensively. Ice core samples, tree rings, and ocean sediments \n",
+      "provide a historical record that scientists use to understand past climate conditions and \n",
+      "predict future trends. The evidence overwhelmingly shows that recent changes are primarily \n",
+      "driven by human activities, particularly the emission of greenhou se gases.  \n",
+      "Chapter 2: Causes of Climate Change  \n",
+      "Greenhouse Gases\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "test_query = \"What is the main cause of climate change?\"\n",
+    "context = retrieve_context_per_question(test_query, chunks_query_retriever)\n",
+    "show_context(context)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Evaluate results"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Note - this currently works with OPENAI only\n",
+    "evaluate_rag(chunks_query_retriever)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}