From 06d2f16b4bf39d2549dbecb47596190d3d89045f Mon Sep 17 00:00:00 2001 From: VakeDomen Date: Tue, 18 Feb 2025 10:06:29 +0000 Subject: [PATCH 1/5] cp --- .../HyPE_Hypothetical_Prompt_Embeddings.ipynb | 297 ++++++++++++++++++ 1 file changed, 297 insertions(+) create mode 100644 all_rag_techniques/HyPE_Hypothetical_Prompt_Embeddings.ipynb diff --git a/all_rag_techniques/HyPE_Hypothetical_Prompt_Embeddings.ipynb b/all_rag_techniques/HyPE_Hypothetical_Prompt_Embeddings.ipynb new file mode 100644 index 0000000..ccca08d --- /dev/null +++ b/all_rag_techniques/HyPE_Hypothetical_Prompt_Embeddings.ipynb @@ -0,0 +1,297 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Simple RAG (Retrieval-Augmented Generation) System\n", + "\n", + "## Overview\n", + "\n", + "This code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying PDF documents. The system encodes the document content into a vector store, which can then be queried to retrieve relevant information.\n", + "\n", + "## Key Components\n", + "\n", + "1. PDF processing and text extraction\n", + "2. Text chunking for manageable processing\n", + "3. Vector store creation using [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) and OpenAI embeddings\n", + "4. Retriever setup for querying the processed documents\n", + "5. Evaluation of the RAG system\n", + "\n", + "## Method Details\n", + "\n", + "### Document Preprocessing\n", + "\n", + "1. The PDF is loaded using PyPDFLoader.\n", + "2. The text is split into chunks using RecursiveCharacterTextSplitter with specified chunk size and overlap.\n", + "\n", + "### Text Cleaning\n", + "\n", + "A custom function `replace_t_with_space` is applied to clean the text chunks. This likely addresses specific formatting issues in the PDF.\n", + "\n", + "### Vector Store Creation\n", + "\n", + "1. OpenAI embeddings are used to create vector representations of the text chunks.\n", + "2. A FAISS vector store is created from these embeddings for efficient similarity search.\n", + "\n", + "### Retriever Setup\n", + "\n", + "1. A retriever is configured to fetch the top 2 most relevant chunks for a given query.\n", + "\n", + "### Encoding Function\n", + "\n", + "The `encode_pdf` function encapsulates the entire process of loading, chunking, cleaning, and encoding the PDF into a vector store.\n", + "\n", + "## Key Features\n", + "\n", + "1. Modular Design: The encoding process is encapsulated in a single function for easy reuse.\n", + "2. Configurable Chunking: Allows adjustment of chunk size and overlap.\n", + "3. Efficient Retrieval: Uses FAISS for fast similarity search.\n", + "4. Evaluation: Includes a function to evaluate the RAG system's performance.\n", + "\n", + "## Usage Example\n", + "\n", + "The code includes a test query: \"What is the main cause of climate change?\". This demonstrates how to use the retriever to fetch relevant context from the processed document.\n", + "\n", + "## Evaluation\n", + "\n", + "The system includes an `evaluate_rag` function to assess the performance of the retriever, though the specific metrics used are not detailed in the provided code.\n", + "\n", + "## Benefits of this Approach\n", + "\n", + "1. Scalability: Can handle large documents by processing them in chunks.\n", + "2. Flexibility: Easy to adjust parameters like chunk size and number of retrieved results.\n", + "3. Efficiency: Utilizes FAISS for fast similarity search in high-dimensional spaces.\n", + "4. Integration with Advanced NLP: Uses OpenAI embeddings for state-of-the-art text representation.\n", + "\n", + "## Conclusion\n", + "\n", + "This simple RAG system provides a solid foundation for building more complex information retrieval and question-answering systems. By encoding document content into a searchable vector store, it enables efficient retrieval of relevant information in response to queries. This approach is particularly useful for applications requiring quick access to specific information within large documents or document collections." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Import libraries and environment variables" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import sys\n", + "from dotenv import load_dotenv\n", + "\n", + "\n", + "# Load environment variables from a .env file\n", + "load_dotenv()\n", + "\n", + "# Set the OpenAI API key environment variable (comment out if not using OpenAI)\n", + "if not os.getenv('OPENAI_API_KEY'):\n", + " os.environ[\"OPENAI_API_KEY\"] = input(\"Please enter your OpenAI API key: \")\n", + "else:\n", + " os.environ[\"OPENAI_API_KEY\"] = os.getenv('OPENAI_API_KEY')\n", + "\n", + "sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path since we work with notebooks\n", + "from helper_functions import *\n", + "from evaluation.evalute_rag import *\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Read Docs" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "path = \"../data/Understanding_Climate_Change.pdf\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Encode document" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def encode_pdf(path, chunk_size=1000, chunk_overlap=200):\n", + " \"\"\"\n", + " Encodes a PDF book into a vector store using OpenAI embeddings.\n", + "\n", + " Args:\n", + " path: The path to the PDF file.\n", + " chunk_size: The desired size of each text chunk.\n", + " chunk_overlap: The amount of overlap between consecutive chunks.\n", + "\n", + " Returns:\n", + " A FAISS vector store containing the encoded book content.\n", + " \"\"\"\n", + "\n", + " # Load PDF documents\n", + " loader = PyPDFLoader(path)\n", + " documents = loader.load()\n", + "\n", + " # Split documents into chunks\n", + " text_splitter = RecursiveCharacterTextSplitter(\n", + " chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len\n", + " )\n", + " texts = text_splitter.split_documents(documents)\n", + " cleaned_texts = replace_t_with_space(texts)\n", + "\n", + " # Create embeddings (Tested with OpenAI and Amazon Bedrock)\n", + " embeddings = get_langchain_embedding_provider(EmbeddingProvider.OPENAI)\n", + " #embeddings = get_langchain_embedding_provider(EmbeddingProvider.AMAZON_BEDROCK)\n", + "\n", + " # Create vector store\n", + " vectorstore = FAISS.from_documents(cleaned_texts, embeddings)\n", + "\n", + " return vectorstore" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "chunks_vector_store = encode_pdf(path, chunk_size=1000, chunk_overlap=200)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create retriever" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "chunks_query_retriever = chunks_vector_store.as_retriever(search_kwargs={\"k\": 2})" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Test retriever" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "c:\\Users\\N7\\PycharmProjects\\llm_tasks\\RAG_TECHNIQUES\\.venv\\Lib\\site-packages\\langchain_core\\_api\\deprecation.py:139: LangChainDeprecationWarning: The method `BaseRetriever.get_relevant_documents` was deprecated in langchain-core 0.1.46 and will be removed in 0.3.0. Use invoke instead.\n", + " warn_deprecated(\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Context 1:\n", + "driven by human activities, particularly the emission of greenhou se gases. \n", + "Chapter 2: Causes of Climate Change \n", + "Greenhouse Gases \n", + "The primary cause of recent climate change is the increase in greenhouse gases in the \n", + "atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n", + "oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n", + "for life on Earth, as it keeps the planet warm enough to support life. However, human \n", + "activities have intensified this natural process, leading to a warmer climate. \n", + "Fossil Fuels \n", + "Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n", + "natural gas used for electricity, heating, and transportation. The industrial revolution marked \n", + "the beginning of a significant increase in fossil fuel consumption, which continues to rise \n", + "today. \n", + "Coal\n", + "\n", + "\n", + "Context 2:\n", + "Most of these climate changes are attributed to very small variations in Earth's orbit that \n", + "change the amount of solar energy our planet receives. During the Holocene epoch, which \n", + "began at the end of the last ice age, human societies f lourished, but the industrial era has seen \n", + "unprecedented changes. \n", + "Modern Observations \n", + "Modern scientific observations indicate a rapid increase in global temperatures, sea levels, \n", + "and extreme weather events. The Intergovernmental Panel on Climate Change (IPCC) has \n", + "documented these changes extensively. Ice core samples, tree rings, and ocean sediments \n", + "provide a historical record that scientists use to understand past climate conditions and \n", + "predict future trends. The evidence overwhelmingly shows that recent changes are primarily \n", + "driven by human activities, particularly the emission of greenhou se gases. \n", + "Chapter 2: Causes of Climate Change \n", + "Greenhouse Gases\n", + "\n", + "\n" + ] + } + ], + "source": [ + "test_query = \"What is the main cause of climate change?\"\n", + "context = retrieve_context_per_question(test_query, chunks_query_retriever)\n", + "show_context(context)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Evaluate results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#Note - this currently works with OPENAI only\n", + "evaluate_rag(chunks_query_retriever)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} From dfb0e9125b3d506b17b75a26273a3a780247cfd0 Mon Sep 17 00:00:00 2001 From: VakeDomen Date: Wed, 19 Feb 2025 20:40:18 +0100 Subject: [PATCH 2/5] HyPE --- .../HyPE_Hypothetical_Prompt_Embeddings.ipynb | 322 ++++++++++++++---- .../HyPE_Hypothetical_Prompt_Embeddings.py | 203 +++++++++++ 2 files changed, 459 insertions(+), 66 deletions(-) create mode 100644 all_rag_techniques_runnable_scripts/HyPE_Hypothetical_Prompt_Embeddings.py diff --git a/all_rag_techniques/HyPE_Hypothetical_Prompt_Embeddings.ipynb b/all_rag_techniques/HyPE_Hypothetical_Prompt_Embeddings.ipynb index ccca08d..95de518 100644 --- a/all_rag_techniques/HyPE_Hypothetical_Prompt_Embeddings.ipynb +++ b/all_rag_techniques/HyPE_Hypothetical_Prompt_Embeddings.ipynb @@ -4,50 +4,50 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Simple RAG (Retrieval-Augmented Generation) System\n", + "# Hypothetical Prompt Embeddings (HyPE)\n", "\n", "## Overview\n", "\n", - "This code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying PDF documents. The system encodes the document content into a vector store, which can then be queried to retrieve relevant information.\n", + "This code implements a Retrieval-Augmented Generation (RAG) system enhanced by Hypothetical Prompt Embeddings (HyPE). Unlike traditional RAG pipelines that struggle with query-document style mismatch, HyPE precomputes hypothetical questions during the indexing phase. This transforms retrieval into a question-question matching problem, eliminating the need for expensive runtime query expansion techniques.\n", "\n", "## Key Components\n", "\n", "1. PDF processing and text extraction\n", - "2. Text chunking for manageable processing\n", - "3. Vector store creation using [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) and OpenAI embeddings\n", - "4. Retriever setup for querying the processed documents\n", - "5. Evaluation of the RAG system\n", + "2. Text chunking to maintain coherent information units\n", + "3. **Hypothetical Prompt Embedding Generation** using an LLM to create multiple proxy questions per chunk\n", + "4. Vector store creation using [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) and OpenAI embeddings\n", + "5. Retriever setup for querying the processed documents\n", + "6. Evaluation of the RAG system\n", "\n", "## Method Details\n", "\n", "### Document Preprocessing\n", "\n", - "1. The PDF is loaded using PyPDFLoader.\n", - "2. The text is split into chunks using RecursiveCharacterTextSplitter with specified chunk size and overlap.\n", + "1. The PDF is loaded using `PyPDFLoader`.\n", + "2. The text is split into chunks using `RecursiveCharacterTextSplitter` with specified chunk size and overlap.\n", "\n", - "### Text Cleaning\n", + "### Hypothetical Question Generation\n", "\n", - "A custom function `replace_t_with_space` is applied to clean the text chunks. This likely addresses specific formatting issues in the PDF.\n", + "Instead of embedding raw text chunks, HyPE **generates multiple hypothetical prompts** for each chunk. These **precomputed questions **simulate user queries, improving alignment with real-world searches. This removes the need for runtime synthetic answer generation needed in techniques like HyDE.\n", "\n", "### Vector Store Creation\n", "\n", - "1. OpenAI embeddings are used to create vector representations of the text chunks.\n", - "2. A FAISS vector store is created from these embeddings for efficient similarity search.\n", + "1. Each hypothetical question is embedded using OpenAI embeddings.\n", + "2. A FAISS vector store is built, associating **each question embedding with its original chunk**.\n", + "3. This approach **stores multiple representations per chunk**, increasing retrieval flexibility.\n", "\n", "### Retriever Setup\n", "\n", - "1. A retriever is configured to fetch the top 2 most relevant chunks for a given query.\n", - "\n", - "### Encoding Function\n", - "\n", - "The `encode_pdf` function encapsulates the entire process of loading, chunking, cleaning, and encoding the PDF into a vector store.\n", + "1. The retriever is optimized for **question-question matching** rather than direct document retrieval.\n", + "2. The FAISS index enables **efficient nearest-neighbor** search over the hypothetical prompt embeddings.\n", + "3. Retrieved chunks provide a **richer and more precise context** for downstream LLM generation.\n", "\n", "## Key Features\n", "\n", - "1. Modular Design: The encoding process is encapsulated in a single function for easy reuse.\n", - "2. Configurable Chunking: Allows adjustment of chunk size and overlap.\n", - "3. Efficient Retrieval: Uses FAISS for fast similarity search.\n", - "4. Evaluation: Includes a function to evaluate the RAG system's performance.\n", + "1. **Precomputed Hypothetical Prompts** – Improves query alignment without runtime overhead.\n", + "2. **Multi-Vector Representation **– Each chunk is indexed multiple times for broader semantic coverage.\n", + "3. **Efficient Retrieval** – FAISS ensures fast similarity search over the enhanced embeddings.\n", + "4. **Modular Design** – The pipeline is easy to adapt for different datasets and retrieval settings. Additionally it's compatible with most optimizations like reranking etc.\n", "\n", "## Usage Example\n", "\n", @@ -55,18 +55,27 @@ "\n", "## Evaluation\n", "\n", - "The system includes an `evaluate_rag` function to assess the performance of the retriever, though the specific metrics used are not detailed in the provided code.\n", + "HyPE's effectiveness is evaluated across multiple datasets, showing:\n", + "\n", + "- Up to 42 percentage points improvement in retrieval precision\n", + "- Up to 45 percentage points improvement in claim recall\n", + "- Compatibility with reranking, multi-vector retrieval, and other RAG optimizations\n", + " (See full evaluation results in [preprint](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5139335))\n", "\n", "## Benefits of this Approach\n", "\n", - "1. Scalability: Can handle large documents by processing them in chunks.\n", - "2. Flexibility: Easy to adjust parameters like chunk size and number of retrieved results.\n", - "3. Efficiency: Utilizes FAISS for fast similarity search in high-dimensional spaces.\n", - "4. Integration with Advanced NLP: Uses OpenAI embeddings for state-of-the-art text representation.\n", + "1. **Eliminates Query-Time Overhead** – All hypothetical generation is done offline at indexing.\n", + "2. **Enhanced Retrieval Precision** – Better alignment between queries and stored content.\n", + "3. **Scalable & Efficient** – No addinal per-query computational cost; retrieval is as fast as standard RAG.\n", + "4. **Flexible & Extensible** – Can be combined with advanced RAG techniques like reranking.\n", "\n", "## Conclusion\n", "\n", - "This simple RAG system provides a solid foundation for building more complex information retrieval and question-answering systems. By encoding document content into a searchable vector store, it enables efficient retrieval of relevant information in response to queries. This approach is particularly useful for applications requiring quick access to specific information within large documents or document collections." + "HyPE provides a scalable and efficient alternative to traditional RAG systems, overcoming query-document style mismatch while avoiding the computational cost of runtime query expansion. By moving hypothetical prompt generation to indexing, it significantly enhances retrieval precision and efficiency, making it a practical solution for real-world applications.\n", + "\n", + "For further details, refer to the full paper: [preprint](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5139335)\n", + "\n", + "\n" ] }, { @@ -78,13 +87,17 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "import os\n", "import sys\n", + "import faiss\n", + "from tqdm import tqdm\n", "from dotenv import load_dotenv\n", + "from concurrent.futures import ThreadPoolExecutor, as_completed\n", + "from langchain_community.docstore.in_memory import InMemoryDocstore\n", "\n", "\n", "# Load environment variables from a .env file\n", @@ -110,11 +123,13 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 64, "metadata": {}, "outputs": [], "source": [ - "path = \"../data/Understanding_Climate_Change.pdf\"" + "path = \"../data/Understanding_Climate_Change.pdf\"\n", + "language_model_name = \"gpt-4o-mini\"\n", + "embedding_model_name = \"text-embedding-3-small\"" ] }, { @@ -126,7 +141,101 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 65, + "metadata": {}, + "outputs": [], + "source": [ + "def generate_hypothetical_prompt_embeddings(chunk_text: str):\n", + " \"\"\"\n", + " Uses the LLM to generate multiple hypothetical questions for a single chunk.\n", + " These questions will be used as 'proxies' for the chunk during retrieval.\n", + "\n", + " Parameters:\n", + " chunk_text (str): Text contents of the chunk\n", + "\n", + " Returns:\n", + " chunk_text (str): Text contents of the chunk. This is done to make the \n", + " multithreading easier\n", + " hypothetical prompt embeddings (List[float]): A list of embedding vectors\n", + " generated from the questions\n", + " \"\"\"\n", + " llm = ChatOpenAI(temperature=0, model_name=language_model_name)\n", + " embedding_model = OpenAIEmbeddings(model=embedding_model_name)\n", + "\n", + " question_gen_prompt = PromptTemplate.from_template(\n", + " \"Analyze the input text and generate essential questions that, when answered, \\\n", + " capture the main points of the text. Each question should be one line, \\\n", + " without numbering or prefixes.\\n\\n \\\n", + " Text:\\n{chunk_text}\\n\\nQuestions:\\n\"\n", + " )\n", + " question_chain = question_gen_prompt | llm | StrOutputParser()\n", + "\n", + " # parse questions from response\n", + " # Notes: \n", + " # - gpt40 likes to split questions by \\n\\n so we remove one \\n\n", + " # - for producetion or if using smaller models from ollama, it's beneficial to use regex to parse \n", + " # things like (un)ordeed lists\n", + " # r\"^\\s*[\\-\\*\\•]|\\s*\\d+\\.\\s*|\\s*[a-zA-Z]\\)\\s*|\\s*\\(\\d+\\)\\s*|\\s*\\([a-zA-Z]\\)\\s*|\\s*\\([ivxlcdm]+\\)\\s*\"\n", + " questions = question_chain.invoke({\"chunk_text\": chunk_text}).replace(\"\\n\\n\", \"\\n\").split(\"\\n\")\n", + " \n", + " return chunk_text, embedding_model.embed_documents(questions)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 66, + "metadata": {}, + "outputs": [], + "source": [ + "def prepare_vector_store(chunks: List[str]):\n", + " \"\"\"\n", + " Creates and populates a FAISS vector store from a list of text chunks.\n", + "\n", + " This function processes a list of text chunks in parallel, generating \n", + " hypothetical prompt embeddings for each chunk.\n", + " The embeddings are stored in a FAISS index for efficient similarity search.\n", + "\n", + " Parameters:\n", + " chunks (List[str]): A list of text chunks to be embedded and stored.\n", + "\n", + " Returns:\n", + " FAISS: A FAISS vector store containing the embedded text chunks.\n", + " \"\"\"\n", + "\n", + " # Wait with initialization to see vector lengths\n", + " vector_store = None \n", + "\n", + " with ThreadPoolExecutor() as pool: \n", + " # Use threading to speed up generation of prompt embeddings\n", + " futures = [pool.submit(generate_hypothetical_prompt_embeddings, c) for c in chunks]\n", + " \n", + " # Process embeddings as they complete\n", + " for f in tqdm(as_completed(futures), total=len(chunks)): \n", + " \n", + " chunk, vectors = f.result() # Retrieve the processed chunk and its embeddings\n", + " \n", + " # Initialize the FAISS vector store on the first chunk\n", + " if vector_store == None: \n", + " vector_store = FAISS(\n", + " embedding_function=OpenAIEmbeddings(model=embedding_model_name), # Define embedding model\n", + " index=faiss.IndexFlatL2(len(vectors[0])) # Define an L2 index for similarity search\n", + " docstore=InMemoryDocstore(), # Use in-memory document storage\n", + " index_to_docstore_id={} # Maintain index-to-document mapping\n", + " )\n", + " \n", + " # Pair the chunk's content with each generated embedding vector.\n", + " # Each chunk is inserted multiple times, once for each vector\n", + " chunks_with_embedding_vectors = [(chunk.page_content, vec) for vec in vectors]\n", + " \n", + " # Add embeddings to the store\n", + " vector_store.add_embeddings(chunks_with_embedding_vectors) \n", + "\n", + " return vector_store # Return the populated vector store\n" + ] + }, + { + "cell_type": "code", + "execution_count": 70, "metadata": {}, "outputs": [], "source": [ @@ -154,22 +263,28 @@ " texts = text_splitter.split_documents(documents)\n", " cleaned_texts = replace_t_with_space(texts)\n", "\n", - " # Create embeddings (Tested with OpenAI and Amazon Bedrock)\n", - " embeddings = get_langchain_embedding_provider(EmbeddingProvider.OPENAI)\n", - " #embeddings = get_langchain_embedding_provider(EmbeddingProvider.AMAZON_BEDROCK)\n", - "\n", - " # Create vector store\n", - " vectorstore = FAISS.from_documents(cleaned_texts, embeddings)\n", + " vectorstore = prepare_vector_store(cleaned_texts)\n", "\n", " return vectorstore" ] }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 71, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 97/97 [00:22<00:00, 4.40it/s]\n" + ] + } + ], "source": [ + "# Chunk size can be quite large with HyPE as we are not loosing percision with more\n", + "# information. For production, test how exhaustive your model is in generating sufficient \n", + "# amount of questions per chunk. This will mostly depend on your information density.\n", "chunks_vector_store = encode_pdf(path, chunk_size=1000, chunk_overlap=200)" ] }, @@ -182,11 +297,11 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 79, "metadata": {}, "outputs": [], "source": [ - "chunks_query_retriever = chunks_vector_store.as_retriever(search_kwargs={\"k\": 2})" + "chunks_query_retriever = chunks_vector_store.as_retriever(search_kwargs={\"k\": 3})" ] }, { @@ -198,22 +313,30 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 80, "metadata": {}, "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "c:\\Users\\N7\\PycharmProjects\\llm_tasks\\RAG_TECHNIQUES\\.venv\\Lib\\site-packages\\langchain_core\\_api\\deprecation.py:139: LangChainDeprecationWarning: The method `BaseRetriever.get_relevant_documents` was deprecated in langchain-core 0.1.46 and will be removed in 0.3.0. Use invoke instead.\n", - " warn_deprecated(\n" - ] - }, { "name": "stdout", "output_type": "stream", "text": [ "Context 1:\n", + "Most of these climate changes are attributed to very small variations in Earth's orbit that \n", + "change the amount of solar energy our planet receives. During the Holocene epoch, which \n", + "began at the end of the last ice age, human societies f lourished, but the industrial era has seen \n", + "unprecedented changes. \n", + "Modern Observations \n", + "Modern scientific observations indicate a rapid increase in global temperatures, sea levels, \n", + "and extreme weather events. The Intergovernmental Panel on Climate Change (IPCC) has \n", + "documented these changes extensively. Ice core samples, tree rings, and ocean sediments \n", + "provide a historical record that scientists use to understand past climate conditions and \n", + "predict future trends. The evidence overwhelmingly shows that recent changes are primarily \n", + "driven by human activities, particularly the emission of greenhou se gases. \n", + "Chapter 2: Causes of Climate Change \n", + "Greenhouse Gases\n", + "\n", + "\n", + "Context 2:\n", "driven by human activities, particularly the emission of greenhou se gases. \n", "Chapter 2: Causes of Climate Change \n", "Greenhouse Gases \n", @@ -230,20 +353,20 @@ "Coal\n", "\n", "\n", - "Context 2:\n", + "Context 3:\n", + "Understanding Climate Change \n", + "Chapter 1: Introduction to Climate Change \n", + "Climate change refers to significant, long -term changes in the global climate. The term \n", + "\"global climate\" encompasses the planet's overall weather patterns, including temperature, \n", + "precipitation, and wind patterns, over an extended period. Over the past cent ury, human \n", + "activities, particularly the burning of fossil fuels and deforestation, have significantly \n", + "contributed to climate change. \n", + "Historical Context \n", + "The Earth's climate has changed throughout history. Over the past 650,000 years, there have \n", + "been seven cycles of glacial advance and retreat, with the abrupt end of the last ice age about \n", + "11,700 years ago marking the beginning of the modern climate era and human civilization. \n", "Most of these climate changes are attributed to very small variations in Earth's orbit that \n", - "change the amount of solar energy our planet receives. During the Holocene epoch, which \n", - "began at the end of the last ice age, human societies f lourished, but the industrial era has seen \n", - "unprecedented changes. \n", - "Modern Observations \n", - "Modern scientific observations indicate a rapid increase in global temperatures, sea levels, \n", - "and extreme weather events. The Intergovernmental Panel on Climate Change (IPCC) has \n", - "documented these changes extensively. Ice core samples, tree rings, and ocean sediments \n", - "provide a historical record that scientists use to understand past climate conditions and \n", - "predict future trends. The evidence overwhelmingly shows that recent changes are primarily \n", - "driven by human activities, particularly the emission of greenhou se gases. \n", - "Chapter 2: Causes of Climate Change \n", - "Greenhouse Gases\n", + "change the amount of solar energy our planet receives. During the Holocene epoch, which\n", "\n", "\n" ] @@ -252,6 +375,8 @@ "source": [ "test_query = \"What is the main cause of climate change?\"\n", "context = retrieve_context_per_question(test_query, chunks_query_retriever)\n", + "# deduplication might be beneficial as it is possible to retrieve the same chunk multiple times\n", + "context = list(set(context))\n", "show_context(context)" ] }, @@ -264,9 +389,74 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 76, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "{'questions': ['1. **Multiple Choice: Causes of Climate Change**',\n", + " ' - What is the primary cause of the current climate change trend?',\n", + " ' A) Solar radiation variations',\n", + " ' B) Natural cycles of the Earth',\n", + " ' C) Human activities, such as burning fossil fuels',\n", + " ' D) Volcanic eruptions',\n", + " '',\n", + " '2. **True or False: Impact on Biodiversity**',\n", + " ' - True or False: Climate change does not have any significant impact on the migration patterns and extinction rates of various species.',\n", + " '',\n", + " '3. **Short Answer: Mitigation Strategies**',\n", + " ' - What are two effective strategies that can be implemented at a community level to mitigate the effects of climate change?',\n", + " '',\n", + " '4. **Matching: Climate Change Effects**',\n", + " ' - Match the following effects of climate change (numbered) with their likely consequences (lettered).',\n", + " ' 1. Rising sea levels',\n", + " ' 2. Increased frequency of extreme weather events',\n", + " ' 3. Melting polar ice caps',\n", + " ' 4. Ocean acidification',\n", + " ' ',\n", + " ' A) Displacement of coastal communities',\n", + " ' B) Loss of marine biodiversity',\n", + " ' C) Increased global temperatures',\n", + " ' D) More frequent and severe hurricanes and floods',\n", + " '',\n", + " '5. **Essay: International Cooperation**',\n", + " ' - Discuss the importance of international cooperation in combating climate change. Include examples of successful global agreements or initiatives and explain how they have contributed to addressing climate change.'],\n", + " 'results': ['```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n", + " '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n", + " '```json\\n{\\n \"Relevance\": 2,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n", + " '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 3\\n}\\n```',\n", + " '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n", + " '```json\\n{\\n \"Relevance\": 1,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n", + " '```json\\n{\\n \"Relevance\": 1,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n", + " '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n", + " '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n", + " '```json\\n{\\n \"Relevance\": 2,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n", + " '```json\\n{\\n \"Relevance\": 2,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n", + " '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 2\\n}\\n```',\n", + " '```json\\n{\\n \"Relevance\": 2,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n", + " '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 3\\n}\\n```',\n", + " '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 2,\\n \"Conciseness\": 3\\n}\\n```',\n", + " '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n", + " '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n", + " '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n", + " '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n", + " '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 3\\n}\\n```',\n", + " '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 2\\n}\\n```',\n", + " '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 3\\n}\\n```',\n", + " '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n", + " '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n", + " '```json\\n{\\n \"Relevance\": 2,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n", + " '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 3\\n}\\n```',\n", + " '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 2,\\n \"Conciseness\": 3\\n}\\n```'],\n", + " 'average_scores': None}" + ] + }, + "execution_count": 76, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "#Note - this currently works with OPENAI only\n", "evaluate_rag(chunks_query_retriever)" @@ -289,7 +479,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.3" + "version": "3.10.12" } }, "nbformat": 4, diff --git a/all_rag_techniques_runnable_scripts/HyPE_Hypothetical_Prompt_Embeddings.py b/all_rag_techniques_runnable_scripts/HyPE_Hypothetical_Prompt_Embeddings.py new file mode 100644 index 0000000..c411bf1 --- /dev/null +++ b/all_rag_techniques_runnable_scripts/HyPE_Hypothetical_Prompt_Embeddings.py @@ -0,0 +1,203 @@ +import os +import sys +import argparse +import time +import faiss +from dotenv import load_dotenv +from tqdm import tqdm +from concurrent.futures import ThreadPoolExecutor, as_completed +from langchain_community.docstore.in_memory import InMemoryDocstore + +# Add the parent directory to the path since we work with notebooks +sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) + +from helper_functions import * +from evaluation.evalute_rag import * + +# Load environment variables from a .env file (e.g., OpenAI API key) +load_dotenv() +os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY') + +class HyPE: + """ + A class to handle the HyPE RAG process, which enhances document chunking by + generating hypothetical questions as proxies for retrieval. + """ + + def __init__(self, path, chunk_size=1000, chunk_overlap=200, n_retrieved=3): + """ + Initializes the HyPE-based RAG retriever by encoding the PDF document with + hypothetical prompt embeddings. + + Args: + path (str): Path to the PDF file to encode. + chunk_size (int): Size of each text chunk (default: 1000). + chunk_overlap (int): Overlap between consecutive chunks (default: 200). + n_retrieved (int): Number of chunks to retrieve for each query (default: 3). + """ + print("\n--- Initializing HyPE RAG Retriever ---") + + # Encode the PDF document into a FAISS vector store using hypothetical prompt embeddings + start_time = time.time() + self.vector_store = self.encode_pdf(path, chunk_size=chunk_size, chunk_overlap=chunk_overlap) + self.time_records = {'Chunking': time.time() - start_time} + print(f"Chunking Time: {self.time_records['Chunking']:.2f} seconds") + + # Create a retriever from the vector store + self.chunks_query_retriever = self.vector_store.as_retriever(search_kwargs={"k": n_retrieved}) + + def generate_hypothetical_prompt_embeddings(self, chunk_text): + """ + Uses an LLM to generate multiple hypothetical questions for a single chunk. + These questions act as 'proxies' for the chunk during retrieval. + + Parameters: + chunk_text (str): Text contents of the chunk. + + Returns: + tuple: (Original chunk text, List of embedding vectors generated from the questions) + """ + llm = ChatOpenAI(temperature=0, model_name="gpt-4o-mini") + embedding_model = OpenAIEmbeddings(model="text-embedding-3-small") + + question_gen_prompt = PromptTemplate.from_template( + "Analyze the input text and generate essential questions that, when answered, \ + capture the main points of the text. Each question should be one line, \ + without numbering or prefixes.\n\n \ + Text:\n{chunk_text}\n\nQuestions:\n" + ) + question_chain = question_gen_prompt | llm | StrOutputParser() + + # Parse questions from response + questions = question_chain.invoke({"chunk_text": chunk_text}).replace("\n\n", "\n").split("\n") + + return chunk_text, embedding_model.embed_documents(questions) + + def prepare_vector_store(self, chunks): + """ + Creates and populates a FAISS vector store using hypothetical prompt embeddings. + + Parameters: + chunks (List[str]): A list of text chunks to be embedded and stored. + + Returns: + FAISS: A FAISS vector store containing the embedded text chunks. + """ + vector_store = None # Wait to initialize to determine vector size + + with ThreadPoolExecutor() as pool: + # Parallelized embedding generation + futures = [pool.submit(self.generate_hypothetical_prompt_embeddings, c) for c in chunks] + + for f in tqdm(as_completed(futures), total=len(chunks)): + chunk, vectors = f.result() # Retrieve processed chunk and embeddings + + # Initialize FAISS store once vector size is known + if vector_store is None: + vector_store = FAISS( + embedding_function=OpenAIEmbeddings(model="text-embedding-3-small"), + index=faiss.IndexFlatL2(len(vectors[0])), + docstore=InMemoryDocstore(), + index_to_docstore_id={} + ) + + # Store multiple vector representations per chunk + chunks_with_embedding_vectors = [(chunk.page_content, vec) for vec in vectors] + vector_store.add_embeddings(chunks_with_embedding_vectors) + + return vector_store + + def encode_pdf(self, path, chunk_size=1000, chunk_overlap=200): + """ + Encodes a PDF document into a vector store using hypothetical prompt embeddings. + + Args: + path: The path to the PDF file. + chunk_size: The size of each text chunk. + chunk_overlap: The overlap between consecutive chunks. + + Returns: + A FAISS vector store containing the encoded book content. + """ + # Load PDF documents + loader = PyPDFLoader(path) + documents = loader.load() + + # Split documents into chunks + text_splitter = RecursiveCharacterTextSplitter( + chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len + ) + texts = text_splitter.split_documents(documents) + cleaned_texts = replace_t_with_space(texts) + + return self.prepare_vector_store(cleaned_texts) + + def run(self, query): + """ + Retrieves and displays the context for the given query. + + Args: + query (str): The query to retrieve context for. + + Returns: + None + """ + # Measure retrieval time + start_time = time.time() + context = retrieve_context_per_question(query, self.chunks_query_retriever) + self.time_records['Retrieval'] = time.time() - start_time + print(f"Retrieval Time: {self.time_records['Retrieval']:.2f} seconds") + + # Deduplicate context and display results + context = list(set(context)) + show_context(context) + + +def validate_args(args): + if args.chunk_size <= 0: + raise ValueError("chunk_size must be a positive integer.") + if args.chunk_overlap < 0: + raise ValueError("chunk_overlap must be a non-negative integer.") + if args.n_retrieved <= 0: + raise ValueError("n_retrieved must be a positive integer.") + return args + + +def parse_args(): + parser = argparse.ArgumentParser(description="Encode a PDF document and test a HyPE-based RAG system.") + parser.add_argument("--path", type=str, default="../data/Understanding_Climate_Change.pdf", + help="Path to the PDF file to encode.") + parser.add_argument("--chunk_size", type=int, default=1000, + help="Size of each text chunk (default: 1000).") + parser.add_argument("--chunk_overlap", type=int, default=200, + help="Overlap between consecutive chunks (default: 200).") + parser.add_argument("--n_retrieved", type=int, default=3, + help="Number of chunks to retrieve for each query (default: 3).") + parser.add_argument("--query", type=str, default="What is the main cause of climate change?", + help="Query to test the retriever (default: 'What is the main cause of climate change?').") + parser.add_argument("--evaluate", action="store_true", + help="Whether to evaluate the retriever's performance (default: False).") + + return validate_args(parser.parse_args()) + + +def main(args): + # Initialize the HyPE-based RAG Retriever + hyperag = HyPE( + path=args.path, + chunk_size=args.chunk_size, + chunk_overlap=args.chunk_overlap, + n_retrieved=args.n_retrieved + ) + + # Retrieve context based on the query + hyperag.run(args.query) + + # Evaluate the retriever's performance on the query (if requested) + if args.evaluate: + evaluate_rag(hyperag.chunks_query_retriever) + + +if __name__ == '__main__': + # Call the main function with parsed arguments + main(parse_args()) From 91a8a893028b999bf3327cfc29ddca43818c1ed1 Mon Sep 17 00:00:00 2001 From: VakeDomen Date: Wed, 19 Feb 2025 20:57:57 +0100 Subject: [PATCH 3/5] readme --- README.md | 65 +++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 41 insertions(+), 24 deletions(-) diff --git a/README.md b/README.md index 8b9cb2b..9a47da0 100644 --- a/README.md +++ b/README.md @@ -153,7 +153,24 @@ Explore the extensive list of cutting-edge RAG techniques: ### 📚 Context and Content Enrichment -8. **[Contextual Chunk Headers :label:](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/contextual_chunk_headers.ipynb)** +8. Hypothetical Prompt Embeddings (HyPE) ❓🚀 + - **[LangChain](all_rag_techniques/HyPE_Hypothetical_Prompt_Embedding.ipynb)** + - **[Runnable Script](all_rag_techniques_runnable_scripts/HyPE_Hypothetical_Prompt_Embedding.py)** + + #### Overview 🔎 + HyPE (Hypothetical Prompt Embeddings) is an enhancement to traditional RAG retrieval that **precomputes hypothetical prompts at the indexing stage**, but inseting the chunk in their place. This transforms retrieval into a **question-question matching task**. This avoids the need for runtime synthetic answer generation, reducing inference-time computational overhead while **improving retrieval alignment**. + + #### Implementation 🛠️ + - 📖 **Precomputed Questions:** Instead of embedding document chunks, HyPE **generates multiple hypothetical queries per chunk** at indexing time. + - 🔍 **Question-Question Matching:** User queries are matched against stored hypothetical questions, leading to **better retrieval alignment**. + - ⚡ **No Runtime Overhead:** Unlike HyDE, HyPE does **not require LLM calls at query time**, making retrieval **faster and cheaper**. + - 📈 **Higher Precision & Recall:** Improves retrieval **context precision by up to 42 percentage points** and **claim recall by up to 45 percentage points**. + + #### Additional Resources 📚 + - **[Preprint: Hypothetical Prompt Embeddings (HyPE)](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5139335)** - Research paper detailing the method, evaluation, and benchmarks. + + +9. **[Contextual Chunk Headers :label:](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/contextual_chunk_headers.ipynb)** #### Overview 🔎 Contextual chunk headers (CCH) is a method of creating document-level and section-level context, and prepending those chunk headers to the chunks prior to embedding them. @@ -164,7 +181,7 @@ Explore the extensive list of cutting-edge RAG techniques: #### Additional Resources 📚 **[dsRAG](https://github.com/D-Star-AI/dsRAG)**: open-source retrieval engine that implements this technique (and a few other advanced RAG techniques) -9. **[Relevant Segment Extraction 🧩](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/relevant_segment_extraction.ipynb)** +10. **[Relevant Segment Extraction 🧩](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/relevant_segment_extraction.ipynb)** #### Overview 🔎 Relevant segment extraction (RSE) is a method of dynamically constructing multi-chunk segments of text that are relevant to a given query. @@ -172,7 +189,7 @@ Explore the extensive list of cutting-edge RAG techniques: #### Implementation 🛠️ Perform a retrieval post-processing step that analyzes the most relevant chunks and identifies longer multi-chunk segments to provide more complete context to the LLM. -10. Context Enrichment Techniques 📝 +11. Context Enrichment Techniques 📝 - **[LangChain](all_rag_techniques/context_enrichment_window_around_chunk.ipynb)** - **[LlamaIndex](all_rag_techniques/context_enrichment_window_around_chunk_with_llamaindex.ipynb)** - **[Runnable Script](all_rag_techniques_runnable_scripts/context_enrichment_window_around_chunk.py)** @@ -183,7 +200,7 @@ Explore the extensive list of cutting-edge RAG techniques: #### Implementation 🛠️ Retrieve the most relevant sentence while also accessing the sentences before and after it in the original text. -11. Semantic Chunking 🧠 +12. Semantic Chunking 🧠 - **[LangChain](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/semantic_chunking.ipynb)** - **[Runnable Script](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques_runnable_scripts/semantic_chunking.py)** @@ -196,7 +213,7 @@ Explore the extensive list of cutting-edge RAG techniques: #### Additional Resources 📚 - **[Semantic Chunking: Improving AI Information Retrieval](https://open.substack.com/pub/diamantai/p/semantic-chunking-improving-ai-information?r=336pe4&utm_campaign=post&utm_medium=web)** - A comprehensive blog post exploring the benefits and implementation of semantic chunking in RAG systems. -12. Contextual Compression 🗜️ +13. Contextual Compression 🗜️ - **[LangChain](all_rag_techniques/contextual_compression.ipynb)** - **[Runnable Script](all_rag_techniques_runnable_scripts/contextual_compression.py)** @@ -206,7 +223,7 @@ Explore the extensive list of cutting-edge RAG techniques: #### Implementation 🛠️ Use an LLM to compress or summarize retrieved chunks, preserving key information relevant to the query. -13. Document Augmentation through Question Generation for Enhanced Retrieval +14. Document Augmentation through Question Generation for Enhanced Retrieval - **[LangChain](all_rag_techniques/document_augmentation.ipynb)** - **[Runnable Script](all_rag_techniques_runnable_scripts/document_augmentation.py)** @@ -218,7 +235,7 @@ Explore the extensive list of cutting-edge RAG techniques: ### 🚀 Advanced Retrieval Methods -14. Fusion Retrieval 🔗 +15. Fusion Retrieval 🔗 - **[LangChain](all_rag_techniques/fusion_retrieval.ipynb)** - **[LlamaIndex](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/fusion_retrieval_with_llamaindex.ipynb)** - **[Runnable Script](all_rag_techniques_runnable_scripts/fusion_retrieval.py)** @@ -229,7 +246,7 @@ Explore the extensive list of cutting-edge RAG techniques: #### Implementation 🛠️ Combine keyword-based search with vector-based search for more comprehensive and accurate retrieval. -15. Intelligent Reranking 📈 +16. Intelligent Reranking 📈 - **[LangChain](all_rag_techniques/reranking.ipynb)** - **[LlamaIndex](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/reranking_with_llamaindex.ipynb)** - **[Runnable Script](all_rag_techniques_runnable_scripts/reranking.py)** @@ -245,7 +262,7 @@ Explore the extensive list of cutting-edge RAG techniques: #### Additional Resources 📚 - **[Relevance Revolution: How Re-ranking Transforms RAG Systems](https://open.substack.com/pub/diamantai/p/relevance-revolution-how-re-ranking?r=336pe4&utm_campaign=post&utm_medium=web)** - A comprehensive blog post exploring the power of re-ranking in enhancing RAG system performance. -16. Multi-faceted Filtering 🔍 +17. Multi-faceted Filtering 🔍 #### Overview 🔎 Applying various filtering techniques to refine and improve the quality of retrieved results. @@ -256,7 +273,7 @@ Explore the extensive list of cutting-edge RAG techniques: - 📄 **Content Filtering:** Remove results that don't match specific content criteria or essential keywords. - 🌈 **Diversity Filtering:** Ensure result diversity by filtering out near-duplicate entries. -17. Hierarchical Indices 🗂️ +18. Hierarchical Indices 🗂️ - **[LangChain](all_rag_techniques/hierarchical_indices.ipynb)** - **[Runnable Script](all_rag_techniques_runnable_scripts/hierarchical_indices.py)** @@ -269,7 +286,7 @@ Explore the extensive list of cutting-edge RAG techniques: #### Additional Resources 📚 - **[Hierarchical Indices: Enhancing RAG Systems](https://open.substack.com/pub/diamantai/p/hierarchical-indices-enhancing-rag?r=336pe4&utm_campaign=post&utm_medium=web)** - A comprehensive blog post exploring the power of hierarchical indices in enhancing RAG system performance. -18. Ensemble Retrieval 🎭 +19. Ensemble Retrieval 🎭 #### Overview 🔎 Combining multiple retrieval models or techniques for more robust and accurate results. @@ -277,7 +294,7 @@ Explore the extensive list of cutting-edge RAG techniques: #### Implementation 🛠️ Apply different embedding models or retrieval algorithms and use voting or weighting mechanisms to determine the final set of retrieved documents. -19. Multi-modal Retrieval 📽️ +20. Multi-modal Retrieval 📽️ #### Overview 🔎 Extending RAG capabilities to handle diverse data types for richer responses. @@ -289,7 +306,7 @@ Explore the extensive list of cutting-edge RAG techniques: ### 🔁 Iterative and Adaptive Techniques -20. Retrieval with Feedback Loops 🔁 +21. Retrieval with Feedback Loops 🔁 - **[LangChain](all_rag_techniques/retrieval_with_feedback_loop.ipynb)** - **[Runnable Script](all_rag_techniques_runnable_scripts/retrieval_with_feedback_loop.py)** @@ -299,7 +316,7 @@ Explore the extensive list of cutting-edge RAG techniques: #### Implementation 🛠️ Collect and utilize user feedback on the relevance and quality of retrieved documents and generated responses to fine-tune retrieval and ranking models. -21. Adaptive Retrieval 🎯 +22. Adaptive Retrieval 🎯 - **[LangChain](all_rag_techniques/adaptive_retrieval.ipynb)** - **[Runnable Script](all_rag_techniques_runnable_scripts/adaptive_retrieval.py)** @@ -309,7 +326,7 @@ Explore the extensive list of cutting-edge RAG techniques: #### Implementation 🛠️ Classify queries into different categories and use tailored retrieval strategies for each, considering user context and preferences. -22. Iterative Retrieval 🔄 +23. Iterative Retrieval 🔄 #### Overview 🔎 Performing multiple rounds of retrieval to refine and enhance result quality. @@ -319,7 +336,7 @@ Explore the extensive list of cutting-edge RAG techniques: ### 📊 Evaluation -23. **[DeepEval Evaluation](evaluation/evaluation_deep_eval.ipynb)** 📘 +24. **[DeepEval Evaluation](evaluation/evaluation_deep_eval.ipynb)** 📘 #### Overview 🔎 Performing evaluations Retrieval-Augmented Generation systems, by covering several metrics and creating test cases. @@ -328,7 +345,7 @@ Explore the extensive list of cutting-edge RAG techniques: Use the `deepeval` library to conduct test cases on correctness, faithfulness and contextual relevancy of RAG systems. -24. **[GroUSE Evaluation](evaluation/evaluation_grouse.ipynb)** 🐦 +25. **[GroUSE Evaluation](evaluation/evaluation_grouse.ipynb)** 🐦 #### Overview 🔎 Evaluate the final stage of Retrieval-Augmented Generation using metrics of the GroUSE framework and meta-evaluate your custom LLM judge on GroUSE unit tests. @@ -339,7 +356,7 @@ Explore the extensive list of cutting-edge RAG techniques: ### 🔬 Explainability and Transparency -25. Explainable Retrieval 🔍 +26. Explainable Retrieval 🔍 - **[LangChain](all_rag_techniques/explainable_retrieval.ipynb)** - **[Runnable Script](all_rag_techniques_runnable_scripts/explainable_retrieval.py)** @@ -351,7 +368,7 @@ Explore the extensive list of cutting-edge RAG techniques: ### 🏗️ Advanced Architectures -26. Knowledge Graph Integration (Graph RAG) 🕸️ +27. Knowledge Graph Integration (Graph RAG) 🕸️ - **[LangChain](all_rag_techniques/graph_rag.ipynb)** - **[Runnable Script](all_rag_techniques_runnable_scripts/graph_rag.py)** @@ -361,7 +378,7 @@ Explore the extensive list of cutting-edge RAG techniques: #### Implementation 🛠️ Retrieve entities and their relationships from a knowledge graph relevant to the query, combining this structured data with unstructured text for more informative responses. -27. GraphRag (Microsoft) 🎯 +28. GraphRag (Microsoft) 🎯 - **[GraphRag](all_rag_techniques/Microsoft_GraphRag.ipynb)** #### Overview 🔎 @@ -370,7 +387,7 @@ Explore the extensive list of cutting-edge RAG techniques: #### Implementation 🛠️ • Analyze an input corpus by extracting entities, relationships from text units. generates summaries of each community and its constituents from the bottom-up. -28. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval 🌳 +29. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval 🌳 - **[LangChain](all_rag_techniques/raptor.ipynb)** - **[Runnable Script](all_rag_techniques_runnable_scripts/raptor.py)** @@ -380,7 +397,7 @@ Explore the extensive list of cutting-edge RAG techniques: #### Implementation 🛠️ Use abstractive summarization to recursively process and summarize retrieved documents, organizing the information in a tree structure for hierarchical context. -29. Self RAG 🔁 +30. Self RAG 🔁 - **[LangChain](all_rag_techniques/self_rag.ipynb)** - **[Runnable Script](all_rag_techniques_runnable_scripts/self_rag.py)** @@ -390,7 +407,7 @@ Explore the extensive list of cutting-edge RAG techniques: #### Implementation 🛠️ • Implement a multi-step process including retrieval decision, document retrieval, relevance evaluation, response generation, support assessment, and utility evaluation to produce accurate, relevant, and useful outputs. -30. Corrective RAG 🔧 +31. Corrective RAG 🔧 - **[LangChain](all_rag_techniques/crag.ipynb)** - **[Runnable Script](all_rag_techniques_runnable_scripts/crag.py)** @@ -402,7 +419,7 @@ Explore the extensive list of cutting-edge RAG techniques: ## 🌟 Special Advanced Technique 🌟 -31. **[Sophisticated Controllable Agent for Complex RAG Tasks 🤖](https://github.com/NirDiamant/Controllable-RAG-Agent)** +32. **[Sophisticated Controllable Agent for Complex RAG Tasks 🤖](https://github.com/NirDiamant/Controllable-RAG-Agent)** #### Overview 🔎 An advanced RAG solution designed to tackle complex questions that simple semantic similarity-based retrieval cannot solve. This approach uses a sophisticated deterministic graph as the "brain" 🧠 of a highly controllable autonomous agent, capable of answering non-trivial questions from your own data. From 165876797c1ea7d3e7a2b6eba83577d389fbcf07 Mon Sep 17 00:00:00 2001 From: VakeDomen Date: Wed, 19 Feb 2025 21:05:40 +0100 Subject: [PATCH 4/5] hype image --- .../HyPE_Hypothetical_Prompt_Embeddings.ipynb | 6 +++++- images/hype.svg | 21 +++++++++++++++++++ 2 files changed, 26 insertions(+), 1 deletion(-) create mode 100644 images/hype.svg diff --git a/all_rag_techniques/HyPE_Hypothetical_Prompt_Embeddings.ipynb b/all_rag_techniques/HyPE_Hypothetical_Prompt_Embeddings.ipynb index 95de518..0cb49ac 100644 --- a/all_rag_techniques/HyPE_Hypothetical_Prompt_Embeddings.ipynb +++ b/all_rag_techniques/HyPE_Hypothetical_Prompt_Embeddings.ipynb @@ -75,7 +75,11 @@ "\n", "For further details, refer to the full paper: [preprint](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5139335)\n", "\n", - "\n" + "\n", + "
\n", + "\n", + "\"HyPE\"\n", + "
" ] }, { diff --git a/images/hype.svg b/images/hype.svg new file mode 100644 index 0000000..e71fc30 --- /dev/null +++ b/images/hype.svg @@ -0,0 +1,21 @@ + + + + + + + + [0.33, -0.69, 0.89, ..., 0.22, 0.66][0.31, -0.74, 0.91, ..., 0.20, 0.63][0.35, -0.76, 0.89, ..., 0.22, 0.60][0.38, -0.71, 0.91, ..., 0.24, 0.62][0.39, -0.76, 0.88, ..., 0.20, 0.59][0.33, -0.69, 0.89, ..., 0.22, 0.66][0.31, -0.74, 0.91, ..., 0.20, 0.63][0.35, -0.76, 0.89, ..., 0.22, 0.60][0.38, -0.71, 0.91, ..., 0.24, 0.62][0.39, -0.76, 0.88, ..., 0.20, 0.59]Retrieval-Augmented Generation (RAG) isa technique that grants generativeartificial intelligence modelsinformation retrieval capabilities. Itmodifies interactions with a largelanguage model (LLM) so that the modelresponds to user queries with referenceto a specified set of documents, usingthis information to augment informationdrawn from its own vast, static trainingdata. This allows LLMs to use domain-specific and/or updated information. Usecases include providing chatbot accessto internal company data or givingfactual information only from anauthoritative source.How does Retrieval-Augmented Generation (RAG)enhance generative artificial intelligencemodels?How does Retrieval-Augmented Generation (RAG)modify interactions with a large language model(LLM)?What are some use cases of Retrieval-AugmentedGeneration (RAG)?How does Retrieval-Augmented Generation (RAG)ensure that a model provides factual informationonly from an authoritative source?How can Retrieval-Augmented Generation (RAG) beused to provide chatbot access to internalcompany data?[0.33, -0.69, 0.89, ..., 0.22, 0.66][0.31, -0.74, 0.91, ..., 0.20, 0.63][0.35, -0.76, 0.89, ..., 0.22, 0.60][0.38, -0.71, 0.91, ..., 0.24, 0.62][0.39, -0.76, 0.88, ..., 0.20, 0.59]............................................. \ No newline at end of file From 57e9dcc87a966b67c12d325f5585c65b55b6bb6f Mon Sep 17 00:00:00 2001 From: VakeDomen Date: Mon, 10 Mar 2025 13:28:16 +0000 Subject: [PATCH 5/5] improved markdown --- .../HyPE_Hypothetical_Prompt_Embeddings.ipynb | 117 ++++++++++++++---- 1 file changed, 92 insertions(+), 25 deletions(-) diff --git a/all_rag_techniques/HyPE_Hypothetical_Prompt_Embeddings.ipynb b/all_rag_techniques/HyPE_Hypothetical_Prompt_Embeddings.ipynb index 0cb49ac..a4bb57b 100644 --- a/all_rag_techniques/HyPE_Hypothetical_Prompt_Embeddings.ipynb +++ b/all_rag_techniques/HyPE_Hypothetical_Prompt_Embeddings.ipynb @@ -10,7 +10,7 @@ "\n", "This code implements a Retrieval-Augmented Generation (RAG) system enhanced by Hypothetical Prompt Embeddings (HyPE). Unlike traditional RAG pipelines that struggle with query-document style mismatch, HyPE precomputes hypothetical questions during the indexing phase. This transforms retrieval into a question-question matching problem, eliminating the need for expensive runtime query expansion techniques.\n", "\n", - "## Key Components\n", + "## Key Components of notebook\n", "\n", "1. PDF processing and text extraction\n", "2. Text chunking to maintain coherent information units\n", @@ -28,7 +28,7 @@ "\n", "### Hypothetical Question Generation\n", "\n", - "Instead of embedding raw text chunks, HyPE **generates multiple hypothetical prompts** for each chunk. These **precomputed questions **simulate user queries, improving alignment with real-world searches. This removes the need for runtime synthetic answer generation needed in techniques like HyDE.\n", + "Instead of embedding raw text chunks, HyPE **generates multiple hypothetical prompts** for each chunk. These **precomputed questions** simulate user queries, improving alignment with real-world searches. This removes the need for runtime synthetic answer generation needed in techniques like HyDE.\n", "\n", "### Vector Store Creation\n", "\n", @@ -45,21 +45,16 @@ "## Key Features\n", "\n", "1. **Precomputed Hypothetical Prompts** – Improves query alignment without runtime overhead.\n", - "2. **Multi-Vector Representation **– Each chunk is indexed multiple times for broader semantic coverage.\n", + "2. **Multi-Vector Representation**– Each chunk is indexed multiple times for broader semantic coverage.\n", "3. **Efficient Retrieval** – FAISS ensures fast similarity search over the enhanced embeddings.\n", "4. **Modular Design** – The pipeline is easy to adapt for different datasets and retrieval settings. Additionally it's compatible with most optimizations like reranking etc.\n", "\n", - "## Usage Example\n", - "\n", - "The code includes a test query: \"What is the main cause of climate change?\". This demonstrates how to use the retriever to fetch relevant context from the processed document.\n", - "\n", "## Evaluation\n", "\n", "HyPE's effectiveness is evaluated across multiple datasets, showing:\n", "\n", "- Up to 42 percentage points improvement in retrieval precision\n", "- Up to 45 percentage points improvement in claim recall\n", - "- Compatibility with reranking, multi-vector retrieval, and other RAG optimizations\n", " (See full evaluation results in [preprint](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5139335))\n", "\n", "## Benefits of this Approach\n", @@ -78,7 +73,7 @@ "\n", "
\n", "\n", - "\"HyPE\"\n", + "\"HyPE\"\n", "
" ] }, @@ -122,7 +117,17 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Read Docs" + "### Define constants\n", + "\n", + "- `PATH`: path to the data, to be embedded into the RAG pipeline\n", + "\n", + "This tutorial uses OpenAI endpoint ([avalible models](https://platform.openai.com/docs/pricing)). \n", + "- `LANGUAGE_MODEL_NAME`: The name of the language model to be used. \n", + "- `EMBEDDING_MODEL_NAME`: The name of the embedding model to be used.\n", + "\n", + "The tutroial uses a `RecursiveCharacterTextSplitter` chunking approach where the chunking length function used is python `len` function. The chunking varables to be tweaked here are:\n", + "- `CHUNK_SIZE`: The minimum length of one chunk\n", + "- `CHUNK_OVERLAP`: The overlap of two consecutive chunks." ] }, { @@ -131,16 +136,26 @@ "metadata": {}, "outputs": [], "source": [ - "path = \"../data/Understanding_Climate_Change.pdf\"\n", - "language_model_name = \"gpt-4o-mini\"\n", - "embedding_model_name = \"text-embedding-3-small\"" + "PATH = \"../data/Understanding_Climate_Change.pdf\"\n", + "LANGUAGE_MODEL_NAME = \"gpt-4o-mini\"\n", + "EMBEDDING_MODEL_NAME = \"text-embedding-3-small\"\n", + "CHUNK_SIZE = 1000\n", + "CHUNK_OVERLAP = 200" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Encode document" + "### Define generation of Hypothetical Prompt Embeddings\n", + "\n", + "The code block below generates hypothetical questions for each text chunk and embeds them for retrieval.\n", + "\n", + "- An LLM extracts key questions from the input chunk.\n", + "- These questions are embedded using OpenAI's model.\n", + "- The function returns the original chunk and its prompt embeddings later used for retrieval.\n", + "\n", + "To ensure clean output, extra newlines are removed, and regex parsing can improve list formatting when needed." ] }, { @@ -163,8 +178,8 @@ " hypothetical prompt embeddings (List[float]): A list of embedding vectors\n", " generated from the questions\n", " \"\"\"\n", - " llm = ChatOpenAI(temperature=0, model_name=language_model_name)\n", - " embedding_model = OpenAIEmbeddings(model=embedding_model_name)\n", + " llm = ChatOpenAI(temperature=0, model_name=LANGUAGE_MODEL_NAME)\n", + " embedding_model = OpenAIEmbeddings(model=EMBEDDING_MODEL_NAME)\n", "\n", " question_gen_prompt = PromptTemplate.from_template(\n", " \"Analyze the input text and generate essential questions that, when answered, \\\n", @@ -176,8 +191,8 @@ "\n", " # parse questions from response\n", " # Notes: \n", - " # - gpt40 likes to split questions by \\n\\n so we remove one \\n\n", - " # - for producetion or if using smaller models from ollama, it's beneficial to use regex to parse \n", + " # - gpt4o likes to split questions by \\n\\n so we remove one \\n\n", + " # - for production or if using smaller models from ollama, it's beneficial to use regex to parse \n", " # things like (un)ordeed lists\n", " # r\"^\\s*[\\-\\*\\•]|\\s*\\d+\\.\\s*|\\s*[a-zA-Z]\\)\\s*|\\s*\\(\\d+\\)\\s*|\\s*\\([a-zA-Z]\\)\\s*|\\s*\\([ivxlcdm]+\\)\\s*\"\n", " questions = question_chain.invoke({\"chunk_text\": chunk_text}).replace(\"\\n\\n\", \"\\n\").split(\"\\n\")\n", @@ -185,6 +200,23 @@ " return chunk_text, embedding_model.embed_documents(questions)\n" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Define creation and population of FAISS Vector Store\n", + "\n", + "The code block below builds a FAISS vector store by embedding text chunks in parallel.\n", + "\n", + "What happens?\n", + "- Parallel processing – Uses threading to generate embeddings faster.\n", + "- FAISS initialization – Sets up an L2 index for efficient similarity search.\n", + "- Chunk embedding – Each chunk is stored multiple times, once for each generated question embedding.\n", + "- In-memory storage – Uses InMemoryDocstore for fast lookup.\n", + "\n", + "This ensures efficient retrieval, improving query alignment with precomputed question embeddings." + ] + }, { "cell_type": "code", "execution_count": 66, @@ -221,14 +253,14 @@ " # Initialize the FAISS vector store on the first chunk\n", " if vector_store == None: \n", " vector_store = FAISS(\n", - " embedding_function=OpenAIEmbeddings(model=embedding_model_name), # Define embedding model\n", + " embedding_function=OpenAIEmbeddings(model=EMBEDDING_MODEL_NAME), # Define embedding model\n", " index=faiss.IndexFlatL2(len(vectors[0])) # Define an L2 index for similarity search\n", " docstore=InMemoryDocstore(), # Use in-memory document storage\n", " index_to_docstore_id={} # Maintain index-to-document mapping\n", " )\n", " \n", " # Pair the chunk's content with each generated embedding vector.\n", - " # Each chunk is inserted multiple times, once for each vector\n", + " # Each chunk is inserted multiple times, once for each prompt vector\n", " chunks_with_embedding_vectors = [(chunk.page_content, vec) for vec in vectors]\n", " \n", " # Add embeddings to the store\n", @@ -237,6 +269,21 @@ " return vector_store # Return the populated vector store\n" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Encode PDF into a FAISS Vector Store\n", + "\n", + "The code block below processes a PDF file and stores its content as embeddings for retrieval.\n", + "\n", + "What happens?\n", + "- PDF loading – Extracts text from the document.\n", + "- Chunking – Splits text into overlapping segments for better context retention.\n", + "- Preprocessing – Cleans text to improve embedding quality.\n", + "- Vector store creation – Generates embeddings and stores them in FAISS for retrieval." + ] + }, { "cell_type": "code", "execution_count": 70, @@ -272,6 +319,16 @@ " return vectorstore" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create HyPE vector store\n", + "\n", + "Now we process the PDF and store its embeddings.\n", + "This step initializes the FAISS vector store with the encoded document." + ] + }, { "cell_type": "code", "execution_count": 71, @@ -289,14 +346,18 @@ "# Chunk size can be quite large with HyPE as we are not loosing percision with more\n", "# information. For production, test how exhaustive your model is in generating sufficient \n", "# amount of questions per chunk. This will mostly depend on your information density.\n", - "chunks_vector_store = encode_pdf(path, chunk_size=1000, chunk_overlap=200)" + "chunks_vector_store = encode_pdf(PATH, chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Create retriever" + "### Create retriever\n", + "\n", + "Now we set up the retriever to fetch relevant chunks from the vector store.\n", + "\n", + "Retrieves the top `k=3` most relevant chunks based on query similarity." ] }, { @@ -312,7 +373,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Test retriever" + "### Test retriever\n", + "\n", + "Now we test retrieval using a sample query.\n", + "\n", + "- Queries the vector store to find the most relevant chunks.\n", + "- Deduplicates results to remove potentially repeated chunks.\n", + "- Displays the retrieved context for inspection.\n", + "\n", + "This step verifies that the retriever returns meaningful and diverse information for the given question." ] }, { @@ -379,7 +448,6 @@ "source": [ "test_query = \"What is the main cause of climate change?\"\n", "context = retrieve_context_per_question(test_query, chunks_query_retriever)\n", - "# deduplication might be beneficial as it is possible to retrieve the same chunk multiple times\n", "context = list(set(context))\n", "show_context(context)" ] @@ -462,7 +530,6 @@ } ], "source": [ - "#Note - this currently works with OPENAI only\n", "evaluate_rag(chunks_query_retriever)" ] }