This commit is contained in:
VakeDomen
2025-02-19 20:40:18 +01:00
parent 06d2f16b4b
commit dfb0e9125b
2 changed files with 459 additions and 66 deletions

View File

@@ -4,50 +4,50 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Simple RAG (Retrieval-Augmented Generation) System\n",
"# Hypothetical Prompt Embeddings (HyPE)\n",
"\n",
"## Overview\n",
"\n",
"This code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying PDF documents. The system encodes the document content into a vector store, which can then be queried to retrieve relevant information.\n",
"This code implements a Retrieval-Augmented Generation (RAG) system enhanced by Hypothetical Prompt Embeddings (HyPE). Unlike traditional RAG pipelines that struggle with query-document style mismatch, HyPE precomputes hypothetical questions during the indexing phase. This transforms retrieval into a question-question matching problem, eliminating the need for expensive runtime query expansion techniques.\n",
"\n",
"## Key Components\n",
"\n",
"1. PDF processing and text extraction\n",
"2. Text chunking for manageable processing\n",
"3. Vector store creation using [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) and OpenAI embeddings\n",
"4. Retriever setup for querying the processed documents\n",
"5. Evaluation of the RAG system\n",
"2. Text chunking to maintain coherent information units\n",
"3. **Hypothetical Prompt Embedding Generation** using an LLM to create multiple proxy questions per chunk\n",
"4. Vector store creation using [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) and OpenAI embeddings\n",
"5. Retriever setup for querying the processed documents\n",
"6. Evaluation of the RAG system\n",
"\n",
"## Method Details\n",
"\n",
"### Document Preprocessing\n",
"\n",
"1. The PDF is loaded using PyPDFLoader.\n",
"2. The text is split into chunks using RecursiveCharacterTextSplitter with specified chunk size and overlap.\n",
"1. The PDF is loaded using `PyPDFLoader`.\n",
"2. The text is split into chunks using `RecursiveCharacterTextSplitter` with specified chunk size and overlap.\n",
"\n",
"### Text Cleaning\n",
"### Hypothetical Question Generation\n",
"\n",
"A custom function `replace_t_with_space` is applied to clean the text chunks. This likely addresses specific formatting issues in the PDF.\n",
"Instead of embedding raw text chunks, HyPE **generates multiple hypothetical prompts** for each chunk. These **precomputed questions **simulate user queries, improving alignment with real-world searches. This removes the need for runtime synthetic answer generation needed in techniques like HyDE.\n",
"\n",
"### Vector Store Creation\n",
"\n",
"1. OpenAI embeddings are used to create vector representations of the text chunks.\n",
"2. A FAISS vector store is created from these embeddings for efficient similarity search.\n",
"1. Each hypothetical question is embedded using OpenAI embeddings.\n",
"2. A FAISS vector store is built, associating **each question embedding with its original chunk**.\n",
"3. This approach **stores multiple representations per chunk**, increasing retrieval flexibility.\n",
"\n",
"### Retriever Setup\n",
"\n",
"1. A retriever is configured to fetch the top 2 most relevant chunks for a given query.\n",
"\n",
"### Encoding Function\n",
"\n",
"The `encode_pdf` function encapsulates the entire process of loading, chunking, cleaning, and encoding the PDF into a vector store.\n",
"1. The retriever is optimized for **question-question matching** rather than direct document retrieval.\n",
"2. The FAISS index enables **efficient nearest-neighbor** search over the hypothetical prompt embeddings.\n",
"3. Retrieved chunks provide a **richer and more precise context** for downstream LLM generation.\n",
"\n",
"## Key Features\n",
"\n",
"1. Modular Design: The encoding process is encapsulated in a single function for easy reuse.\n",
"2. Configurable Chunking: Allows adjustment of chunk size and overlap.\n",
"3. Efficient Retrieval: Uses FAISS for fast similarity search.\n",
"4. Evaluation: Includes a function to evaluate the RAG system's performance.\n",
"1. **Precomputed Hypothetical Prompts** Improves query alignment without runtime overhead.\n",
"2. **Multi-Vector Representation ** Each chunk is indexed multiple times for broader semantic coverage.\n",
"3. **Efficient Retrieval** FAISS ensures fast similarity search over the enhanced embeddings.\n",
"4. **Modular Design** The pipeline is easy to adapt for different datasets and retrieval settings. Additionally it's compatible with most optimizations like reranking etc.\n",
"\n",
"## Usage Example\n",
"\n",
@@ -55,18 +55,27 @@
"\n",
"## Evaluation\n",
"\n",
"The system includes an `evaluate_rag` function to assess the performance of the retriever, though the specific metrics used are not detailed in the provided code.\n",
"HyPE's effectiveness is evaluated across multiple datasets, showing:\n",
"\n",
"- Up to 42 percentage points improvement in retrieval precision\n",
"- Up to 45 percentage points improvement in claim recall\n",
"- Compatibility with reranking, multi-vector retrieval, and other RAG optimizations\n",
" (See full evaluation results in [preprint](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5139335))\n",
"\n",
"## Benefits of this Approach\n",
"\n",
"1. Scalability: Can handle large documents by processing them in chunks.\n",
"2. Flexibility: Easy to adjust parameters like chunk size and number of retrieved results.\n",
"3. Efficiency: Utilizes FAISS for fast similarity search in high-dimensional spaces.\n",
"4. Integration with Advanced NLP: Uses OpenAI embeddings for state-of-the-art text representation.\n",
"1. **Eliminates Query-Time Overhead** All hypothetical generation is done offline at indexing.\n",
"2. **Enhanced Retrieval Precision** Better alignment between queries and stored content.\n",
"3. **Scalable & Efficient** No addinal per-query computational cost; retrieval is as fast as standard RAG.\n",
"4. **Flexible & Extensible** Can be combined with advanced RAG techniques like reranking.\n",
"\n",
"## Conclusion\n",
"\n",
"This simple RAG system provides a solid foundation for building more complex information retrieval and question-answering systems. By encoding document content into a searchable vector store, it enables efficient retrieval of relevant information in response to queries. This approach is particularly useful for applications requiring quick access to specific information within large documents or document collections."
"HyPE provides a scalable and efficient alternative to traditional RAG systems, overcoming query-document style mismatch while avoiding the computational cost of runtime query expansion. By moving hypothetical prompt generation to indexing, it significantly enhances retrieval precision and efficiency, making it a practical solution for real-world applications.\n",
"\n",
"For further details, refer to the full paper: [preprint](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5139335)\n",
"\n",
"\n"
]
},
{
@@ -78,13 +87,17 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 63,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import sys\n",
"import faiss\n",
"from tqdm import tqdm\n",
"from dotenv import load_dotenv\n",
"from concurrent.futures import ThreadPoolExecutor, as_completed\n",
"from langchain_community.docstore.in_memory import InMemoryDocstore\n",
"\n",
"\n",
"# Load environment variables from a .env file\n",
@@ -110,11 +123,13 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 64,
"metadata": {},
"outputs": [],
"source": [
"path = \"../data/Understanding_Climate_Change.pdf\""
"path = \"../data/Understanding_Climate_Change.pdf\"\n",
"language_model_name = \"gpt-4o-mini\"\n",
"embedding_model_name = \"text-embedding-3-small\""
]
},
{
@@ -126,7 +141,101 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 65,
"metadata": {},
"outputs": [],
"source": [
"def generate_hypothetical_prompt_embeddings(chunk_text: str):\n",
" \"\"\"\n",
" Uses the LLM to generate multiple hypothetical questions for a single chunk.\n",
" These questions will be used as 'proxies' for the chunk during retrieval.\n",
"\n",
" Parameters:\n",
" chunk_text (str): Text contents of the chunk\n",
"\n",
" Returns:\n",
" chunk_text (str): Text contents of the chunk. This is done to make the \n",
" multithreading easier\n",
" hypothetical prompt embeddings (List[float]): A list of embedding vectors\n",
" generated from the questions\n",
" \"\"\"\n",
" llm = ChatOpenAI(temperature=0, model_name=language_model_name)\n",
" embedding_model = OpenAIEmbeddings(model=embedding_model_name)\n",
"\n",
" question_gen_prompt = PromptTemplate.from_template(\n",
" \"Analyze the input text and generate essential questions that, when answered, \\\n",
" capture the main points of the text. Each question should be one line, \\\n",
" without numbering or prefixes.\\n\\n \\\n",
" Text:\\n{chunk_text}\\n\\nQuestions:\\n\"\n",
" )\n",
" question_chain = question_gen_prompt | llm | StrOutputParser()\n",
"\n",
" # parse questions from response\n",
" # Notes: \n",
" # - gpt40 likes to split questions by \\n\\n so we remove one \\n\n",
" # - for producetion or if using smaller models from ollama, it's beneficial to use regex to parse \n",
" # things like (un)ordeed lists\n",
" # r\"^\\s*[\\-\\*\\•]|\\s*\\d+\\.\\s*|\\s*[a-zA-Z]\\)\\s*|\\s*\\(\\d+\\)\\s*|\\s*\\([a-zA-Z]\\)\\s*|\\s*\\([ivxlcdm]+\\)\\s*\"\n",
" questions = question_chain.invoke({\"chunk_text\": chunk_text}).replace(\"\\n\\n\", \"\\n\").split(\"\\n\")\n",
" \n",
" return chunk_text, embedding_model.embed_documents(questions)\n"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [],
"source": [
"def prepare_vector_store(chunks: List[str]):\n",
" \"\"\"\n",
" Creates and populates a FAISS vector store from a list of text chunks.\n",
"\n",
" This function processes a list of text chunks in parallel, generating \n",
" hypothetical prompt embeddings for each chunk.\n",
" The embeddings are stored in a FAISS index for efficient similarity search.\n",
"\n",
" Parameters:\n",
" chunks (List[str]): A list of text chunks to be embedded and stored.\n",
"\n",
" Returns:\n",
" FAISS: A FAISS vector store containing the embedded text chunks.\n",
" \"\"\"\n",
"\n",
" # Wait with initialization to see vector lengths\n",
" vector_store = None \n",
"\n",
" with ThreadPoolExecutor() as pool: \n",
" # Use threading to speed up generation of prompt embeddings\n",
" futures = [pool.submit(generate_hypothetical_prompt_embeddings, c) for c in chunks]\n",
" \n",
" # Process embeddings as they complete\n",
" for f in tqdm(as_completed(futures), total=len(chunks)): \n",
" \n",
" chunk, vectors = f.result() # Retrieve the processed chunk and its embeddings\n",
" \n",
" # Initialize the FAISS vector store on the first chunk\n",
" if vector_store == None: \n",
" vector_store = FAISS(\n",
" embedding_function=OpenAIEmbeddings(model=embedding_model_name), # Define embedding model\n",
" index=faiss.IndexFlatL2(len(vectors[0])) # Define an L2 index for similarity search\n",
" docstore=InMemoryDocstore(), # Use in-memory document storage\n",
" index_to_docstore_id={} # Maintain index-to-document mapping\n",
" )\n",
" \n",
" # Pair the chunk's content with each generated embedding vector.\n",
" # Each chunk is inserted multiple times, once for each vector\n",
" chunks_with_embedding_vectors = [(chunk.page_content, vec) for vec in vectors]\n",
" \n",
" # Add embeddings to the store\n",
" vector_store.add_embeddings(chunks_with_embedding_vectors) \n",
"\n",
" return vector_store # Return the populated vector store\n"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [],
"source": [
@@ -154,22 +263,28 @@
" texts = text_splitter.split_documents(documents)\n",
" cleaned_texts = replace_t_with_space(texts)\n",
"\n",
" # Create embeddings (Tested with OpenAI and Amazon Bedrock)\n",
" embeddings = get_langchain_embedding_provider(EmbeddingProvider.OPENAI)\n",
" #embeddings = get_langchain_embedding_provider(EmbeddingProvider.AMAZON_BEDROCK)\n",
"\n",
" # Create vector store\n",
" vectorstore = FAISS.from_documents(cleaned_texts, embeddings)\n",
" vectorstore = prepare_vector_store(cleaned_texts)\n",
"\n",
" return vectorstore"
]
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 71,
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 97/97 [00:22<00:00, 4.40it/s]\n"
]
}
],
"source": [
"# Chunk size can be quite large with HyPE as we are not loosing percision with more\n",
"# information. For production, test how exhaustive your model is in generating sufficient \n",
"# amount of questions per chunk. This will mostly depend on your information density.\n",
"chunks_vector_store = encode_pdf(path, chunk_size=1000, chunk_overlap=200)"
]
},
@@ -182,11 +297,11 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 79,
"metadata": {},
"outputs": [],
"source": [
"chunks_query_retriever = chunks_vector_store.as_retriever(search_kwargs={\"k\": 2})"
"chunks_query_retriever = chunks_vector_store.as_retriever(search_kwargs={\"k\": 3})"
]
},
{
@@ -198,22 +313,30 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 80,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\N7\\PycharmProjects\\llm_tasks\\RAG_TECHNIQUES\\.venv\\Lib\\site-packages\\langchain_core\\_api\\deprecation.py:139: LangChainDeprecationWarning: The method `BaseRetriever.get_relevant_documents` was deprecated in langchain-core 0.1.46 and will be removed in 0.3.0. Use invoke instead.\n",
" warn_deprecated(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Context 1:\n",
"Most of these climate changes are attributed to very small variations in Earth's orbit that \n",
"change the amount of solar energy our planet receives. During the Holocene epoch, which \n",
"began at the end of the last ice age, human societies f lourished, but the industrial era has seen \n",
"unprecedented changes. \n",
"Modern Observations \n",
"Modern scientific observations indicate a rapid increase in global temperatures, sea levels, \n",
"and extreme weather events. The Intergovernmental Panel on Climate Change (IPCC) has \n",
"documented these changes extensively. Ice core samples, tree rings, and ocean sediments \n",
"provide a historical record that scientists use to understand past climate conditions and \n",
"predict future trends. The evidence overwhelmingly shows that recent changes are primarily \n",
"driven by human activities, particularly the emission of greenhou se gases. \n",
"Chapter 2: Causes of Climate Change \n",
"Greenhouse Gases\n",
"\n",
"\n",
"Context 2:\n",
"driven by human activities, particularly the emission of greenhou se gases. \n",
"Chapter 2: Causes of Climate Change \n",
"Greenhouse Gases \n",
@@ -230,20 +353,20 @@
"Coal\n",
"\n",
"\n",
"Context 2:\n",
"Context 3:\n",
"Understanding Climate Change \n",
"Chapter 1: Introduction to Climate Change \n",
"Climate change refers to significant, long -term changes in the global climate. The term \n",
"\"global climate\" encompasses the planet's overall weather patterns, including temperature, \n",
"precipitation, and wind patterns, over an extended period. Over the past cent ury, human \n",
"activities, particularly the burning of fossil fuels and deforestation, have significantly \n",
"contributed to climate change. \n",
"Historical Context \n",
"The Earth's climate has changed throughout history. Over the past 650,000 years, there have \n",
"been seven cycles of glacial advance and retreat, with the abrupt end of the last ice age about \n",
"11,700 years ago marking the beginning of the modern climate era and human civilization. \n",
"Most of these climate changes are attributed to very small variations in Earth's orbit that \n",
"change the amount of solar energy our planet receives. During the Holocene epoch, which \n",
"began at the end of the last ice age, human societies f lourished, but the industrial era has seen \n",
"unprecedented changes. \n",
"Modern Observations \n",
"Modern scientific observations indicate a rapid increase in global temperatures, sea levels, \n",
"and extreme weather events. The Intergovernmental Panel on Climate Change (IPCC) has \n",
"documented these changes extensively. Ice core samples, tree rings, and ocean sediments \n",
"provide a historical record that scientists use to understand past climate conditions and \n",
"predict future trends. The evidence overwhelmingly shows that recent changes are primarily \n",
"driven by human activities, particularly the emission of greenhou se gases. \n",
"Chapter 2: Causes of Climate Change \n",
"Greenhouse Gases\n",
"change the amount of solar energy our planet receives. During the Holocene epoch, which\n",
"\n",
"\n"
]
@@ -252,6 +375,8 @@
"source": [
"test_query = \"What is the main cause of climate change?\"\n",
"context = retrieve_context_per_question(test_query, chunks_query_retriever)\n",
"# deduplication might be beneficial as it is possible to retrieve the same chunk multiple times\n",
"context = list(set(context))\n",
"show_context(context)"
]
},
@@ -264,9 +389,74 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 76,
"metadata": {},
"outputs": [],
"outputs": [
{
"data": {
"text/plain": [
"{'questions': ['1. **Multiple Choice: Causes of Climate Change**',\n",
" ' - What is the primary cause of the current climate change trend?',\n",
" ' A) Solar radiation variations',\n",
" ' B) Natural cycles of the Earth',\n",
" ' C) Human activities, such as burning fossil fuels',\n",
" ' D) Volcanic eruptions',\n",
" '',\n",
" '2. **True or False: Impact on Biodiversity**',\n",
" ' - True or False: Climate change does not have any significant impact on the migration patterns and extinction rates of various species.',\n",
" '',\n",
" '3. **Short Answer: Mitigation Strategies**',\n",
" ' - What are two effective strategies that can be implemented at a community level to mitigate the effects of climate change?',\n",
" '',\n",
" '4. **Matching: Climate Change Effects**',\n",
" ' - Match the following effects of climate change (numbered) with their likely consequences (lettered).',\n",
" ' 1. Rising sea levels',\n",
" ' 2. Increased frequency of extreme weather events',\n",
" ' 3. Melting polar ice caps',\n",
" ' 4. Ocean acidification',\n",
" ' ',\n",
" ' A) Displacement of coastal communities',\n",
" ' B) Loss of marine biodiversity',\n",
" ' C) Increased global temperatures',\n",
" ' D) More frequent and severe hurricanes and floods',\n",
" '',\n",
" '5. **Essay: International Cooperation**',\n",
" ' - Discuss the importance of international cooperation in combating climate change. Include examples of successful global agreements or initiatives and explain how they have contributed to addressing climate change.'],\n",
" 'results': ['```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 2,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 1,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 1,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 2,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 2,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 2\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 2,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 2,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 2\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 2,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 2,\\n \"Conciseness\": 3\\n}\\n```'],\n",
" 'average_scores': None}"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Note - this currently works with OPENAI only\n",
"evaluate_rag(chunks_query_retriever)"
@@ -289,7 +479,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
"version": "3.10.12"
}
},
"nbformat": 4,

View File

@@ -0,0 +1,203 @@
import os
import sys
import argparse
import time
import faiss
from dotenv import load_dotenv
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor, as_completed
from langchain_community.docstore.in_memory import InMemoryDocstore
# Add the parent directory to the path since we work with notebooks
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))
from helper_functions import *
from evaluation.evalute_rag import *
# Load environment variables from a .env file (e.g., OpenAI API key)
load_dotenv()
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')
class HyPE:
"""
A class to handle the HyPE RAG process, which enhances document chunking by
generating hypothetical questions as proxies for retrieval.
"""
def __init__(self, path, chunk_size=1000, chunk_overlap=200, n_retrieved=3):
"""
Initializes the HyPE-based RAG retriever by encoding the PDF document with
hypothetical prompt embeddings.
Args:
path (str): Path to the PDF file to encode.
chunk_size (int): Size of each text chunk (default: 1000).
chunk_overlap (int): Overlap between consecutive chunks (default: 200).
n_retrieved (int): Number of chunks to retrieve for each query (default: 3).
"""
print("\n--- Initializing HyPE RAG Retriever ---")
# Encode the PDF document into a FAISS vector store using hypothetical prompt embeddings
start_time = time.time()
self.vector_store = self.encode_pdf(path, chunk_size=chunk_size, chunk_overlap=chunk_overlap)
self.time_records = {'Chunking': time.time() - start_time}
print(f"Chunking Time: {self.time_records['Chunking']:.2f} seconds")
# Create a retriever from the vector store
self.chunks_query_retriever = self.vector_store.as_retriever(search_kwargs={"k": n_retrieved})
def generate_hypothetical_prompt_embeddings(self, chunk_text):
"""
Uses an LLM to generate multiple hypothetical questions for a single chunk.
These questions act as 'proxies' for the chunk during retrieval.
Parameters:
chunk_text (str): Text contents of the chunk.
Returns:
tuple: (Original chunk text, List of embedding vectors generated from the questions)
"""
llm = ChatOpenAI(temperature=0, model_name="gpt-4o-mini")
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")
question_gen_prompt = PromptTemplate.from_template(
"Analyze the input text and generate essential questions that, when answered, \
capture the main points of the text. Each question should be one line, \
without numbering or prefixes.\n\n \
Text:\n{chunk_text}\n\nQuestions:\n"
)
question_chain = question_gen_prompt | llm | StrOutputParser()
# Parse questions from response
questions = question_chain.invoke({"chunk_text": chunk_text}).replace("\n\n", "\n").split("\n")
return chunk_text, embedding_model.embed_documents(questions)
def prepare_vector_store(self, chunks):
"""
Creates and populates a FAISS vector store using hypothetical prompt embeddings.
Parameters:
chunks (List[str]): A list of text chunks to be embedded and stored.
Returns:
FAISS: A FAISS vector store containing the embedded text chunks.
"""
vector_store = None # Wait to initialize to determine vector size
with ThreadPoolExecutor() as pool:
# Parallelized embedding generation
futures = [pool.submit(self.generate_hypothetical_prompt_embeddings, c) for c in chunks]
for f in tqdm(as_completed(futures), total=len(chunks)):
chunk, vectors = f.result() # Retrieve processed chunk and embeddings
# Initialize FAISS store once vector size is known
if vector_store is None:
vector_store = FAISS(
embedding_function=OpenAIEmbeddings(model="text-embedding-3-small"),
index=faiss.IndexFlatL2(len(vectors[0])),
docstore=InMemoryDocstore(),
index_to_docstore_id={}
)
# Store multiple vector representations per chunk
chunks_with_embedding_vectors = [(chunk.page_content, vec) for vec in vectors]
vector_store.add_embeddings(chunks_with_embedding_vectors)
return vector_store
def encode_pdf(self, path, chunk_size=1000, chunk_overlap=200):
"""
Encodes a PDF document into a vector store using hypothetical prompt embeddings.
Args:
path: The path to the PDF file.
chunk_size: The size of each text chunk.
chunk_overlap: The overlap between consecutive chunks.
Returns:
A FAISS vector store containing the encoded book content.
"""
# Load PDF documents
loader = PyPDFLoader(path)
documents = loader.load()
# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
)
texts = text_splitter.split_documents(documents)
cleaned_texts = replace_t_with_space(texts)
return self.prepare_vector_store(cleaned_texts)
def run(self, query):
"""
Retrieves and displays the context for the given query.
Args:
query (str): The query to retrieve context for.
Returns:
None
"""
# Measure retrieval time
start_time = time.time()
context = retrieve_context_per_question(query, self.chunks_query_retriever)
self.time_records['Retrieval'] = time.time() - start_time
print(f"Retrieval Time: {self.time_records['Retrieval']:.2f} seconds")
# Deduplicate context and display results
context = list(set(context))
show_context(context)
def validate_args(args):
if args.chunk_size <= 0:
raise ValueError("chunk_size must be a positive integer.")
if args.chunk_overlap < 0:
raise ValueError("chunk_overlap must be a non-negative integer.")
if args.n_retrieved <= 0:
raise ValueError("n_retrieved must be a positive integer.")
return args
def parse_args():
parser = argparse.ArgumentParser(description="Encode a PDF document and test a HyPE-based RAG system.")
parser.add_argument("--path", type=str, default="../data/Understanding_Climate_Change.pdf",
help="Path to the PDF file to encode.")
parser.add_argument("--chunk_size", type=int, default=1000,
help="Size of each text chunk (default: 1000).")
parser.add_argument("--chunk_overlap", type=int, default=200,
help="Overlap between consecutive chunks (default: 200).")
parser.add_argument("--n_retrieved", type=int, default=3,
help="Number of chunks to retrieve for each query (default: 3).")
parser.add_argument("--query", type=str, default="What is the main cause of climate change?",
help="Query to test the retriever (default: 'What is the main cause of climate change?').")
parser.add_argument("--evaluate", action="store_true",
help="Whether to evaluate the retriever's performance (default: False).")
return validate_args(parser.parse_args())
def main(args):
# Initialize the HyPE-based RAG Retriever
hyperag = HyPE(
path=args.path,
chunk_size=args.chunk_size,
chunk_overlap=args.chunk_overlap,
n_retrieved=args.n_retrieved
)
# Retrieve context based on the query
hyperag.run(args.query)
# Evaluate the retriever's performance on the query (if requested)
if args.evaluate:
evaluate_rag(hyperag.chunks_query_retriever)
if __name__ == '__main__':
# Call the main function with parsed arguments
main(parse_args())