mirror of
https://github.com/NirDiamant/RAG_Techniques.git
synced 2025-04-07 00:48:52 +03:00
improved markdown
This commit is contained in:
@@ -10,7 +10,7 @@
|
||||
"\n",
|
||||
"This code implements a Retrieval-Augmented Generation (RAG) system enhanced by Hypothetical Prompt Embeddings (HyPE). Unlike traditional RAG pipelines that struggle with query-document style mismatch, HyPE precomputes hypothetical questions during the indexing phase. This transforms retrieval into a question-question matching problem, eliminating the need for expensive runtime query expansion techniques.\n",
|
||||
"\n",
|
||||
"## Key Components\n",
|
||||
"## Key Components of notebook\n",
|
||||
"\n",
|
||||
"1. PDF processing and text extraction\n",
|
||||
"2. Text chunking to maintain coherent information units\n",
|
||||
@@ -28,7 +28,7 @@
|
||||
"\n",
|
||||
"### Hypothetical Question Generation\n",
|
||||
"\n",
|
||||
"Instead of embedding raw text chunks, HyPE **generates multiple hypothetical prompts** for each chunk. These **precomputed questions **simulate user queries, improving alignment with real-world searches. This removes the need for runtime synthetic answer generation needed in techniques like HyDE.\n",
|
||||
"Instead of embedding raw text chunks, HyPE **generates multiple hypothetical prompts** for each chunk. These **precomputed questions** simulate user queries, improving alignment with real-world searches. This removes the need for runtime synthetic answer generation needed in techniques like HyDE.\n",
|
||||
"\n",
|
||||
"### Vector Store Creation\n",
|
||||
"\n",
|
||||
@@ -45,21 +45,16 @@
|
||||
"## Key Features\n",
|
||||
"\n",
|
||||
"1. **Precomputed Hypothetical Prompts** – Improves query alignment without runtime overhead.\n",
|
||||
"2. **Multi-Vector Representation **– Each chunk is indexed multiple times for broader semantic coverage.\n",
|
||||
"2. **Multi-Vector Representation**– Each chunk is indexed multiple times for broader semantic coverage.\n",
|
||||
"3. **Efficient Retrieval** – FAISS ensures fast similarity search over the enhanced embeddings.\n",
|
||||
"4. **Modular Design** – The pipeline is easy to adapt for different datasets and retrieval settings. Additionally it's compatible with most optimizations like reranking etc.\n",
|
||||
"\n",
|
||||
"## Usage Example\n",
|
||||
"\n",
|
||||
"The code includes a test query: \"What is the main cause of climate change?\". This demonstrates how to use the retriever to fetch relevant context from the processed document.\n",
|
||||
"\n",
|
||||
"## Evaluation\n",
|
||||
"\n",
|
||||
"HyPE's effectiveness is evaluated across multiple datasets, showing:\n",
|
||||
"\n",
|
||||
"- Up to 42 percentage points improvement in retrieval precision\n",
|
||||
"- Up to 45 percentage points improvement in claim recall\n",
|
||||
"- Compatibility with reranking, multi-vector retrieval, and other RAG optimizations\n",
|
||||
" (See full evaluation results in [preprint](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5139335))\n",
|
||||
"\n",
|
||||
"## Benefits of this Approach\n",
|
||||
@@ -78,7 +73,7 @@
|
||||
"\n",
|
||||
"<div style=\"text-align: center;\">\n",
|
||||
"\n",
|
||||
"<img src=\"../images/hype.svg\" alt=\"HyPE\" style=\"width:40%; height:auto;\">\n",
|
||||
"<img src=\"../images/hype.svg\" alt=\"HyPE\" style=\"width:70%; height:auto;\">\n",
|
||||
"</div>"
|
||||
]
|
||||
},
|
||||
@@ -122,7 +117,17 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Read Docs"
|
||||
"### Define constants\n",
|
||||
"\n",
|
||||
"- `PATH`: path to the data, to be embedded into the RAG pipeline\n",
|
||||
"\n",
|
||||
"This tutorial uses OpenAI endpoint ([avalible models](https://platform.openai.com/docs/pricing)). \n",
|
||||
"- `LANGUAGE_MODEL_NAME`: The name of the language model to be used. \n",
|
||||
"- `EMBEDDING_MODEL_NAME`: The name of the embedding model to be used.\n",
|
||||
"\n",
|
||||
"The tutroial uses a `RecursiveCharacterTextSplitter` chunking approach where the chunking length function used is python `len` function. The chunking varables to be tweaked here are:\n",
|
||||
"- `CHUNK_SIZE`: The minimum length of one chunk\n",
|
||||
"- `CHUNK_OVERLAP`: The overlap of two consecutive chunks."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -131,16 +136,26 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"path = \"../data/Understanding_Climate_Change.pdf\"\n",
|
||||
"language_model_name = \"gpt-4o-mini\"\n",
|
||||
"embedding_model_name = \"text-embedding-3-small\""
|
||||
"PATH = \"../data/Understanding_Climate_Change.pdf\"\n",
|
||||
"LANGUAGE_MODEL_NAME = \"gpt-4o-mini\"\n",
|
||||
"EMBEDDING_MODEL_NAME = \"text-embedding-3-small\"\n",
|
||||
"CHUNK_SIZE = 1000\n",
|
||||
"CHUNK_OVERLAP = 200"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Encode document"
|
||||
"### Define generation of Hypothetical Prompt Embeddings\n",
|
||||
"\n",
|
||||
"The code block below generates hypothetical questions for each text chunk and embeds them for retrieval.\n",
|
||||
"\n",
|
||||
"- An LLM extracts key questions from the input chunk.\n",
|
||||
"- These questions are embedded using OpenAI's model.\n",
|
||||
"- The function returns the original chunk and its prompt embeddings later used for retrieval.\n",
|
||||
"\n",
|
||||
"To ensure clean output, extra newlines are removed, and regex parsing can improve list formatting when needed."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -163,8 +178,8 @@
|
||||
" hypothetical prompt embeddings (List[float]): A list of embedding vectors\n",
|
||||
" generated from the questions\n",
|
||||
" \"\"\"\n",
|
||||
" llm = ChatOpenAI(temperature=0, model_name=language_model_name)\n",
|
||||
" embedding_model = OpenAIEmbeddings(model=embedding_model_name)\n",
|
||||
" llm = ChatOpenAI(temperature=0, model_name=LANGUAGE_MODEL_NAME)\n",
|
||||
" embedding_model = OpenAIEmbeddings(model=EMBEDDING_MODEL_NAME)\n",
|
||||
"\n",
|
||||
" question_gen_prompt = PromptTemplate.from_template(\n",
|
||||
" \"Analyze the input text and generate essential questions that, when answered, \\\n",
|
||||
@@ -176,8 +191,8 @@
|
||||
"\n",
|
||||
" # parse questions from response\n",
|
||||
" # Notes: \n",
|
||||
" # - gpt40 likes to split questions by \\n\\n so we remove one \\n\n",
|
||||
" # - for producetion or if using smaller models from ollama, it's beneficial to use regex to parse \n",
|
||||
" # - gpt4o likes to split questions by \\n\\n so we remove one \\n\n",
|
||||
" # - for production or if using smaller models from ollama, it's beneficial to use regex to parse \n",
|
||||
" # things like (un)ordeed lists\n",
|
||||
" # r\"^\\s*[\\-\\*\\•]|\\s*\\d+\\.\\s*|\\s*[a-zA-Z]\\)\\s*|\\s*\\(\\d+\\)\\s*|\\s*\\([a-zA-Z]\\)\\s*|\\s*\\([ivxlcdm]+\\)\\s*\"\n",
|
||||
" questions = question_chain.invoke({\"chunk_text\": chunk_text}).replace(\"\\n\\n\", \"\\n\").split(\"\\n\")\n",
|
||||
@@ -185,6 +200,23 @@
|
||||
" return chunk_text, embedding_model.embed_documents(questions)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Define creation and population of FAISS Vector Store\n",
|
||||
"\n",
|
||||
"The code block below builds a FAISS vector store by embedding text chunks in parallel.\n",
|
||||
"\n",
|
||||
"What happens?\n",
|
||||
"- Parallel processing – Uses threading to generate embeddings faster.\n",
|
||||
"- FAISS initialization – Sets up an L2 index for efficient similarity search.\n",
|
||||
"- Chunk embedding – Each chunk is stored multiple times, once for each generated question embedding.\n",
|
||||
"- In-memory storage – Uses InMemoryDocstore for fast lookup.\n",
|
||||
"\n",
|
||||
"This ensures efficient retrieval, improving query alignment with precomputed question embeddings."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 66,
|
||||
@@ -221,14 +253,14 @@
|
||||
" # Initialize the FAISS vector store on the first chunk\n",
|
||||
" if vector_store == None: \n",
|
||||
" vector_store = FAISS(\n",
|
||||
" embedding_function=OpenAIEmbeddings(model=embedding_model_name), # Define embedding model\n",
|
||||
" embedding_function=OpenAIEmbeddings(model=EMBEDDING_MODEL_NAME), # Define embedding model\n",
|
||||
" index=faiss.IndexFlatL2(len(vectors[0])) # Define an L2 index for similarity search\n",
|
||||
" docstore=InMemoryDocstore(), # Use in-memory document storage\n",
|
||||
" index_to_docstore_id={} # Maintain index-to-document mapping\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" # Pair the chunk's content with each generated embedding vector.\n",
|
||||
" # Each chunk is inserted multiple times, once for each vector\n",
|
||||
" # Each chunk is inserted multiple times, once for each prompt vector\n",
|
||||
" chunks_with_embedding_vectors = [(chunk.page_content, vec) for vec in vectors]\n",
|
||||
" \n",
|
||||
" # Add embeddings to the store\n",
|
||||
@@ -237,6 +269,21 @@
|
||||
" return vector_store # Return the populated vector store\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Encode PDF into a FAISS Vector Store\n",
|
||||
"\n",
|
||||
"The code block below processes a PDF file and stores its content as embeddings for retrieval.\n",
|
||||
"\n",
|
||||
"What happens?\n",
|
||||
"- PDF loading – Extracts text from the document.\n",
|
||||
"- Chunking – Splits text into overlapping segments for better context retention.\n",
|
||||
"- Preprocessing – Cleans text to improve embedding quality.\n",
|
||||
"- Vector store creation – Generates embeddings and stores them in FAISS for retrieval."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 70,
|
||||
@@ -272,6 +319,16 @@
|
||||
" return vectorstore"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Create HyPE vector store\n",
|
||||
"\n",
|
||||
"Now we process the PDF and store its embeddings.\n",
|
||||
"This step initializes the FAISS vector store with the encoded document."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 71,
|
||||
@@ -289,14 +346,18 @@
|
||||
"# Chunk size can be quite large with HyPE as we are not loosing percision with more\n",
|
||||
"# information. For production, test how exhaustive your model is in generating sufficient \n",
|
||||
"# amount of questions per chunk. This will mostly depend on your information density.\n",
|
||||
"chunks_vector_store = encode_pdf(path, chunk_size=1000, chunk_overlap=200)"
|
||||
"chunks_vector_store = encode_pdf(PATH, chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Create retriever"
|
||||
"### Create retriever\n",
|
||||
"\n",
|
||||
"Now we set up the retriever to fetch relevant chunks from the vector store.\n",
|
||||
"\n",
|
||||
"Retrieves the top `k=3` most relevant chunks based on query similarity."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -312,7 +373,15 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Test retriever"
|
||||
"### Test retriever\n",
|
||||
"\n",
|
||||
"Now we test retrieval using a sample query.\n",
|
||||
"\n",
|
||||
"- Queries the vector store to find the most relevant chunks.\n",
|
||||
"- Deduplicates results to remove potentially repeated chunks.\n",
|
||||
"- Displays the retrieved context for inspection.\n",
|
||||
"\n",
|
||||
"This step verifies that the retriever returns meaningful and diverse information for the given question."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -379,7 +448,6 @@
|
||||
"source": [
|
||||
"test_query = \"What is the main cause of climate change?\"\n",
|
||||
"context = retrieve_context_per_question(test_query, chunks_query_retriever)\n",
|
||||
"# deduplication might be beneficial as it is possible to retrieve the same chunk multiple times\n",
|
||||
"context = list(set(context))\n",
|
||||
"show_context(context)"
|
||||
]
|
||||
@@ -462,7 +530,6 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"#Note - this currently works with OPENAI only\n",
|
||||
"evaluate_rag(chunks_query_retriever)"
|
||||
]
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user