improved markdown

This commit is contained in:
VakeDomen
2025-03-10 13:28:16 +00:00
parent 6096797c7e
commit 57e9dcc87a

View File

@@ -10,7 +10,7 @@
"\n",
"This code implements a Retrieval-Augmented Generation (RAG) system enhanced by Hypothetical Prompt Embeddings (HyPE). Unlike traditional RAG pipelines that struggle with query-document style mismatch, HyPE precomputes hypothetical questions during the indexing phase. This transforms retrieval into a question-question matching problem, eliminating the need for expensive runtime query expansion techniques.\n",
"\n",
"## Key Components\n",
"## Key Components of notebook\n",
"\n",
"1. PDF processing and text extraction\n",
"2. Text chunking to maintain coherent information units\n",
@@ -28,7 +28,7 @@
"\n",
"### Hypothetical Question Generation\n",
"\n",
"Instead of embedding raw text chunks, HyPE **generates multiple hypothetical prompts** for each chunk. These **precomputed questions **simulate user queries, improving alignment with real-world searches. This removes the need for runtime synthetic answer generation needed in techniques like HyDE.\n",
"Instead of embedding raw text chunks, HyPE **generates multiple hypothetical prompts** for each chunk. These **precomputed questions** simulate user queries, improving alignment with real-world searches. This removes the need for runtime synthetic answer generation needed in techniques like HyDE.\n",
"\n",
"### Vector Store Creation\n",
"\n",
@@ -45,21 +45,16 @@
"## Key Features\n",
"\n",
"1. **Precomputed Hypothetical Prompts** Improves query alignment without runtime overhead.\n",
"2. **Multi-Vector Representation ** Each chunk is indexed multiple times for broader semantic coverage.\n",
"2. **Multi-Vector Representation** Each chunk is indexed multiple times for broader semantic coverage.\n",
"3. **Efficient Retrieval** FAISS ensures fast similarity search over the enhanced embeddings.\n",
"4. **Modular Design** The pipeline is easy to adapt for different datasets and retrieval settings. Additionally it's compatible with most optimizations like reranking etc.\n",
"\n",
"## Usage Example\n",
"\n",
"The code includes a test query: \"What is the main cause of climate change?\". This demonstrates how to use the retriever to fetch relevant context from the processed document.\n",
"\n",
"## Evaluation\n",
"\n",
"HyPE's effectiveness is evaluated across multiple datasets, showing:\n",
"\n",
"- Up to 42 percentage points improvement in retrieval precision\n",
"- Up to 45 percentage points improvement in claim recall\n",
"- Compatibility with reranking, multi-vector retrieval, and other RAG optimizations\n",
" (See full evaluation results in [preprint](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5139335))\n",
"\n",
"## Benefits of this Approach\n",
@@ -78,7 +73,7 @@
"\n",
"<div style=\"text-align: center;\">\n",
"\n",
"<img src=\"../images/hype.svg\" alt=\"HyPE\" style=\"width:40%; height:auto;\">\n",
"<img src=\"../images/hype.svg\" alt=\"HyPE\" style=\"width:70%; height:auto;\">\n",
"</div>"
]
},
@@ -122,7 +117,17 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Read Docs"
"### Define constants\n",
"\n",
"- `PATH`: path to the data, to be embedded into the RAG pipeline\n",
"\n",
"This tutorial uses OpenAI endpoint ([avalible models](https://platform.openai.com/docs/pricing)). \n",
"- `LANGUAGE_MODEL_NAME`: The name of the language model to be used. \n",
"- `EMBEDDING_MODEL_NAME`: The name of the embedding model to be used.\n",
"\n",
"The tutroial uses a `RecursiveCharacterTextSplitter` chunking approach where the chunking length function used is python `len` function. The chunking varables to be tweaked here are:\n",
"- `CHUNK_SIZE`: The minimum length of one chunk\n",
"- `CHUNK_OVERLAP`: The overlap of two consecutive chunks."
]
},
{
@@ -131,16 +136,26 @@
"metadata": {},
"outputs": [],
"source": [
"path = \"../data/Understanding_Climate_Change.pdf\"\n",
"language_model_name = \"gpt-4o-mini\"\n",
"embedding_model_name = \"text-embedding-3-small\""
"PATH = \"../data/Understanding_Climate_Change.pdf\"\n",
"LANGUAGE_MODEL_NAME = \"gpt-4o-mini\"\n",
"EMBEDDING_MODEL_NAME = \"text-embedding-3-small\"\n",
"CHUNK_SIZE = 1000\n",
"CHUNK_OVERLAP = 200"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Encode document"
"### Define generation of Hypothetical Prompt Embeddings\n",
"\n",
"The code block below generates hypothetical questions for each text chunk and embeds them for retrieval.\n",
"\n",
"- An LLM extracts key questions from the input chunk.\n",
"- These questions are embedded using OpenAI's model.\n",
"- The function returns the original chunk and its prompt embeddings later used for retrieval.\n",
"\n",
"To ensure clean output, extra newlines are removed, and regex parsing can improve list formatting when needed."
]
},
{
@@ -163,8 +178,8 @@
" hypothetical prompt embeddings (List[float]): A list of embedding vectors\n",
" generated from the questions\n",
" \"\"\"\n",
" llm = ChatOpenAI(temperature=0, model_name=language_model_name)\n",
" embedding_model = OpenAIEmbeddings(model=embedding_model_name)\n",
" llm = ChatOpenAI(temperature=0, model_name=LANGUAGE_MODEL_NAME)\n",
" embedding_model = OpenAIEmbeddings(model=EMBEDDING_MODEL_NAME)\n",
"\n",
" question_gen_prompt = PromptTemplate.from_template(\n",
" \"Analyze the input text and generate essential questions that, when answered, \\\n",
@@ -176,8 +191,8 @@
"\n",
" # parse questions from response\n",
" # Notes: \n",
" # - gpt40 likes to split questions by \\n\\n so we remove one \\n\n",
" # - for producetion or if using smaller models from ollama, it's beneficial to use regex to parse \n",
" # - gpt4o likes to split questions by \\n\\n so we remove one \\n\n",
" # - for production or if using smaller models from ollama, it's beneficial to use regex to parse \n",
" # things like (un)ordeed lists\n",
" # r\"^\\s*[\\-\\*\\•]|\\s*\\d+\\.\\s*|\\s*[a-zA-Z]\\)\\s*|\\s*\\(\\d+\\)\\s*|\\s*\\([a-zA-Z]\\)\\s*|\\s*\\([ivxlcdm]+\\)\\s*\"\n",
" questions = question_chain.invoke({\"chunk_text\": chunk_text}).replace(\"\\n\\n\", \"\\n\").split(\"\\n\")\n",
@@ -185,6 +200,23 @@
" return chunk_text, embedding_model.embed_documents(questions)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Define creation and population of FAISS Vector Store\n",
"\n",
"The code block below builds a FAISS vector store by embedding text chunks in parallel.\n",
"\n",
"What happens?\n",
"- Parallel processing Uses threading to generate embeddings faster.\n",
"- FAISS initialization Sets up an L2 index for efficient similarity search.\n",
"- Chunk embedding Each chunk is stored multiple times, once for each generated question embedding.\n",
"- In-memory storage Uses InMemoryDocstore for fast lookup.\n",
"\n",
"This ensures efficient retrieval, improving query alignment with precomputed question embeddings."
]
},
{
"cell_type": "code",
"execution_count": 66,
@@ -221,14 +253,14 @@
" # Initialize the FAISS vector store on the first chunk\n",
" if vector_store == None: \n",
" vector_store = FAISS(\n",
" embedding_function=OpenAIEmbeddings(model=embedding_model_name), # Define embedding model\n",
" embedding_function=OpenAIEmbeddings(model=EMBEDDING_MODEL_NAME), # Define embedding model\n",
" index=faiss.IndexFlatL2(len(vectors[0])) # Define an L2 index for similarity search\n",
" docstore=InMemoryDocstore(), # Use in-memory document storage\n",
" index_to_docstore_id={} # Maintain index-to-document mapping\n",
" )\n",
" \n",
" # Pair the chunk's content with each generated embedding vector.\n",
" # Each chunk is inserted multiple times, once for each vector\n",
" # Each chunk is inserted multiple times, once for each prompt vector\n",
" chunks_with_embedding_vectors = [(chunk.page_content, vec) for vec in vectors]\n",
" \n",
" # Add embeddings to the store\n",
@@ -237,6 +269,21 @@
" return vector_store # Return the populated vector store\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Encode PDF into a FAISS Vector Store\n",
"\n",
"The code block below processes a PDF file and stores its content as embeddings for retrieval.\n",
"\n",
"What happens?\n",
"- PDF loading Extracts text from the document.\n",
"- Chunking Splits text into overlapping segments for better context retention.\n",
"- Preprocessing Cleans text to improve embedding quality.\n",
"- Vector store creation Generates embeddings and stores them in FAISS for retrieval."
]
},
{
"cell_type": "code",
"execution_count": 70,
@@ -272,6 +319,16 @@
" return vectorstore"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create HyPE vector store\n",
"\n",
"Now we process the PDF and store its embeddings.\n",
"This step initializes the FAISS vector store with the encoded document."
]
},
{
"cell_type": "code",
"execution_count": 71,
@@ -289,14 +346,18 @@
"# Chunk size can be quite large with HyPE as we are not loosing percision with more\n",
"# information. For production, test how exhaustive your model is in generating sufficient \n",
"# amount of questions per chunk. This will mostly depend on your information density.\n",
"chunks_vector_store = encode_pdf(path, chunk_size=1000, chunk_overlap=200)"
"chunks_vector_store = encode_pdf(PATH, chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create retriever"
"### Create retriever\n",
"\n",
"Now we set up the retriever to fetch relevant chunks from the vector store.\n",
"\n",
"Retrieves the top `k=3` most relevant chunks based on query similarity."
]
},
{
@@ -312,7 +373,15 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Test retriever"
"### Test retriever\n",
"\n",
"Now we test retrieval using a sample query.\n",
"\n",
"- Queries the vector store to find the most relevant chunks.\n",
"- Deduplicates results to remove potentially repeated chunks.\n",
"- Displays the retrieved context for inspection.\n",
"\n",
"This step verifies that the retriever returns meaningful and diverse information for the given question."
]
},
{
@@ -379,7 +448,6 @@
"source": [
"test_query = \"What is the main cause of climate change?\"\n",
"context = retrieve_context_per_question(test_query, chunks_query_retriever)\n",
"# deduplication might be beneficial as it is possible to retrieve the same chunk multiple times\n",
"context = list(set(context))\n",
"show_context(context)"
]
@@ -462,7 +530,6 @@
}
],
"source": [
"#Note - this currently works with OPENAI only\n",
"evaluate_rag(chunks_query_retriever)"
]
}