improved markdown

2025-04-07 00:48:52 +03:00 · 2025-03-10 13:28:16 +00:00
parent 6096797c7e
commit 57e9dcc87a
1 changed files with 92 additions and 25 deletions
--- a/all_rag_techniques/HyPE_Hypothetical_Prompt_Embeddings.ipynb
+++ b/all_rag_techniques/HyPE_Hypothetical_Prompt_Embeddings.ipynb
@@ -10,7 +10,7 @@
    "\n",
    "This code implements a Retrieval-Augmented Generation (RAG) system enhanced by Hypothetical Prompt Embeddings (HyPE). Unlike traditional RAG pipelines that struggle with query-document style mismatch, HyPE precomputes hypothetical questions during the indexing phase. This transforms retrieval into a question-question matching problem, eliminating the need for expensive runtime query expansion techniques.\n",
    "\n",
-    "## Key Components\n",
+    "## Key Components of notebook\n",
    "\n",
    "1. PDF processing and text extraction\n",
    "2. Text chunking to maintain coherent information units\n",
@@ -28,7 +28,7 @@
    "\n",
    "### Hypothetical Question Generation\n",
    "\n",
-    "Instead of embedding raw text chunks, HyPE **generates multiple hypothetical prompts** for each chunk. These **precomputed questions **simulate user queries, improving alignment with real-world searches. This removes the need for runtime synthetic answer generation needed in techniques like HyDE.\n",
+    "Instead of embedding raw text chunks, HyPE **generates multiple hypothetical prompts** for each chunk. These **precomputed questions** simulate user queries, improving alignment with real-world searches. This removes the need for runtime synthetic answer generation needed in techniques like HyDE.\n",
    "\n",
    "### Vector Store Creation\n",
    "\n",
@@ -45,21 +45,16 @@
    "## Key Features\n",
    "\n",
    "1. **Precomputed Hypothetical Prompts** – Improves query alignment without runtime overhead.\n",
-    "2. **Multi-Vector Representation **– Each chunk is indexed multiple times for broader semantic coverage.\n",
+    "2. **Multi-Vector Representation**– Each chunk is indexed multiple times for broader semantic coverage.\n",
    "3. **Efficient Retrieval** – FAISS ensures fast similarity search over the enhanced embeddings.\n",
    "4. **Modular Design** – The pipeline is easy to adapt for different datasets and retrieval settings. Additionally it's compatible with most optimizations like reranking etc.\n",
    "\n",
-    "## Usage Example\n",
-    "\n",
-    "The code includes a test query: \"What is the main cause of climate change?\". This demonstrates how to use the retriever to fetch relevant context from the processed document.\n",
-    "\n",
    "## Evaluation\n",
    "\n",
    "HyPE's effectiveness is evaluated across multiple datasets, showing:\n",
    "\n",
    "- Up to 42 percentage points improvement in retrieval precision\n",
    "- Up to 45 percentage points improvement in claim recall\n",
-    "- Compatibility with reranking, multi-vector retrieval, and other RAG optimizations\n",
    "    (See full evaluation results in [preprint](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5139335))\n",
    "\n",
    "## Benefits of this Approach\n",
@@ -78,7 +73,7 @@
    "\n",
    "<div style=\"text-align: center;\">\n",
    "\n",
-    "<img src=\"../images/hype.svg\" alt=\"HyPE\" style=\"width:40%; height:auto;\">\n",
+    "<img src=\"../images/hype.svg\" alt=\"HyPE\" style=\"width:70%; height:auto;\">\n",
    "</div>"
   ]
  },
@@ -122,7 +117,17 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Read Docs"
+    "### Define constants\n",
+    "\n",
+    "- `PATH`: path to the data, to be embedded into the RAG pipeline\n",
+    "\n",
+    "This tutorial uses OpenAI endpoint ([avalible models](https://platform.openai.com/docs/pricing)). \n",
+    "- `LANGUAGE_MODEL_NAME`: The name of the language model to be used. \n",
+    "- `EMBEDDING_MODEL_NAME`: The name of the embedding model to be used.\n",
+    "\n",
+    "The tutroial uses a `RecursiveCharacterTextSplitter` chunking approach where the chunking length function used is python `len` function. The chunking varables to be tweaked here are:\n",
+    "- `CHUNK_SIZE`: The minimum length of one chunk\n",
+    "- `CHUNK_OVERLAP`: The overlap of two consecutive chunks."
   ]
  },
  {
@@ -131,16 +136,26 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "path = \"../data/Understanding_Climate_Change.pdf\"\n",
-    "language_model_name = \"gpt-4o-mini\"\n",
-    "embedding_model_name = \"text-embedding-3-small\""
+    "PATH = \"../data/Understanding_Climate_Change.pdf\"\n",
+    "LANGUAGE_MODEL_NAME = \"gpt-4o-mini\"\n",
+    "EMBEDDING_MODEL_NAME = \"text-embedding-3-small\"\n",
+    "CHUNK_SIZE = 1000\n",
+    "CHUNK_OVERLAP = 200"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Encode document"
+    "### Define generation of Hypothetical Prompt Embeddings\n",
+    "\n",
+    "The code block below generates hypothetical questions for each text chunk and embeds them for retrieval.\n",
+    "\n",
+    "- An LLM extracts key questions from the input chunk.\n",
+    "- These questions are embedded using OpenAI's model.\n",
+    "- The function returns the original chunk and its prompt embeddings later used for retrieval.\n",
+    "\n",
+    "To ensure clean output, extra newlines are removed, and regex parsing can improve list formatting when needed."
   ]
  },
  {
@@ -163,8 +178,8 @@
    "    hypothetical prompt embeddings (List[float]): A list of embedding vectors\n",
    "        generated from the questions\n",
    "    \"\"\"\n",
-    "    llm = ChatOpenAI(temperature=0, model_name=language_model_name)\n",
-    "    embedding_model = OpenAIEmbeddings(model=embedding_model_name)\n",
+    "    llm = ChatOpenAI(temperature=0, model_name=LANGUAGE_MODEL_NAME)\n",
+    "    embedding_model = OpenAIEmbeddings(model=EMBEDDING_MODEL_NAME)\n",
    "\n",
    "    question_gen_prompt = PromptTemplate.from_template(\n",
    "        \"Analyze the input text and generate essential questions that, when answered, \\\n",
@@ -176,8 +191,8 @@
    "\n",
    "    # parse questions from response\n",
    "    # Notes: \n",
-    "    # - gpt40 likes to split questions by \\n\\n so we remove one \\n\n",
-    "    # - for producetion or if using smaller models from ollama, it's beneficial to use regex to parse \n",
+    "    # - gpt4o likes to split questions by \\n\\n so we remove one \\n\n",
+    "    # - for production or if using smaller models from ollama, it's beneficial to use regex to parse \n",
    "    # things like (un)ordeed lists\n",
    "    # r\"^\\s*[\\-\\*\\•]|\\s*\\d+\\.\\s*|\\s*[a-zA-Z]\\)\\s*|\\s*\\(\\d+\\)\\s*|\\s*\\([a-zA-Z]\\)\\s*|\\s*\\([ivxlcdm]+\\)\\s*\"\n",
    "    questions = question_chain.invoke({\"chunk_text\": chunk_text}).replace(\"\\n\\n\", \"\\n\").split(\"\\n\")\n",
@@ -185,6 +200,23 @@
    "    return chunk_text, embedding_model.embed_documents(questions)\n"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Define creation and population of FAISS Vector Store\n",
+    "\n",
+    "The code block below builds a FAISS vector store by embedding text chunks in parallel.\n",
+    "\n",
+    "What happens?\n",
+    "- Parallel processing – Uses threading to generate embeddings faster.\n",
+    "- FAISS initialization – Sets up an L2 index for efficient similarity search.\n",
+    "- Chunk embedding – Each chunk is stored multiple times, once for each generated question embedding.\n",
+    "- In-memory storage – Uses InMemoryDocstore for fast lookup.\n",
+    "\n",
+    "This ensures efficient retrieval, improving query alignment with precomputed question embeddings."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 66,
@@ -221,14 +253,14 @@
    "            # Initialize the FAISS vector store on the first chunk\n",
    "            if vector_store == None:  \n",
    "                vector_store = FAISS(\n",
-    "                    embedding_function=OpenAIEmbeddings(model=embedding_model_name),  # Define embedding model\n",
+    "                    embedding_function=OpenAIEmbeddings(model=EMBEDDING_MODEL_NAME),  # Define embedding model\n",
    "                    index=faiss.IndexFlatL2(len(vectors[0]))  # Define an L2 index for similarity search\n",
    "                    docstore=InMemoryDocstore(),  # Use in-memory document storage\n",
    "                    index_to_docstore_id={}  # Maintain index-to-document mapping\n",
    "                )\n",
    "            \n",
    "            # Pair the chunk's content with each generated embedding vector.\n",
-    "            # Each chunk is inserted multiple times, once for each vector\n",
+    "            # Each chunk is inserted multiple times, once for each prompt vector\n",
    "            chunks_with_embedding_vectors = [(chunk.page_content, vec) for vec in vectors]\n",
    "            \n",
    "            # Add embeddings to the store\n",
@@ -237,6 +269,21 @@
    "    return vector_store  # Return the populated vector store\n"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Encode PDF into a FAISS Vector Store\n",
+    "\n",
+    "The code block below processes a PDF file and stores its content as embeddings for retrieval.\n",
+    "\n",
+    "What happens?\n",
+    "- PDF loading – Extracts text from the document.\n",
+    "- Chunking – Splits text into overlapping segments for better context retention.\n",
+    "- Preprocessing – Cleans text to improve embedding quality.\n",
+    "- Vector store creation – Generates embeddings and stores them in FAISS for retrieval."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 70,
@@ -272,6 +319,16 @@
    "    return vectorstore"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Create HyPE vector store\n",
+    "\n",
+    "Now we process the PDF and store its embeddings.\n",
+    "This step initializes the FAISS vector store with the encoded document."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 71,
@@ -289,14 +346,18 @@
    "# Chunk size can be quite large with HyPE as we are not loosing percision with more\n",
    "# information. For production, test how exhaustive your model is in generating sufficient \n",
    "# amount of questions per chunk. This will mostly depend on your information density.\n",
-    "chunks_vector_store = encode_pdf(path, chunk_size=1000, chunk_overlap=200)"
+    "chunks_vector_store = encode_pdf(PATH, chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Create retriever"
+    "### Create retriever\n",
+    "\n",
+    "Now we set up the retriever to fetch relevant chunks from the vector store.\n",
+    "\n",
+    "Retrieves the top `k=3` most relevant chunks based on query similarity."
   ]
  },
  {
@@ -312,7 +373,15 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Test retriever"
+    "### Test retriever\n",
+    "\n",
+    "Now we test retrieval using a sample query.\n",
+    "\n",
+    "- Queries the vector store to find the most relevant chunks.\n",
+    "- Deduplicates results to remove potentially repeated chunks.\n",
+    "- Displays the retrieved context for inspection.\n",
+    "\n",
+    "This step verifies that the retriever returns meaningful and diverse information for the given question."
   ]
  },
  {
@@ -379,7 +448,6 @@
   "source": [
    "test_query = \"What is the main cause of climate change?\"\n",
    "context = retrieve_context_per_question(test_query, chunks_query_retriever)\n",
-    "# deduplication might be beneficial as it is possible to retrieve the same chunk multiple times\n",
    "context = list(set(context))\n",
    "show_context(context)"
   ]
@@ -462,7 +530,6 @@
    }
   ],
   "source": [
-    "#Note - this currently works with OPENAI only\n",
    "evaluate_rag(chunks_query_retriever)"
   ]
  }