mirror of
https://github.com/NirDiamant/RAG_Techniques.git
synced 2025-04-07 00:48:52 +03:00
67
README.md
67
README.md
@@ -153,7 +153,24 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
|
||||
### 📚 Context and Content Enrichment
|
||||
|
||||
8. **[Contextual Chunk Headers :label:](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/contextual_chunk_headers.ipynb)**
|
||||
8. Hypothetical Prompt Embeddings (HyPE) ❓🚀
|
||||
- **[LangChain](all_rag_techniques/HyPE_Hypothetical_Prompt_Embedding.ipynb)**
|
||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/HyPE_Hypothetical_Prompt_Embedding.py)**
|
||||
|
||||
#### Overview 🔎
|
||||
HyPE (Hypothetical Prompt Embeddings) is an enhancement to traditional RAG retrieval that **precomputes hypothetical prompts at the indexing stage**, but inseting the chunk in their place. This transforms retrieval into a **question-question matching task**. This avoids the need for runtime synthetic answer generation, reducing inference-time computational overhead while **improving retrieval alignment**.
|
||||
|
||||
#### Implementation 🛠️
|
||||
- 📖 **Precomputed Questions:** Instead of embedding document chunks, HyPE **generates multiple hypothetical queries per chunk** at indexing time.
|
||||
- 🔍 **Question-Question Matching:** User queries are matched against stored hypothetical questions, leading to **better retrieval alignment**.
|
||||
- ⚡ **No Runtime Overhead:** Unlike HyDE, HyPE does **not require LLM calls at query time**, making retrieval **faster and cheaper**.
|
||||
- 📈 **Higher Precision & Recall:** Improves retrieval **context precision by up to 42 percentage points** and **claim recall by up to 45 percentage points**.
|
||||
|
||||
#### Additional Resources 📚
|
||||
- **[Preprint: Hypothetical Prompt Embeddings (HyPE)](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5139335)** - Research paper detailing the method, evaluation, and benchmarks.
|
||||
|
||||
|
||||
9. **[Contextual Chunk Headers :label:](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/contextual_chunk_headers.ipynb)**
|
||||
|
||||
#### Overview 🔎
|
||||
Contextual chunk headers (CCH) is a method of creating document-level and section-level context, and prepending those chunk headers to the chunks prior to embedding them.
|
||||
@@ -164,7 +181,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Additional Resources 📚
|
||||
**[dsRAG](https://github.com/D-Star-AI/dsRAG)**: open-source retrieval engine that implements this technique (and a few other advanced RAG techniques)
|
||||
|
||||
9. **[Relevant Segment Extraction 🧩](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/relevant_segment_extraction.ipynb)**
|
||||
10. **[Relevant Segment Extraction 🧩](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/relevant_segment_extraction.ipynb)**
|
||||
|
||||
#### Overview 🔎
|
||||
Relevant segment extraction (RSE) is a method of dynamically constructing multi-chunk segments of text that are relevant to a given query.
|
||||
@@ -172,7 +189,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Implementation 🛠️
|
||||
Perform a retrieval post-processing step that analyzes the most relevant chunks and identifies longer multi-chunk segments to provide more complete context to the LLM.
|
||||
|
||||
10. Context Enrichment Techniques 📝
|
||||
11. Context Enrichment Techniques 📝
|
||||
- **[LangChain](all_rag_techniques/context_enrichment_window_around_chunk.ipynb)**
|
||||
- **[LlamaIndex](all_rag_techniques/context_enrichment_window_around_chunk_with_llamaindex.ipynb)**
|
||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/context_enrichment_window_around_chunk.py)**
|
||||
@@ -183,7 +200,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Implementation 🛠️
|
||||
Retrieve the most relevant sentence while also accessing the sentences before and after it in the original text.
|
||||
|
||||
11. Semantic Chunking 🧠
|
||||
12. Semantic Chunking 🧠
|
||||
- **[LangChain](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/semantic_chunking.ipynb)**
|
||||
- **[Runnable Script](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques_runnable_scripts/semantic_chunking.py)**
|
||||
|
||||
@@ -196,7 +213,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Additional Resources 📚
|
||||
- **[Semantic Chunking: Improving AI Information Retrieval](https://open.substack.com/pub/diamantai/p/semantic-chunking-improving-ai-information?r=336pe4&utm_campaign=post&utm_medium=web)** - A comprehensive blog post exploring the benefits and implementation of semantic chunking in RAG systems.
|
||||
|
||||
12. Contextual Compression 🗜️
|
||||
13. Contextual Compression 🗜️
|
||||
- **[LangChain](all_rag_techniques/contextual_compression.ipynb)**
|
||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/contextual_compression.py)**
|
||||
|
||||
@@ -206,7 +223,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Implementation 🛠️
|
||||
Use an LLM to compress or summarize retrieved chunks, preserving key information relevant to the query.
|
||||
|
||||
13. Document Augmentation through Question Generation for Enhanced Retrieval
|
||||
14. Document Augmentation through Question Generation for Enhanced Retrieval
|
||||
- **[LangChain](all_rag_techniques/document_augmentation.ipynb)**
|
||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/document_augmentation.py)**
|
||||
|
||||
@@ -218,7 +235,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
|
||||
### 🚀 Advanced Retrieval Methods
|
||||
|
||||
14. Fusion Retrieval 🔗
|
||||
15. Fusion Retrieval 🔗
|
||||
- **[LangChain](all_rag_techniques/fusion_retrieval.ipynb)**
|
||||
- **[LlamaIndex](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/fusion_retrieval_with_llamaindex.ipynb)**
|
||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/fusion_retrieval.py)**
|
||||
@@ -229,7 +246,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Implementation 🛠️
|
||||
Combine keyword-based search with vector-based search for more comprehensive and accurate retrieval.
|
||||
|
||||
15. Intelligent Reranking 📈
|
||||
16. Intelligent Reranking 📈
|
||||
- **[LangChain](all_rag_techniques/reranking.ipynb)**
|
||||
- **[LlamaIndex](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/reranking_with_llamaindex.ipynb)**
|
||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/reranking.py)**
|
||||
@@ -245,7 +262,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Additional Resources 📚
|
||||
- **[Relevance Revolution: How Re-ranking Transforms RAG Systems](https://open.substack.com/pub/diamantai/p/relevance-revolution-how-re-ranking?r=336pe4&utm_campaign=post&utm_medium=web)** - A comprehensive blog post exploring the power of re-ranking in enhancing RAG system performance.
|
||||
|
||||
16. Multi-faceted Filtering 🔍
|
||||
17. Multi-faceted Filtering 🔍
|
||||
|
||||
#### Overview 🔎
|
||||
Applying various filtering techniques to refine and improve the quality of retrieved results.
|
||||
@@ -256,7 +273,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
- 📄 **Content Filtering:** Remove results that don't match specific content criteria or essential keywords.
|
||||
- 🌈 **Diversity Filtering:** Ensure result diversity by filtering out near-duplicate entries.
|
||||
|
||||
17. Hierarchical Indices 🗂️
|
||||
18. Hierarchical Indices 🗂️
|
||||
- **[LangChain](all_rag_techniques/hierarchical_indices.ipynb)**
|
||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/hierarchical_indices.py)**
|
||||
|
||||
@@ -269,7 +286,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Additional Resources 📚
|
||||
- **[Hierarchical Indices: Enhancing RAG Systems](https://open.substack.com/pub/diamantai/p/hierarchical-indices-enhancing-rag?r=336pe4&utm_campaign=post&utm_medium=web)** - A comprehensive blog post exploring the power of hierarchical indices in enhancing RAG system performance.
|
||||
|
||||
18. Ensemble Retrieval 🎭
|
||||
19. Ensemble Retrieval 🎭
|
||||
|
||||
#### Overview 🔎
|
||||
Combining multiple retrieval models or techniques for more robust and accurate results.
|
||||
@@ -277,7 +294,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Implementation 🛠️
|
||||
Apply different embedding models or retrieval algorithms and use voting or weighting mechanisms to determine the final set of retrieved documents.
|
||||
|
||||
19. Dartboard Retrieval 🎯
|
||||
20. Dartboard Retrieval 🎯
|
||||
- **[LangChain](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/dartboard.ipynb)**
|
||||
#### Overview 🔎
|
||||
Optimizing over Relevant Information Gain in Retrieval
|
||||
@@ -286,7 +303,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
- Combine both relevance and diversity into a single scoring function and directly optimize for it.
|
||||
- POC showing plain simple RAG underperforming when the database is dense, and the dartboard retrieval outperforming it.
|
||||
|
||||
20. Multi-modal Retrieval 📽️
|
||||
21. Multi-modal Retrieval 📽️
|
||||
|
||||
#### Overview 🔎
|
||||
Extending RAG capabilities to handle diverse data types for richer responses.
|
||||
@@ -298,7 +315,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
|
||||
### 🔁 Iterative and Adaptive Techniques
|
||||
|
||||
21. Retrieval with Feedback Loops 🔁
|
||||
22. Retrieval with Feedback Loops 🔁
|
||||
- **[LangChain](all_rag_techniques/retrieval_with_feedback_loop.ipynb)**
|
||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/retrieval_with_feedback_loop.py)**
|
||||
|
||||
@@ -308,7 +325,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Implementation 🛠️
|
||||
Collect and utilize user feedback on the relevance and quality of retrieved documents and generated responses to fine-tune retrieval and ranking models.
|
||||
|
||||
22. Adaptive Retrieval 🎯
|
||||
23. Adaptive Retrieval 🎯
|
||||
- **[LangChain](all_rag_techniques/adaptive_retrieval.ipynb)**
|
||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/adaptive_retrieval.py)**
|
||||
|
||||
@@ -318,7 +335,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Implementation 🛠️
|
||||
Classify queries into different categories and use tailored retrieval strategies for each, considering user context and preferences.
|
||||
|
||||
23. Iterative Retrieval 🔄
|
||||
24. Iterative Retrieval 🔄
|
||||
|
||||
#### Overview 🔎
|
||||
Performing multiple rounds of retrieval to refine and enhance result quality.
|
||||
@@ -328,7 +345,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
|
||||
### 📊 Evaluation
|
||||
|
||||
24. **[DeepEval Evaluation](evaluation/evaluation_deep_eval.ipynb)** 📘
|
||||
25. **[DeepEval Evaluation](evaluation/evaluation_deep_eval.ipynb)** 📘
|
||||
|
||||
#### Overview 🔎
|
||||
Performing evaluations Retrieval-Augmented Generation systems, by covering several metrics and creating test cases.
|
||||
@@ -337,7 +354,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
Use the `deepeval` library to conduct test cases on correctness, faithfulness and contextual relevancy of RAG systems.
|
||||
|
||||
|
||||
25. **[GroUSE Evaluation](evaluation/evaluation_grouse.ipynb)** 🐦
|
||||
26. **[GroUSE Evaluation](evaluation/evaluation_grouse.ipynb)** 🐦
|
||||
|
||||
#### Overview 🔎
|
||||
Evaluate the final stage of Retrieval-Augmented Generation using metrics of the GroUSE framework and meta-evaluate your custom LLM judge on GroUSE unit tests.
|
||||
@@ -348,7 +365,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
|
||||
### 🔬 Explainability and Transparency
|
||||
|
||||
26. Explainable Retrieval 🔍
|
||||
27. Explainable Retrieval 🔍
|
||||
- **[LangChain](all_rag_techniques/explainable_retrieval.ipynb)**
|
||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/explainable_retrieval.py)**
|
||||
|
||||
@@ -360,7 +377,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
|
||||
### 🏗️ Advanced Architectures
|
||||
|
||||
27. Knowledge Graph Integration (Graph RAG) 🕸️
|
||||
28. Knowledge Graph Integration (Graph RAG) 🕸️
|
||||
- **[LangChain](all_rag_techniques/graph_rag.ipynb)**
|
||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/graph_rag.py)**
|
||||
|
||||
@@ -370,7 +387,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Implementation 🛠️
|
||||
Retrieve entities and their relationships from a knowledge graph relevant to the query, combining this structured data with unstructured text for more informative responses.
|
||||
|
||||
28. GraphRag (Microsoft) 🎯
|
||||
29. GraphRag (Microsoft) 🎯
|
||||
- **[GraphRag](all_rag_techniques/Microsoft_GraphRag.ipynb)**
|
||||
|
||||
#### Overview 🔎
|
||||
@@ -379,7 +396,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Implementation 🛠️
|
||||
• Analyze an input corpus by extracting entities, relationships from text units. generates summaries of each community and its constituents from the bottom-up.
|
||||
|
||||
29. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval 🌳
|
||||
30. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval 🌳
|
||||
- **[LangChain](all_rag_techniques/raptor.ipynb)**
|
||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/raptor.py)**
|
||||
|
||||
@@ -389,7 +406,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Implementation 🛠️
|
||||
Use abstractive summarization to recursively process and summarize retrieved documents, organizing the information in a tree structure for hierarchical context.
|
||||
|
||||
30. Self RAG 🔁
|
||||
31. Self RAG 🔁
|
||||
- **[LangChain](all_rag_techniques/self_rag.ipynb)**
|
||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/self_rag.py)**
|
||||
|
||||
@@ -399,7 +416,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Implementation 🛠️
|
||||
• Implement a multi-step process including retrieval decision, document retrieval, relevance evaluation, response generation, support assessment, and utility evaluation to produce accurate, relevant, and useful outputs.
|
||||
|
||||
31. Corrective RAG 🔧
|
||||
32. Corrective RAG 🔧
|
||||
- **[LangChain](all_rag_techniques/crag.ipynb)**
|
||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/crag.py)**
|
||||
|
||||
@@ -411,7 +428,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
|
||||
## 🌟 Special Advanced Technique 🌟
|
||||
|
||||
32. **[Sophisticated Controllable Agent for Complex RAG Tasks 🤖](https://github.com/NirDiamant/Controllable-RAG-Agent)**
|
||||
33. **[Sophisticated Controllable Agent for Complex RAG Tasks 🤖](https://github.com/NirDiamant/Controllable-RAG-Agent)**
|
||||
|
||||
#### Overview 🔎
|
||||
An advanced RAG solution designed to tackle complex questions that simple semantic similarity-based retrieval cannot solve. This approach uses a sophisticated deterministic graph as the "brain" 🧠 of a highly controllable autonomous agent, capable of answering non-trivial questions from your own data.
|
||||
|
||||
558
all_rag_techniques/HyPE_Hypothetical_Prompt_Embeddings.ipynb
Normal file
558
all_rag_techniques/HyPE_Hypothetical_Prompt_Embeddings.ipynb
Normal file
@@ -0,0 +1,558 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Hypothetical Prompt Embeddings (HyPE)\n",
|
||||
"\n",
|
||||
"## Overview\n",
|
||||
"\n",
|
||||
"This code implements a Retrieval-Augmented Generation (RAG) system enhanced by Hypothetical Prompt Embeddings (HyPE). Unlike traditional RAG pipelines that struggle with query-document style mismatch, HyPE precomputes hypothetical questions during the indexing phase. This transforms retrieval into a question-question matching problem, eliminating the need for expensive runtime query expansion techniques.\n",
|
||||
"\n",
|
||||
"## Key Components of notebook\n",
|
||||
"\n",
|
||||
"1. PDF processing and text extraction\n",
|
||||
"2. Text chunking to maintain coherent information units\n",
|
||||
"3. **Hypothetical Prompt Embedding Generation** using an LLM to create multiple proxy questions per chunk\n",
|
||||
"4. Vector store creation using [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) and OpenAI embeddings\n",
|
||||
"5. Retriever setup for querying the processed documents\n",
|
||||
"6. Evaluation of the RAG system\n",
|
||||
"\n",
|
||||
"## Method Details\n",
|
||||
"\n",
|
||||
"### Document Preprocessing\n",
|
||||
"\n",
|
||||
"1. The PDF is loaded using `PyPDFLoader`.\n",
|
||||
"2. The text is split into chunks using `RecursiveCharacterTextSplitter` with specified chunk size and overlap.\n",
|
||||
"\n",
|
||||
"### Hypothetical Question Generation\n",
|
||||
"\n",
|
||||
"Instead of embedding raw text chunks, HyPE **generates multiple hypothetical prompts** for each chunk. These **precomputed questions** simulate user queries, improving alignment with real-world searches. This removes the need for runtime synthetic answer generation needed in techniques like HyDE.\n",
|
||||
"\n",
|
||||
"### Vector Store Creation\n",
|
||||
"\n",
|
||||
"1. Each hypothetical question is embedded using OpenAI embeddings.\n",
|
||||
"2. A FAISS vector store is built, associating **each question embedding with its original chunk**.\n",
|
||||
"3. This approach **stores multiple representations per chunk**, increasing retrieval flexibility.\n",
|
||||
"\n",
|
||||
"### Retriever Setup\n",
|
||||
"\n",
|
||||
"1. The retriever is optimized for **question-question matching** rather than direct document retrieval.\n",
|
||||
"2. The FAISS index enables **efficient nearest-neighbor** search over the hypothetical prompt embeddings.\n",
|
||||
"3. Retrieved chunks provide a **richer and more precise context** for downstream LLM generation.\n",
|
||||
"\n",
|
||||
"## Key Features\n",
|
||||
"\n",
|
||||
"1. **Precomputed Hypothetical Prompts** – Improves query alignment without runtime overhead.\n",
|
||||
"2. **Multi-Vector Representation**– Each chunk is indexed multiple times for broader semantic coverage.\n",
|
||||
"3. **Efficient Retrieval** – FAISS ensures fast similarity search over the enhanced embeddings.\n",
|
||||
"4. **Modular Design** – The pipeline is easy to adapt for different datasets and retrieval settings. Additionally it's compatible with most optimizations like reranking etc.\n",
|
||||
"\n",
|
||||
"## Evaluation\n",
|
||||
"\n",
|
||||
"HyPE's effectiveness is evaluated across multiple datasets, showing:\n",
|
||||
"\n",
|
||||
"- Up to 42 percentage points improvement in retrieval precision\n",
|
||||
"- Up to 45 percentage points improvement in claim recall\n",
|
||||
" (See full evaluation results in [preprint](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5139335))\n",
|
||||
"\n",
|
||||
"## Benefits of this Approach\n",
|
||||
"\n",
|
||||
"1. **Eliminates Query-Time Overhead** – All hypothetical generation is done offline at indexing.\n",
|
||||
"2. **Enhanced Retrieval Precision** – Better alignment between queries and stored content.\n",
|
||||
"3. **Scalable & Efficient** – No addinal per-query computational cost; retrieval is as fast as standard RAG.\n",
|
||||
"4. **Flexible & Extensible** – Can be combined with advanced RAG techniques like reranking.\n",
|
||||
"\n",
|
||||
"## Conclusion\n",
|
||||
"\n",
|
||||
"HyPE provides a scalable and efficient alternative to traditional RAG systems, overcoming query-document style mismatch while avoiding the computational cost of runtime query expansion. By moving hypothetical prompt generation to indexing, it significantly enhances retrieval precision and efficiency, making it a practical solution for real-world applications.\n",
|
||||
"\n",
|
||||
"For further details, refer to the full paper: [preprint](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5139335)\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"<div style=\"text-align: center;\">\n",
|
||||
"\n",
|
||||
"<img src=\"../images/hype.svg\" alt=\"HyPE\" style=\"width:70%; height:auto;\">\n",
|
||||
"</div>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Import libraries and environment variables"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 63,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"import sys\n",
|
||||
"import faiss\n",
|
||||
"from tqdm import tqdm\n",
|
||||
"from dotenv import load_dotenv\n",
|
||||
"from concurrent.futures import ThreadPoolExecutor, as_completed\n",
|
||||
"from langchain_community.docstore.in_memory import InMemoryDocstore\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# Load environment variables from a .env file\n",
|
||||
"load_dotenv()\n",
|
||||
"\n",
|
||||
"# Set the OpenAI API key environment variable (comment out if not using OpenAI)\n",
|
||||
"if not os.getenv('OPENAI_API_KEY'):\n",
|
||||
" os.environ[\"OPENAI_API_KEY\"] = input(\"Please enter your OpenAI API key: \")\n",
|
||||
"else:\n",
|
||||
" os.environ[\"OPENAI_API_KEY\"] = os.getenv('OPENAI_API_KEY')\n",
|
||||
"\n",
|
||||
"sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path since we work with notebooks\n",
|
||||
"from helper_functions import *\n",
|
||||
"from evaluation.evalute_rag import *\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Define constants\n",
|
||||
"\n",
|
||||
"- `PATH`: path to the data, to be embedded into the RAG pipeline\n",
|
||||
"\n",
|
||||
"This tutorial uses OpenAI endpoint ([avalible models](https://platform.openai.com/docs/pricing)). \n",
|
||||
"- `LANGUAGE_MODEL_NAME`: The name of the language model to be used. \n",
|
||||
"- `EMBEDDING_MODEL_NAME`: The name of the embedding model to be used.\n",
|
||||
"\n",
|
||||
"The tutroial uses a `RecursiveCharacterTextSplitter` chunking approach where the chunking length function used is python `len` function. The chunking varables to be tweaked here are:\n",
|
||||
"- `CHUNK_SIZE`: The minimum length of one chunk\n",
|
||||
"- `CHUNK_OVERLAP`: The overlap of two consecutive chunks."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 64,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"PATH = \"../data/Understanding_Climate_Change.pdf\"\n",
|
||||
"LANGUAGE_MODEL_NAME = \"gpt-4o-mini\"\n",
|
||||
"EMBEDDING_MODEL_NAME = \"text-embedding-3-small\"\n",
|
||||
"CHUNK_SIZE = 1000\n",
|
||||
"CHUNK_OVERLAP = 200"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Define generation of Hypothetical Prompt Embeddings\n",
|
||||
"\n",
|
||||
"The code block below generates hypothetical questions for each text chunk and embeds them for retrieval.\n",
|
||||
"\n",
|
||||
"- An LLM extracts key questions from the input chunk.\n",
|
||||
"- These questions are embedded using OpenAI's model.\n",
|
||||
"- The function returns the original chunk and its prompt embeddings later used for retrieval.\n",
|
||||
"\n",
|
||||
"To ensure clean output, extra newlines are removed, and regex parsing can improve list formatting when needed."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 65,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def generate_hypothetical_prompt_embeddings(chunk_text: str):\n",
|
||||
" \"\"\"\n",
|
||||
" Uses the LLM to generate multiple hypothetical questions for a single chunk.\n",
|
||||
" These questions will be used as 'proxies' for the chunk during retrieval.\n",
|
||||
"\n",
|
||||
" Parameters:\n",
|
||||
" chunk_text (str): Text contents of the chunk\n",
|
||||
"\n",
|
||||
" Returns:\n",
|
||||
" chunk_text (str): Text contents of the chunk. This is done to make the \n",
|
||||
" multithreading easier\n",
|
||||
" hypothetical prompt embeddings (List[float]): A list of embedding vectors\n",
|
||||
" generated from the questions\n",
|
||||
" \"\"\"\n",
|
||||
" llm = ChatOpenAI(temperature=0, model_name=LANGUAGE_MODEL_NAME)\n",
|
||||
" embedding_model = OpenAIEmbeddings(model=EMBEDDING_MODEL_NAME)\n",
|
||||
"\n",
|
||||
" question_gen_prompt = PromptTemplate.from_template(\n",
|
||||
" \"Analyze the input text and generate essential questions that, when answered, \\\n",
|
||||
" capture the main points of the text. Each question should be one line, \\\n",
|
||||
" without numbering or prefixes.\\n\\n \\\n",
|
||||
" Text:\\n{chunk_text}\\n\\nQuestions:\\n\"\n",
|
||||
" )\n",
|
||||
" question_chain = question_gen_prompt | llm | StrOutputParser()\n",
|
||||
"\n",
|
||||
" # parse questions from response\n",
|
||||
" # Notes: \n",
|
||||
" # - gpt4o likes to split questions by \\n\\n so we remove one \\n\n",
|
||||
" # - for production or if using smaller models from ollama, it's beneficial to use regex to parse \n",
|
||||
" # things like (un)ordeed lists\n",
|
||||
" # r\"^\\s*[\\-\\*\\•]|\\s*\\d+\\.\\s*|\\s*[a-zA-Z]\\)\\s*|\\s*\\(\\d+\\)\\s*|\\s*\\([a-zA-Z]\\)\\s*|\\s*\\([ivxlcdm]+\\)\\s*\"\n",
|
||||
" questions = question_chain.invoke({\"chunk_text\": chunk_text}).replace(\"\\n\\n\", \"\\n\").split(\"\\n\")\n",
|
||||
" \n",
|
||||
" return chunk_text, embedding_model.embed_documents(questions)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Define creation and population of FAISS Vector Store\n",
|
||||
"\n",
|
||||
"The code block below builds a FAISS vector store by embedding text chunks in parallel.\n",
|
||||
"\n",
|
||||
"What happens?\n",
|
||||
"- Parallel processing – Uses threading to generate embeddings faster.\n",
|
||||
"- FAISS initialization – Sets up an L2 index for efficient similarity search.\n",
|
||||
"- Chunk embedding – Each chunk is stored multiple times, once for each generated question embedding.\n",
|
||||
"- In-memory storage – Uses InMemoryDocstore for fast lookup.\n",
|
||||
"\n",
|
||||
"This ensures efficient retrieval, improving query alignment with precomputed question embeddings."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 66,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def prepare_vector_store(chunks: List[str]):\n",
|
||||
" \"\"\"\n",
|
||||
" Creates and populates a FAISS vector store from a list of text chunks.\n",
|
||||
"\n",
|
||||
" This function processes a list of text chunks in parallel, generating \n",
|
||||
" hypothetical prompt embeddings for each chunk.\n",
|
||||
" The embeddings are stored in a FAISS index for efficient similarity search.\n",
|
||||
"\n",
|
||||
" Parameters:\n",
|
||||
" chunks (List[str]): A list of text chunks to be embedded and stored.\n",
|
||||
"\n",
|
||||
" Returns:\n",
|
||||
" FAISS: A FAISS vector store containing the embedded text chunks.\n",
|
||||
" \"\"\"\n",
|
||||
"\n",
|
||||
" # Wait with initialization to see vector lengths\n",
|
||||
" vector_store = None \n",
|
||||
"\n",
|
||||
" with ThreadPoolExecutor() as pool: \n",
|
||||
" # Use threading to speed up generation of prompt embeddings\n",
|
||||
" futures = [pool.submit(generate_hypothetical_prompt_embeddings, c) for c in chunks]\n",
|
||||
" \n",
|
||||
" # Process embeddings as they complete\n",
|
||||
" for f in tqdm(as_completed(futures), total=len(chunks)): \n",
|
||||
" \n",
|
||||
" chunk, vectors = f.result() # Retrieve the processed chunk and its embeddings\n",
|
||||
" \n",
|
||||
" # Initialize the FAISS vector store on the first chunk\n",
|
||||
" if vector_store == None: \n",
|
||||
" vector_store = FAISS(\n",
|
||||
" embedding_function=OpenAIEmbeddings(model=EMBEDDING_MODEL_NAME), # Define embedding model\n",
|
||||
" index=faiss.IndexFlatL2(len(vectors[0])) # Define an L2 index for similarity search\n",
|
||||
" docstore=InMemoryDocstore(), # Use in-memory document storage\n",
|
||||
" index_to_docstore_id={} # Maintain index-to-document mapping\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" # Pair the chunk's content with each generated embedding vector.\n",
|
||||
" # Each chunk is inserted multiple times, once for each prompt vector\n",
|
||||
" chunks_with_embedding_vectors = [(chunk.page_content, vec) for vec in vectors]\n",
|
||||
" \n",
|
||||
" # Add embeddings to the store\n",
|
||||
" vector_store.add_embeddings(chunks_with_embedding_vectors) \n",
|
||||
"\n",
|
||||
" return vector_store # Return the populated vector store\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Encode PDF into a FAISS Vector Store\n",
|
||||
"\n",
|
||||
"The code block below processes a PDF file and stores its content as embeddings for retrieval.\n",
|
||||
"\n",
|
||||
"What happens?\n",
|
||||
"- PDF loading – Extracts text from the document.\n",
|
||||
"- Chunking – Splits text into overlapping segments for better context retention.\n",
|
||||
"- Preprocessing – Cleans text to improve embedding quality.\n",
|
||||
"- Vector store creation – Generates embeddings and stores them in FAISS for retrieval."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 70,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def encode_pdf(path, chunk_size=1000, chunk_overlap=200):\n",
|
||||
" \"\"\"\n",
|
||||
" Encodes a PDF book into a vector store using OpenAI embeddings.\n",
|
||||
"\n",
|
||||
" Args:\n",
|
||||
" path: The path to the PDF file.\n",
|
||||
" chunk_size: The desired size of each text chunk.\n",
|
||||
" chunk_overlap: The amount of overlap between consecutive chunks.\n",
|
||||
"\n",
|
||||
" Returns:\n",
|
||||
" A FAISS vector store containing the encoded book content.\n",
|
||||
" \"\"\"\n",
|
||||
"\n",
|
||||
" # Load PDF documents\n",
|
||||
" loader = PyPDFLoader(path)\n",
|
||||
" documents = loader.load()\n",
|
||||
"\n",
|
||||
" # Split documents into chunks\n",
|
||||
" text_splitter = RecursiveCharacterTextSplitter(\n",
|
||||
" chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len\n",
|
||||
" )\n",
|
||||
" texts = text_splitter.split_documents(documents)\n",
|
||||
" cleaned_texts = replace_t_with_space(texts)\n",
|
||||
"\n",
|
||||
" vectorstore = prepare_vector_store(cleaned_texts)\n",
|
||||
"\n",
|
||||
" return vectorstore"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Create HyPE vector store\n",
|
||||
"\n",
|
||||
"Now we process the PDF and store its embeddings.\n",
|
||||
"This step initializes the FAISS vector store with the encoded document."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 71,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"100%|██████████| 97/97 [00:22<00:00, 4.40it/s]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Chunk size can be quite large with HyPE as we are not loosing percision with more\n",
|
||||
"# information. For production, test how exhaustive your model is in generating sufficient \n",
|
||||
"# amount of questions per chunk. This will mostly depend on your information density.\n",
|
||||
"chunks_vector_store = encode_pdf(PATH, chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Create retriever\n",
|
||||
"\n",
|
||||
"Now we set up the retriever to fetch relevant chunks from the vector store.\n",
|
||||
"\n",
|
||||
"Retrieves the top `k=3` most relevant chunks based on query similarity."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 79,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"chunks_query_retriever = chunks_vector_store.as_retriever(search_kwargs={\"k\": 3})"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Test retriever\n",
|
||||
"\n",
|
||||
"Now we test retrieval using a sample query.\n",
|
||||
"\n",
|
||||
"- Queries the vector store to find the most relevant chunks.\n",
|
||||
"- Deduplicates results to remove potentially repeated chunks.\n",
|
||||
"- Displays the retrieved context for inspection.\n",
|
||||
"\n",
|
||||
"This step verifies that the retriever returns meaningful and diverse information for the given question."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 80,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Context 1:\n",
|
||||
"Most of these climate changes are attributed to very small variations in Earth's orbit that \n",
|
||||
"change the amount of solar energy our planet receives. During the Holocene epoch, which \n",
|
||||
"began at the end of the last ice age, human societies f lourished, but the industrial era has seen \n",
|
||||
"unprecedented changes. \n",
|
||||
"Modern Observations \n",
|
||||
"Modern scientific observations indicate a rapid increase in global temperatures, sea levels, \n",
|
||||
"and extreme weather events. The Intergovernmental Panel on Climate Change (IPCC) has \n",
|
||||
"documented these changes extensively. Ice core samples, tree rings, and ocean sediments \n",
|
||||
"provide a historical record that scientists use to understand past climate conditions and \n",
|
||||
"predict future trends. The evidence overwhelmingly shows that recent changes are primarily \n",
|
||||
"driven by human activities, particularly the emission of greenhou se gases. \n",
|
||||
"Chapter 2: Causes of Climate Change \n",
|
||||
"Greenhouse Gases\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"Context 2:\n",
|
||||
"driven by human activities, particularly the emission of greenhou se gases. \n",
|
||||
"Chapter 2: Causes of Climate Change \n",
|
||||
"Greenhouse Gases \n",
|
||||
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
|
||||
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
|
||||
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
|
||||
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
|
||||
"activities have intensified this natural process, leading to a warmer climate. \n",
|
||||
"Fossil Fuels \n",
|
||||
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
|
||||
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
|
||||
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
|
||||
"today. \n",
|
||||
"Coal\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"Context 3:\n",
|
||||
"Understanding Climate Change \n",
|
||||
"Chapter 1: Introduction to Climate Change \n",
|
||||
"Climate change refers to significant, long -term changes in the global climate. The term \n",
|
||||
"\"global climate\" encompasses the planet's overall weather patterns, including temperature, \n",
|
||||
"precipitation, and wind patterns, over an extended period. Over the past cent ury, human \n",
|
||||
"activities, particularly the burning of fossil fuels and deforestation, have significantly \n",
|
||||
"contributed to climate change. \n",
|
||||
"Historical Context \n",
|
||||
"The Earth's climate has changed throughout history. Over the past 650,000 years, there have \n",
|
||||
"been seven cycles of glacial advance and retreat, with the abrupt end of the last ice age about \n",
|
||||
"11,700 years ago marking the beginning of the modern climate era and human civilization. \n",
|
||||
"Most of these climate changes are attributed to very small variations in Earth's orbit that \n",
|
||||
"change the amount of solar energy our planet receives. During the Holocene epoch, which\n",
|
||||
"\n",
|
||||
"\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"test_query = \"What is the main cause of climate change?\"\n",
|
||||
"context = retrieve_context_per_question(test_query, chunks_query_retriever)\n",
|
||||
"context = list(set(context))\n",
|
||||
"show_context(context)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Evaluate results"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 76,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'questions': ['1. **Multiple Choice: Causes of Climate Change**',\n",
|
||||
" ' - What is the primary cause of the current climate change trend?',\n",
|
||||
" ' A) Solar radiation variations',\n",
|
||||
" ' B) Natural cycles of the Earth',\n",
|
||||
" ' C) Human activities, such as burning fossil fuels',\n",
|
||||
" ' D) Volcanic eruptions',\n",
|
||||
" '',\n",
|
||||
" '2. **True or False: Impact on Biodiversity**',\n",
|
||||
" ' - True or False: Climate change does not have any significant impact on the migration patterns and extinction rates of various species.',\n",
|
||||
" '',\n",
|
||||
" '3. **Short Answer: Mitigation Strategies**',\n",
|
||||
" ' - What are two effective strategies that can be implemented at a community level to mitigate the effects of climate change?',\n",
|
||||
" '',\n",
|
||||
" '4. **Matching: Climate Change Effects**',\n",
|
||||
" ' - Match the following effects of climate change (numbered) with their likely consequences (lettered).',\n",
|
||||
" ' 1. Rising sea levels',\n",
|
||||
" ' 2. Increased frequency of extreme weather events',\n",
|
||||
" ' 3. Melting polar ice caps',\n",
|
||||
" ' 4. Ocean acidification',\n",
|
||||
" ' ',\n",
|
||||
" ' A) Displacement of coastal communities',\n",
|
||||
" ' B) Loss of marine biodiversity',\n",
|
||||
" ' C) Increased global temperatures',\n",
|
||||
" ' D) More frequent and severe hurricanes and floods',\n",
|
||||
" '',\n",
|
||||
" '5. **Essay: International Cooperation**',\n",
|
||||
" ' - Discuss the importance of international cooperation in combating climate change. Include examples of successful global agreements or initiatives and explain how they have contributed to addressing climate change.'],\n",
|
||||
" 'results': ['```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 2,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 1,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 1,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 2,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 2,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 2\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 2,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 2,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 2\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 2,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 2,\\n \"Conciseness\": 3\\n}\\n```'],\n",
|
||||
" 'average_scores': None}"
|
||||
]
|
||||
},
|
||||
"execution_count": 76,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"evaluate_rag(chunks_query_retriever)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": ".venv",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.12"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -0,0 +1,203 @@
|
||||
import os
|
||||
import sys
|
||||
import argparse
|
||||
import time
|
||||
import faiss
|
||||
from dotenv import load_dotenv
|
||||
from tqdm import tqdm
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from langchain_community.docstore.in_memory import InMemoryDocstore
|
||||
|
||||
# Add the parent directory to the path since we work with notebooks
|
||||
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))
|
||||
|
||||
from helper_functions import *
|
||||
from evaluation.evalute_rag import *
|
||||
|
||||
# Load environment variables from a .env file (e.g., OpenAI API key)
|
||||
load_dotenv()
|
||||
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')
|
||||
|
||||
class HyPE:
|
||||
"""
|
||||
A class to handle the HyPE RAG process, which enhances document chunking by
|
||||
generating hypothetical questions as proxies for retrieval.
|
||||
"""
|
||||
|
||||
def __init__(self, path, chunk_size=1000, chunk_overlap=200, n_retrieved=3):
|
||||
"""
|
||||
Initializes the HyPE-based RAG retriever by encoding the PDF document with
|
||||
hypothetical prompt embeddings.
|
||||
|
||||
Args:
|
||||
path (str): Path to the PDF file to encode.
|
||||
chunk_size (int): Size of each text chunk (default: 1000).
|
||||
chunk_overlap (int): Overlap between consecutive chunks (default: 200).
|
||||
n_retrieved (int): Number of chunks to retrieve for each query (default: 3).
|
||||
"""
|
||||
print("\n--- Initializing HyPE RAG Retriever ---")
|
||||
|
||||
# Encode the PDF document into a FAISS vector store using hypothetical prompt embeddings
|
||||
start_time = time.time()
|
||||
self.vector_store = self.encode_pdf(path, chunk_size=chunk_size, chunk_overlap=chunk_overlap)
|
||||
self.time_records = {'Chunking': time.time() - start_time}
|
||||
print(f"Chunking Time: {self.time_records['Chunking']:.2f} seconds")
|
||||
|
||||
# Create a retriever from the vector store
|
||||
self.chunks_query_retriever = self.vector_store.as_retriever(search_kwargs={"k": n_retrieved})
|
||||
|
||||
def generate_hypothetical_prompt_embeddings(self, chunk_text):
|
||||
"""
|
||||
Uses an LLM to generate multiple hypothetical questions for a single chunk.
|
||||
These questions act as 'proxies' for the chunk during retrieval.
|
||||
|
||||
Parameters:
|
||||
chunk_text (str): Text contents of the chunk.
|
||||
|
||||
Returns:
|
||||
tuple: (Original chunk text, List of embedding vectors generated from the questions)
|
||||
"""
|
||||
llm = ChatOpenAI(temperature=0, model_name="gpt-4o-mini")
|
||||
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")
|
||||
|
||||
question_gen_prompt = PromptTemplate.from_template(
|
||||
"Analyze the input text and generate essential questions that, when answered, \
|
||||
capture the main points of the text. Each question should be one line, \
|
||||
without numbering or prefixes.\n\n \
|
||||
Text:\n{chunk_text}\n\nQuestions:\n"
|
||||
)
|
||||
question_chain = question_gen_prompt | llm | StrOutputParser()
|
||||
|
||||
# Parse questions from response
|
||||
questions = question_chain.invoke({"chunk_text": chunk_text}).replace("\n\n", "\n").split("\n")
|
||||
|
||||
return chunk_text, embedding_model.embed_documents(questions)
|
||||
|
||||
def prepare_vector_store(self, chunks):
|
||||
"""
|
||||
Creates and populates a FAISS vector store using hypothetical prompt embeddings.
|
||||
|
||||
Parameters:
|
||||
chunks (List[str]): A list of text chunks to be embedded and stored.
|
||||
|
||||
Returns:
|
||||
FAISS: A FAISS vector store containing the embedded text chunks.
|
||||
"""
|
||||
vector_store = None # Wait to initialize to determine vector size
|
||||
|
||||
with ThreadPoolExecutor() as pool:
|
||||
# Parallelized embedding generation
|
||||
futures = [pool.submit(self.generate_hypothetical_prompt_embeddings, c) for c in chunks]
|
||||
|
||||
for f in tqdm(as_completed(futures), total=len(chunks)):
|
||||
chunk, vectors = f.result() # Retrieve processed chunk and embeddings
|
||||
|
||||
# Initialize FAISS store once vector size is known
|
||||
if vector_store is None:
|
||||
vector_store = FAISS(
|
||||
embedding_function=OpenAIEmbeddings(model="text-embedding-3-small"),
|
||||
index=faiss.IndexFlatL2(len(vectors[0])),
|
||||
docstore=InMemoryDocstore(),
|
||||
index_to_docstore_id={}
|
||||
)
|
||||
|
||||
# Store multiple vector representations per chunk
|
||||
chunks_with_embedding_vectors = [(chunk.page_content, vec) for vec in vectors]
|
||||
vector_store.add_embeddings(chunks_with_embedding_vectors)
|
||||
|
||||
return vector_store
|
||||
|
||||
def encode_pdf(self, path, chunk_size=1000, chunk_overlap=200):
|
||||
"""
|
||||
Encodes a PDF document into a vector store using hypothetical prompt embeddings.
|
||||
|
||||
Args:
|
||||
path: The path to the PDF file.
|
||||
chunk_size: The size of each text chunk.
|
||||
chunk_overlap: The overlap between consecutive chunks.
|
||||
|
||||
Returns:
|
||||
A FAISS vector store containing the encoded book content.
|
||||
"""
|
||||
# Load PDF documents
|
||||
loader = PyPDFLoader(path)
|
||||
documents = loader.load()
|
||||
|
||||
# Split documents into chunks
|
||||
text_splitter = RecursiveCharacterTextSplitter(
|
||||
chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
|
||||
)
|
||||
texts = text_splitter.split_documents(documents)
|
||||
cleaned_texts = replace_t_with_space(texts)
|
||||
|
||||
return self.prepare_vector_store(cleaned_texts)
|
||||
|
||||
def run(self, query):
|
||||
"""
|
||||
Retrieves and displays the context for the given query.
|
||||
|
||||
Args:
|
||||
query (str): The query to retrieve context for.
|
||||
|
||||
Returns:
|
||||
None
|
||||
"""
|
||||
# Measure retrieval time
|
||||
start_time = time.time()
|
||||
context = retrieve_context_per_question(query, self.chunks_query_retriever)
|
||||
self.time_records['Retrieval'] = time.time() - start_time
|
||||
print(f"Retrieval Time: {self.time_records['Retrieval']:.2f} seconds")
|
||||
|
||||
# Deduplicate context and display results
|
||||
context = list(set(context))
|
||||
show_context(context)
|
||||
|
||||
|
||||
def validate_args(args):
|
||||
if args.chunk_size <= 0:
|
||||
raise ValueError("chunk_size must be a positive integer.")
|
||||
if args.chunk_overlap < 0:
|
||||
raise ValueError("chunk_overlap must be a non-negative integer.")
|
||||
if args.n_retrieved <= 0:
|
||||
raise ValueError("n_retrieved must be a positive integer.")
|
||||
return args
|
||||
|
||||
|
||||
def parse_args():
|
||||
parser = argparse.ArgumentParser(description="Encode a PDF document and test a HyPE-based RAG system.")
|
||||
parser.add_argument("--path", type=str, default="../data/Understanding_Climate_Change.pdf",
|
||||
help="Path to the PDF file to encode.")
|
||||
parser.add_argument("--chunk_size", type=int, default=1000,
|
||||
help="Size of each text chunk (default: 1000).")
|
||||
parser.add_argument("--chunk_overlap", type=int, default=200,
|
||||
help="Overlap between consecutive chunks (default: 200).")
|
||||
parser.add_argument("--n_retrieved", type=int, default=3,
|
||||
help="Number of chunks to retrieve for each query (default: 3).")
|
||||
parser.add_argument("--query", type=str, default="What is the main cause of climate change?",
|
||||
help="Query to test the retriever (default: 'What is the main cause of climate change?').")
|
||||
parser.add_argument("--evaluate", action="store_true",
|
||||
help="Whether to evaluate the retriever's performance (default: False).")
|
||||
|
||||
return validate_args(parser.parse_args())
|
||||
|
||||
|
||||
def main(args):
|
||||
# Initialize the HyPE-based RAG Retriever
|
||||
hyperag = HyPE(
|
||||
path=args.path,
|
||||
chunk_size=args.chunk_size,
|
||||
chunk_overlap=args.chunk_overlap,
|
||||
n_retrieved=args.n_retrieved
|
||||
)
|
||||
|
||||
# Retrieve context based on the query
|
||||
hyperag.run(args.query)
|
||||
|
||||
# Evaluate the retriever's performance on the query (if requested)
|
||||
if args.evaluate:
|
||||
evaluate_rag(hyperag.chunks_query_retriever)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
# Call the main function with parsed arguments
|
||||
main(parse_args())
|
||||
21
images/hype.svg
Normal file
21
images/hype.svg
Normal file
File diff suppressed because one or more lines are too long
|
After Width: | Height: | Size: 98 KiB |
Reference in New Issue
Block a user