mirror of
https://github.com/NirDiamant/RAG_Techniques.git
synced 2025-04-07 00:48:52 +03:00
Compare commits
34 Commits
5b689a0d6c
...
1cfb0d44cb
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
1cfb0d44cb | ||
|
|
57e9dcc87a | ||
|
|
73b91bfa13 | ||
|
|
7d603611bd | ||
|
|
e76f08482a | ||
|
|
942467c05d | ||
|
|
6096797c7e | ||
|
|
42cabf3a9b | ||
|
|
990ecff889 | ||
|
|
3c46cc9b0a | ||
|
|
c4eb7c15e6 | ||
|
|
165876797c | ||
|
|
91a8a89302 | ||
|
|
dfb0e9125b | ||
|
|
d100326db5 | ||
|
|
6e1698f962 | ||
|
|
9b19b48637 | ||
|
|
b0b1b2f72e | ||
|
|
06d2f16b4b | ||
|
|
a51359b9c1 | ||
|
|
db8b6a7b6c | ||
|
|
c1d4bb450f | ||
|
|
673ffb5b0a | ||
|
|
c8791970e9 | ||
|
|
0993d27edf | ||
|
|
7249e55824 | ||
|
|
076560320c | ||
|
|
a1155a5581 | ||
|
|
76a529eccf | ||
|
|
f50f0e4373 | ||
|
|
8a9d842ede | ||
|
|
0ff4ed2270 | ||
|
|
2d3344b4f8 | ||
|
|
209dde5430 |
@@ -1,4 +1,4 @@
|
||||
# Contributing to Advanced RAG Techniques
|
||||
# Contributing to RAG Techniques
|
||||
|
||||
Welcome to the world's largest and most comprehensive repository of Retrieval-Augmented Generation (RAG) tutorials! 🌟 We're thrilled you're interested in contributing to this ever-growing knowledge base. Your expertise and creativity can help us maintain our position at the forefront of RAG technology.
|
||||
|
||||
|
||||
78
README.md
78
README.md
@@ -30,9 +30,11 @@ Welcome to one of the most comprehensive and dynamic collections of Retrieval-Au
|
||||
|
||||
[](https://diamantai.substack.com/?r=336pe4&utm_campaign=pub-share-checklist)
|
||||
|
||||
*Join thousands of AI enthusiasts getting unique cutting-edge insights and free tutorials! **Plus, subscribers get exclusive early access and special discounts to our upcoming RAG Techniques course!** *
|
||||
*Join over 15,000 of AI enthusiasts getting unique cutting-edge insights and free tutorials!* ***Plus, subscribers get exclusive early access and special 33% discounts to my book and the upcoming RAG Techniques course!***
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
[](https://diamantai.substack.com/?r=336pe4&utm_campaign=pub-share-checklist)
|
||||
|
||||
|
||||
@@ -151,7 +153,24 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
|
||||
### 📚 Context and Content Enrichment
|
||||
|
||||
8. **[Contextual Chunk Headers :label:](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/contextual_chunk_headers.ipynb)**
|
||||
8. Hypothetical Prompt Embeddings (HyPE) ❓🚀
|
||||
- **[LangChain](all_rag_techniques/HyPE_Hypothetical_Prompt_Embedding.ipynb)**
|
||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/HyPE_Hypothetical_Prompt_Embedding.py)**
|
||||
|
||||
#### Overview 🔎
|
||||
HyPE (Hypothetical Prompt Embeddings) is an enhancement to traditional RAG retrieval that **precomputes hypothetical prompts at the indexing stage**, but inseting the chunk in their place. This transforms retrieval into a **question-question matching task**. This avoids the need for runtime synthetic answer generation, reducing inference-time computational overhead while **improving retrieval alignment**.
|
||||
|
||||
#### Implementation 🛠️
|
||||
- 📖 **Precomputed Questions:** Instead of embedding document chunks, HyPE **generates multiple hypothetical queries per chunk** at indexing time.
|
||||
- 🔍 **Question-Question Matching:** User queries are matched against stored hypothetical questions, leading to **better retrieval alignment**.
|
||||
- ⚡ **No Runtime Overhead:** Unlike HyDE, HyPE does **not require LLM calls at query time**, making retrieval **faster and cheaper**.
|
||||
- 📈 **Higher Precision & Recall:** Improves retrieval **context precision by up to 42 percentage points** and **claim recall by up to 45 percentage points**.
|
||||
|
||||
#### Additional Resources 📚
|
||||
- **[Preprint: Hypothetical Prompt Embeddings (HyPE)](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5139335)** - Research paper detailing the method, evaluation, and benchmarks.
|
||||
|
||||
|
||||
9. **[Contextual Chunk Headers :label:](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/contextual_chunk_headers.ipynb)**
|
||||
|
||||
#### Overview 🔎
|
||||
Contextual chunk headers (CCH) is a method of creating document-level and section-level context, and prepending those chunk headers to the chunks prior to embedding them.
|
||||
@@ -162,7 +181,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Additional Resources 📚
|
||||
**[dsRAG](https://github.com/D-Star-AI/dsRAG)**: open-source retrieval engine that implements this technique (and a few other advanced RAG techniques)
|
||||
|
||||
9. **[Relevant Segment Extraction 🧩](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/relevant_segment_extraction.ipynb)**
|
||||
10. **[Relevant Segment Extraction 🧩](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/relevant_segment_extraction.ipynb)**
|
||||
|
||||
#### Overview 🔎
|
||||
Relevant segment extraction (RSE) is a method of dynamically constructing multi-chunk segments of text that are relevant to a given query.
|
||||
@@ -170,7 +189,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Implementation 🛠️
|
||||
Perform a retrieval post-processing step that analyzes the most relevant chunks and identifies longer multi-chunk segments to provide more complete context to the LLM.
|
||||
|
||||
10. Context Enrichment Techniques 📝
|
||||
11. Context Enrichment Techniques 📝
|
||||
- **[LangChain](all_rag_techniques/context_enrichment_window_around_chunk.ipynb)**
|
||||
- **[LlamaIndex](all_rag_techniques/context_enrichment_window_around_chunk_with_llamaindex.ipynb)**
|
||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/context_enrichment_window_around_chunk.py)**
|
||||
@@ -181,7 +200,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Implementation 🛠️
|
||||
Retrieve the most relevant sentence while also accessing the sentences before and after it in the original text.
|
||||
|
||||
11. Semantic Chunking 🧠
|
||||
12. Semantic Chunking 🧠
|
||||
- **[LangChain](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/semantic_chunking.ipynb)**
|
||||
- **[Runnable Script](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques_runnable_scripts/semantic_chunking.py)**
|
||||
|
||||
@@ -194,7 +213,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Additional Resources 📚
|
||||
- **[Semantic Chunking: Improving AI Information Retrieval](https://open.substack.com/pub/diamantai/p/semantic-chunking-improving-ai-information?r=336pe4&utm_campaign=post&utm_medium=web)** - A comprehensive blog post exploring the benefits and implementation of semantic chunking in RAG systems.
|
||||
|
||||
12. Contextual Compression 🗜️
|
||||
13. Contextual Compression 🗜️
|
||||
- **[LangChain](all_rag_techniques/contextual_compression.ipynb)**
|
||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/contextual_compression.py)**
|
||||
|
||||
@@ -204,7 +223,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Implementation 🛠️
|
||||
Use an LLM to compress or summarize retrieved chunks, preserving key information relevant to the query.
|
||||
|
||||
13. Document Augmentation through Question Generation for Enhanced Retrieval
|
||||
14. Document Augmentation through Question Generation for Enhanced Retrieval
|
||||
- **[LangChain](all_rag_techniques/document_augmentation.ipynb)**
|
||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/document_augmentation.py)**
|
||||
|
||||
@@ -216,7 +235,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
|
||||
### 🚀 Advanced Retrieval Methods
|
||||
|
||||
14. Fusion Retrieval 🔗
|
||||
15. Fusion Retrieval 🔗
|
||||
- **[LangChain](all_rag_techniques/fusion_retrieval.ipynb)**
|
||||
- **[LlamaIndex](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/fusion_retrieval_with_llamaindex.ipynb)**
|
||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/fusion_retrieval.py)**
|
||||
@@ -227,7 +246,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Implementation 🛠️
|
||||
Combine keyword-based search with vector-based search for more comprehensive and accurate retrieval.
|
||||
|
||||
15. Intelligent Reranking 📈
|
||||
16. Intelligent Reranking 📈
|
||||
- **[LangChain](all_rag_techniques/reranking.ipynb)**
|
||||
- **[LlamaIndex](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/reranking_with_llamaindex.ipynb)**
|
||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/reranking.py)**
|
||||
@@ -243,7 +262,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Additional Resources 📚
|
||||
- **[Relevance Revolution: How Re-ranking Transforms RAG Systems](https://open.substack.com/pub/diamantai/p/relevance-revolution-how-re-ranking?r=336pe4&utm_campaign=post&utm_medium=web)** - A comprehensive blog post exploring the power of re-ranking in enhancing RAG system performance.
|
||||
|
||||
16. Multi-faceted Filtering 🔍
|
||||
17. Multi-faceted Filtering 🔍
|
||||
|
||||
#### Overview 🔎
|
||||
Applying various filtering techniques to refine and improve the quality of retrieved results.
|
||||
@@ -254,7 +273,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
- 📄 **Content Filtering:** Remove results that don't match specific content criteria or essential keywords.
|
||||
- 🌈 **Diversity Filtering:** Ensure result diversity by filtering out near-duplicate entries.
|
||||
|
||||
17. Hierarchical Indices 🗂️
|
||||
18. Hierarchical Indices 🗂️
|
||||
- **[LangChain](all_rag_techniques/hierarchical_indices.ipynb)**
|
||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/hierarchical_indices.py)**
|
||||
|
||||
@@ -267,7 +286,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Additional Resources 📚
|
||||
- **[Hierarchical Indices: Enhancing RAG Systems](https://open.substack.com/pub/diamantai/p/hierarchical-indices-enhancing-rag?r=336pe4&utm_campaign=post&utm_medium=web)** - A comprehensive blog post exploring the power of hierarchical indices in enhancing RAG system performance.
|
||||
|
||||
18. Ensemble Retrieval 🎭
|
||||
19. Ensemble Retrieval 🎭
|
||||
|
||||
#### Overview 🔎
|
||||
Combining multiple retrieval models or techniques for more robust and accurate results.
|
||||
@@ -275,7 +294,16 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Implementation 🛠️
|
||||
Apply different embedding models or retrieval algorithms and use voting or weighting mechanisms to determine the final set of retrieved documents.
|
||||
|
||||
19. Multi-modal Retrieval 📽️
|
||||
20. Dartboard Retrieval 🎯
|
||||
- **[LangChain](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/dartboard.ipynb)**
|
||||
#### Overview 🔎
|
||||
Optimizing over Relevant Information Gain in Retrieval
|
||||
|
||||
#### Implementation 🛠️
|
||||
- Combine both relevance and diversity into a single scoring function and directly optimize for it.
|
||||
- POC showing plain simple RAG underperforming when the database is dense, and the dartboard retrieval outperforming it.
|
||||
|
||||
21. Multi-modal Retrieval 📽️
|
||||
|
||||
#### Overview 🔎
|
||||
Extending RAG capabilities to handle diverse data types for richer responses.
|
||||
@@ -287,7 +315,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
|
||||
### 🔁 Iterative and Adaptive Techniques
|
||||
|
||||
20. Retrieval with Feedback Loops 🔁
|
||||
22. Retrieval with Feedback Loops 🔁
|
||||
- **[LangChain](all_rag_techniques/retrieval_with_feedback_loop.ipynb)**
|
||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/retrieval_with_feedback_loop.py)**
|
||||
|
||||
@@ -297,7 +325,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Implementation 🛠️
|
||||
Collect and utilize user feedback on the relevance and quality of retrieved documents and generated responses to fine-tune retrieval and ranking models.
|
||||
|
||||
21. Adaptive Retrieval 🎯
|
||||
23. Adaptive Retrieval 🎯
|
||||
- **[LangChain](all_rag_techniques/adaptive_retrieval.ipynb)**
|
||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/adaptive_retrieval.py)**
|
||||
|
||||
@@ -307,7 +335,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Implementation 🛠️
|
||||
Classify queries into different categories and use tailored retrieval strategies for each, considering user context and preferences.
|
||||
|
||||
22. Iterative Retrieval 🔄
|
||||
24. Iterative Retrieval 🔄
|
||||
|
||||
#### Overview 🔎
|
||||
Performing multiple rounds of retrieval to refine and enhance result quality.
|
||||
@@ -317,7 +345,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
|
||||
### 📊 Evaluation
|
||||
|
||||
23. **[DeepEval Evaluation](evaluation/evaluation_deep_eval.ipynb)** 📘
|
||||
25. **[DeepEval Evaluation](evaluation/evaluation_deep_eval.ipynb)** 📘
|
||||
|
||||
#### Overview 🔎
|
||||
Performing evaluations Retrieval-Augmented Generation systems, by covering several metrics and creating test cases.
|
||||
@@ -326,7 +354,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
Use the `deepeval` library to conduct test cases on correctness, faithfulness and contextual relevancy of RAG systems.
|
||||
|
||||
|
||||
24. **[GroUSE Evaluation](evaluation/evaluation_grouse.ipynb)** 🐦
|
||||
26. **[GroUSE Evaluation](evaluation/evaluation_grouse.ipynb)** 🐦
|
||||
|
||||
#### Overview 🔎
|
||||
Evaluate the final stage of Retrieval-Augmented Generation using metrics of the GroUSE framework and meta-evaluate your custom LLM judge on GroUSE unit tests.
|
||||
@@ -337,7 +365,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
|
||||
### 🔬 Explainability and Transparency
|
||||
|
||||
25. Explainable Retrieval 🔍
|
||||
27. Explainable Retrieval 🔍
|
||||
- **[LangChain](all_rag_techniques/explainable_retrieval.ipynb)**
|
||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/explainable_retrieval.py)**
|
||||
|
||||
@@ -349,7 +377,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
|
||||
### 🏗️ Advanced Architectures
|
||||
|
||||
26. Knowledge Graph Integration (Graph RAG) 🕸️
|
||||
28. Knowledge Graph Integration (Graph RAG) 🕸️
|
||||
- **[LangChain](all_rag_techniques/graph_rag.ipynb)**
|
||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/graph_rag.py)**
|
||||
|
||||
@@ -359,7 +387,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Implementation 🛠️
|
||||
Retrieve entities and their relationships from a knowledge graph relevant to the query, combining this structured data with unstructured text for more informative responses.
|
||||
|
||||
27. GraphRag (Microsoft) 🎯
|
||||
29. GraphRag (Microsoft) 🎯
|
||||
- **[GraphRag](all_rag_techniques/Microsoft_GraphRag.ipynb)**
|
||||
|
||||
#### Overview 🔎
|
||||
@@ -368,7 +396,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Implementation 🛠️
|
||||
• Analyze an input corpus by extracting entities, relationships from text units. generates summaries of each community and its constituents from the bottom-up.
|
||||
|
||||
28. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval 🌳
|
||||
30. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval 🌳
|
||||
- **[LangChain](all_rag_techniques/raptor.ipynb)**
|
||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/raptor.py)**
|
||||
|
||||
@@ -378,7 +406,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Implementation 🛠️
|
||||
Use abstractive summarization to recursively process and summarize retrieved documents, organizing the information in a tree structure for hierarchical context.
|
||||
|
||||
29. Self RAG 🔁
|
||||
31. Self RAG 🔁
|
||||
- **[LangChain](all_rag_techniques/self_rag.ipynb)**
|
||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/self_rag.py)**
|
||||
|
||||
@@ -388,7 +416,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
#### Implementation 🛠️
|
||||
• Implement a multi-step process including retrieval decision, document retrieval, relevance evaluation, response generation, support assessment, and utility evaluation to produce accurate, relevant, and useful outputs.
|
||||
|
||||
30. Corrective RAG 🔧
|
||||
32. Corrective RAG 🔧
|
||||
- **[LangChain](all_rag_techniques/crag.ipynb)**
|
||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/crag.py)**
|
||||
|
||||
@@ -400,7 +428,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
||||
|
||||
## 🌟 Special Advanced Technique 🌟
|
||||
|
||||
31. **[Sophisticated Controllable Agent for Complex RAG Tasks 🤖](https://github.com/NirDiamant/Controllable-RAG-Agent)**
|
||||
33. **[Sophisticated Controllable Agent for Complex RAG Tasks 🤖](https://github.com/NirDiamant/Controllable-RAG-Agent)**
|
||||
|
||||
#### Overview 🔎
|
||||
An advanced RAG solution designed to tackle complex questions that simple semantic similarity-based retrieval cannot solve. This approach uses a sophisticated deterministic graph as the "brain" 🧠 of a highly controllable autonomous agent, capable of answering non-trivial questions from your own data.
|
||||
|
||||
558
all_rag_techniques/HyPE_Hypothetical_Prompt_Embeddings.ipynb
Normal file
558
all_rag_techniques/HyPE_Hypothetical_Prompt_Embeddings.ipynb
Normal file
@@ -0,0 +1,558 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Hypothetical Prompt Embeddings (HyPE)\n",
|
||||
"\n",
|
||||
"## Overview\n",
|
||||
"\n",
|
||||
"This code implements a Retrieval-Augmented Generation (RAG) system enhanced by Hypothetical Prompt Embeddings (HyPE). Unlike traditional RAG pipelines that struggle with query-document style mismatch, HyPE precomputes hypothetical questions during the indexing phase. This transforms retrieval into a question-question matching problem, eliminating the need for expensive runtime query expansion techniques.\n",
|
||||
"\n",
|
||||
"## Key Components of notebook\n",
|
||||
"\n",
|
||||
"1. PDF processing and text extraction\n",
|
||||
"2. Text chunking to maintain coherent information units\n",
|
||||
"3. **Hypothetical Prompt Embedding Generation** using an LLM to create multiple proxy questions per chunk\n",
|
||||
"4. Vector store creation using [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) and OpenAI embeddings\n",
|
||||
"5. Retriever setup for querying the processed documents\n",
|
||||
"6. Evaluation of the RAG system\n",
|
||||
"\n",
|
||||
"## Method Details\n",
|
||||
"\n",
|
||||
"### Document Preprocessing\n",
|
||||
"\n",
|
||||
"1. The PDF is loaded using `PyPDFLoader`.\n",
|
||||
"2. The text is split into chunks using `RecursiveCharacterTextSplitter` with specified chunk size and overlap.\n",
|
||||
"\n",
|
||||
"### Hypothetical Question Generation\n",
|
||||
"\n",
|
||||
"Instead of embedding raw text chunks, HyPE **generates multiple hypothetical prompts** for each chunk. These **precomputed questions** simulate user queries, improving alignment with real-world searches. This removes the need for runtime synthetic answer generation needed in techniques like HyDE.\n",
|
||||
"\n",
|
||||
"### Vector Store Creation\n",
|
||||
"\n",
|
||||
"1. Each hypothetical question is embedded using OpenAI embeddings.\n",
|
||||
"2. A FAISS vector store is built, associating **each question embedding with its original chunk**.\n",
|
||||
"3. This approach **stores multiple representations per chunk**, increasing retrieval flexibility.\n",
|
||||
"\n",
|
||||
"### Retriever Setup\n",
|
||||
"\n",
|
||||
"1. The retriever is optimized for **question-question matching** rather than direct document retrieval.\n",
|
||||
"2. The FAISS index enables **efficient nearest-neighbor** search over the hypothetical prompt embeddings.\n",
|
||||
"3. Retrieved chunks provide a **richer and more precise context** for downstream LLM generation.\n",
|
||||
"\n",
|
||||
"## Key Features\n",
|
||||
"\n",
|
||||
"1. **Precomputed Hypothetical Prompts** – Improves query alignment without runtime overhead.\n",
|
||||
"2. **Multi-Vector Representation**– Each chunk is indexed multiple times for broader semantic coverage.\n",
|
||||
"3. **Efficient Retrieval** – FAISS ensures fast similarity search over the enhanced embeddings.\n",
|
||||
"4. **Modular Design** – The pipeline is easy to adapt for different datasets and retrieval settings. Additionally it's compatible with most optimizations like reranking etc.\n",
|
||||
"\n",
|
||||
"## Evaluation\n",
|
||||
"\n",
|
||||
"HyPE's effectiveness is evaluated across multiple datasets, showing:\n",
|
||||
"\n",
|
||||
"- Up to 42 percentage points improvement in retrieval precision\n",
|
||||
"- Up to 45 percentage points improvement in claim recall\n",
|
||||
" (See full evaluation results in [preprint](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5139335))\n",
|
||||
"\n",
|
||||
"## Benefits of this Approach\n",
|
||||
"\n",
|
||||
"1. **Eliminates Query-Time Overhead** – All hypothetical generation is done offline at indexing.\n",
|
||||
"2. **Enhanced Retrieval Precision** – Better alignment between queries and stored content.\n",
|
||||
"3. **Scalable & Efficient** – No addinal per-query computational cost; retrieval is as fast as standard RAG.\n",
|
||||
"4. **Flexible & Extensible** – Can be combined with advanced RAG techniques like reranking.\n",
|
||||
"\n",
|
||||
"## Conclusion\n",
|
||||
"\n",
|
||||
"HyPE provides a scalable and efficient alternative to traditional RAG systems, overcoming query-document style mismatch while avoiding the computational cost of runtime query expansion. By moving hypothetical prompt generation to indexing, it significantly enhances retrieval precision and efficiency, making it a practical solution for real-world applications.\n",
|
||||
"\n",
|
||||
"For further details, refer to the full paper: [preprint](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5139335)\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"<div style=\"text-align: center;\">\n",
|
||||
"\n",
|
||||
"<img src=\"../images/hype.svg\" alt=\"HyPE\" style=\"width:70%; height:auto;\">\n",
|
||||
"</div>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Import libraries and environment variables"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 63,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"import sys\n",
|
||||
"import faiss\n",
|
||||
"from tqdm import tqdm\n",
|
||||
"from dotenv import load_dotenv\n",
|
||||
"from concurrent.futures import ThreadPoolExecutor, as_completed\n",
|
||||
"from langchain_community.docstore.in_memory import InMemoryDocstore\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# Load environment variables from a .env file\n",
|
||||
"load_dotenv()\n",
|
||||
"\n",
|
||||
"# Set the OpenAI API key environment variable (comment out if not using OpenAI)\n",
|
||||
"if not os.getenv('OPENAI_API_KEY'):\n",
|
||||
" os.environ[\"OPENAI_API_KEY\"] = input(\"Please enter your OpenAI API key: \")\n",
|
||||
"else:\n",
|
||||
" os.environ[\"OPENAI_API_KEY\"] = os.getenv('OPENAI_API_KEY')\n",
|
||||
"\n",
|
||||
"sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path since we work with notebooks\n",
|
||||
"from helper_functions import *\n",
|
||||
"from evaluation.evalute_rag import *\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Define constants\n",
|
||||
"\n",
|
||||
"- `PATH`: path to the data, to be embedded into the RAG pipeline\n",
|
||||
"\n",
|
||||
"This tutorial uses OpenAI endpoint ([avalible models](https://platform.openai.com/docs/pricing)). \n",
|
||||
"- `LANGUAGE_MODEL_NAME`: The name of the language model to be used. \n",
|
||||
"- `EMBEDDING_MODEL_NAME`: The name of the embedding model to be used.\n",
|
||||
"\n",
|
||||
"The tutroial uses a `RecursiveCharacterTextSplitter` chunking approach where the chunking length function used is python `len` function. The chunking varables to be tweaked here are:\n",
|
||||
"- `CHUNK_SIZE`: The minimum length of one chunk\n",
|
||||
"- `CHUNK_OVERLAP`: The overlap of two consecutive chunks."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 64,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"PATH = \"../data/Understanding_Climate_Change.pdf\"\n",
|
||||
"LANGUAGE_MODEL_NAME = \"gpt-4o-mini\"\n",
|
||||
"EMBEDDING_MODEL_NAME = \"text-embedding-3-small\"\n",
|
||||
"CHUNK_SIZE = 1000\n",
|
||||
"CHUNK_OVERLAP = 200"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Define generation of Hypothetical Prompt Embeddings\n",
|
||||
"\n",
|
||||
"The code block below generates hypothetical questions for each text chunk and embeds them for retrieval.\n",
|
||||
"\n",
|
||||
"- An LLM extracts key questions from the input chunk.\n",
|
||||
"- These questions are embedded using OpenAI's model.\n",
|
||||
"- The function returns the original chunk and its prompt embeddings later used for retrieval.\n",
|
||||
"\n",
|
||||
"To ensure clean output, extra newlines are removed, and regex parsing can improve list formatting when needed."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 65,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def generate_hypothetical_prompt_embeddings(chunk_text: str):\n",
|
||||
" \"\"\"\n",
|
||||
" Uses the LLM to generate multiple hypothetical questions for a single chunk.\n",
|
||||
" These questions will be used as 'proxies' for the chunk during retrieval.\n",
|
||||
"\n",
|
||||
" Parameters:\n",
|
||||
" chunk_text (str): Text contents of the chunk\n",
|
||||
"\n",
|
||||
" Returns:\n",
|
||||
" chunk_text (str): Text contents of the chunk. This is done to make the \n",
|
||||
" multithreading easier\n",
|
||||
" hypothetical prompt embeddings (List[float]): A list of embedding vectors\n",
|
||||
" generated from the questions\n",
|
||||
" \"\"\"\n",
|
||||
" llm = ChatOpenAI(temperature=0, model_name=LANGUAGE_MODEL_NAME)\n",
|
||||
" embedding_model = OpenAIEmbeddings(model=EMBEDDING_MODEL_NAME)\n",
|
||||
"\n",
|
||||
" question_gen_prompt = PromptTemplate.from_template(\n",
|
||||
" \"Analyze the input text and generate essential questions that, when answered, \\\n",
|
||||
" capture the main points of the text. Each question should be one line, \\\n",
|
||||
" without numbering or prefixes.\\n\\n \\\n",
|
||||
" Text:\\n{chunk_text}\\n\\nQuestions:\\n\"\n",
|
||||
" )\n",
|
||||
" question_chain = question_gen_prompt | llm | StrOutputParser()\n",
|
||||
"\n",
|
||||
" # parse questions from response\n",
|
||||
" # Notes: \n",
|
||||
" # - gpt4o likes to split questions by \\n\\n so we remove one \\n\n",
|
||||
" # - for production or if using smaller models from ollama, it's beneficial to use regex to parse \n",
|
||||
" # things like (un)ordeed lists\n",
|
||||
" # r\"^\\s*[\\-\\*\\•]|\\s*\\d+\\.\\s*|\\s*[a-zA-Z]\\)\\s*|\\s*\\(\\d+\\)\\s*|\\s*\\([a-zA-Z]\\)\\s*|\\s*\\([ivxlcdm]+\\)\\s*\"\n",
|
||||
" questions = question_chain.invoke({\"chunk_text\": chunk_text}).replace(\"\\n\\n\", \"\\n\").split(\"\\n\")\n",
|
||||
" \n",
|
||||
" return chunk_text, embedding_model.embed_documents(questions)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Define creation and population of FAISS Vector Store\n",
|
||||
"\n",
|
||||
"The code block below builds a FAISS vector store by embedding text chunks in parallel.\n",
|
||||
"\n",
|
||||
"What happens?\n",
|
||||
"- Parallel processing – Uses threading to generate embeddings faster.\n",
|
||||
"- FAISS initialization – Sets up an L2 index for efficient similarity search.\n",
|
||||
"- Chunk embedding – Each chunk is stored multiple times, once for each generated question embedding.\n",
|
||||
"- In-memory storage – Uses InMemoryDocstore for fast lookup.\n",
|
||||
"\n",
|
||||
"This ensures efficient retrieval, improving query alignment with precomputed question embeddings."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 66,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def prepare_vector_store(chunks: List[str]):\n",
|
||||
" \"\"\"\n",
|
||||
" Creates and populates a FAISS vector store from a list of text chunks.\n",
|
||||
"\n",
|
||||
" This function processes a list of text chunks in parallel, generating \n",
|
||||
" hypothetical prompt embeddings for each chunk.\n",
|
||||
" The embeddings are stored in a FAISS index for efficient similarity search.\n",
|
||||
"\n",
|
||||
" Parameters:\n",
|
||||
" chunks (List[str]): A list of text chunks to be embedded and stored.\n",
|
||||
"\n",
|
||||
" Returns:\n",
|
||||
" FAISS: A FAISS vector store containing the embedded text chunks.\n",
|
||||
" \"\"\"\n",
|
||||
"\n",
|
||||
" # Wait with initialization to see vector lengths\n",
|
||||
" vector_store = None \n",
|
||||
"\n",
|
||||
" with ThreadPoolExecutor() as pool: \n",
|
||||
" # Use threading to speed up generation of prompt embeddings\n",
|
||||
" futures = [pool.submit(generate_hypothetical_prompt_embeddings, c) for c in chunks]\n",
|
||||
" \n",
|
||||
" # Process embeddings as they complete\n",
|
||||
" for f in tqdm(as_completed(futures), total=len(chunks)): \n",
|
||||
" \n",
|
||||
" chunk, vectors = f.result() # Retrieve the processed chunk and its embeddings\n",
|
||||
" \n",
|
||||
" # Initialize the FAISS vector store on the first chunk\n",
|
||||
" if vector_store == None: \n",
|
||||
" vector_store = FAISS(\n",
|
||||
" embedding_function=OpenAIEmbeddings(model=EMBEDDING_MODEL_NAME), # Define embedding model\n",
|
||||
" index=faiss.IndexFlatL2(len(vectors[0])) # Define an L2 index for similarity search\n",
|
||||
" docstore=InMemoryDocstore(), # Use in-memory document storage\n",
|
||||
" index_to_docstore_id={} # Maintain index-to-document mapping\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" # Pair the chunk's content with each generated embedding vector.\n",
|
||||
" # Each chunk is inserted multiple times, once for each prompt vector\n",
|
||||
" chunks_with_embedding_vectors = [(chunk.page_content, vec) for vec in vectors]\n",
|
||||
" \n",
|
||||
" # Add embeddings to the store\n",
|
||||
" vector_store.add_embeddings(chunks_with_embedding_vectors) \n",
|
||||
"\n",
|
||||
" return vector_store # Return the populated vector store\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Encode PDF into a FAISS Vector Store\n",
|
||||
"\n",
|
||||
"The code block below processes a PDF file and stores its content as embeddings for retrieval.\n",
|
||||
"\n",
|
||||
"What happens?\n",
|
||||
"- PDF loading – Extracts text from the document.\n",
|
||||
"- Chunking – Splits text into overlapping segments for better context retention.\n",
|
||||
"- Preprocessing – Cleans text to improve embedding quality.\n",
|
||||
"- Vector store creation – Generates embeddings and stores them in FAISS for retrieval."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 70,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def encode_pdf(path, chunk_size=1000, chunk_overlap=200):\n",
|
||||
" \"\"\"\n",
|
||||
" Encodes a PDF book into a vector store using OpenAI embeddings.\n",
|
||||
"\n",
|
||||
" Args:\n",
|
||||
" path: The path to the PDF file.\n",
|
||||
" chunk_size: The desired size of each text chunk.\n",
|
||||
" chunk_overlap: The amount of overlap between consecutive chunks.\n",
|
||||
"\n",
|
||||
" Returns:\n",
|
||||
" A FAISS vector store containing the encoded book content.\n",
|
||||
" \"\"\"\n",
|
||||
"\n",
|
||||
" # Load PDF documents\n",
|
||||
" loader = PyPDFLoader(path)\n",
|
||||
" documents = loader.load()\n",
|
||||
"\n",
|
||||
" # Split documents into chunks\n",
|
||||
" text_splitter = RecursiveCharacterTextSplitter(\n",
|
||||
" chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len\n",
|
||||
" )\n",
|
||||
" texts = text_splitter.split_documents(documents)\n",
|
||||
" cleaned_texts = replace_t_with_space(texts)\n",
|
||||
"\n",
|
||||
" vectorstore = prepare_vector_store(cleaned_texts)\n",
|
||||
"\n",
|
||||
" return vectorstore"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Create HyPE vector store\n",
|
||||
"\n",
|
||||
"Now we process the PDF and store its embeddings.\n",
|
||||
"This step initializes the FAISS vector store with the encoded document."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 71,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"100%|██████████| 97/97 [00:22<00:00, 4.40it/s]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Chunk size can be quite large with HyPE as we are not loosing percision with more\n",
|
||||
"# information. For production, test how exhaustive your model is in generating sufficient \n",
|
||||
"# amount of questions per chunk. This will mostly depend on your information density.\n",
|
||||
"chunks_vector_store = encode_pdf(PATH, chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Create retriever\n",
|
||||
"\n",
|
||||
"Now we set up the retriever to fetch relevant chunks from the vector store.\n",
|
||||
"\n",
|
||||
"Retrieves the top `k=3` most relevant chunks based on query similarity."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 79,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"chunks_query_retriever = chunks_vector_store.as_retriever(search_kwargs={\"k\": 3})"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Test retriever\n",
|
||||
"\n",
|
||||
"Now we test retrieval using a sample query.\n",
|
||||
"\n",
|
||||
"- Queries the vector store to find the most relevant chunks.\n",
|
||||
"- Deduplicates results to remove potentially repeated chunks.\n",
|
||||
"- Displays the retrieved context for inspection.\n",
|
||||
"\n",
|
||||
"This step verifies that the retriever returns meaningful and diverse information for the given question."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 80,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Context 1:\n",
|
||||
"Most of these climate changes are attributed to very small variations in Earth's orbit that \n",
|
||||
"change the amount of solar energy our planet receives. During the Holocene epoch, which \n",
|
||||
"began at the end of the last ice age, human societies f lourished, but the industrial era has seen \n",
|
||||
"unprecedented changes. \n",
|
||||
"Modern Observations \n",
|
||||
"Modern scientific observations indicate a rapid increase in global temperatures, sea levels, \n",
|
||||
"and extreme weather events. The Intergovernmental Panel on Climate Change (IPCC) has \n",
|
||||
"documented these changes extensively. Ice core samples, tree rings, and ocean sediments \n",
|
||||
"provide a historical record that scientists use to understand past climate conditions and \n",
|
||||
"predict future trends. The evidence overwhelmingly shows that recent changes are primarily \n",
|
||||
"driven by human activities, particularly the emission of greenhou se gases. \n",
|
||||
"Chapter 2: Causes of Climate Change \n",
|
||||
"Greenhouse Gases\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"Context 2:\n",
|
||||
"driven by human activities, particularly the emission of greenhou se gases. \n",
|
||||
"Chapter 2: Causes of Climate Change \n",
|
||||
"Greenhouse Gases \n",
|
||||
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
|
||||
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
|
||||
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
|
||||
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
|
||||
"activities have intensified this natural process, leading to a warmer climate. \n",
|
||||
"Fossil Fuels \n",
|
||||
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
|
||||
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
|
||||
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
|
||||
"today. \n",
|
||||
"Coal\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"Context 3:\n",
|
||||
"Understanding Climate Change \n",
|
||||
"Chapter 1: Introduction to Climate Change \n",
|
||||
"Climate change refers to significant, long -term changes in the global climate. The term \n",
|
||||
"\"global climate\" encompasses the planet's overall weather patterns, including temperature, \n",
|
||||
"precipitation, and wind patterns, over an extended period. Over the past cent ury, human \n",
|
||||
"activities, particularly the burning of fossil fuels and deforestation, have significantly \n",
|
||||
"contributed to climate change. \n",
|
||||
"Historical Context \n",
|
||||
"The Earth's climate has changed throughout history. Over the past 650,000 years, there have \n",
|
||||
"been seven cycles of glacial advance and retreat, with the abrupt end of the last ice age about \n",
|
||||
"11,700 years ago marking the beginning of the modern climate era and human civilization. \n",
|
||||
"Most of these climate changes are attributed to very small variations in Earth's orbit that \n",
|
||||
"change the amount of solar energy our planet receives. During the Holocene epoch, which\n",
|
||||
"\n",
|
||||
"\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"test_query = \"What is the main cause of climate change?\"\n",
|
||||
"context = retrieve_context_per_question(test_query, chunks_query_retriever)\n",
|
||||
"context = list(set(context))\n",
|
||||
"show_context(context)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Evaluate results"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 76,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'questions': ['1. **Multiple Choice: Causes of Climate Change**',\n",
|
||||
" ' - What is the primary cause of the current climate change trend?',\n",
|
||||
" ' A) Solar radiation variations',\n",
|
||||
" ' B) Natural cycles of the Earth',\n",
|
||||
" ' C) Human activities, such as burning fossil fuels',\n",
|
||||
" ' D) Volcanic eruptions',\n",
|
||||
" '',\n",
|
||||
" '2. **True or False: Impact on Biodiversity**',\n",
|
||||
" ' - True or False: Climate change does not have any significant impact on the migration patterns and extinction rates of various species.',\n",
|
||||
" '',\n",
|
||||
" '3. **Short Answer: Mitigation Strategies**',\n",
|
||||
" ' - What are two effective strategies that can be implemented at a community level to mitigate the effects of climate change?',\n",
|
||||
" '',\n",
|
||||
" '4. **Matching: Climate Change Effects**',\n",
|
||||
" ' - Match the following effects of climate change (numbered) with their likely consequences (lettered).',\n",
|
||||
" ' 1. Rising sea levels',\n",
|
||||
" ' 2. Increased frequency of extreme weather events',\n",
|
||||
" ' 3. Melting polar ice caps',\n",
|
||||
" ' 4. Ocean acidification',\n",
|
||||
" ' ',\n",
|
||||
" ' A) Displacement of coastal communities',\n",
|
||||
" ' B) Loss of marine biodiversity',\n",
|
||||
" ' C) Increased global temperatures',\n",
|
||||
" ' D) More frequent and severe hurricanes and floods',\n",
|
||||
" '',\n",
|
||||
" '5. **Essay: International Cooperation**',\n",
|
||||
" ' - Discuss the importance of international cooperation in combating climate change. Include examples of successful global agreements or initiatives and explain how they have contributed to addressing climate change.'],\n",
|
||||
" 'results': ['```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 2,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 1,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 1,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 2,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 2,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 2\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 2,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 2,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 2\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 2,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 3\\n}\\n```',\n",
|
||||
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 2,\\n \"Conciseness\": 3\\n}\\n```'],\n",
|
||||
" 'average_scores': None}"
|
||||
]
|
||||
},
|
||||
"execution_count": 76,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"evaluate_rag(chunks_query_retriever)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": ".venv",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.12"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -19,7 +19,7 @@
|
||||
"nest_asyncio.apply()\n",
|
||||
"from dotenv import load_dotenv\n",
|
||||
"\n",
|
||||
"from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, ServiceContext\n",
|
||||
"from llama_index.core import VectorStoreIndex, SimpleDirectoryReader\n",
|
||||
"from llama_index.core.prompts import PromptTemplate\n",
|
||||
"\n",
|
||||
"from llama_index.core.evaluation import (\n",
|
||||
@@ -28,6 +28,7 @@
|
||||
" RelevancyEvaluator\n",
|
||||
")\n",
|
||||
"from llama_index.llms.openai import OpenAI\n",
|
||||
"from llama_index.core import Settings\n",
|
||||
"\n",
|
||||
"import openai\n",
|
||||
"import time\n",
|
||||
@@ -90,11 +91,11 @@
|
||||
"# We will use GPT-4 for evaluating the responses\n",
|
||||
"gpt4 = OpenAI(temperature=0, model=\"gpt-4o\")\n",
|
||||
"\n",
|
||||
"# Define service context for GPT-4 for evaluation\n",
|
||||
"service_context_gpt4 = ServiceContext.from_defaults(llm=gpt4)\n",
|
||||
"# Set appropriate settings for the LLM\n",
|
||||
"Settings.llm = gpt4\n",
|
||||
"\n",
|
||||
"# Define Faithfulness and Relevancy Evaluators which are based on GPT-4\n",
|
||||
"faithfulness_gpt4 = FaithfulnessEvaluator(service_context=service_context_gpt4)\n",
|
||||
"# Define Faithfulness Evaluators which are based on GPT-4\n",
|
||||
"faithfulness_gpt4 = FaithfulnessEvaluator()\n",
|
||||
"\n",
|
||||
"faithfulness_new_prompt_template = PromptTemplate(\"\"\" Please tell if a given piece of information is directly supported by the context.\n",
|
||||
" You need to answer with either YES or NO.\n",
|
||||
@@ -123,7 +124,9 @@
|
||||
" \"\"\")\n",
|
||||
"\n",
|
||||
"faithfulness_gpt4.update_prompts({\"your_prompt_key\": faithfulness_new_prompt_template}) # Update the prompts dictionary with the new prompt template\n",
|
||||
"relevancy_gpt4 = RelevancyEvaluator(service_context=service_context_gpt4)"
|
||||
"\n",
|
||||
"# Define Relevancy Evaluators which are based on GPT-4\n",
|
||||
"relevancy_gpt4 = RelevancyEvaluator()"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -159,10 +162,12 @@
|
||||
" # create vector index\n",
|
||||
" llm = OpenAI(model=\"gpt-3.5-turbo\")\n",
|
||||
"\n",
|
||||
" service_context = ServiceContext.from_defaults(llm=llm, chunk_size=chunk_size, chunk_overlap=chunk_size//5) \n",
|
||||
" vector_index = VectorStoreIndex.from_documents(\n",
|
||||
" eval_documents, service_context=service_context\n",
|
||||
" )\n",
|
||||
" Settings.llm = llm\n",
|
||||
" Settings.chunk_size = chunk_size\n",
|
||||
" Settings.chunk_overlap = chunk_size // 5 \n",
|
||||
"\n",
|
||||
" vector_index = VectorStoreIndex.from_documents(eval_documents)\n",
|
||||
" \n",
|
||||
" # build query engine\n",
|
||||
" query_engine = vector_index.as_query_engine(similarity_top_k=5)\n",
|
||||
" num_questions = len(eval_questions)\n",
|
||||
@@ -234,7 +239,7 @@
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": ".venv",
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
@@ -248,7 +253,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.12.0"
|
||||
"version": "3.11.0"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
||||
@@ -97,7 +97,7 @@
|
||||
"from langchain_core.pydantic_v1 import BaseModel, Field\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path sicnce we work with notebooks\n",
|
||||
"sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path since we work with notebooks\n",
|
||||
"from helper_functions import *\n",
|
||||
"from evaluation.evalute_rag import *\n",
|
||||
"\n",
|
||||
|
||||
609
all_rag_techniques/dartboard.ipynb
Normal file
609
all_rag_techniques/dartboard.ipynb
Normal file
@@ -0,0 +1,609 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Dartboard RAG: Retrieval-Augmented Generation with Balanced Relevance and Diversity\n",
|
||||
"\n",
|
||||
"## Overview\n",
|
||||
"The **Dartboard RAG** process addresses a common challenge in large knowledge bases: ensuring the retrieved information is both relevant and non-redundant. By explicitly optimizing a combined relevance-diversity scoring function, it prevents multiple top-k documents from offering the same information. This approach is drawn from the elegant method in thepaper:\n",
|
||||
"\n",
|
||||
"> [*Better RAG using Relevant Information Gain*](https://arxiv.org/abs/2407.12101)\n",
|
||||
"\n",
|
||||
"The paper outlines three variations of the core idea—hybrid RAG (dense + sparse), a cross-encoder version, and a vanilla approach. The **vanilla approach** conveys the fundamental concept most directly, and this implementation extends it with optional weights to control the balance between relevance and diversity.\n",
|
||||
"\n",
|
||||
"## Motivation\n",
|
||||
"\n",
|
||||
"1. **Dense, Overlapping Knowledge Bases** \n",
|
||||
" In large databases, documents may repeat similar content, causing redundancy in top-k retrieval.\n",
|
||||
"\n",
|
||||
"2. **Improved Information Coverage** \n",
|
||||
" Combining relevance and diversity yields a richer set of documents, mitigating the “echo chamber” effect of overly similar content.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"## Key Components\n",
|
||||
"\n",
|
||||
"1. **Relevance & Diversity Combination** \n",
|
||||
" - Computes a score factoring in both how pertinent a document is to the query and how distinct it is from already chosen documents.\n",
|
||||
"\n",
|
||||
"2. **Weighted Balancing** \n",
|
||||
" - Introduces RELEVANCE_WEIGHT and DIVERSITY_WEIGHT to allow dynamic control of scoring. \n",
|
||||
" - Helps in avoiding overly diverse but less relevant results.\n",
|
||||
"\n",
|
||||
"3. **Production-Ready Code** \n",
|
||||
" - Derived from the official implementation yet reorganized for clarity. \n",
|
||||
" - Allows easier integration into existing RAG pipelines.\n",
|
||||
"\n",
|
||||
"## Method Details\n",
|
||||
"\n",
|
||||
"1. **Document Retrieval** \n",
|
||||
" - Obtain an initial set of candidate documents based on similarity (e.g., cosine or BM25). \n",
|
||||
" - Typically retrieves top-N candidates as a starting point.\n",
|
||||
"\n",
|
||||
"2. **Scoring & Selection** \n",
|
||||
" - Each document’s overall score combines **relevance** and **diversity**: \n",
|
||||
" - Select the highest-scoring document, then penalize documents that are overly similar to it. \n",
|
||||
" - Repeat until top-k documents are identified.\n",
|
||||
"\n",
|
||||
"3. **Hybrid / Fusion & Cross-Encoder Support** \n",
|
||||
" Essentially, all you need are distances between documents and the query, and distances between documents. You can easily extract these from hybrid / fusion retrieval or from cross-encoder retrieval. The only recommendation I have is to rely less on raking based scores.\n",
|
||||
" - For **hybrid / fusion retrieval**: Merge similarities (dense and sparse / BM25) into a single distance. This can be achieved by combining cosine similarity over the dense and the sparse vectors (e.g. averaging them). the move to distances is straightforward (1 - mean cosine similarity). \n",
|
||||
" - For **cross-encoders**: You can directly use the cross-encoder similarity scores (1- similarity), potentially adjusting with scaling factors.\n",
|
||||
"\n",
|
||||
"4. **Balancing & Adjustment** \n",
|
||||
" - Tune DIVERSITY_WEIGHT and RELEVANCE_WEIGHT based on your needs and the density of your dataset. \n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"By integrating both **relevance** and **diversity** into retrieval, the Dartboard RAG approach ensures that top-k documents collectively offer richer, more comprehensive information—leading to higher-quality responses in Retrieval-Augmented Generation systems.\n",
|
||||
"\n",
|
||||
"The paper also has an official code implemention, and this code is based on it, but I think this one here is more readable, manageable and production ready."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Import libraries and environment variables"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Please enter your OpenAI API key: \n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"import sys\n",
|
||||
"from dotenv import load_dotenv\n",
|
||||
"from scipy.special import logsumexp\n",
|
||||
"from typing import Tuple, List, Any\n",
|
||||
"import numpy as np\n",
|
||||
"\n",
|
||||
"# Load environment variables from a .env file\n",
|
||||
"load_dotenv()\n",
|
||||
"# Set the OpenAI API key environment variable (comment out if not using OpenAI)\n",
|
||||
"if not os.getenv('OPENAI_API_KEY'):\n",
|
||||
" print(\"Please enter your OpenAI API key: \")\n",
|
||||
" os.environ[\"OPENAI_API_KEY\"] = input(\"Please enter your OpenAI API key: \")\n",
|
||||
"else:\n",
|
||||
" os.environ[\"OPENAI_API_KEY\"] = os.getenv('OPENAI_API_KEY')\n",
|
||||
"\n",
|
||||
"sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path since we work with notebooks\n",
|
||||
"from helper_functions import *\n",
|
||||
"from evaluation.evalute_rag import *\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Read Docs"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"path = \"../data/Understanding_Climate_Change.pdf\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Encode document"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# this part is same like simple_rag.ipynb, only simulating a dense dataset\n",
|
||||
"def encode_pdf(path, chunk_size=1000, chunk_overlap=200):\n",
|
||||
" \"\"\"\n",
|
||||
" Encodes a PDF book into a vector store using OpenAI embeddings.\n",
|
||||
"\n",
|
||||
" Args:\n",
|
||||
" path: The path to the PDF file.\n",
|
||||
" chunk_size: The desired size of each text chunk.\n",
|
||||
" chunk_overlap: The amount of overlap between consecutive chunks.\n",
|
||||
"\n",
|
||||
" Returns:\n",
|
||||
" A FAISS vector store containing the encoded book content.\n",
|
||||
" \"\"\"\n",
|
||||
"\n",
|
||||
" # Load PDF documents\n",
|
||||
" loader = PyPDFLoader(path)\n",
|
||||
" documents = loader.load()\n",
|
||||
" documents=documents*5 # load every document 5 times to emulate a dense dataset\n",
|
||||
"\n",
|
||||
" # Split documents into chunks\n",
|
||||
" text_splitter = RecursiveCharacterTextSplitter(\n",
|
||||
" chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len\n",
|
||||
" )\n",
|
||||
" texts = text_splitter.split_documents(documents)\n",
|
||||
" cleaned_texts = replace_t_with_space(texts)\n",
|
||||
"\n",
|
||||
" # Create embeddings (Tested with OpenAI and Amazon Bedrock)\n",
|
||||
" embeddings = get_langchain_embedding_provider(EmbeddingProvider.OPENAI)\n",
|
||||
" #embeddings = get_langchain_embedding_provider(EmbeddingProvider.AMAZON_BEDROCK)\n",
|
||||
"\n",
|
||||
" # Create vector store\n",
|
||||
" vectorstore = FAISS.from_documents(cleaned_texts, embeddings)\n",
|
||||
"\n",
|
||||
" return vectorstore"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Create Vector store\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"chunks_vector_store = encode_pdf(path, chunk_size=1000, chunk_overlap=200)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Some helper functions for using the vector store for retrieval.\n",
|
||||
"this part is same like simple_rag.ipynb, only its using the actual FAISS index (not the wrapper)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"\n",
|
||||
"def idx_to_text(idx:int):\n",
|
||||
" \"\"\"\n",
|
||||
" Convert a Vector store index to the corresponding text.\n",
|
||||
" \"\"\"\n",
|
||||
" docstore_id = chunks_vector_store.index_to_docstore_id[idx]\n",
|
||||
" document = chunks_vector_store.docstore.search(docstore_id)\n",
|
||||
" return document.page_content\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def get_context(query:str,k:int=5) -> Tuple[np.ndarray, np.ndarray, List[str]]:\n",
|
||||
" \"\"\"\n",
|
||||
" Retrieve top k context items for a query using top k retrieval.\n",
|
||||
" \"\"\"\n",
|
||||
" # regular top k retrieval\n",
|
||||
" q_vec=chunks_vector_store.embedding_function.embed_documents([query])\n",
|
||||
" _,indices=chunks_vector_store.index.search(np.array(q_vec),k=k)\n",
|
||||
"\n",
|
||||
" texts = [idx_to_text(i) for i in indices[0]]\n",
|
||||
" return texts\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"\n",
|
||||
"test_query = \"What is the main cause of climate change?\"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Regular top k retrieval\n",
|
||||
"- This demonstration shows that when database is dense (here we simulate density by loading each document 5 times), the results are not good, we don't get the most relevant results. Note that the top 3 results are all repetitions of the same document."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Context 1:\n",
|
||||
"driven by human activities, particularly the emission of greenhou se gases. \n",
|
||||
"Chapter 2: Causes of Climate Change \n",
|
||||
"Greenhouse Gases \n",
|
||||
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
|
||||
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
|
||||
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
|
||||
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
|
||||
"activities have intensified this natural process, leading to a warmer climate. \n",
|
||||
"Fossil Fuels \n",
|
||||
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
|
||||
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
|
||||
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
|
||||
"today. \n",
|
||||
"Coal\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"Context 2:\n",
|
||||
"driven by human activities, particularly the emission of greenhou se gases. \n",
|
||||
"Chapter 2: Causes of Climate Change \n",
|
||||
"Greenhouse Gases \n",
|
||||
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
|
||||
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
|
||||
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
|
||||
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
|
||||
"activities have intensified this natural process, leading to a warmer climate. \n",
|
||||
"Fossil Fuels \n",
|
||||
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
|
||||
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
|
||||
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
|
||||
"today. \n",
|
||||
"Coal\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"Context 3:\n",
|
||||
"driven by human activities, particularly the emission of greenhou se gases. \n",
|
||||
"Chapter 2: Causes of Climate Change \n",
|
||||
"Greenhouse Gases \n",
|
||||
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
|
||||
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
|
||||
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
|
||||
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
|
||||
"activities have intensified this natural process, leading to a warmer climate. \n",
|
||||
"Fossil Fuels \n",
|
||||
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
|
||||
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
|
||||
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
|
||||
"today. \n",
|
||||
"Coal\n",
|
||||
"\n",
|
||||
"\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"texts=get_context(test_query,k=3)\n",
|
||||
"show_context(texts)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Now for the real part :) "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"\n",
|
||||
"### More utils for distances normalization"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 21,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def lognorm(dist:np.ndarray, sigma:float):\n",
|
||||
" \"\"\"\n",
|
||||
" Calculate the log-normal probability for a given distance and sigma.\n",
|
||||
" \"\"\"\n",
|
||||
" if sigma < 1e-9: \n",
|
||||
" return -np.inf * dist\n",
|
||||
" return -np.log(sigma) - 0.5 * np.log(2 * np.pi) - dist**2 / (2 * sigma**2)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Greedy Dartboard Search\n",
|
||||
"\n",
|
||||
"This is the core algorithm: A search algorithm that selects a diverse set of relevant documents from a collection by balancing two factors: relevance to the query and diversity among selected documents.\n",
|
||||
"\n",
|
||||
"Given distances between a query and documents, plus distances between all documents, the algorithm:\n",
|
||||
"\n",
|
||||
"1. Selects the most relevant document first\n",
|
||||
"2. Iteratively selects additional documents by combining:\n",
|
||||
" - Relevance to the original query\n",
|
||||
" - Diversity from previously selected documents\n",
|
||||
"\n",
|
||||
"The balance between relevance and diversity is controlled by weights:\n",
|
||||
"- `DIVERSITY_WEIGHT`: Importance of difference from existing selections\n",
|
||||
"- `RELEVANCE_WEIGHT`: Importance of relevance to query\n",
|
||||
"- `SIGMA`: Smoothing parameter for probability conversion\n",
|
||||
"\n",
|
||||
"The algorithm returns both the selected documents and their selection scores, making it useful for applications like search results where you want relevant but varied results.\n",
|
||||
"\n",
|
||||
"For example, when searching news articles, it would first return the most relevant article, then find articles that are both on-topic and provide new information, avoiding redundant selections."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Configuration parameters\n",
|
||||
"DIVERSITY_WEIGHT = 1.0 # Weight for diversity in document selection\n",
|
||||
"RELEVANCE_WEIGHT = 1.0 # Weight for relevance to query\n",
|
||||
"SIGMA = 0.1 # Smoothing parameter for probability distribution\n",
|
||||
"\n",
|
||||
"def greedy_dartsearch(\n",
|
||||
" query_distances: np.ndarray,\n",
|
||||
" document_distances: np.ndarray,\n",
|
||||
" documents: List[str],\n",
|
||||
" num_results: int\n",
|
||||
") -> Tuple[List[str], List[float]]:\n",
|
||||
" \"\"\"\n",
|
||||
" Perform greedy dartboard search to select top k documents balancing relevance and diversity.\n",
|
||||
" \n",
|
||||
" Args:\n",
|
||||
" query_distances: Distance between query and each document\n",
|
||||
" document_distances: Pairwise distances between documents\n",
|
||||
" documents: List of document texts\n",
|
||||
" num_results: Number of documents to return\n",
|
||||
" \n",
|
||||
" Returns:\n",
|
||||
" Tuple containing:\n",
|
||||
" - List of selected document texts\n",
|
||||
" - List of selection scores for each document\n",
|
||||
" \"\"\"\n",
|
||||
" # Avoid division by zero in probability calculations\n",
|
||||
" sigma = max(SIGMA, 1e-5)\n",
|
||||
" \n",
|
||||
" # Convert distances to probability distributions\n",
|
||||
" query_probabilities = lognorm(query_distances, sigma)\n",
|
||||
" document_probabilities = lognorm(document_distances, sigma)\n",
|
||||
" \n",
|
||||
" # Initialize with most relevant document\n",
|
||||
" \n",
|
||||
" most_relevant_idx = np.argmax(query_probabilities)\n",
|
||||
" selected_indices = np.array([most_relevant_idx])\n",
|
||||
" selection_scores = [1.0] # dummy score for the first document\n",
|
||||
" # Get initial distances from the first selected document\n",
|
||||
" max_distances = document_probabilities[most_relevant_idx]\n",
|
||||
" \n",
|
||||
" # Select remaining documents\n",
|
||||
" while len(selected_indices) < num_results:\n",
|
||||
" # Update maximum distances considering new document\n",
|
||||
" updated_distances = np.maximum(max_distances, document_probabilities)\n",
|
||||
" \n",
|
||||
" # Calculate combined diversity and relevance scores\n",
|
||||
" combined_scores = (\n",
|
||||
" updated_distances * DIVERSITY_WEIGHT +\n",
|
||||
" query_probabilities * RELEVANCE_WEIGHT\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" # Normalize scores and mask already selected documents\n",
|
||||
" normalized_scores = logsumexp(combined_scores, axis=1)\n",
|
||||
" normalized_scores[selected_indices] = -np.inf\n",
|
||||
" \n",
|
||||
" # Select best remaining document\n",
|
||||
" best_idx = np.argmax(normalized_scores)\n",
|
||||
" best_score = np.max(normalized_scores)\n",
|
||||
" \n",
|
||||
" # Update tracking variables\n",
|
||||
" max_distances = updated_distances[best_idx]\n",
|
||||
" selected_indices = np.append(selected_indices, best_idx)\n",
|
||||
" selection_scores.append(best_score)\n",
|
||||
" \n",
|
||||
" # Return selected documents and their scores\n",
|
||||
" selected_documents = [documents[i] for i in selected_indices]\n",
|
||||
" return selected_documents, selection_scores"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Dartboard Context Retrieval\n",
|
||||
"\n",
|
||||
"### Main function for using the dartboard retrieval. This serves instead of get_context (which is simple RAG). It:\n",
|
||||
"\n",
|
||||
"1. Takes a text query, vectorizes it, gets the top k documents (and their vectors) via simple RAG\n",
|
||||
"2. Uses these vectors to calculate the similarities to query and between candidate matches\n",
|
||||
"3. Runs the dartboard algorithm to refine the candidate matches to a final list of k documents\n",
|
||||
"4. Returns the final list of documents and their scores"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"\n",
|
||||
"def get_context_with_dartboard(\n",
|
||||
" query: str,\n",
|
||||
" num_results: int = 5,\n",
|
||||
" oversampling_factor: int = 3\n",
|
||||
") -> Tuple[List[str], List[float]]:\n",
|
||||
" \"\"\"\n",
|
||||
" Retrieve most relevant and diverse context items for a query using the dartboard algorithm.\n",
|
||||
" \n",
|
||||
" Args:\n",
|
||||
" query: The search query string\n",
|
||||
" num_results: Number of context items to return (default: 5)\n",
|
||||
" oversampling_factor: Factor to oversample initial results for better diversity (default: 3)\n",
|
||||
" \n",
|
||||
" Returns:\n",
|
||||
" Tuple containing:\n",
|
||||
" - List of selected context texts\n",
|
||||
" - List of selection scores\n",
|
||||
" \n",
|
||||
" Note:\n",
|
||||
" The function uses cosine similarity converted to distance. Initial retrieval \n",
|
||||
" fetches oversampling_factor * num_results items to ensure sufficient diversity \n",
|
||||
" in the final selection.\n",
|
||||
" \"\"\"\n",
|
||||
" # Embed query and retrieve initial candidates\n",
|
||||
" query_embedding = chunks_vector_store.embedding_function.embed_documents([query])\n",
|
||||
" _, candidate_indices = chunks_vector_store.index.search(\n",
|
||||
" np.array(query_embedding),\n",
|
||||
" k=num_results * oversampling_factor\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" # Get document vectors and texts for candidates\n",
|
||||
" candidate_vectors = np.array(\n",
|
||||
" chunks_vector_store.index.reconstruct_batch(candidate_indices[0])\n",
|
||||
" )\n",
|
||||
" candidate_texts = [idx_to_text(idx) for idx in candidate_indices[0]]\n",
|
||||
" \n",
|
||||
" # Calculate distance matrices\n",
|
||||
" # Using 1 - cosine_similarity as distance metric\n",
|
||||
" document_distances = 1 - np.dot(candidate_vectors, candidate_vectors.T)\n",
|
||||
" query_distances = 1 - np.dot(query_embedding, candidate_vectors.T)\n",
|
||||
" \n",
|
||||
" # Apply dartboard selection algorithm\n",
|
||||
" selected_texts, selection_scores = greedy_dartsearch(\n",
|
||||
" query_distances,\n",
|
||||
" document_distances,\n",
|
||||
" candidate_texts,\n",
|
||||
" num_results\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" return selected_texts, selection_scores"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### dartboard retrieval - results on same query, k, and dataset\n",
|
||||
"- As you can see now the top 3 results are not mere repetitions. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 22,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Context 1:\n",
|
||||
"driven by human activities, particularly the emission of greenhou se gases. \n",
|
||||
"Chapter 2: Causes of Climate Change \n",
|
||||
"Greenhouse Gases \n",
|
||||
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
|
||||
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
|
||||
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
|
||||
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
|
||||
"activities have intensified this natural process, leading to a warmer climate. \n",
|
||||
"Fossil Fuels \n",
|
||||
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
|
||||
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
|
||||
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
|
||||
"today. \n",
|
||||
"Coal\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"Context 2:\n",
|
||||
"Most of these climate changes are attributed to very small variations in Earth's orbit that \n",
|
||||
"change the amount of solar energy our planet receives. During the Holocene epoch, which \n",
|
||||
"began at the end of the last ice age, human societies f lourished, but the industrial era has seen \n",
|
||||
"unprecedented changes. \n",
|
||||
"Modern Observations \n",
|
||||
"Modern scientific observations indicate a rapid increase in global temperatures, sea levels, \n",
|
||||
"and extreme weather events. The Intergovernmental Panel on Climate Change (IPCC) has \n",
|
||||
"documented these changes extensively. Ice core samples, tree rings, and ocean sediments \n",
|
||||
"provide a historical record that scientists use to understand past climate conditions and \n",
|
||||
"predict future trends. The evidence overwhelmingly shows that recent changes are primarily \n",
|
||||
"driven by human activities, particularly the emission of greenhou se gases. \n",
|
||||
"Chapter 2: Causes of Climate Change \n",
|
||||
"Greenhouse Gases\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"Context 3:\n",
|
||||
"driven by human activities, particularly the emission of greenhou se gases. \n",
|
||||
"Chapter 2: Causes of Climate Change \n",
|
||||
"Greenhouse Gases \n",
|
||||
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
|
||||
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
|
||||
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
|
||||
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
|
||||
"activities have intensified this natural process, leading to a warmer climate. \n",
|
||||
"Fossil Fuels \n",
|
||||
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
|
||||
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
|
||||
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
|
||||
"today. \n",
|
||||
"Coal\n",
|
||||
"\n",
|
||||
"\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"texts,scores=get_context_with_dartboard(test_query,k=3)\n",
|
||||
"show_context(texts)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.12"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -239,7 +239,7 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain_core.prompts import ChatPromptTemplate\n",
|
||||
"from langchain_core.pydantic_v1 import BaseModel, Field\n",
|
||||
"from pydantic import BaseModel, Field\n",
|
||||
"from langchain_groq import ChatGroq\n",
|
||||
"\n",
|
||||
"# Data model\n",
|
||||
|
||||
@@ -8,7 +8,7 @@
|
||||
"\n",
|
||||
"## Overview\n",
|
||||
"\n",
|
||||
"This code implements a semantic chunking approach for processing and retrieving information from PDF documents, [first proposed by Greg Kamradt](https://youtu.be/8OJC21T2SL4?t=1933) and subsequently [implemented in LangChain](https://docs.llamaindex.ai/en/stable/examples/node_parsers/semantic_chunking/). Unlike traditional methods that split text based on fixed character or word counts, semantic chunking aims to create more meaningful and context-aware text segments.\n",
|
||||
"This code implements a semantic chunking approach for processing and retrieving information from PDF documents, [first proposed by Greg Kamradt](https://youtu.be/8OJC21T2SL4?t=1933) and subsequently [implemented in LangChain](https://python.langchain.com/docs/how_to/semantic-chunker/). Unlike traditional methods that split text based on fixed character or word counts, semantic chunking aims to create more meaningful and context-aware text segments.\n",
|
||||
"\n",
|
||||
"## Motivation\n",
|
||||
"\n",
|
||||
|
||||
@@ -289,7 +289,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.12.3"
|
||||
"version": "3.12.0"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
||||
@@ -0,0 +1,203 @@
|
||||
import os
|
||||
import sys
|
||||
import argparse
|
||||
import time
|
||||
import faiss
|
||||
from dotenv import load_dotenv
|
||||
from tqdm import tqdm
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from langchain_community.docstore.in_memory import InMemoryDocstore
|
||||
|
||||
# Add the parent directory to the path since we work with notebooks
|
||||
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))
|
||||
|
||||
from helper_functions import *
|
||||
from evaluation.evalute_rag import *
|
||||
|
||||
# Load environment variables from a .env file (e.g., OpenAI API key)
|
||||
load_dotenv()
|
||||
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')
|
||||
|
||||
class HyPE:
|
||||
"""
|
||||
A class to handle the HyPE RAG process, which enhances document chunking by
|
||||
generating hypothetical questions as proxies for retrieval.
|
||||
"""
|
||||
|
||||
def __init__(self, path, chunk_size=1000, chunk_overlap=200, n_retrieved=3):
|
||||
"""
|
||||
Initializes the HyPE-based RAG retriever by encoding the PDF document with
|
||||
hypothetical prompt embeddings.
|
||||
|
||||
Args:
|
||||
path (str): Path to the PDF file to encode.
|
||||
chunk_size (int): Size of each text chunk (default: 1000).
|
||||
chunk_overlap (int): Overlap between consecutive chunks (default: 200).
|
||||
n_retrieved (int): Number of chunks to retrieve for each query (default: 3).
|
||||
"""
|
||||
print("\n--- Initializing HyPE RAG Retriever ---")
|
||||
|
||||
# Encode the PDF document into a FAISS vector store using hypothetical prompt embeddings
|
||||
start_time = time.time()
|
||||
self.vector_store = self.encode_pdf(path, chunk_size=chunk_size, chunk_overlap=chunk_overlap)
|
||||
self.time_records = {'Chunking': time.time() - start_time}
|
||||
print(f"Chunking Time: {self.time_records['Chunking']:.2f} seconds")
|
||||
|
||||
# Create a retriever from the vector store
|
||||
self.chunks_query_retriever = self.vector_store.as_retriever(search_kwargs={"k": n_retrieved})
|
||||
|
||||
def generate_hypothetical_prompt_embeddings(self, chunk_text):
|
||||
"""
|
||||
Uses an LLM to generate multiple hypothetical questions for a single chunk.
|
||||
These questions act as 'proxies' for the chunk during retrieval.
|
||||
|
||||
Parameters:
|
||||
chunk_text (str): Text contents of the chunk.
|
||||
|
||||
Returns:
|
||||
tuple: (Original chunk text, List of embedding vectors generated from the questions)
|
||||
"""
|
||||
llm = ChatOpenAI(temperature=0, model_name="gpt-4o-mini")
|
||||
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")
|
||||
|
||||
question_gen_prompt = PromptTemplate.from_template(
|
||||
"Analyze the input text and generate essential questions that, when answered, \
|
||||
capture the main points of the text. Each question should be one line, \
|
||||
without numbering or prefixes.\n\n \
|
||||
Text:\n{chunk_text}\n\nQuestions:\n"
|
||||
)
|
||||
question_chain = question_gen_prompt | llm | StrOutputParser()
|
||||
|
||||
# Parse questions from response
|
||||
questions = question_chain.invoke({"chunk_text": chunk_text}).replace("\n\n", "\n").split("\n")
|
||||
|
||||
return chunk_text, embedding_model.embed_documents(questions)
|
||||
|
||||
def prepare_vector_store(self, chunks):
|
||||
"""
|
||||
Creates and populates a FAISS vector store using hypothetical prompt embeddings.
|
||||
|
||||
Parameters:
|
||||
chunks (List[str]): A list of text chunks to be embedded and stored.
|
||||
|
||||
Returns:
|
||||
FAISS: A FAISS vector store containing the embedded text chunks.
|
||||
"""
|
||||
vector_store = None # Wait to initialize to determine vector size
|
||||
|
||||
with ThreadPoolExecutor() as pool:
|
||||
# Parallelized embedding generation
|
||||
futures = [pool.submit(self.generate_hypothetical_prompt_embeddings, c) for c in chunks]
|
||||
|
||||
for f in tqdm(as_completed(futures), total=len(chunks)):
|
||||
chunk, vectors = f.result() # Retrieve processed chunk and embeddings
|
||||
|
||||
# Initialize FAISS store once vector size is known
|
||||
if vector_store is None:
|
||||
vector_store = FAISS(
|
||||
embedding_function=OpenAIEmbeddings(model="text-embedding-3-small"),
|
||||
index=faiss.IndexFlatL2(len(vectors[0])),
|
||||
docstore=InMemoryDocstore(),
|
||||
index_to_docstore_id={}
|
||||
)
|
||||
|
||||
# Store multiple vector representations per chunk
|
||||
chunks_with_embedding_vectors = [(chunk.page_content, vec) for vec in vectors]
|
||||
vector_store.add_embeddings(chunks_with_embedding_vectors)
|
||||
|
||||
return vector_store
|
||||
|
||||
def encode_pdf(self, path, chunk_size=1000, chunk_overlap=200):
|
||||
"""
|
||||
Encodes a PDF document into a vector store using hypothetical prompt embeddings.
|
||||
|
||||
Args:
|
||||
path: The path to the PDF file.
|
||||
chunk_size: The size of each text chunk.
|
||||
chunk_overlap: The overlap between consecutive chunks.
|
||||
|
||||
Returns:
|
||||
A FAISS vector store containing the encoded book content.
|
||||
"""
|
||||
# Load PDF documents
|
||||
loader = PyPDFLoader(path)
|
||||
documents = loader.load()
|
||||
|
||||
# Split documents into chunks
|
||||
text_splitter = RecursiveCharacterTextSplitter(
|
||||
chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
|
||||
)
|
||||
texts = text_splitter.split_documents(documents)
|
||||
cleaned_texts = replace_t_with_space(texts)
|
||||
|
||||
return self.prepare_vector_store(cleaned_texts)
|
||||
|
||||
def run(self, query):
|
||||
"""
|
||||
Retrieves and displays the context for the given query.
|
||||
|
||||
Args:
|
||||
query (str): The query to retrieve context for.
|
||||
|
||||
Returns:
|
||||
None
|
||||
"""
|
||||
# Measure retrieval time
|
||||
start_time = time.time()
|
||||
context = retrieve_context_per_question(query, self.chunks_query_retriever)
|
||||
self.time_records['Retrieval'] = time.time() - start_time
|
||||
print(f"Retrieval Time: {self.time_records['Retrieval']:.2f} seconds")
|
||||
|
||||
# Deduplicate context and display results
|
||||
context = list(set(context))
|
||||
show_context(context)
|
||||
|
||||
|
||||
def validate_args(args):
|
||||
if args.chunk_size <= 0:
|
||||
raise ValueError("chunk_size must be a positive integer.")
|
||||
if args.chunk_overlap < 0:
|
||||
raise ValueError("chunk_overlap must be a non-negative integer.")
|
||||
if args.n_retrieved <= 0:
|
||||
raise ValueError("n_retrieved must be a positive integer.")
|
||||
return args
|
||||
|
||||
|
||||
def parse_args():
|
||||
parser = argparse.ArgumentParser(description="Encode a PDF document and test a HyPE-based RAG system.")
|
||||
parser.add_argument("--path", type=str, default="../data/Understanding_Climate_Change.pdf",
|
||||
help="Path to the PDF file to encode.")
|
||||
parser.add_argument("--chunk_size", type=int, default=1000,
|
||||
help="Size of each text chunk (default: 1000).")
|
||||
parser.add_argument("--chunk_overlap", type=int, default=200,
|
||||
help="Overlap between consecutive chunks (default: 200).")
|
||||
parser.add_argument("--n_retrieved", type=int, default=3,
|
||||
help="Number of chunks to retrieve for each query (default: 3).")
|
||||
parser.add_argument("--query", type=str, default="What is the main cause of climate change?",
|
||||
help="Query to test the retriever (default: 'What is the main cause of climate change?').")
|
||||
parser.add_argument("--evaluate", action="store_true",
|
||||
help="Whether to evaluate the retriever's performance (default: False).")
|
||||
|
||||
return validate_args(parser.parse_args())
|
||||
|
||||
|
||||
def main(args):
|
||||
# Initialize the HyPE-based RAG Retriever
|
||||
hyperag = HyPE(
|
||||
path=args.path,
|
||||
chunk_size=args.chunk_size,
|
||||
chunk_overlap=args.chunk_overlap,
|
||||
n_retrieved=args.n_retrieved
|
||||
)
|
||||
|
||||
# Retrieve context based on the query
|
||||
hyperag.run(args.query)
|
||||
|
||||
# Evaluate the retriever's performance on the query (if requested)
|
||||
if args.evaluate:
|
||||
evaluate_rag(hyperag.chunks_query_retriever)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
# Call the main function with parsed arguments
|
||||
main(parse_args())
|
||||
@@ -1,11 +1,10 @@
|
||||
import os
|
||||
import sys
|
||||
from dotenv import load_dotenv
|
||||
from langchain.prompts import PromptTemplate
|
||||
from langchain.vectorstores import FAISS
|
||||
from langchain.embeddings import OpenAIEmbeddings
|
||||
from langchain.text_splitter import CharacterTextSplitter
|
||||
from langchain.prompts import PromptTemplate
|
||||
from langchain_core.prompts import PromptTemplate
|
||||
from langchain_community.vectorstores import FAISS
|
||||
from langchain_openai import OpenAIEmbeddings
|
||||
from langchain_text_splitters import CharacterTextSplitter
|
||||
|
||||
from langchain_core.retrievers import BaseRetriever
|
||||
from typing import List, Dict, Any
|
||||
|
||||
@@ -7,6 +7,7 @@ from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
|
||||
from llama_index.core.prompts import PromptTemplate
|
||||
from llama_index.core.evaluation import DatasetGenerator, FaithfulnessEvaluator, RelevancyEvaluator
|
||||
from llama_index.llms.openai import OpenAI
|
||||
from llama_index.core.node_parser import SentenceSplitter
|
||||
|
||||
# Apply asyncio fix for Jupyter notebooks
|
||||
nest_asyncio.apply()
|
||||
@@ -44,7 +45,8 @@ def evaluate_response_time_and_accuracy(chunk_size, eval_questions, eval_documen
|
||||
Settings.llm = llm
|
||||
|
||||
# Create vector index
|
||||
vector_index = VectorStoreIndex.from_documents(eval_documents)
|
||||
splitter = SentenceSplitter(chunk_size=chunk_size)
|
||||
vector_index = VectorStoreIndex.from_documents(eval_documents, transformations=[splitter])
|
||||
|
||||
# Build query engine
|
||||
query_engine = vector_index.as_query_engine(similarity_top_k=5)
|
||||
|
||||
@@ -1,5 +1,6 @@
|
||||
FORM 10-K FORM 10-KUNITED STATES
|
||||
SECURITIES AND EXCHANGE COMMISSION
|
||||
Washington, D.C.
|
||||
Washington, D.C. 20549
|
||||
FORM 10-K
|
||||
(Mark One)
|
||||
|
||||
@@ -14,12 +14,14 @@ Custom modules:
|
||||
"""
|
||||
|
||||
import json
|
||||
from typing import List, Tuple
|
||||
from typing import List, Tuple, Dict, Any
|
||||
|
||||
from deepeval import evaluate
|
||||
from deepeval.metrics import GEval, FaithfulnessMetric, ContextualRelevancyMetric
|
||||
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
|
||||
from langchain_openai import ChatOpenAI
|
||||
from langchain_core.prompts import PromptTemplate
|
||||
from langchain_core.output_parsers import StrOutputParser
|
||||
|
||||
# 09/15/24 kimmeyh Added path where helper functions is located to the path
|
||||
# Add the parent directory to the path since we work with notebooks
|
||||
@@ -90,41 +92,75 @@ relevance_metric = ContextualRelevancyMetric(
|
||||
include_reason=True
|
||||
)
|
||||
|
||||
def evaluate_rag(chunks_query_retriever, num_questions: int = 5) -> None:
|
||||
def evaluate_rag(retriever, num_questions: int = 5) -> Dict[str, Any]:
|
||||
"""
|
||||
Evaluate the RAG system using predefined metrics.
|
||||
|
||||
Evaluates a RAG system using predefined test questions and metrics.
|
||||
|
||||
Args:
|
||||
chunks_query_retriever: Function to retrieve context chunks for a given query.
|
||||
num_questions (int): Number of questions to evaluate (default: 5).
|
||||
retriever: The retriever component to evaluate
|
||||
num_questions: Number of test questions to generate
|
||||
|
||||
Returns:
|
||||
Dict containing evaluation metrics
|
||||
"""
|
||||
llm = ChatOpenAI(temperature=0, model_name="gpt-4o", max_tokens=2000)
|
||||
question_answer_from_context_chain = create_question_answer_from_context_chain(llm)
|
||||
|
||||
# Load questions and answers from JSON file
|
||||
q_a_file_name = "../data/q_a.json"
|
||||
with open(q_a_file_name, "r", encoding="utf-8") as json_file:
|
||||
q_a = json.load(json_file)
|
||||
|
||||
questions = [qa["question"] for qa in q_a][:num_questions]
|
||||
ground_truth_answers = [qa["answer"] for qa in q_a][:num_questions]
|
||||
generated_answers = []
|
||||
retrieved_documents = []
|
||||
|
||||
# Generate answers and retrieve documents for each question
|
||||
for question in questions:
|
||||
context = retrieve_context_per_question(question, chunks_query_retriever)
|
||||
retrieved_documents.append(context)
|
||||
context_string = " ".join(context)
|
||||
result = answer_question_from_context(question, context_string, question_answer_from_context_chain)
|
||||
generated_answers.append(result["answer"])
|
||||
|
||||
# Create test cases and evaluate
|
||||
test_cases = create_deep_eval_test_cases(questions, ground_truth_answers, generated_answers, retrieved_documents)
|
||||
evaluate(
|
||||
test_cases=test_cases,
|
||||
metrics=[correctness_metric, faithfulness_metric, relevance_metric]
|
||||
|
||||
# Initialize LLM
|
||||
llm = ChatOpenAI(temperature=0, model_name="gpt-4-turbo-preview")
|
||||
|
||||
# Create evaluation prompt
|
||||
eval_prompt = PromptTemplate.from_template("""
|
||||
Evaluate the following retrieval results for the question.
|
||||
|
||||
Question: {question}
|
||||
Retrieved Context: {context}
|
||||
|
||||
Rate on a scale of 1-5 (5 being best) for:
|
||||
1. Relevance: How relevant is the retrieved information to the question?
|
||||
2. Completeness: Does the context contain all necessary information?
|
||||
3. Conciseness: Is the retrieved context focused and free of irrelevant information?
|
||||
|
||||
Provide ratings in JSON format:
|
||||
""")
|
||||
|
||||
# Create evaluation chain
|
||||
eval_chain = (
|
||||
eval_prompt
|
||||
| llm
|
||||
| StrOutputParser()
|
||||
)
|
||||
|
||||
# Generate test questions
|
||||
question_gen_prompt = PromptTemplate.from_template(
|
||||
"Generate {num_questions} diverse test questions about climate change:"
|
||||
)
|
||||
question_chain = question_gen_prompt | llm | StrOutputParser()
|
||||
|
||||
questions = question_chain.invoke({"num_questions": num_questions}).split("\n")
|
||||
|
||||
# Evaluate each question
|
||||
results = []
|
||||
for question in questions:
|
||||
# Get retrieval results
|
||||
context = retriever.get_relevant_documents(question)
|
||||
context_text = "\n".join([doc.page_content for doc in context])
|
||||
|
||||
# Evaluate results
|
||||
eval_result = eval_chain.invoke({
|
||||
"question": question,
|
||||
"context": context_text
|
||||
})
|
||||
results.append(eval_result)
|
||||
|
||||
return {
|
||||
"questions": questions,
|
||||
"results": results,
|
||||
"average_scores": calculate_average_scores(results)
|
||||
}
|
||||
|
||||
def calculate_average_scores(results: List[Dict]) -> Dict[str, float]:
|
||||
"""Calculate average scores across all evaluation results."""
|
||||
# Implementation depends on the exact format of your results
|
||||
pass
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Add any necessary setup or configuration here
|
||||
|
||||
@@ -17,7 +17,7 @@ from enum import Enum
|
||||
|
||||
def replace_t_with_space(list_of_documents):
|
||||
"""
|
||||
Replaces all tab characters ('\t') with spaces in the page content of each document.
|
||||
Replaces all tab characters ('\t') with spaces in the page content of each document
|
||||
|
||||
Args:
|
||||
list_of_documents: A list of document objects, each with a 'page_content' attribute.
|
||||
|
||||
21
images/hype.svg
Normal file
21
images/hype.svg
Normal file
File diff suppressed because one or more lines are too long
|
After Width: | Height: | Size: 98 KiB |
@@ -208,3 +208,55 @@ nbformat==5.10.4
|
||||
xxhash==3.5.0
|
||||
yarl==1.10.0
|
||||
zipp==3.20.1
|
||||
|
||||
# Core LangChain packages
|
||||
langchain>=0.1.0
|
||||
langchain-core>=0.1.17
|
||||
langchain-community>=0.0.13
|
||||
langchain-openai>=0.0.5
|
||||
langchain-anthropic>=0.0.9
|
||||
langchain-groq>=0.0.1
|
||||
langchain-cohere>=0.0.1
|
||||
|
||||
# Vector stores and embeddings
|
||||
faiss-cpu>=1.7.4
|
||||
chromadb>=0.4.22
|
||||
|
||||
# Document processing
|
||||
PyMuPDF>=1.23.8 # for fitz
|
||||
python-docx>=1.0.1
|
||||
pypdf>=3.17.4
|
||||
rank-bm25>=0.2.2
|
||||
|
||||
# Machine Learning and Data Science
|
||||
numpy>=1.24.3
|
||||
pandas>=2.0.3
|
||||
scikit-learn>=1.3.0
|
||||
|
||||
# API Clients
|
||||
openai>=1.12.0
|
||||
anthropic>=0.8.1
|
||||
cohere>=4.48
|
||||
groq>=0.4.2
|
||||
|
||||
# Testing and Evaluation
|
||||
pytest>=7.4.0
|
||||
deepeval>=0.20.12
|
||||
grouse>=0.3.0
|
||||
|
||||
# Development Tools
|
||||
python-dotenv>=1.0.0
|
||||
jupyter>=1.0.0
|
||||
notebook>=7.0.6
|
||||
ipykernel>=6.29.2
|
||||
|
||||
# Type Checking
|
||||
pydantic>=2.6.1
|
||||
typing-extensions>=4.9.0
|
||||
|
||||
# Async Support
|
||||
aiohttp>=3.9.1
|
||||
asyncio>=3.4.3
|
||||
|
||||
# Utilities
|
||||
tqdm>=4.66.1
|
||||
|
||||
@@ -1,10 +1,18 @@
|
||||
import pytest
|
||||
import os
|
||||
import sys
|
||||
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
|
||||
from langchain_community.vectorstores import FAISS
|
||||
from langchain_core.prompts import PromptTemplate
|
||||
from langchain_text_splitters import CharacterTextSplitter
|
||||
from dotenv import load_dotenv
|
||||
|
||||
# Add the main folder to sys.path
|
||||
sys.path.append(os.path.abspath(os.path.dirname(__file__) + "/../"))
|
||||
|
||||
# Load environment variables
|
||||
load_dotenv()
|
||||
|
||||
def pytest_addoption(parser):
|
||||
parser.addoption(
|
||||
"--exclude", action="store", help="Comma-separated list of notebook or script files' paths to exclude"
|
||||
@@ -40,4 +48,58 @@ def script_paths(request):
|
||||
|
||||
path_with_full_address = [folder + s for s in include_scripts]
|
||||
|
||||
return path_with_full_address
|
||||
return path_with_full_address
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def llm():
|
||||
"""Fixture for ChatOpenAI model."""
|
||||
return ChatOpenAI(
|
||||
temperature=0,
|
||||
model_name="gpt-4-turbo-preview",
|
||||
max_tokens=4000
|
||||
)
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def embeddings():
|
||||
"""Fixture for OpenAI embeddings."""
|
||||
return OpenAIEmbeddings()
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def text_splitter():
|
||||
"""Fixture for text splitter."""
|
||||
return CharacterTextSplitter(
|
||||
chunk_size=1000,
|
||||
chunk_overlap=200
|
||||
)
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def sample_texts():
|
||||
"""Fixture for sample test data."""
|
||||
return [
|
||||
"The Earth is the third planet from the Sun.",
|
||||
"Climate change is a significant global challenge.",
|
||||
"Renewable energy sources include solar and wind power."
|
||||
]
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def vector_store(embeddings, sample_texts, text_splitter):
|
||||
"""Fixture for vector store."""
|
||||
docs = text_splitter.create_documents(sample_texts)
|
||||
return FAISS.from_documents(docs, embeddings)
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def retriever(vector_store):
|
||||
"""Fixture for retriever."""
|
||||
return vector_store.as_retriever(search_kwargs={"k": 2})
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def basic_prompt():
|
||||
"""Fixture for basic prompt template."""
|
||||
return PromptTemplate.from_template("""
|
||||
Answer the following question based on the context provided:
|
||||
|
||||
Context: {context}
|
||||
Question: {question}
|
||||
|
||||
Answer:
|
||||
""")
|
||||
Reference in New Issue
Block a user