Compare commits

...

34 Commits

Author SHA1 Message Date
NirDiamant
1cfb0d44cb Merge pull request #83 from VakeDomen/feature/hype
Feature/hype
2025-04-01 23:32:26 +03:00
VakeDomen
57e9dcc87a improved markdown 2025-03-10 13:28:16 +00:00
nird
73b91bfa13 updated readme 2025-03-06 00:38:52 +02:00
nird
7d603611bd update subs count 2025-03-06 00:36:56 +02:00
NirDiamant
e76f08482a Merge pull request #87 from anantgupta129/fix/chunk_size
Fix: Ensure Chunk Size Parameter is Properly Utilized
2025-03-06 00:31:15 +02:00
Anant Gupta
942467c05d fix chunk size utilization 2025-03-05 16:12:49 +05:30
VakeDomen
6096797c7e merge main 2025-03-01 18:28:18 +00:00
NirDiamant
42cabf3a9b Merge pull request #85 from Redempt1onzzZZ/main
[Typo] Update crag.ipynb
2025-02-24 09:21:57 +02:00
1ndigo
990ecff889 [Typo] Update crag.ipynb 2025-02-24 13:49:53 +08:00
NirDiamant
3c46cc9b0a Merge pull request #84 from roybka/dartboard_minor_correction
fix scores initialization
2025-02-19 22:29:45 +02:00
rotbka
c4eb7c15e6 fix scores initialization 2025-02-19 22:27:10 +02:00
VakeDomen
165876797c hype image 2025-02-19 21:05:40 +01:00
VakeDomen
91a8a89302 readme 2025-02-19 20:57:57 +01:00
VakeDomen
dfb0e9125b HyPE 2025-02-19 20:40:18 +01:00
NirDiamant
d100326db5 Merge pull request #82 from Un1que11/patch-1
Update semantic_chunking.ipynb
2025-02-19 21:37:27 +02:00
nird
6e1698f962 made the dartboard more understandable 2025-02-19 21:36:18 +02:00
NirDiamant
9b19b48637 Merge pull request #81 from roybka/dartboard_algo
dartboard algo implementation + README
2025-02-19 21:26:53 +02:00
Mikhail Orlov
b0b1b2f72e Update semantic_chunking.ipynb
There was a link to LlamaIndex in the article for Langchain
2025-02-18 13:15:19 +01:00
VakeDomen
06d2f16b4b cp 2025-02-18 10:06:29 +00:00
rotbka
a51359b9c1 better explanation 2025-02-17 17:44:10 +02:00
rotbka
db8b6a7b6c more comments and markdown 2025-02-15 21:05:36 +02:00
rotbka
c1d4bb450f better arrange functions, split them to cells 2025-02-13 09:46:13 +02:00
rotbka
673ffb5b0a improve readme - dartboard section 2025-02-12 19:49:30 +02:00
rotbka
c8791970e9 tidy up 2025-02-12 19:44:18 +02:00
rotbka
0993d27edf dartboard algo implementation + README 2025-02-10 14:12:17 +02:00
NirDiamant
7249e55824 Merge pull request #79 from speedwagon1299/FixServiceContext
Modified from Service Context to LLaMa Settings
2025-02-03 00:50:05 +02:00
speedwagon1299
076560320c Modified from Service Context to LLaMa Settings 2025-02-03 00:06:49 +05:30
nird
a1155a5581 updated contributing 2025-02-02 12:28:23 +02:00
nird
76a529eccf updated code 2025-02-02 12:24:43 +02:00
nird
f50f0e4373 updated imports 2025-02-02 12:15:20 +02:00
NirDiamant
8a9d842ede Merge pull request #76 from speedwagon1299/ReliableRagFix
Fixed pydantic import in reliable_rag.ipynb
2025-02-02 12:03:53 +02:00
speedwagon1299
0ff4ed2270 Fixed pydantic import in reliable rag 2025-02-01 18:16:53 +05:30
nird
2d3344b4f8 updated readme 2025-01-29 23:17:33 +02:00
nird
209dde5430 updated readme 2025-01-29 23:13:41 +02:00
18 changed files with 1658 additions and 82 deletions

View File

@@ -1,4 +1,4 @@
# Contributing to Advanced RAG Techniques
# Contributing to RAG Techniques
Welcome to the world's largest and most comprehensive repository of Retrieval-Augmented Generation (RAG) tutorials! 🌟 We're thrilled you're interested in contributing to this ever-growing knowledge base. Your expertise and creativity can help us maintain our position at the forefront of RAG technology.

View File

@@ -30,9 +30,11 @@ Welcome to one of the most comprehensive and dynamic collections of Retrieval-Au
[![Subscribe to DiamantAI Newsletter](images/subscribe-button.svg)](https://diamantai.substack.com/?r=336pe4&utm_campaign=pub-share-checklist)
*Join thousands of AI enthusiasts getting unique cutting-edge insights and free tutorials! **Plus, subscribers get exclusive early access and special discounts to our upcoming RAG Techniques course!** *
*Join over 15,000 of AI enthusiasts getting unique cutting-edge insights and free tutorials!* ***Plus, subscribers get exclusive early access and special 33% discounts to my book and the upcoming RAG Techniques course!***
</div>
[![DiamantAI's newsletter](images/substack_image.png)](https://diamantai.substack.com/?r=336pe4&utm_campaign=pub-share-checklist)
@@ -151,7 +153,24 @@ Explore the extensive list of cutting-edge RAG techniques:
### 📚 Context and Content Enrichment
8. **[Contextual Chunk Headers :label:](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/contextual_chunk_headers.ipynb)**
8. Hypothetical Prompt Embeddings (HyPE) ❓🚀
- **[LangChain](all_rag_techniques/HyPE_Hypothetical_Prompt_Embedding.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/HyPE_Hypothetical_Prompt_Embedding.py)**
#### Overview 🔎
HyPE (Hypothetical Prompt Embeddings) is an enhancement to traditional RAG retrieval that **precomputes hypothetical prompts at the indexing stage**, but inseting the chunk in their place. This transforms retrieval into a **question-question matching task**. This avoids the need for runtime synthetic answer generation, reducing inference-time computational overhead while **improving retrieval alignment**.
#### Implementation 🛠️
- 📖 **Precomputed Questions:** Instead of embedding document chunks, HyPE **generates multiple hypothetical queries per chunk** at indexing time.
- 🔍 **Question-Question Matching:** User queries are matched against stored hypothetical questions, leading to **better retrieval alignment**.
-**No Runtime Overhead:** Unlike HyDE, HyPE does **not require LLM calls at query time**, making retrieval **faster and cheaper**.
- 📈 **Higher Precision & Recall:** Improves retrieval **context precision by up to 42 percentage points** and **claim recall by up to 45 percentage points**.
#### Additional Resources 📚
- **[Preprint: Hypothetical Prompt Embeddings (HyPE)](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5139335)** - Research paper detailing the method, evaluation, and benchmarks.
9. **[Contextual Chunk Headers :label:](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/contextual_chunk_headers.ipynb)**
#### Overview 🔎
Contextual chunk headers (CCH) is a method of creating document-level and section-level context, and prepending those chunk headers to the chunks prior to embedding them.
@@ -162,7 +181,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Additional Resources 📚
**[dsRAG](https://github.com/D-Star-AI/dsRAG)**: open-source retrieval engine that implements this technique (and a few other advanced RAG techniques)
9. **[Relevant Segment Extraction 🧩](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/relevant_segment_extraction.ipynb)**
10. **[Relevant Segment Extraction 🧩](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/relevant_segment_extraction.ipynb)**
#### Overview 🔎
Relevant segment extraction (RSE) is a method of dynamically constructing multi-chunk segments of text that are relevant to a given query.
@@ -170,7 +189,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️
Perform a retrieval post-processing step that analyzes the most relevant chunks and identifies longer multi-chunk segments to provide more complete context to the LLM.
10. Context Enrichment Techniques 📝
11. Context Enrichment Techniques 📝
- **[LangChain](all_rag_techniques/context_enrichment_window_around_chunk.ipynb)**
- **[LlamaIndex](all_rag_techniques/context_enrichment_window_around_chunk_with_llamaindex.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/context_enrichment_window_around_chunk.py)**
@@ -181,7 +200,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️
Retrieve the most relevant sentence while also accessing the sentences before and after it in the original text.
11. Semantic Chunking 🧠
12. Semantic Chunking 🧠
- **[LangChain](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/semantic_chunking.ipynb)**
- **[Runnable Script](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques_runnable_scripts/semantic_chunking.py)**
@@ -194,7 +213,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Additional Resources 📚
- **[Semantic Chunking: Improving AI Information Retrieval](https://open.substack.com/pub/diamantai/p/semantic-chunking-improving-ai-information?r=336pe4&utm_campaign=post&utm_medium=web)** - A comprehensive blog post exploring the benefits and implementation of semantic chunking in RAG systems.
12. Contextual Compression 🗜️
13. Contextual Compression 🗜️
- **[LangChain](all_rag_techniques/contextual_compression.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/contextual_compression.py)**
@@ -204,7 +223,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️
Use an LLM to compress or summarize retrieved chunks, preserving key information relevant to the query.
13. Document Augmentation through Question Generation for Enhanced Retrieval
14. Document Augmentation through Question Generation for Enhanced Retrieval
- **[LangChain](all_rag_techniques/document_augmentation.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/document_augmentation.py)**
@@ -216,7 +235,7 @@ Explore the extensive list of cutting-edge RAG techniques:
### 🚀 Advanced Retrieval Methods
14. Fusion Retrieval 🔗
15. Fusion Retrieval 🔗
- **[LangChain](all_rag_techniques/fusion_retrieval.ipynb)**
- **[LlamaIndex](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/fusion_retrieval_with_llamaindex.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/fusion_retrieval.py)**
@@ -227,7 +246,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️
Combine keyword-based search with vector-based search for more comprehensive and accurate retrieval.
15. Intelligent Reranking 📈
16. Intelligent Reranking 📈
- **[LangChain](all_rag_techniques/reranking.ipynb)**
- **[LlamaIndex](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/reranking_with_llamaindex.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/reranking.py)**
@@ -243,7 +262,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Additional Resources 📚
- **[Relevance Revolution: How Re-ranking Transforms RAG Systems](https://open.substack.com/pub/diamantai/p/relevance-revolution-how-re-ranking?r=336pe4&utm_campaign=post&utm_medium=web)** - A comprehensive blog post exploring the power of re-ranking in enhancing RAG system performance.
16. Multi-faceted Filtering 🔍
17. Multi-faceted Filtering 🔍
#### Overview 🔎
Applying various filtering techniques to refine and improve the quality of retrieved results.
@@ -254,7 +273,7 @@ Explore the extensive list of cutting-edge RAG techniques:
- 📄 **Content Filtering:** Remove results that don't match specific content criteria or essential keywords.
- 🌈 **Diversity Filtering:** Ensure result diversity by filtering out near-duplicate entries.
17. Hierarchical Indices 🗂️
18. Hierarchical Indices 🗂️
- **[LangChain](all_rag_techniques/hierarchical_indices.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/hierarchical_indices.py)**
@@ -267,7 +286,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Additional Resources 📚
- **[Hierarchical Indices: Enhancing RAG Systems](https://open.substack.com/pub/diamantai/p/hierarchical-indices-enhancing-rag?r=336pe4&utm_campaign=post&utm_medium=web)** - A comprehensive blog post exploring the power of hierarchical indices in enhancing RAG system performance.
18. Ensemble Retrieval 🎭
19. Ensemble Retrieval 🎭
#### Overview 🔎
Combining multiple retrieval models or techniques for more robust and accurate results.
@@ -275,7 +294,16 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️
Apply different embedding models or retrieval algorithms and use voting or weighting mechanisms to determine the final set of retrieved documents.
19. Multi-modal Retrieval 📽️
20. Dartboard Retrieval 🎯
- **[LangChain](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/dartboard.ipynb)**
#### Overview 🔎
Optimizing over Relevant Information Gain in Retrieval
#### Implementation 🛠️
- Combine both relevance and diversity into a single scoring function and directly optimize for it.
- POC showing plain simple RAG underperforming when the database is dense, and the dartboard retrieval outperforming it.
21. Multi-modal Retrieval 📽️
#### Overview 🔎
Extending RAG capabilities to handle diverse data types for richer responses.
@@ -287,7 +315,7 @@ Explore the extensive list of cutting-edge RAG techniques:
### 🔁 Iterative and Adaptive Techniques
20. Retrieval with Feedback Loops 🔁
22. Retrieval with Feedback Loops 🔁
- **[LangChain](all_rag_techniques/retrieval_with_feedback_loop.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/retrieval_with_feedback_loop.py)**
@@ -297,7 +325,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️
Collect and utilize user feedback on the relevance and quality of retrieved documents and generated responses to fine-tune retrieval and ranking models.
21. Adaptive Retrieval 🎯
23. Adaptive Retrieval 🎯
- **[LangChain](all_rag_techniques/adaptive_retrieval.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/adaptive_retrieval.py)**
@@ -307,7 +335,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️
Classify queries into different categories and use tailored retrieval strategies for each, considering user context and preferences.
22. Iterative Retrieval 🔄
24. Iterative Retrieval 🔄
#### Overview 🔎
Performing multiple rounds of retrieval to refine and enhance result quality.
@@ -317,7 +345,7 @@ Explore the extensive list of cutting-edge RAG techniques:
### 📊 Evaluation
23. **[DeepEval Evaluation](evaluation/evaluation_deep_eval.ipynb)** 📘
25. **[DeepEval Evaluation](evaluation/evaluation_deep_eval.ipynb)** 📘
#### Overview 🔎
Performing evaluations Retrieval-Augmented Generation systems, by covering several metrics and creating test cases.
@@ -326,7 +354,7 @@ Explore the extensive list of cutting-edge RAG techniques:
Use the `deepeval` library to conduct test cases on correctness, faithfulness and contextual relevancy of RAG systems.
24. **[GroUSE Evaluation](evaluation/evaluation_grouse.ipynb)** 🐦
26. **[GroUSE Evaluation](evaluation/evaluation_grouse.ipynb)** 🐦
#### Overview 🔎
Evaluate the final stage of Retrieval-Augmented Generation using metrics of the GroUSE framework and meta-evaluate your custom LLM judge on GroUSE unit tests.
@@ -337,7 +365,7 @@ Explore the extensive list of cutting-edge RAG techniques:
### 🔬 Explainability and Transparency
25. Explainable Retrieval 🔍
27. Explainable Retrieval 🔍
- **[LangChain](all_rag_techniques/explainable_retrieval.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/explainable_retrieval.py)**
@@ -349,7 +377,7 @@ Explore the extensive list of cutting-edge RAG techniques:
### 🏗️ Advanced Architectures
26. Knowledge Graph Integration (Graph RAG) 🕸️
28. Knowledge Graph Integration (Graph RAG) 🕸️
- **[LangChain](all_rag_techniques/graph_rag.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/graph_rag.py)**
@@ -359,7 +387,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️
Retrieve entities and their relationships from a knowledge graph relevant to the query, combining this structured data with unstructured text for more informative responses.
27. GraphRag (Microsoft) 🎯
29. GraphRag (Microsoft) 🎯
- **[GraphRag](all_rag_techniques/Microsoft_GraphRag.ipynb)**
#### Overview 🔎
@@ -368,7 +396,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️
• Analyze an input corpus by extracting entities, relationships from text units. generates summaries of each community and its constituents from the bottom-up.
28. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval 🌳
30. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval 🌳
- **[LangChain](all_rag_techniques/raptor.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/raptor.py)**
@@ -378,7 +406,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️
Use abstractive summarization to recursively process and summarize retrieved documents, organizing the information in a tree structure for hierarchical context.
29. Self RAG 🔁
31. Self RAG 🔁
- **[LangChain](all_rag_techniques/self_rag.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/self_rag.py)**
@@ -388,7 +416,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️
• Implement a multi-step process including retrieval decision, document retrieval, relevance evaluation, response generation, support assessment, and utility evaluation to produce accurate, relevant, and useful outputs.
30. Corrective RAG 🔧
32. Corrective RAG 🔧
- **[LangChain](all_rag_techniques/crag.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/crag.py)**
@@ -400,7 +428,7 @@ Explore the extensive list of cutting-edge RAG techniques:
## 🌟 Special Advanced Technique 🌟
31. **[Sophisticated Controllable Agent for Complex RAG Tasks 🤖](https://github.com/NirDiamant/Controllable-RAG-Agent)**
33. **[Sophisticated Controllable Agent for Complex RAG Tasks 🤖](https://github.com/NirDiamant/Controllable-RAG-Agent)**
#### Overview 🔎
An advanced RAG solution designed to tackle complex questions that simple semantic similarity-based retrieval cannot solve. This approach uses a sophisticated deterministic graph as the "brain" 🧠 of a highly controllable autonomous agent, capable of answering non-trivial questions from your own data.

View File

@@ -0,0 +1,558 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Hypothetical Prompt Embeddings (HyPE)\n",
"\n",
"## Overview\n",
"\n",
"This code implements a Retrieval-Augmented Generation (RAG) system enhanced by Hypothetical Prompt Embeddings (HyPE). Unlike traditional RAG pipelines that struggle with query-document style mismatch, HyPE precomputes hypothetical questions during the indexing phase. This transforms retrieval into a question-question matching problem, eliminating the need for expensive runtime query expansion techniques.\n",
"\n",
"## Key Components of notebook\n",
"\n",
"1. PDF processing and text extraction\n",
"2. Text chunking to maintain coherent information units\n",
"3. **Hypothetical Prompt Embedding Generation** using an LLM to create multiple proxy questions per chunk\n",
"4. Vector store creation using [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) and OpenAI embeddings\n",
"5. Retriever setup for querying the processed documents\n",
"6. Evaluation of the RAG system\n",
"\n",
"## Method Details\n",
"\n",
"### Document Preprocessing\n",
"\n",
"1. The PDF is loaded using `PyPDFLoader`.\n",
"2. The text is split into chunks using `RecursiveCharacterTextSplitter` with specified chunk size and overlap.\n",
"\n",
"### Hypothetical Question Generation\n",
"\n",
"Instead of embedding raw text chunks, HyPE **generates multiple hypothetical prompts** for each chunk. These **precomputed questions** simulate user queries, improving alignment with real-world searches. This removes the need for runtime synthetic answer generation needed in techniques like HyDE.\n",
"\n",
"### Vector Store Creation\n",
"\n",
"1. Each hypothetical question is embedded using OpenAI embeddings.\n",
"2. A FAISS vector store is built, associating **each question embedding with its original chunk**.\n",
"3. This approach **stores multiple representations per chunk**, increasing retrieval flexibility.\n",
"\n",
"### Retriever Setup\n",
"\n",
"1. The retriever is optimized for **question-question matching** rather than direct document retrieval.\n",
"2. The FAISS index enables **efficient nearest-neighbor** search over the hypothetical prompt embeddings.\n",
"3. Retrieved chunks provide a **richer and more precise context** for downstream LLM generation.\n",
"\n",
"## Key Features\n",
"\n",
"1. **Precomputed Hypothetical Prompts** Improves query alignment without runtime overhead.\n",
"2. **Multi-Vector Representation** Each chunk is indexed multiple times for broader semantic coverage.\n",
"3. **Efficient Retrieval** FAISS ensures fast similarity search over the enhanced embeddings.\n",
"4. **Modular Design** The pipeline is easy to adapt for different datasets and retrieval settings. Additionally it's compatible with most optimizations like reranking etc.\n",
"\n",
"## Evaluation\n",
"\n",
"HyPE's effectiveness is evaluated across multiple datasets, showing:\n",
"\n",
"- Up to 42 percentage points improvement in retrieval precision\n",
"- Up to 45 percentage points improvement in claim recall\n",
" (See full evaluation results in [preprint](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5139335))\n",
"\n",
"## Benefits of this Approach\n",
"\n",
"1. **Eliminates Query-Time Overhead** All hypothetical generation is done offline at indexing.\n",
"2. **Enhanced Retrieval Precision** Better alignment between queries and stored content.\n",
"3. **Scalable & Efficient** No addinal per-query computational cost; retrieval is as fast as standard RAG.\n",
"4. **Flexible & Extensible** Can be combined with advanced RAG techniques like reranking.\n",
"\n",
"## Conclusion\n",
"\n",
"HyPE provides a scalable and efficient alternative to traditional RAG systems, overcoming query-document style mismatch while avoiding the computational cost of runtime query expansion. By moving hypothetical prompt generation to indexing, it significantly enhances retrieval precision and efficiency, making it a practical solution for real-world applications.\n",
"\n",
"For further details, refer to the full paper: [preprint](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5139335)\n",
"\n",
"\n",
"<div style=\"text-align: center;\">\n",
"\n",
"<img src=\"../images/hype.svg\" alt=\"HyPE\" style=\"width:70%; height:auto;\">\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Import libraries and environment variables"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import sys\n",
"import faiss\n",
"from tqdm import tqdm\n",
"from dotenv import load_dotenv\n",
"from concurrent.futures import ThreadPoolExecutor, as_completed\n",
"from langchain_community.docstore.in_memory import InMemoryDocstore\n",
"\n",
"\n",
"# Load environment variables from a .env file\n",
"load_dotenv()\n",
"\n",
"# Set the OpenAI API key environment variable (comment out if not using OpenAI)\n",
"if not os.getenv('OPENAI_API_KEY'):\n",
" os.environ[\"OPENAI_API_KEY\"] = input(\"Please enter your OpenAI API key: \")\n",
"else:\n",
" os.environ[\"OPENAI_API_KEY\"] = os.getenv('OPENAI_API_KEY')\n",
"\n",
"sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path since we work with notebooks\n",
"from helper_functions import *\n",
"from evaluation.evalute_rag import *\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Define constants\n",
"\n",
"- `PATH`: path to the data, to be embedded into the RAG pipeline\n",
"\n",
"This tutorial uses OpenAI endpoint ([avalible models](https://platform.openai.com/docs/pricing)). \n",
"- `LANGUAGE_MODEL_NAME`: The name of the language model to be used. \n",
"- `EMBEDDING_MODEL_NAME`: The name of the embedding model to be used.\n",
"\n",
"The tutroial uses a `RecursiveCharacterTextSplitter` chunking approach where the chunking length function used is python `len` function. The chunking varables to be tweaked here are:\n",
"- `CHUNK_SIZE`: The minimum length of one chunk\n",
"- `CHUNK_OVERLAP`: The overlap of two consecutive chunks."
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [],
"source": [
"PATH = \"../data/Understanding_Climate_Change.pdf\"\n",
"LANGUAGE_MODEL_NAME = \"gpt-4o-mini\"\n",
"EMBEDDING_MODEL_NAME = \"text-embedding-3-small\"\n",
"CHUNK_SIZE = 1000\n",
"CHUNK_OVERLAP = 200"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Define generation of Hypothetical Prompt Embeddings\n",
"\n",
"The code block below generates hypothetical questions for each text chunk and embeds them for retrieval.\n",
"\n",
"- An LLM extracts key questions from the input chunk.\n",
"- These questions are embedded using OpenAI's model.\n",
"- The function returns the original chunk and its prompt embeddings later used for retrieval.\n",
"\n",
"To ensure clean output, extra newlines are removed, and regex parsing can improve list formatting when needed."
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [],
"source": [
"def generate_hypothetical_prompt_embeddings(chunk_text: str):\n",
" \"\"\"\n",
" Uses the LLM to generate multiple hypothetical questions for a single chunk.\n",
" These questions will be used as 'proxies' for the chunk during retrieval.\n",
"\n",
" Parameters:\n",
" chunk_text (str): Text contents of the chunk\n",
"\n",
" Returns:\n",
" chunk_text (str): Text contents of the chunk. This is done to make the \n",
" multithreading easier\n",
" hypothetical prompt embeddings (List[float]): A list of embedding vectors\n",
" generated from the questions\n",
" \"\"\"\n",
" llm = ChatOpenAI(temperature=0, model_name=LANGUAGE_MODEL_NAME)\n",
" embedding_model = OpenAIEmbeddings(model=EMBEDDING_MODEL_NAME)\n",
"\n",
" question_gen_prompt = PromptTemplate.from_template(\n",
" \"Analyze the input text and generate essential questions that, when answered, \\\n",
" capture the main points of the text. Each question should be one line, \\\n",
" without numbering or prefixes.\\n\\n \\\n",
" Text:\\n{chunk_text}\\n\\nQuestions:\\n\"\n",
" )\n",
" question_chain = question_gen_prompt | llm | StrOutputParser()\n",
"\n",
" # parse questions from response\n",
" # Notes: \n",
" # - gpt4o likes to split questions by \\n\\n so we remove one \\n\n",
" # - for production or if using smaller models from ollama, it's beneficial to use regex to parse \n",
" # things like (un)ordeed lists\n",
" # r\"^\\s*[\\-\\*\\•]|\\s*\\d+\\.\\s*|\\s*[a-zA-Z]\\)\\s*|\\s*\\(\\d+\\)\\s*|\\s*\\([a-zA-Z]\\)\\s*|\\s*\\([ivxlcdm]+\\)\\s*\"\n",
" questions = question_chain.invoke({\"chunk_text\": chunk_text}).replace(\"\\n\\n\", \"\\n\").split(\"\\n\")\n",
" \n",
" return chunk_text, embedding_model.embed_documents(questions)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Define creation and population of FAISS Vector Store\n",
"\n",
"The code block below builds a FAISS vector store by embedding text chunks in parallel.\n",
"\n",
"What happens?\n",
"- Parallel processing Uses threading to generate embeddings faster.\n",
"- FAISS initialization Sets up an L2 index for efficient similarity search.\n",
"- Chunk embedding Each chunk is stored multiple times, once for each generated question embedding.\n",
"- In-memory storage Uses InMemoryDocstore for fast lookup.\n",
"\n",
"This ensures efficient retrieval, improving query alignment with precomputed question embeddings."
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [],
"source": [
"def prepare_vector_store(chunks: List[str]):\n",
" \"\"\"\n",
" Creates and populates a FAISS vector store from a list of text chunks.\n",
"\n",
" This function processes a list of text chunks in parallel, generating \n",
" hypothetical prompt embeddings for each chunk.\n",
" The embeddings are stored in a FAISS index for efficient similarity search.\n",
"\n",
" Parameters:\n",
" chunks (List[str]): A list of text chunks to be embedded and stored.\n",
"\n",
" Returns:\n",
" FAISS: A FAISS vector store containing the embedded text chunks.\n",
" \"\"\"\n",
"\n",
" # Wait with initialization to see vector lengths\n",
" vector_store = None \n",
"\n",
" with ThreadPoolExecutor() as pool: \n",
" # Use threading to speed up generation of prompt embeddings\n",
" futures = [pool.submit(generate_hypothetical_prompt_embeddings, c) for c in chunks]\n",
" \n",
" # Process embeddings as they complete\n",
" for f in tqdm(as_completed(futures), total=len(chunks)): \n",
" \n",
" chunk, vectors = f.result() # Retrieve the processed chunk and its embeddings\n",
" \n",
" # Initialize the FAISS vector store on the first chunk\n",
" if vector_store == None: \n",
" vector_store = FAISS(\n",
" embedding_function=OpenAIEmbeddings(model=EMBEDDING_MODEL_NAME), # Define embedding model\n",
" index=faiss.IndexFlatL2(len(vectors[0])) # Define an L2 index for similarity search\n",
" docstore=InMemoryDocstore(), # Use in-memory document storage\n",
" index_to_docstore_id={} # Maintain index-to-document mapping\n",
" )\n",
" \n",
" # Pair the chunk's content with each generated embedding vector.\n",
" # Each chunk is inserted multiple times, once for each prompt vector\n",
" chunks_with_embedding_vectors = [(chunk.page_content, vec) for vec in vectors]\n",
" \n",
" # Add embeddings to the store\n",
" vector_store.add_embeddings(chunks_with_embedding_vectors) \n",
"\n",
" return vector_store # Return the populated vector store\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Encode PDF into a FAISS Vector Store\n",
"\n",
"The code block below processes a PDF file and stores its content as embeddings for retrieval.\n",
"\n",
"What happens?\n",
"- PDF loading Extracts text from the document.\n",
"- Chunking Splits text into overlapping segments for better context retention.\n",
"- Preprocessing Cleans text to improve embedding quality.\n",
"- Vector store creation Generates embeddings and stores them in FAISS for retrieval."
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [],
"source": [
"def encode_pdf(path, chunk_size=1000, chunk_overlap=200):\n",
" \"\"\"\n",
" Encodes a PDF book into a vector store using OpenAI embeddings.\n",
"\n",
" Args:\n",
" path: The path to the PDF file.\n",
" chunk_size: The desired size of each text chunk.\n",
" chunk_overlap: The amount of overlap between consecutive chunks.\n",
"\n",
" Returns:\n",
" A FAISS vector store containing the encoded book content.\n",
" \"\"\"\n",
"\n",
" # Load PDF documents\n",
" loader = PyPDFLoader(path)\n",
" documents = loader.load()\n",
"\n",
" # Split documents into chunks\n",
" text_splitter = RecursiveCharacterTextSplitter(\n",
" chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len\n",
" )\n",
" texts = text_splitter.split_documents(documents)\n",
" cleaned_texts = replace_t_with_space(texts)\n",
"\n",
" vectorstore = prepare_vector_store(cleaned_texts)\n",
"\n",
" return vectorstore"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create HyPE vector store\n",
"\n",
"Now we process the PDF and store its embeddings.\n",
"This step initializes the FAISS vector store with the encoded document."
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 97/97 [00:22<00:00, 4.40it/s]\n"
]
}
],
"source": [
"# Chunk size can be quite large with HyPE as we are not loosing percision with more\n",
"# information. For production, test how exhaustive your model is in generating sufficient \n",
"# amount of questions per chunk. This will mostly depend on your information density.\n",
"chunks_vector_store = encode_pdf(PATH, chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create retriever\n",
"\n",
"Now we set up the retriever to fetch relevant chunks from the vector store.\n",
"\n",
"Retrieves the top `k=3` most relevant chunks based on query similarity."
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [],
"source": [
"chunks_query_retriever = chunks_vector_store.as_retriever(search_kwargs={\"k\": 3})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Test retriever\n",
"\n",
"Now we test retrieval using a sample query.\n",
"\n",
"- Queries the vector store to find the most relevant chunks.\n",
"- Deduplicates results to remove potentially repeated chunks.\n",
"- Displays the retrieved context for inspection.\n",
"\n",
"This step verifies that the retriever returns meaningful and diverse information for the given question."
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Context 1:\n",
"Most of these climate changes are attributed to very small variations in Earth's orbit that \n",
"change the amount of solar energy our planet receives. During the Holocene epoch, which \n",
"began at the end of the last ice age, human societies f lourished, but the industrial era has seen \n",
"unprecedented changes. \n",
"Modern Observations \n",
"Modern scientific observations indicate a rapid increase in global temperatures, sea levels, \n",
"and extreme weather events. The Intergovernmental Panel on Climate Change (IPCC) has \n",
"documented these changes extensively. Ice core samples, tree rings, and ocean sediments \n",
"provide a historical record that scientists use to understand past climate conditions and \n",
"predict future trends. The evidence overwhelmingly shows that recent changes are primarily \n",
"driven by human activities, particularly the emission of greenhou se gases. \n",
"Chapter 2: Causes of Climate Change \n",
"Greenhouse Gases\n",
"\n",
"\n",
"Context 2:\n",
"driven by human activities, particularly the emission of greenhou se gases. \n",
"Chapter 2: Causes of Climate Change \n",
"Greenhouse Gases \n",
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
"activities have intensified this natural process, leading to a warmer climate. \n",
"Fossil Fuels \n",
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
"today. \n",
"Coal\n",
"\n",
"\n",
"Context 3:\n",
"Understanding Climate Change \n",
"Chapter 1: Introduction to Climate Change \n",
"Climate change refers to significant, long -term changes in the global climate. The term \n",
"\"global climate\" encompasses the planet's overall weather patterns, including temperature, \n",
"precipitation, and wind patterns, over an extended period. Over the past cent ury, human \n",
"activities, particularly the burning of fossil fuels and deforestation, have significantly \n",
"contributed to climate change. \n",
"Historical Context \n",
"The Earth's climate has changed throughout history. Over the past 650,000 years, there have \n",
"been seven cycles of glacial advance and retreat, with the abrupt end of the last ice age about \n",
"11,700 years ago marking the beginning of the modern climate era and human civilization. \n",
"Most of these climate changes are attributed to very small variations in Earth's orbit that \n",
"change the amount of solar energy our planet receives. During the Holocene epoch, which\n",
"\n",
"\n"
]
}
],
"source": [
"test_query = \"What is the main cause of climate change?\"\n",
"context = retrieve_context_per_question(test_query, chunks_query_retriever)\n",
"context = list(set(context))\n",
"show_context(context)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Evaluate results"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'questions': ['1. **Multiple Choice: Causes of Climate Change**',\n",
" ' - What is the primary cause of the current climate change trend?',\n",
" ' A) Solar radiation variations',\n",
" ' B) Natural cycles of the Earth',\n",
" ' C) Human activities, such as burning fossil fuels',\n",
" ' D) Volcanic eruptions',\n",
" '',\n",
" '2. **True or False: Impact on Biodiversity**',\n",
" ' - True or False: Climate change does not have any significant impact on the migration patterns and extinction rates of various species.',\n",
" '',\n",
" '3. **Short Answer: Mitigation Strategies**',\n",
" ' - What are two effective strategies that can be implemented at a community level to mitigate the effects of climate change?',\n",
" '',\n",
" '4. **Matching: Climate Change Effects**',\n",
" ' - Match the following effects of climate change (numbered) with their likely consequences (lettered).',\n",
" ' 1. Rising sea levels',\n",
" ' 2. Increased frequency of extreme weather events',\n",
" ' 3. Melting polar ice caps',\n",
" ' 4. Ocean acidification',\n",
" ' ',\n",
" ' A) Displacement of coastal communities',\n",
" ' B) Loss of marine biodiversity',\n",
" ' C) Increased global temperatures',\n",
" ' D) More frequent and severe hurricanes and floods',\n",
" '',\n",
" '5. **Essay: International Cooperation**',\n",
" ' - Discuss the importance of international cooperation in combating climate change. Include examples of successful global agreements or initiatives and explain how they have contributed to addressing climate change.'],\n",
" 'results': ['```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 2,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 1,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 1,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 2,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 2,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 2\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 2,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 2,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 2\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 5,\\n \"Completeness\": 4,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 2,\\n \"Completeness\": 1,\\n \"Conciseness\": 2\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 3,\\n \"Conciseness\": 3\\n}\\n```',\n",
" '```json\\n{\\n \"Relevance\": 4,\\n \"Completeness\": 2,\\n \"Conciseness\": 3\\n}\\n```'],\n",
" 'average_scores': None}"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"evaluate_rag(chunks_query_retriever)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -19,7 +19,7 @@
"nest_asyncio.apply()\n",
"from dotenv import load_dotenv\n",
"\n",
"from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, ServiceContext\n",
"from llama_index.core import VectorStoreIndex, SimpleDirectoryReader\n",
"from llama_index.core.prompts import PromptTemplate\n",
"\n",
"from llama_index.core.evaluation import (\n",
@@ -28,6 +28,7 @@
" RelevancyEvaluator\n",
")\n",
"from llama_index.llms.openai import OpenAI\n",
"from llama_index.core import Settings\n",
"\n",
"import openai\n",
"import time\n",
@@ -90,11 +91,11 @@
"# We will use GPT-4 for evaluating the responses\n",
"gpt4 = OpenAI(temperature=0, model=\"gpt-4o\")\n",
"\n",
"# Define service context for GPT-4 for evaluation\n",
"service_context_gpt4 = ServiceContext.from_defaults(llm=gpt4)\n",
"# Set appropriate settings for the LLM\n",
"Settings.llm = gpt4\n",
"\n",
"# Define Faithfulness and Relevancy Evaluators which are based on GPT-4\n",
"faithfulness_gpt4 = FaithfulnessEvaluator(service_context=service_context_gpt4)\n",
"# Define Faithfulness Evaluators which are based on GPT-4\n",
"faithfulness_gpt4 = FaithfulnessEvaluator()\n",
"\n",
"faithfulness_new_prompt_template = PromptTemplate(\"\"\" Please tell if a given piece of information is directly supported by the context.\n",
" You need to answer with either YES or NO.\n",
@@ -123,7 +124,9 @@
" \"\"\")\n",
"\n",
"faithfulness_gpt4.update_prompts({\"your_prompt_key\": faithfulness_new_prompt_template}) # Update the prompts dictionary with the new prompt template\n",
"relevancy_gpt4 = RelevancyEvaluator(service_context=service_context_gpt4)"
"\n",
"# Define Relevancy Evaluators which are based on GPT-4\n",
"relevancy_gpt4 = RelevancyEvaluator()"
]
},
{
@@ -159,10 +162,12 @@
" # create vector index\n",
" llm = OpenAI(model=\"gpt-3.5-turbo\")\n",
"\n",
" service_context = ServiceContext.from_defaults(llm=llm, chunk_size=chunk_size, chunk_overlap=chunk_size//5) \n",
" vector_index = VectorStoreIndex.from_documents(\n",
" eval_documents, service_context=service_context\n",
" )\n",
" Settings.llm = llm\n",
" Settings.chunk_size = chunk_size\n",
" Settings.chunk_overlap = chunk_size // 5 \n",
"\n",
" vector_index = VectorStoreIndex.from_documents(eval_documents)\n",
" \n",
" # build query engine\n",
" query_engine = vector_index.as_query_engine(similarity_top_k=5)\n",
" num_questions = len(eval_questions)\n",
@@ -234,7 +239,7 @@
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
@@ -248,7 +253,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.0"
"version": "3.11.0"
}
},
"nbformat": 4,

View File

@@ -97,7 +97,7 @@
"from langchain_core.pydantic_v1 import BaseModel, Field\n",
"\n",
"\n",
"sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path sicnce we work with notebooks\n",
"sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path since we work with notebooks\n",
"from helper_functions import *\n",
"from evaluation.evalute_rag import *\n",
"\n",

View File

@@ -0,0 +1,609 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Dartboard RAG: Retrieval-Augmented Generation with Balanced Relevance and Diversity\n",
"\n",
"## Overview\n",
"The **Dartboard RAG** process addresses a common challenge in large knowledge bases: ensuring the retrieved information is both relevant and non-redundant. By explicitly optimizing a combined relevance-diversity scoring function, it prevents multiple top-k documents from offering the same information. This approach is drawn from the elegant method in thepaper:\n",
"\n",
"> [*Better RAG using Relevant Information Gain*](https://arxiv.org/abs/2407.12101)\n",
"\n",
"The paper outlines three variations of the core idea—hybrid RAG (dense + sparse), a cross-encoder version, and a vanilla approach. The **vanilla approach** conveys the fundamental concept most directly, and this implementation extends it with optional weights to control the balance between relevance and diversity.\n",
"\n",
"## Motivation\n",
"\n",
"1. **Dense, Overlapping Knowledge Bases** \n",
" In large databases, documents may repeat similar content, causing redundancy in top-k retrieval.\n",
"\n",
"2. **Improved Information Coverage** \n",
" Combining relevance and diversity yields a richer set of documents, mitigating the “echo chamber” effect of overly similar content.\n",
"\n",
"\n",
"## Key Components\n",
"\n",
"1. **Relevance & Diversity Combination** \n",
" - Computes a score factoring in both how pertinent a document is to the query and how distinct it is from already chosen documents.\n",
"\n",
"2. **Weighted Balancing** \n",
" - Introduces RELEVANCE_WEIGHT and DIVERSITY_WEIGHT to allow dynamic control of scoring. \n",
" - Helps in avoiding overly diverse but less relevant results.\n",
"\n",
"3. **Production-Ready Code** \n",
" - Derived from the official implementation yet reorganized for clarity. \n",
" - Allows easier integration into existing RAG pipelines.\n",
"\n",
"## Method Details\n",
"\n",
"1. **Document Retrieval** \n",
" - Obtain an initial set of candidate documents based on similarity (e.g., cosine or BM25). \n",
" - Typically retrieves top-N candidates as a starting point.\n",
"\n",
"2. **Scoring & Selection** \n",
" - Each documents overall score combines **relevance** and **diversity**: \n",
" - Select the highest-scoring document, then penalize documents that are overly similar to it. \n",
" - Repeat until top-k documents are identified.\n",
"\n",
"3. **Hybrid / Fusion & Cross-Encoder Support** \n",
" Essentially, all you need are distances between documents and the query, and distances between documents. You can easily extract these from hybrid / fusion retrieval or from cross-encoder retrieval. The only recommendation I have is to rely less on raking based scores.\n",
" - For **hybrid / fusion retrieval**: Merge similarities (dense and sparse / BM25) into a single distance. This can be achieved by combining cosine similarity over the dense and the sparse vectors (e.g. averaging them). the move to distances is straightforward (1 - mean cosine similarity). \n",
" - For **cross-encoders**: You can directly use the cross-encoder similarity scores (1- similarity), potentially adjusting with scaling factors.\n",
"\n",
"4. **Balancing & Adjustment** \n",
" - Tune DIVERSITY_WEIGHT and RELEVANCE_WEIGHT based on your needs and the density of your dataset. \n",
"\n",
"\n",
"\n",
"By integrating both **relevance** and **diversity** into retrieval, the Dartboard RAG approach ensures that top-k documents collectively offer richer, more comprehensive information—leading to higher-quality responses in Retrieval-Augmented Generation systems.\n",
"\n",
"The paper also has an official code implemention, and this code is based on it, but I think this one here is more readable, manageable and production ready."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Import libraries and environment variables"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Please enter your OpenAI API key: \n"
]
}
],
"source": [
"import os\n",
"import sys\n",
"from dotenv import load_dotenv\n",
"from scipy.special import logsumexp\n",
"from typing import Tuple, List, Any\n",
"import numpy as np\n",
"\n",
"# Load environment variables from a .env file\n",
"load_dotenv()\n",
"# Set the OpenAI API key environment variable (comment out if not using OpenAI)\n",
"if not os.getenv('OPENAI_API_KEY'):\n",
" print(\"Please enter your OpenAI API key: \")\n",
" os.environ[\"OPENAI_API_KEY\"] = input(\"Please enter your OpenAI API key: \")\n",
"else:\n",
" os.environ[\"OPENAI_API_KEY\"] = os.getenv('OPENAI_API_KEY')\n",
"\n",
"sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path since we work with notebooks\n",
"from helper_functions import *\n",
"from evaluation.evalute_rag import *\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Read Docs"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"path = \"../data/Understanding_Climate_Change.pdf\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Encode document"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# this part is same like simple_rag.ipynb, only simulating a dense dataset\n",
"def encode_pdf(path, chunk_size=1000, chunk_overlap=200):\n",
" \"\"\"\n",
" Encodes a PDF book into a vector store using OpenAI embeddings.\n",
"\n",
" Args:\n",
" path: The path to the PDF file.\n",
" chunk_size: The desired size of each text chunk.\n",
" chunk_overlap: The amount of overlap between consecutive chunks.\n",
"\n",
" Returns:\n",
" A FAISS vector store containing the encoded book content.\n",
" \"\"\"\n",
"\n",
" # Load PDF documents\n",
" loader = PyPDFLoader(path)\n",
" documents = loader.load()\n",
" documents=documents*5 # load every document 5 times to emulate a dense dataset\n",
"\n",
" # Split documents into chunks\n",
" text_splitter = RecursiveCharacterTextSplitter(\n",
" chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len\n",
" )\n",
" texts = text_splitter.split_documents(documents)\n",
" cleaned_texts = replace_t_with_space(texts)\n",
"\n",
" # Create embeddings (Tested with OpenAI and Amazon Bedrock)\n",
" embeddings = get_langchain_embedding_provider(EmbeddingProvider.OPENAI)\n",
" #embeddings = get_langchain_embedding_provider(EmbeddingProvider.AMAZON_BEDROCK)\n",
"\n",
" # Create vector store\n",
" vectorstore = FAISS.from_documents(cleaned_texts, embeddings)\n",
"\n",
" return vectorstore"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create Vector store\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"chunks_vector_store = encode_pdf(path, chunk_size=1000, chunk_overlap=200)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Some helper functions for using the vector store for retrieval.\n",
"this part is same like simple_rag.ipynb, only its using the actual FAISS index (not the wrapper)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"\n",
"def idx_to_text(idx:int):\n",
" \"\"\"\n",
" Convert a Vector store index to the corresponding text.\n",
" \"\"\"\n",
" docstore_id = chunks_vector_store.index_to_docstore_id[idx]\n",
" document = chunks_vector_store.docstore.search(docstore_id)\n",
" return document.page_content\n",
"\n",
"\n",
"def get_context(query:str,k:int=5) -> Tuple[np.ndarray, np.ndarray, List[str]]:\n",
" \"\"\"\n",
" Retrieve top k context items for a query using top k retrieval.\n",
" \"\"\"\n",
" # regular top k retrieval\n",
" q_vec=chunks_vector_store.embedding_function.embed_documents([query])\n",
" _,indices=chunks_vector_store.index.search(np.array(q_vec),k=k)\n",
"\n",
" texts = [idx_to_text(i) for i in indices[0]]\n",
" return texts\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"\n",
"test_query = \"What is the main cause of climate change?\"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Regular top k retrieval\n",
"- This demonstration shows that when database is dense (here we simulate density by loading each document 5 times), the results are not good, we don't get the most relevant results. Note that the top 3 results are all repetitions of the same document."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Context 1:\n",
"driven by human activities, particularly the emission of greenhou se gases. \n",
"Chapter 2: Causes of Climate Change \n",
"Greenhouse Gases \n",
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
"activities have intensified this natural process, leading to a warmer climate. \n",
"Fossil Fuels \n",
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
"today. \n",
"Coal\n",
"\n",
"\n",
"Context 2:\n",
"driven by human activities, particularly the emission of greenhou se gases. \n",
"Chapter 2: Causes of Climate Change \n",
"Greenhouse Gases \n",
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
"activities have intensified this natural process, leading to a warmer climate. \n",
"Fossil Fuels \n",
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
"today. \n",
"Coal\n",
"\n",
"\n",
"Context 3:\n",
"driven by human activities, particularly the emission of greenhou se gases. \n",
"Chapter 2: Causes of Climate Change \n",
"Greenhouse Gases \n",
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
"activities have intensified this natural process, leading to a warmer climate. \n",
"Fossil Fuels \n",
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
"today. \n",
"Coal\n",
"\n",
"\n"
]
}
],
"source": [
"texts=get_context(test_query,k=3)\n",
"show_context(texts)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Now for the real part :) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### More utils for distances normalization"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"def lognorm(dist:np.ndarray, sigma:float):\n",
" \"\"\"\n",
" Calculate the log-normal probability for a given distance and sigma.\n",
" \"\"\"\n",
" if sigma < 1e-9: \n",
" return -np.inf * dist\n",
" return -np.log(sigma) - 0.5 * np.log(2 * np.pi) - dist**2 / (2 * sigma**2)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Greedy Dartboard Search\n",
"\n",
"This is the core algorithm: A search algorithm that selects a diverse set of relevant documents from a collection by balancing two factors: relevance to the query and diversity among selected documents.\n",
"\n",
"Given distances between a query and documents, plus distances between all documents, the algorithm:\n",
"\n",
"1. Selects the most relevant document first\n",
"2. Iteratively selects additional documents by combining:\n",
" - Relevance to the original query\n",
" - Diversity from previously selected documents\n",
"\n",
"The balance between relevance and diversity is controlled by weights:\n",
"- `DIVERSITY_WEIGHT`: Importance of difference from existing selections\n",
"- `RELEVANCE_WEIGHT`: Importance of relevance to query\n",
"- `SIGMA`: Smoothing parameter for probability conversion\n",
"\n",
"The algorithm returns both the selected documents and their selection scores, making it useful for applications like search results where you want relevant but varied results.\n",
"\n",
"For example, when searching news articles, it would first return the most relevant article, then find articles that are both on-topic and provide new information, avoiding redundant selections."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Configuration parameters\n",
"DIVERSITY_WEIGHT = 1.0 # Weight for diversity in document selection\n",
"RELEVANCE_WEIGHT = 1.0 # Weight for relevance to query\n",
"SIGMA = 0.1 # Smoothing parameter for probability distribution\n",
"\n",
"def greedy_dartsearch(\n",
" query_distances: np.ndarray,\n",
" document_distances: np.ndarray,\n",
" documents: List[str],\n",
" num_results: int\n",
") -> Tuple[List[str], List[float]]:\n",
" \"\"\"\n",
" Perform greedy dartboard search to select top k documents balancing relevance and diversity.\n",
" \n",
" Args:\n",
" query_distances: Distance between query and each document\n",
" document_distances: Pairwise distances between documents\n",
" documents: List of document texts\n",
" num_results: Number of documents to return\n",
" \n",
" Returns:\n",
" Tuple containing:\n",
" - List of selected document texts\n",
" - List of selection scores for each document\n",
" \"\"\"\n",
" # Avoid division by zero in probability calculations\n",
" sigma = max(SIGMA, 1e-5)\n",
" \n",
" # Convert distances to probability distributions\n",
" query_probabilities = lognorm(query_distances, sigma)\n",
" document_probabilities = lognorm(document_distances, sigma)\n",
" \n",
" # Initialize with most relevant document\n",
" \n",
" most_relevant_idx = np.argmax(query_probabilities)\n",
" selected_indices = np.array([most_relevant_idx])\n",
" selection_scores = [1.0] # dummy score for the first document\n",
" # Get initial distances from the first selected document\n",
" max_distances = document_probabilities[most_relevant_idx]\n",
" \n",
" # Select remaining documents\n",
" while len(selected_indices) < num_results:\n",
" # Update maximum distances considering new document\n",
" updated_distances = np.maximum(max_distances, document_probabilities)\n",
" \n",
" # Calculate combined diversity and relevance scores\n",
" combined_scores = (\n",
" updated_distances * DIVERSITY_WEIGHT +\n",
" query_probabilities * RELEVANCE_WEIGHT\n",
" )\n",
" \n",
" # Normalize scores and mask already selected documents\n",
" normalized_scores = logsumexp(combined_scores, axis=1)\n",
" normalized_scores[selected_indices] = -np.inf\n",
" \n",
" # Select best remaining document\n",
" best_idx = np.argmax(normalized_scores)\n",
" best_score = np.max(normalized_scores)\n",
" \n",
" # Update tracking variables\n",
" max_distances = updated_distances[best_idx]\n",
" selected_indices = np.append(selected_indices, best_idx)\n",
" selection_scores.append(best_score)\n",
" \n",
" # Return selected documents and their scores\n",
" selected_documents = [documents[i] for i in selected_indices]\n",
" return selected_documents, selection_scores"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dartboard Context Retrieval\n",
"\n",
"### Main function for using the dartboard retrieval. This serves instead of get_context (which is simple RAG). It:\n",
"\n",
"1. Takes a text query, vectorizes it, gets the top k documents (and their vectors) via simple RAG\n",
"2. Uses these vectors to calculate the similarities to query and between candidate matches\n",
"3. Runs the dartboard algorithm to refine the candidate matches to a final list of k documents\n",
"4. Returns the final list of documents and their scores"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"def get_context_with_dartboard(\n",
" query: str,\n",
" num_results: int = 5,\n",
" oversampling_factor: int = 3\n",
") -> Tuple[List[str], List[float]]:\n",
" \"\"\"\n",
" Retrieve most relevant and diverse context items for a query using the dartboard algorithm.\n",
" \n",
" Args:\n",
" query: The search query string\n",
" num_results: Number of context items to return (default: 5)\n",
" oversampling_factor: Factor to oversample initial results for better diversity (default: 3)\n",
" \n",
" Returns:\n",
" Tuple containing:\n",
" - List of selected context texts\n",
" - List of selection scores\n",
" \n",
" Note:\n",
" The function uses cosine similarity converted to distance. Initial retrieval \n",
" fetches oversampling_factor * num_results items to ensure sufficient diversity \n",
" in the final selection.\n",
" \"\"\"\n",
" # Embed query and retrieve initial candidates\n",
" query_embedding = chunks_vector_store.embedding_function.embed_documents([query])\n",
" _, candidate_indices = chunks_vector_store.index.search(\n",
" np.array(query_embedding),\n",
" k=num_results * oversampling_factor\n",
" )\n",
" \n",
" # Get document vectors and texts for candidates\n",
" candidate_vectors = np.array(\n",
" chunks_vector_store.index.reconstruct_batch(candidate_indices[0])\n",
" )\n",
" candidate_texts = [idx_to_text(idx) for idx in candidate_indices[0]]\n",
" \n",
" # Calculate distance matrices\n",
" # Using 1 - cosine_similarity as distance metric\n",
" document_distances = 1 - np.dot(candidate_vectors, candidate_vectors.T)\n",
" query_distances = 1 - np.dot(query_embedding, candidate_vectors.T)\n",
" \n",
" # Apply dartboard selection algorithm\n",
" selected_texts, selection_scores = greedy_dartsearch(\n",
" query_distances,\n",
" document_distances,\n",
" candidate_texts,\n",
" num_results\n",
" )\n",
" \n",
" return selected_texts, selection_scores"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### dartboard retrieval - results on same query, k, and dataset\n",
"- As you can see now the top 3 results are not mere repetitions. "
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Context 1:\n",
"driven by human activities, particularly the emission of greenhou se gases. \n",
"Chapter 2: Causes of Climate Change \n",
"Greenhouse Gases \n",
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
"activities have intensified this natural process, leading to a warmer climate. \n",
"Fossil Fuels \n",
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
"today. \n",
"Coal\n",
"\n",
"\n",
"Context 2:\n",
"Most of these climate changes are attributed to very small variations in Earth's orbit that \n",
"change the amount of solar energy our planet receives. During the Holocene epoch, which \n",
"began at the end of the last ice age, human societies f lourished, but the industrial era has seen \n",
"unprecedented changes. \n",
"Modern Observations \n",
"Modern scientific observations indicate a rapid increase in global temperatures, sea levels, \n",
"and extreme weather events. The Intergovernmental Panel on Climate Change (IPCC) has \n",
"documented these changes extensively. Ice core samples, tree rings, and ocean sediments \n",
"provide a historical record that scientists use to understand past climate conditions and \n",
"predict future trends. The evidence overwhelmingly shows that recent changes are primarily \n",
"driven by human activities, particularly the emission of greenhou se gases. \n",
"Chapter 2: Causes of Climate Change \n",
"Greenhouse Gases\n",
"\n",
"\n",
"Context 3:\n",
"driven by human activities, particularly the emission of greenhou se gases. \n",
"Chapter 2: Causes of Climate Change \n",
"Greenhouse Gases \n",
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
"activities have intensified this natural process, leading to a warmer climate. \n",
"Fossil Fuels \n",
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
"today. \n",
"Coal\n",
"\n",
"\n"
]
}
],
"source": [
"texts,scores=get_context_with_dartboard(test_query,k=3)\n",
"show_context(texts)\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -239,7 +239,7 @@
"outputs": [],
"source": [
"from langchain_core.prompts import ChatPromptTemplate\n",
"from langchain_core.pydantic_v1 import BaseModel, Field\n",
"from pydantic import BaseModel, Field\n",
"from langchain_groq import ChatGroq\n",
"\n",
"# Data model\n",

View File

@@ -8,7 +8,7 @@
"\n",
"## Overview\n",
"\n",
"This code implements a semantic chunking approach for processing and retrieving information from PDF documents, [first proposed by Greg Kamradt](https://youtu.be/8OJC21T2SL4?t=1933) and subsequently [implemented in LangChain](https://docs.llamaindex.ai/en/stable/examples/node_parsers/semantic_chunking/). Unlike traditional methods that split text based on fixed character or word counts, semantic chunking aims to create more meaningful and context-aware text segments.\n",
"This code implements a semantic chunking approach for processing and retrieving information from PDF documents, [first proposed by Greg Kamradt](https://youtu.be/8OJC21T2SL4?t=1933) and subsequently [implemented in LangChain](https://python.langchain.com/docs/how_to/semantic-chunker/). Unlike traditional methods that split text based on fixed character or word counts, semantic chunking aims to create more meaningful and context-aware text segments.\n",
"\n",
"## Motivation\n",
"\n",

View File

@@ -289,7 +289,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
"version": "3.12.0"
}
},
"nbformat": 4,

View File

@@ -0,0 +1,203 @@
import os
import sys
import argparse
import time
import faiss
from dotenv import load_dotenv
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor, as_completed
from langchain_community.docstore.in_memory import InMemoryDocstore
# Add the parent directory to the path since we work with notebooks
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))
from helper_functions import *
from evaluation.evalute_rag import *
# Load environment variables from a .env file (e.g., OpenAI API key)
load_dotenv()
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')
class HyPE:
"""
A class to handle the HyPE RAG process, which enhances document chunking by
generating hypothetical questions as proxies for retrieval.
"""
def __init__(self, path, chunk_size=1000, chunk_overlap=200, n_retrieved=3):
"""
Initializes the HyPE-based RAG retriever by encoding the PDF document with
hypothetical prompt embeddings.
Args:
path (str): Path to the PDF file to encode.
chunk_size (int): Size of each text chunk (default: 1000).
chunk_overlap (int): Overlap between consecutive chunks (default: 200).
n_retrieved (int): Number of chunks to retrieve for each query (default: 3).
"""
print("\n--- Initializing HyPE RAG Retriever ---")
# Encode the PDF document into a FAISS vector store using hypothetical prompt embeddings
start_time = time.time()
self.vector_store = self.encode_pdf(path, chunk_size=chunk_size, chunk_overlap=chunk_overlap)
self.time_records = {'Chunking': time.time() - start_time}
print(f"Chunking Time: {self.time_records['Chunking']:.2f} seconds")
# Create a retriever from the vector store
self.chunks_query_retriever = self.vector_store.as_retriever(search_kwargs={"k": n_retrieved})
def generate_hypothetical_prompt_embeddings(self, chunk_text):
"""
Uses an LLM to generate multiple hypothetical questions for a single chunk.
These questions act as 'proxies' for the chunk during retrieval.
Parameters:
chunk_text (str): Text contents of the chunk.
Returns:
tuple: (Original chunk text, List of embedding vectors generated from the questions)
"""
llm = ChatOpenAI(temperature=0, model_name="gpt-4o-mini")
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")
question_gen_prompt = PromptTemplate.from_template(
"Analyze the input text and generate essential questions that, when answered, \
capture the main points of the text. Each question should be one line, \
without numbering or prefixes.\n\n \
Text:\n{chunk_text}\n\nQuestions:\n"
)
question_chain = question_gen_prompt | llm | StrOutputParser()
# Parse questions from response
questions = question_chain.invoke({"chunk_text": chunk_text}).replace("\n\n", "\n").split("\n")
return chunk_text, embedding_model.embed_documents(questions)
def prepare_vector_store(self, chunks):
"""
Creates and populates a FAISS vector store using hypothetical prompt embeddings.
Parameters:
chunks (List[str]): A list of text chunks to be embedded and stored.
Returns:
FAISS: A FAISS vector store containing the embedded text chunks.
"""
vector_store = None # Wait to initialize to determine vector size
with ThreadPoolExecutor() as pool:
# Parallelized embedding generation
futures = [pool.submit(self.generate_hypothetical_prompt_embeddings, c) for c in chunks]
for f in tqdm(as_completed(futures), total=len(chunks)):
chunk, vectors = f.result() # Retrieve processed chunk and embeddings
# Initialize FAISS store once vector size is known
if vector_store is None:
vector_store = FAISS(
embedding_function=OpenAIEmbeddings(model="text-embedding-3-small"),
index=faiss.IndexFlatL2(len(vectors[0])),
docstore=InMemoryDocstore(),
index_to_docstore_id={}
)
# Store multiple vector representations per chunk
chunks_with_embedding_vectors = [(chunk.page_content, vec) for vec in vectors]
vector_store.add_embeddings(chunks_with_embedding_vectors)
return vector_store
def encode_pdf(self, path, chunk_size=1000, chunk_overlap=200):
"""
Encodes a PDF document into a vector store using hypothetical prompt embeddings.
Args:
path: The path to the PDF file.
chunk_size: The size of each text chunk.
chunk_overlap: The overlap between consecutive chunks.
Returns:
A FAISS vector store containing the encoded book content.
"""
# Load PDF documents
loader = PyPDFLoader(path)
documents = loader.load()
# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
)
texts = text_splitter.split_documents(documents)
cleaned_texts = replace_t_with_space(texts)
return self.prepare_vector_store(cleaned_texts)
def run(self, query):
"""
Retrieves and displays the context for the given query.
Args:
query (str): The query to retrieve context for.
Returns:
None
"""
# Measure retrieval time
start_time = time.time()
context = retrieve_context_per_question(query, self.chunks_query_retriever)
self.time_records['Retrieval'] = time.time() - start_time
print(f"Retrieval Time: {self.time_records['Retrieval']:.2f} seconds")
# Deduplicate context and display results
context = list(set(context))
show_context(context)
def validate_args(args):
if args.chunk_size <= 0:
raise ValueError("chunk_size must be a positive integer.")
if args.chunk_overlap < 0:
raise ValueError("chunk_overlap must be a non-negative integer.")
if args.n_retrieved <= 0:
raise ValueError("n_retrieved must be a positive integer.")
return args
def parse_args():
parser = argparse.ArgumentParser(description="Encode a PDF document and test a HyPE-based RAG system.")
parser.add_argument("--path", type=str, default="../data/Understanding_Climate_Change.pdf",
help="Path to the PDF file to encode.")
parser.add_argument("--chunk_size", type=int, default=1000,
help="Size of each text chunk (default: 1000).")
parser.add_argument("--chunk_overlap", type=int, default=200,
help="Overlap between consecutive chunks (default: 200).")
parser.add_argument("--n_retrieved", type=int, default=3,
help="Number of chunks to retrieve for each query (default: 3).")
parser.add_argument("--query", type=str, default="What is the main cause of climate change?",
help="Query to test the retriever (default: 'What is the main cause of climate change?').")
parser.add_argument("--evaluate", action="store_true",
help="Whether to evaluate the retriever's performance (default: False).")
return validate_args(parser.parse_args())
def main(args):
# Initialize the HyPE-based RAG Retriever
hyperag = HyPE(
path=args.path,
chunk_size=args.chunk_size,
chunk_overlap=args.chunk_overlap,
n_retrieved=args.n_retrieved
)
# Retrieve context based on the query
hyperag.run(args.query)
# Evaluate the retriever's performance on the query (if requested)
if args.evaluate:
evaluate_rag(hyperag.chunks_query_retriever)
if __name__ == '__main__':
# Call the main function with parsed arguments
main(parse_args())

View File

@@ -1,11 +1,10 @@
import os
import sys
from dotenv import load_dotenv
from langchain.prompts import PromptTemplate
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.prompts import PromptTemplate
from langchain_core.prompts import PromptTemplate
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_core.retrievers import BaseRetriever
from typing import List, Dict, Any

View File

@@ -7,6 +7,7 @@ from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.prompts import PromptTemplate
from llama_index.core.evaluation import DatasetGenerator, FaithfulnessEvaluator, RelevancyEvaluator
from llama_index.llms.openai import OpenAI
from llama_index.core.node_parser import SentenceSplitter
# Apply asyncio fix for Jupyter notebooks
nest_asyncio.apply()
@@ -44,7 +45,8 @@ def evaluate_response_time_and_accuracy(chunk_size, eval_questions, eval_documen
Settings.llm = llm
# Create vector index
vector_index = VectorStoreIndex.from_documents(eval_documents)
splitter = SentenceSplitter(chunk_size=chunk_size)
vector_index = VectorStoreIndex.from_documents(eval_documents, transformations=[splitter])
# Build query engine
query_engine = vector_index.as_query_engine(similarity_top_k=5)

View File

@@ -1,5 +1,6 @@
FORM 10-K FORM 10-KUNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C.
Washington, D.C. 20549
FORM 10-K
(Mark One)

View File

@@ -14,12 +14,14 @@ Custom modules:
"""
import json
from typing import List, Tuple
from typing import List, Tuple, Dict, Any
from deepeval import evaluate
from deepeval.metrics import GEval, FaithfulnessMetric, ContextualRelevancyMetric
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
# 09/15/24 kimmeyh Added path where helper functions is located to the path
# Add the parent directory to the path since we work with notebooks
@@ -90,41 +92,75 @@ relevance_metric = ContextualRelevancyMetric(
include_reason=True
)
def evaluate_rag(chunks_query_retriever, num_questions: int = 5) -> None:
def evaluate_rag(retriever, num_questions: int = 5) -> Dict[str, Any]:
"""
Evaluate the RAG system using predefined metrics.
Evaluates a RAG system using predefined test questions and metrics.
Args:
chunks_query_retriever: Function to retrieve context chunks for a given query.
num_questions (int): Number of questions to evaluate (default: 5).
retriever: The retriever component to evaluate
num_questions: Number of test questions to generate
Returns:
Dict containing evaluation metrics
"""
llm = ChatOpenAI(temperature=0, model_name="gpt-4o", max_tokens=2000)
question_answer_from_context_chain = create_question_answer_from_context_chain(llm)
# Load questions and answers from JSON file
q_a_file_name = "../data/q_a.json"
with open(q_a_file_name, "r", encoding="utf-8") as json_file:
q_a = json.load(json_file)
questions = [qa["question"] for qa in q_a][:num_questions]
ground_truth_answers = [qa["answer"] for qa in q_a][:num_questions]
generated_answers = []
retrieved_documents = []
# Generate answers and retrieve documents for each question
for question in questions:
context = retrieve_context_per_question(question, chunks_query_retriever)
retrieved_documents.append(context)
context_string = " ".join(context)
result = answer_question_from_context(question, context_string, question_answer_from_context_chain)
generated_answers.append(result["answer"])
# Create test cases and evaluate
test_cases = create_deep_eval_test_cases(questions, ground_truth_answers, generated_answers, retrieved_documents)
evaluate(
test_cases=test_cases,
metrics=[correctness_metric, faithfulness_metric, relevance_metric]
# Initialize LLM
llm = ChatOpenAI(temperature=0, model_name="gpt-4-turbo-preview")
# Create evaluation prompt
eval_prompt = PromptTemplate.from_template("""
Evaluate the following retrieval results for the question.
Question: {question}
Retrieved Context: {context}
Rate on a scale of 1-5 (5 being best) for:
1. Relevance: How relevant is the retrieved information to the question?
2. Completeness: Does the context contain all necessary information?
3. Conciseness: Is the retrieved context focused and free of irrelevant information?
Provide ratings in JSON format:
""")
# Create evaluation chain
eval_chain = (
eval_prompt
| llm
| StrOutputParser()
)
# Generate test questions
question_gen_prompt = PromptTemplate.from_template(
"Generate {num_questions} diverse test questions about climate change:"
)
question_chain = question_gen_prompt | llm | StrOutputParser()
questions = question_chain.invoke({"num_questions": num_questions}).split("\n")
# Evaluate each question
results = []
for question in questions:
# Get retrieval results
context = retriever.get_relevant_documents(question)
context_text = "\n".join([doc.page_content for doc in context])
# Evaluate results
eval_result = eval_chain.invoke({
"question": question,
"context": context_text
})
results.append(eval_result)
return {
"questions": questions,
"results": results,
"average_scores": calculate_average_scores(results)
}
def calculate_average_scores(results: List[Dict]) -> Dict[str, float]:
"""Calculate average scores across all evaluation results."""
# Implementation depends on the exact format of your results
pass
if __name__ == "__main__":
# Add any necessary setup or configuration here

View File

@@ -17,7 +17,7 @@ from enum import Enum
def replace_t_with_space(list_of_documents):
"""
Replaces all tab characters ('\t') with spaces in the page content of each document.
Replaces all tab characters ('\t') with spaces in the page content of each document
Args:
list_of_documents: A list of document objects, each with a 'page_content' attribute.

21
images/hype.svg Normal file

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 98 KiB

View File

@@ -208,3 +208,55 @@ nbformat==5.10.4
xxhash==3.5.0
yarl==1.10.0
zipp==3.20.1
# Core LangChain packages
langchain>=0.1.0
langchain-core>=0.1.17
langchain-community>=0.0.13
langchain-openai>=0.0.5
langchain-anthropic>=0.0.9
langchain-groq>=0.0.1
langchain-cohere>=0.0.1
# Vector stores and embeddings
faiss-cpu>=1.7.4
chromadb>=0.4.22
# Document processing
PyMuPDF>=1.23.8 # for fitz
python-docx>=1.0.1
pypdf>=3.17.4
rank-bm25>=0.2.2
# Machine Learning and Data Science
numpy>=1.24.3
pandas>=2.0.3
scikit-learn>=1.3.0
# API Clients
openai>=1.12.0
anthropic>=0.8.1
cohere>=4.48
groq>=0.4.2
# Testing and Evaluation
pytest>=7.4.0
deepeval>=0.20.12
grouse>=0.3.0
# Development Tools
python-dotenv>=1.0.0
jupyter>=1.0.0
notebook>=7.0.6
ipykernel>=6.29.2
# Type Checking
pydantic>=2.6.1
typing-extensions>=4.9.0
# Async Support
aiohttp>=3.9.1
asyncio>=3.4.3
# Utilities
tqdm>=4.66.1

View File

@@ -1,10 +1,18 @@
import pytest
import os
import sys
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import PromptTemplate
from langchain_text_splitters import CharacterTextSplitter
from dotenv import load_dotenv
# Add the main folder to sys.path
sys.path.append(os.path.abspath(os.path.dirname(__file__) + "/../"))
# Load environment variables
load_dotenv()
def pytest_addoption(parser):
parser.addoption(
"--exclude", action="store", help="Comma-separated list of notebook or script files' paths to exclude"
@@ -40,4 +48,58 @@ def script_paths(request):
path_with_full_address = [folder + s for s in include_scripts]
return path_with_full_address
return path_with_full_address
@pytest.fixture(scope="session")
def llm():
"""Fixture for ChatOpenAI model."""
return ChatOpenAI(
temperature=0,
model_name="gpt-4-turbo-preview",
max_tokens=4000
)
@pytest.fixture(scope="session")
def embeddings():
"""Fixture for OpenAI embeddings."""
return OpenAIEmbeddings()
@pytest.fixture(scope="session")
def text_splitter():
"""Fixture for text splitter."""
return CharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
@pytest.fixture(scope="session")
def sample_texts():
"""Fixture for sample test data."""
return [
"The Earth is the third planet from the Sun.",
"Climate change is a significant global challenge.",
"Renewable energy sources include solar and wind power."
]
@pytest.fixture(scope="session")
def vector_store(embeddings, sample_texts, text_splitter):
"""Fixture for vector store."""
docs = text_splitter.create_documents(sample_texts)
return FAISS.from_documents(docs, embeddings)
@pytest.fixture(scope="session")
def retriever(vector_store):
"""Fixture for retriever."""
return vector_store.as_retriever(search_kwargs={"k": 2})
@pytest.fixture(scope="session")
def basic_prompt():
"""Fixture for basic prompt template."""
return PromptTemplate.from_template("""
Answer the following question based on the context provided:
Context: {context}
Question: {question}
Answer:
""")