mirror of
https://github.com/NirDiamant/RAG_Techniques.git
synced 2025-04-07 00:48:52 +03:00
Merge pull request #81 from roybka/dartboard_algo
dartboard algo implementation + README
This commit is contained in:
35
README.md
35
README.md
@@ -277,7 +277,16 @@ Explore the extensive list of cutting-edge RAG techniques:
|
|||||||
#### Implementation 🛠️
|
#### Implementation 🛠️
|
||||||
Apply different embedding models or retrieval algorithms and use voting or weighting mechanisms to determine the final set of retrieved documents.
|
Apply different embedding models or retrieval algorithms and use voting or weighting mechanisms to determine the final set of retrieved documents.
|
||||||
|
|
||||||
19. Multi-modal Retrieval 📽️
|
19. Dartboard Retrieval 🎯
|
||||||
|
- **[LangChain](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/dartboard.ipynb)**
|
||||||
|
#### Overview 🔎
|
||||||
|
Optimizing over Relevant Information Gain in Retrieval
|
||||||
|
|
||||||
|
#### Implementation 🛠️
|
||||||
|
- Combine both relevance and diversity into a single scoring function and directly optimize for it.
|
||||||
|
- POC showing plain simple RAG underperforming when the database is dense, and the dartboard retrieval outperforming it.
|
||||||
|
|
||||||
|
20. Multi-modal Retrieval 📽️
|
||||||
|
|
||||||
#### Overview 🔎
|
#### Overview 🔎
|
||||||
Extending RAG capabilities to handle diverse data types for richer responses.
|
Extending RAG capabilities to handle diverse data types for richer responses.
|
||||||
@@ -289,7 +298,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
|||||||
|
|
||||||
### 🔁 Iterative and Adaptive Techniques
|
### 🔁 Iterative and Adaptive Techniques
|
||||||
|
|
||||||
20. Retrieval with Feedback Loops 🔁
|
21. Retrieval with Feedback Loops 🔁
|
||||||
- **[LangChain](all_rag_techniques/retrieval_with_feedback_loop.ipynb)**
|
- **[LangChain](all_rag_techniques/retrieval_with_feedback_loop.ipynb)**
|
||||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/retrieval_with_feedback_loop.py)**
|
- **[Runnable Script](all_rag_techniques_runnable_scripts/retrieval_with_feedback_loop.py)**
|
||||||
|
|
||||||
@@ -299,7 +308,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
|||||||
#### Implementation 🛠️
|
#### Implementation 🛠️
|
||||||
Collect and utilize user feedback on the relevance and quality of retrieved documents and generated responses to fine-tune retrieval and ranking models.
|
Collect and utilize user feedback on the relevance and quality of retrieved documents and generated responses to fine-tune retrieval and ranking models.
|
||||||
|
|
||||||
21. Adaptive Retrieval 🎯
|
22. Adaptive Retrieval 🎯
|
||||||
- **[LangChain](all_rag_techniques/adaptive_retrieval.ipynb)**
|
- **[LangChain](all_rag_techniques/adaptive_retrieval.ipynb)**
|
||||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/adaptive_retrieval.py)**
|
- **[Runnable Script](all_rag_techniques_runnable_scripts/adaptive_retrieval.py)**
|
||||||
|
|
||||||
@@ -309,7 +318,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
|||||||
#### Implementation 🛠️
|
#### Implementation 🛠️
|
||||||
Classify queries into different categories and use tailored retrieval strategies for each, considering user context and preferences.
|
Classify queries into different categories and use tailored retrieval strategies for each, considering user context and preferences.
|
||||||
|
|
||||||
22. Iterative Retrieval 🔄
|
23. Iterative Retrieval 🔄
|
||||||
|
|
||||||
#### Overview 🔎
|
#### Overview 🔎
|
||||||
Performing multiple rounds of retrieval to refine and enhance result quality.
|
Performing multiple rounds of retrieval to refine and enhance result quality.
|
||||||
@@ -319,7 +328,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
|||||||
|
|
||||||
### 📊 Evaluation
|
### 📊 Evaluation
|
||||||
|
|
||||||
23. **[DeepEval Evaluation](evaluation/evaluation_deep_eval.ipynb)** 📘
|
24. **[DeepEval Evaluation](evaluation/evaluation_deep_eval.ipynb)** 📘
|
||||||
|
|
||||||
#### Overview 🔎
|
#### Overview 🔎
|
||||||
Performing evaluations Retrieval-Augmented Generation systems, by covering several metrics and creating test cases.
|
Performing evaluations Retrieval-Augmented Generation systems, by covering several metrics and creating test cases.
|
||||||
@@ -328,7 +337,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
|||||||
Use the `deepeval` library to conduct test cases on correctness, faithfulness and contextual relevancy of RAG systems.
|
Use the `deepeval` library to conduct test cases on correctness, faithfulness and contextual relevancy of RAG systems.
|
||||||
|
|
||||||
|
|
||||||
24. **[GroUSE Evaluation](evaluation/evaluation_grouse.ipynb)** 🐦
|
25. **[GroUSE Evaluation](evaluation/evaluation_grouse.ipynb)** 🐦
|
||||||
|
|
||||||
#### Overview 🔎
|
#### Overview 🔎
|
||||||
Evaluate the final stage of Retrieval-Augmented Generation using metrics of the GroUSE framework and meta-evaluate your custom LLM judge on GroUSE unit tests.
|
Evaluate the final stage of Retrieval-Augmented Generation using metrics of the GroUSE framework and meta-evaluate your custom LLM judge on GroUSE unit tests.
|
||||||
@@ -339,7 +348,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
|||||||
|
|
||||||
### 🔬 Explainability and Transparency
|
### 🔬 Explainability and Transparency
|
||||||
|
|
||||||
25. Explainable Retrieval 🔍
|
26. Explainable Retrieval 🔍
|
||||||
- **[LangChain](all_rag_techniques/explainable_retrieval.ipynb)**
|
- **[LangChain](all_rag_techniques/explainable_retrieval.ipynb)**
|
||||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/explainable_retrieval.py)**
|
- **[Runnable Script](all_rag_techniques_runnable_scripts/explainable_retrieval.py)**
|
||||||
|
|
||||||
@@ -351,7 +360,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
|||||||
|
|
||||||
### 🏗️ Advanced Architectures
|
### 🏗️ Advanced Architectures
|
||||||
|
|
||||||
26. Knowledge Graph Integration (Graph RAG) 🕸️
|
27. Knowledge Graph Integration (Graph RAG) 🕸️
|
||||||
- **[LangChain](all_rag_techniques/graph_rag.ipynb)**
|
- **[LangChain](all_rag_techniques/graph_rag.ipynb)**
|
||||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/graph_rag.py)**
|
- **[Runnable Script](all_rag_techniques_runnable_scripts/graph_rag.py)**
|
||||||
|
|
||||||
@@ -361,7 +370,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
|||||||
#### Implementation 🛠️
|
#### Implementation 🛠️
|
||||||
Retrieve entities and their relationships from a knowledge graph relevant to the query, combining this structured data with unstructured text for more informative responses.
|
Retrieve entities and their relationships from a knowledge graph relevant to the query, combining this structured data with unstructured text for more informative responses.
|
||||||
|
|
||||||
27. GraphRag (Microsoft) 🎯
|
28. GraphRag (Microsoft) 🎯
|
||||||
- **[GraphRag](all_rag_techniques/Microsoft_GraphRag.ipynb)**
|
- **[GraphRag](all_rag_techniques/Microsoft_GraphRag.ipynb)**
|
||||||
|
|
||||||
#### Overview 🔎
|
#### Overview 🔎
|
||||||
@@ -370,7 +379,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
|||||||
#### Implementation 🛠️
|
#### Implementation 🛠️
|
||||||
• Analyze an input corpus by extracting entities, relationships from text units. generates summaries of each community and its constituents from the bottom-up.
|
• Analyze an input corpus by extracting entities, relationships from text units. generates summaries of each community and its constituents from the bottom-up.
|
||||||
|
|
||||||
28. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval 🌳
|
29. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval 🌳
|
||||||
- **[LangChain](all_rag_techniques/raptor.ipynb)**
|
- **[LangChain](all_rag_techniques/raptor.ipynb)**
|
||||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/raptor.py)**
|
- **[Runnable Script](all_rag_techniques_runnable_scripts/raptor.py)**
|
||||||
|
|
||||||
@@ -380,7 +389,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
|||||||
#### Implementation 🛠️
|
#### Implementation 🛠️
|
||||||
Use abstractive summarization to recursively process and summarize retrieved documents, organizing the information in a tree structure for hierarchical context.
|
Use abstractive summarization to recursively process and summarize retrieved documents, organizing the information in a tree structure for hierarchical context.
|
||||||
|
|
||||||
29. Self RAG 🔁
|
30. Self RAG 🔁
|
||||||
- **[LangChain](all_rag_techniques/self_rag.ipynb)**
|
- **[LangChain](all_rag_techniques/self_rag.ipynb)**
|
||||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/self_rag.py)**
|
- **[Runnable Script](all_rag_techniques_runnable_scripts/self_rag.py)**
|
||||||
|
|
||||||
@@ -390,7 +399,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
|||||||
#### Implementation 🛠️
|
#### Implementation 🛠️
|
||||||
• Implement a multi-step process including retrieval decision, document retrieval, relevance evaluation, response generation, support assessment, and utility evaluation to produce accurate, relevant, and useful outputs.
|
• Implement a multi-step process including retrieval decision, document retrieval, relevance evaluation, response generation, support assessment, and utility evaluation to produce accurate, relevant, and useful outputs.
|
||||||
|
|
||||||
30. Corrective RAG 🔧
|
31. Corrective RAG 🔧
|
||||||
- **[LangChain](all_rag_techniques/crag.ipynb)**
|
- **[LangChain](all_rag_techniques/crag.ipynb)**
|
||||||
- **[Runnable Script](all_rag_techniques_runnable_scripts/crag.py)**
|
- **[Runnable Script](all_rag_techniques_runnable_scripts/crag.py)**
|
||||||
|
|
||||||
@@ -402,7 +411,7 @@ Explore the extensive list of cutting-edge RAG techniques:
|
|||||||
|
|
||||||
## 🌟 Special Advanced Technique 🌟
|
## 🌟 Special Advanced Technique 🌟
|
||||||
|
|
||||||
31. **[Sophisticated Controllable Agent for Complex RAG Tasks 🤖](https://github.com/NirDiamant/Controllable-RAG-Agent)**
|
32. **[Sophisticated Controllable Agent for Complex RAG Tasks 🤖](https://github.com/NirDiamant/Controllable-RAG-Agent)**
|
||||||
|
|
||||||
#### Overview 🔎
|
#### Overview 🔎
|
||||||
An advanced RAG solution designed to tackle complex questions that simple semantic similarity-based retrieval cannot solve. This approach uses a sophisticated deterministic graph as the "brain" 🧠 of a highly controllable autonomous agent, capable of answering non-trivial questions from your own data.
|
An advanced RAG solution designed to tackle complex questions that simple semantic similarity-based retrieval cannot solve. This approach uses a sophisticated deterministic graph as the "brain" 🧠 of a highly controllable autonomous agent, capable of answering non-trivial questions from your own data.
|
||||||
|
|||||||
523
all_rag_techniques/dartboard.ipynb
Normal file
523
all_rag_techniques/dartboard.ipynb
Normal file
@@ -0,0 +1,523 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Dartboard RAG: Retrieval-Augmented Generation with Balanced Relevance and Diversity\n",
|
||||||
|
"\n",
|
||||||
|
"## Overview\n",
|
||||||
|
"The **Dartboard RAG** process addresses a common challenge in large knowledge bases: ensuring the retrieved information is both relevant and non-redundant. By explicitly optimizing a combined relevance-diversity scoring function, it prevents multiple top-k documents from offering the same information. This approach is drawn from the elegant method in thepaper:\n",
|
||||||
|
"\n",
|
||||||
|
"> [*Better RAG using Relevant Information Gain*](https://arxiv.org/abs/2407.12101)\n",
|
||||||
|
"\n",
|
||||||
|
"The paper outlines three variations of the core idea—hybrid RAG (dense + sparse), a cross-encoder version, and a vanilla approach. The **vanilla approach** conveys the fundamental concept most directly, and this implementation extends it with optional weights to control the balance between relevance and diversity.\n",
|
||||||
|
"\n",
|
||||||
|
"## Motivation\n",
|
||||||
|
"\n",
|
||||||
|
"1. **Dense, Overlapping Knowledge Bases** \n",
|
||||||
|
" In large databases, documents may repeat similar content, causing redundancy in top-k retrieval.\n",
|
||||||
|
"\n",
|
||||||
|
"2. **Improved Information Coverage** \n",
|
||||||
|
" Combining relevance and diversity yields a richer set of documents, mitigating the “echo chamber” effect of overly similar content.\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"## Key Components\n",
|
||||||
|
"\n",
|
||||||
|
"1. **Relevance & Diversity Combination** \n",
|
||||||
|
" - Computes a score factoring in both how pertinent a document is to the query and how distinct it is from already chosen documents.\n",
|
||||||
|
"\n",
|
||||||
|
"2. **Weighted Balancing** \n",
|
||||||
|
" - Introduces RELEVANCE_WEIGHT and DIVERSITY_WEIGHT to allow dynamic control of scoring. \n",
|
||||||
|
" - Helps in avoiding overly diverse but less relevant results.\n",
|
||||||
|
"\n",
|
||||||
|
"3. **Production-Ready Code** \n",
|
||||||
|
" - Derived from the official implementation yet reorganized for clarity. \n",
|
||||||
|
" - Allows easier integration into existing RAG pipelines.\n",
|
||||||
|
"\n",
|
||||||
|
"## Method Details\n",
|
||||||
|
"\n",
|
||||||
|
"1. **Document Retrieval** \n",
|
||||||
|
" - Obtain an initial set of candidate documents based on similarity (e.g., cosine or BM25). \n",
|
||||||
|
" - Typically retrieves top-N candidates as a starting point.\n",
|
||||||
|
"\n",
|
||||||
|
"2. **Scoring & Selection** \n",
|
||||||
|
" - Each document’s overall score combines **relevance** and **diversity**: \n",
|
||||||
|
" - Select the highest-scoring document, then penalize documents that are overly similar to it. \n",
|
||||||
|
" - Repeat until top-k documents are identified.\n",
|
||||||
|
"\n",
|
||||||
|
"3. **Hybrid / Fusion & Cross-Encoder Support** \n",
|
||||||
|
" Essentially, all you need are distances between documents and the query, and distances between documents. You can easily extract these from hybrid / fusion retrieval or from cross-encoder retrieval. The only recommendation I have is to rely less on raking based scores.\n",
|
||||||
|
" - For **hybrid / fusion retrieval**: Merge similarities (dense and sparse / BM25) into a single distance. This can be achieved by combining cosine similarity over the dense and the sparse vectors (e.g. averaging them). the move to distances is straightforward (1 - mean cosine similarity). \n",
|
||||||
|
" - For **cross-encoders**: You can directly use the cross-encoder similarity scores (1- similarity), potentially adjusting with scaling factors.\n",
|
||||||
|
"\n",
|
||||||
|
"4. **Balancing & Adjustment** \n",
|
||||||
|
" - Tune DIVERSITY_WEIGHT and RELEVANCE_WEIGHT based on your needs and the density of your dataset. \n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"By integrating both **relevance** and **diversity** into retrieval, the Dartboard RAG approach ensures that top-k documents collectively offer richer, more comprehensive information—leading to higher-quality responses in Retrieval-Augmented Generation systems.\n",
|
||||||
|
"\n",
|
||||||
|
"The paper also has an official code implemention, and this code is based on it, but I think this one here is more readable, manageable and production ready."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Import libraries and environment variables"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 2,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"Please enter your OpenAI API key: \n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"import os\n",
|
||||||
|
"import sys\n",
|
||||||
|
"from dotenv import load_dotenv\n",
|
||||||
|
"import numpy as np\n",
|
||||||
|
"from scipy.special import logsumexp\n",
|
||||||
|
"from typing import Tuple, List\n",
|
||||||
|
"# Load environment variables from a .env file\n",
|
||||||
|
"load_dotenv()\n",
|
||||||
|
"# Set the OpenAI API key environment variable (comment out if not using OpenAI)\n",
|
||||||
|
"if not os.getenv('OPENAI_API_KEY'):\n",
|
||||||
|
" print(\"Please enter your OpenAI API key: \")\n",
|
||||||
|
" os.environ[\"OPENAI_API_KEY\"] = input(\"Please enter your OpenAI API key: \")\n",
|
||||||
|
"else:\n",
|
||||||
|
" os.environ[\"OPENAI_API_KEY\"] = os.getenv('OPENAI_API_KEY')\n",
|
||||||
|
"\n",
|
||||||
|
"sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path since we work with notebooks\n",
|
||||||
|
"from helper_functions import *\n",
|
||||||
|
"from evaluation.evalute_rag import *\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Read Docs"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 3,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"path = \"../data/Understanding_Climate_Change.pdf\""
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Encode document"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 4,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# this part is same like simple_rag.ipynb, only simulating a dense dataset\n",
|
||||||
|
"def encode_pdf(path, chunk_size=1000, chunk_overlap=200):\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" Encodes a PDF book into a vector store using OpenAI embeddings.\n",
|
||||||
|
"\n",
|
||||||
|
" Args:\n",
|
||||||
|
" path: The path to the PDF file.\n",
|
||||||
|
" chunk_size: The desired size of each text chunk.\n",
|
||||||
|
" chunk_overlap: The amount of overlap between consecutive chunks.\n",
|
||||||
|
"\n",
|
||||||
|
" Returns:\n",
|
||||||
|
" A FAISS vector store containing the encoded book content.\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
"\n",
|
||||||
|
" # Load PDF documents\n",
|
||||||
|
" loader = PyPDFLoader(path)\n",
|
||||||
|
" documents = loader.load()\n",
|
||||||
|
" documents=documents*5 # load every document 5 times to emulate a dense dataset\n",
|
||||||
|
"\n",
|
||||||
|
" # Split documents into chunks\n",
|
||||||
|
" text_splitter = RecursiveCharacterTextSplitter(\n",
|
||||||
|
" chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len\n",
|
||||||
|
" )\n",
|
||||||
|
" texts = text_splitter.split_documents(documents)\n",
|
||||||
|
" cleaned_texts = replace_t_with_space(texts)\n",
|
||||||
|
"\n",
|
||||||
|
" # Create embeddings (Tested with OpenAI and Amazon Bedrock)\n",
|
||||||
|
" embeddings = get_langchain_embedding_provider(EmbeddingProvider.OPENAI)\n",
|
||||||
|
" #embeddings = get_langchain_embedding_provider(EmbeddingProvider.AMAZON_BEDROCK)\n",
|
||||||
|
"\n",
|
||||||
|
" # Create vector store\n",
|
||||||
|
" vectorstore = FAISS.from_documents(cleaned_texts, embeddings)\n",
|
||||||
|
"\n",
|
||||||
|
" return vectorstore"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Create Vector store\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 5,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"chunks_vector_store = encode_pdf(path, chunk_size=1000, chunk_overlap=200)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Some helper functions for using the vector store for retrieval.\n",
|
||||||
|
"this part is same like simple_rag.ipynb, only its using the actual FAISS index (not the wrapper)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 6,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"\n",
|
||||||
|
"def idx_to_text(idx:int):\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" Convert a Vector store index to the corresponding text.\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" docstore_id = chunks_vector_store.index_to_docstore_id[idx]\n",
|
||||||
|
" document = chunks_vector_store.docstore.search(docstore_id)\n",
|
||||||
|
" return document.page_content\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"def get_context(query:str,k:int=5) -> Tuple[np.ndarray, np.ndarray, List[str]]:\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" Retrieve top k context items for a query using top k retrieval.\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" # regular top k retrieval\n",
|
||||||
|
" q_vec=chunks_vector_store.embedding_function.embed_documents([query])\n",
|
||||||
|
" _,indices=chunks_vector_store.index.search(np.array(q_vec),k=k)\n",
|
||||||
|
"\n",
|
||||||
|
" texts = [idx_to_text(i) for i in indices[0]]\n",
|
||||||
|
" return texts\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 10,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"\n",
|
||||||
|
"test_query = \"What is the main cause of climate change?\"\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Regular top k retrieval\n",
|
||||||
|
"- This demonstration shows that when database is dense (here we simulate density by loading each document 5 times), the results are not good, we don't get the most relevant results. Note that the top 3 results are all repetitions of the same document."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 11,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"Context 1:\n",
|
||||||
|
"driven by human activities, particularly the emission of greenhou se gases. \n",
|
||||||
|
"Chapter 2: Causes of Climate Change \n",
|
||||||
|
"Greenhouse Gases \n",
|
||||||
|
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
|
||||||
|
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
|
||||||
|
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
|
||||||
|
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
|
||||||
|
"activities have intensified this natural process, leading to a warmer climate. \n",
|
||||||
|
"Fossil Fuels \n",
|
||||||
|
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
|
||||||
|
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
|
||||||
|
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
|
||||||
|
"today. \n",
|
||||||
|
"Coal\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"Context 2:\n",
|
||||||
|
"driven by human activities, particularly the emission of greenhou se gases. \n",
|
||||||
|
"Chapter 2: Causes of Climate Change \n",
|
||||||
|
"Greenhouse Gases \n",
|
||||||
|
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
|
||||||
|
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
|
||||||
|
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
|
||||||
|
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
|
||||||
|
"activities have intensified this natural process, leading to a warmer climate. \n",
|
||||||
|
"Fossil Fuels \n",
|
||||||
|
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
|
||||||
|
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
|
||||||
|
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
|
||||||
|
"today. \n",
|
||||||
|
"Coal\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"Context 3:\n",
|
||||||
|
"driven by human activities, particularly the emission of greenhou se gases. \n",
|
||||||
|
"Chapter 2: Causes of Climate Change \n",
|
||||||
|
"Greenhouse Gases \n",
|
||||||
|
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
|
||||||
|
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
|
||||||
|
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
|
||||||
|
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
|
||||||
|
"activities have intensified this natural process, leading to a warmer climate. \n",
|
||||||
|
"Fossil Fuels \n",
|
||||||
|
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
|
||||||
|
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
|
||||||
|
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
|
||||||
|
"today. \n",
|
||||||
|
"Coal\n",
|
||||||
|
"\n",
|
||||||
|
"\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"texts=get_context(test_query,k=3)\n",
|
||||||
|
"show_context(texts)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"\n",
|
||||||
|
"### More utils for distances normalization"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 21,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"def lognorm(dist:np.ndarray, sigma:float):\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" Calculate the log-normal probability for a given distance and sigma.\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" if sigma < 1e-9: \n",
|
||||||
|
" return -np.inf * dist\n",
|
||||||
|
" return -np.log(sigma) - 0.5 * np.log(2 * np.pi) - dist**2 / (2 * sigma**2)\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"\n",
|
||||||
|
"### Definitions of parameters, and the actual function that optimizes both relevance and diversity \n",
|
||||||
|
"This is the core function that chooses the top k documents based on relevance and diversity. It uses distances between each candidate document and the query and between candidate documents."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 18,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"\n",
|
||||||
|
"# Adjust these according to your needs, knowledge base density, etc. \n",
|
||||||
|
"DIVERSITY_WEIGHT=1.0\n",
|
||||||
|
"RELEVANCE_WEIGHT=1.0\n",
|
||||||
|
"SIGMA=0.1\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"def greedy_dartsearch(q_dists:np.ndarray, dists_mat:np.ndarray, texts:List[str], k:int) -> Tuple[List[str], List[float]]:\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" Perform greedy dartboard search to select top k documents.\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" sigma=np.max([SIGMA,1e-5]) # avoid division by zero\n",
|
||||||
|
" qprobs = lognorm(q_dists, sigma)\n",
|
||||||
|
" ccprobmat = lognorm(dists_mat, sigma)\n",
|
||||||
|
" out_scores=[]\n",
|
||||||
|
" top_idx = np.argmax(qprobs) # start with the most relevant document\n",
|
||||||
|
" chosen_inds = np.array([top_idx]) # initialize the array of selected documents\n",
|
||||||
|
" maxes = ccprobmat[top_idx] # Vector of distances to the most relevant document\n",
|
||||||
|
" while len(chosen_inds) < k:\n",
|
||||||
|
" newmaxes = np.maximum(maxes, ccprobmat) # update the maximum distances, note the broadcasting (matrix and vector)\n",
|
||||||
|
"\n",
|
||||||
|
" logscores = newmaxes*DIVERSITY_WEIGHT + qprobs*RELEVANCE_WEIGHT # score all the items\n",
|
||||||
|
" scores = logsumexp(logscores, axis=1) # normalize the scores\n",
|
||||||
|
" scores[chosen_inds] = -np.inf # avoid selecting the same document twice\n",
|
||||||
|
" best_idx = np.argmax(scores) # select the best item\n",
|
||||||
|
" best_score=np.max(scores) # avoid division by zero\n",
|
||||||
|
" maxes = newmaxes[best_idx] # update the maximum distances\n",
|
||||||
|
" chosen_inds = np.append(chosen_inds, best_idx) # add the best item to the set\n",
|
||||||
|
" out_scores.append(best_score) # add the best score to the list\n",
|
||||||
|
" return [texts[i] for i in chosen_inds],out_scores\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"\n",
|
||||||
|
"### Main function for using the dartboard retrieval. This serves instead of get_context (which is simple RAG) it:\n",
|
||||||
|
"\n",
|
||||||
|
"1. Takes a text query, vectorzes it, gets the top k documents (and their vectors) via simple RAG. \n",
|
||||||
|
"2. Uses these vectors to calculate the similarities to query and between candidate matches.\n",
|
||||||
|
"3. Runs the dartboard algorithm to refine the candidate matches to a final list of k documents.\n",
|
||||||
|
"4. Returns the final list of documents and their scores."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 19,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"\n",
|
||||||
|
"def get_context_with_dartboard(query:str,k:int=5) -> Tuple[List[str], List[float]]:\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" Retrieve top k context items for a query using the dartboard algorithm.\n",
|
||||||
|
" This function only handles the vectors and indices, the rest is handled by the get_dartboard function and below.\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" q_vec=chunks_vector_store.embedding_function.embed_documents([query]) # embed the query\n",
|
||||||
|
" _,indices=chunks_vector_store.index.search(np.array(q_vec),k=k*3) # fetch more than k to ensure we overcome density and use diversity\n",
|
||||||
|
"\n",
|
||||||
|
" vecs = np.array(chunks_vector_store.index.reconstruct_batch(indices[0])) # reconstruct the vectors of the retrieved documents\n",
|
||||||
|
" texts = [idx_to_text(i) for i in indices[0]] # convert the indices to texts\n",
|
||||||
|
"\n",
|
||||||
|
" \n",
|
||||||
|
" # calculate similarities and convert them to distances:\n",
|
||||||
|
" dists_mat = 1-np.dot(vecs,vecs.T) # 1-cosine distance, you may think of better distance functions. This can also be applied to cross-encoder scores. \n",
|
||||||
|
" q_dists = 1-np.dot(q_vec,vecs.T) # calculate the distances to the query\n",
|
||||||
|
" \n",
|
||||||
|
" # run the dartboard algorithm\n",
|
||||||
|
" texts, scores=greedy_dartsearch(q_dists,dists_mat,texts,k)\n",
|
||||||
|
" return texts,scores\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### dartboard retrieval - results on same query, k, and dataset\n",
|
||||||
|
"- As you can see now the top 3 results are not mere repetitions. "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 22,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"Context 1:\n",
|
||||||
|
"driven by human activities, particularly the emission of greenhou se gases. \n",
|
||||||
|
"Chapter 2: Causes of Climate Change \n",
|
||||||
|
"Greenhouse Gases \n",
|
||||||
|
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
|
||||||
|
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
|
||||||
|
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
|
||||||
|
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
|
||||||
|
"activities have intensified this natural process, leading to a warmer climate. \n",
|
||||||
|
"Fossil Fuels \n",
|
||||||
|
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
|
||||||
|
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
|
||||||
|
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
|
||||||
|
"today. \n",
|
||||||
|
"Coal\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"Context 2:\n",
|
||||||
|
"Most of these climate changes are attributed to very small variations in Earth's orbit that \n",
|
||||||
|
"change the amount of solar energy our planet receives. During the Holocene epoch, which \n",
|
||||||
|
"began at the end of the last ice age, human societies f lourished, but the industrial era has seen \n",
|
||||||
|
"unprecedented changes. \n",
|
||||||
|
"Modern Observations \n",
|
||||||
|
"Modern scientific observations indicate a rapid increase in global temperatures, sea levels, \n",
|
||||||
|
"and extreme weather events. The Intergovernmental Panel on Climate Change (IPCC) has \n",
|
||||||
|
"documented these changes extensively. Ice core samples, tree rings, and ocean sediments \n",
|
||||||
|
"provide a historical record that scientists use to understand past climate conditions and \n",
|
||||||
|
"predict future trends. The evidence overwhelmingly shows that recent changes are primarily \n",
|
||||||
|
"driven by human activities, particularly the emission of greenhou se gases. \n",
|
||||||
|
"Chapter 2: Causes of Climate Change \n",
|
||||||
|
"Greenhouse Gases\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"Context 3:\n",
|
||||||
|
"driven by human activities, particularly the emission of greenhou se gases. \n",
|
||||||
|
"Chapter 2: Causes of Climate Change \n",
|
||||||
|
"Greenhouse Gases \n",
|
||||||
|
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
|
||||||
|
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
|
||||||
|
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
|
||||||
|
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
|
||||||
|
"activities have intensified this natural process, leading to a warmer climate. \n",
|
||||||
|
"Fossil Fuels \n",
|
||||||
|
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
|
||||||
|
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
|
||||||
|
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
|
||||||
|
"today. \n",
|
||||||
|
"Coal\n",
|
||||||
|
"\n",
|
||||||
|
"\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"texts,scores=get_context_with_dartboard(test_query,k=3)\n",
|
||||||
|
"show_context(texts)\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": []
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "Python 3 (ipykernel)",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.9.12"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 2
|
||||||
|
}
|
||||||
Reference in New Issue
Block a user