Merge pull request #81 from roybka/dartboard_algo

dartboard algo implementation + README
This commit is contained in:
NirDiamant
2025-02-19 21:26:53 +02:00
committed by GitHub
2 changed files with 545 additions and 13 deletions

View File

@@ -277,7 +277,16 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️ #### Implementation 🛠️
Apply different embedding models or retrieval algorithms and use voting or weighting mechanisms to determine the final set of retrieved documents. Apply different embedding models or retrieval algorithms and use voting or weighting mechanisms to determine the final set of retrieved documents.
19. Multi-modal Retrieval 📽️ 19. Dartboard Retrieval 🎯
- **[LangChain](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/dartboard.ipynb)**
#### Overview 🔎
Optimizing over Relevant Information Gain in Retrieval
#### Implementation 🛠️
- Combine both relevance and diversity into a single scoring function and directly optimize for it.
- POC showing plain simple RAG underperforming when the database is dense, and the dartboard retrieval outperforming it.
20. Multi-modal Retrieval 📽️
#### Overview 🔎 #### Overview 🔎
Extending RAG capabilities to handle diverse data types for richer responses. Extending RAG capabilities to handle diverse data types for richer responses.
@@ -289,7 +298,7 @@ Explore the extensive list of cutting-edge RAG techniques:
### 🔁 Iterative and Adaptive Techniques ### 🔁 Iterative and Adaptive Techniques
20. Retrieval with Feedback Loops 🔁 21. Retrieval with Feedback Loops 🔁
- **[LangChain](all_rag_techniques/retrieval_with_feedback_loop.ipynb)** - **[LangChain](all_rag_techniques/retrieval_with_feedback_loop.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/retrieval_with_feedback_loop.py)** - **[Runnable Script](all_rag_techniques_runnable_scripts/retrieval_with_feedback_loop.py)**
@@ -299,7 +308,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️ #### Implementation 🛠️
Collect and utilize user feedback on the relevance and quality of retrieved documents and generated responses to fine-tune retrieval and ranking models. Collect and utilize user feedback on the relevance and quality of retrieved documents and generated responses to fine-tune retrieval and ranking models.
21. Adaptive Retrieval 🎯 22. Adaptive Retrieval 🎯
- **[LangChain](all_rag_techniques/adaptive_retrieval.ipynb)** - **[LangChain](all_rag_techniques/adaptive_retrieval.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/adaptive_retrieval.py)** - **[Runnable Script](all_rag_techniques_runnable_scripts/adaptive_retrieval.py)**
@@ -309,7 +318,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️ #### Implementation 🛠️
Classify queries into different categories and use tailored retrieval strategies for each, considering user context and preferences. Classify queries into different categories and use tailored retrieval strategies for each, considering user context and preferences.
22. Iterative Retrieval 🔄 23. Iterative Retrieval 🔄
#### Overview 🔎 #### Overview 🔎
Performing multiple rounds of retrieval to refine and enhance result quality. Performing multiple rounds of retrieval to refine and enhance result quality.
@@ -319,7 +328,7 @@ Explore the extensive list of cutting-edge RAG techniques:
### 📊 Evaluation ### 📊 Evaluation
23. **[DeepEval Evaluation](evaluation/evaluation_deep_eval.ipynb)** 📘 24. **[DeepEval Evaluation](evaluation/evaluation_deep_eval.ipynb)** 📘
#### Overview 🔎 #### Overview 🔎
Performing evaluations Retrieval-Augmented Generation systems, by covering several metrics and creating test cases. Performing evaluations Retrieval-Augmented Generation systems, by covering several metrics and creating test cases.
@@ -328,7 +337,7 @@ Explore the extensive list of cutting-edge RAG techniques:
Use the `deepeval` library to conduct test cases on correctness, faithfulness and contextual relevancy of RAG systems. Use the `deepeval` library to conduct test cases on correctness, faithfulness and contextual relevancy of RAG systems.
24. **[GroUSE Evaluation](evaluation/evaluation_grouse.ipynb)** 🐦 25. **[GroUSE Evaluation](evaluation/evaluation_grouse.ipynb)** 🐦
#### Overview 🔎 #### Overview 🔎
Evaluate the final stage of Retrieval-Augmented Generation using metrics of the GroUSE framework and meta-evaluate your custom LLM judge on GroUSE unit tests. Evaluate the final stage of Retrieval-Augmented Generation using metrics of the GroUSE framework and meta-evaluate your custom LLM judge on GroUSE unit tests.
@@ -339,7 +348,7 @@ Explore the extensive list of cutting-edge RAG techniques:
### 🔬 Explainability and Transparency ### 🔬 Explainability and Transparency
25. Explainable Retrieval 🔍 26. Explainable Retrieval 🔍
- **[LangChain](all_rag_techniques/explainable_retrieval.ipynb)** - **[LangChain](all_rag_techniques/explainable_retrieval.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/explainable_retrieval.py)** - **[Runnable Script](all_rag_techniques_runnable_scripts/explainable_retrieval.py)**
@@ -351,7 +360,7 @@ Explore the extensive list of cutting-edge RAG techniques:
### 🏗️ Advanced Architectures ### 🏗️ Advanced Architectures
26. Knowledge Graph Integration (Graph RAG) 🕸️ 27. Knowledge Graph Integration (Graph RAG) 🕸️
- **[LangChain](all_rag_techniques/graph_rag.ipynb)** - **[LangChain](all_rag_techniques/graph_rag.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/graph_rag.py)** - **[Runnable Script](all_rag_techniques_runnable_scripts/graph_rag.py)**
@@ -361,7 +370,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️ #### Implementation 🛠️
Retrieve entities and their relationships from a knowledge graph relevant to the query, combining this structured data with unstructured text for more informative responses. Retrieve entities and their relationships from a knowledge graph relevant to the query, combining this structured data with unstructured text for more informative responses.
27. GraphRag (Microsoft) 🎯 28. GraphRag (Microsoft) 🎯
- **[GraphRag](all_rag_techniques/Microsoft_GraphRag.ipynb)** - **[GraphRag](all_rag_techniques/Microsoft_GraphRag.ipynb)**
#### Overview 🔎 #### Overview 🔎
@@ -370,7 +379,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️ #### Implementation 🛠️
• Analyze an input corpus by extracting entities, relationships from text units. generates summaries of each community and its constituents from the bottom-up. • Analyze an input corpus by extracting entities, relationships from text units. generates summaries of each community and its constituents from the bottom-up.
28. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval 🌳 29. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval 🌳
- **[LangChain](all_rag_techniques/raptor.ipynb)** - **[LangChain](all_rag_techniques/raptor.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/raptor.py)** - **[Runnable Script](all_rag_techniques_runnable_scripts/raptor.py)**
@@ -380,7 +389,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️ #### Implementation 🛠️
Use abstractive summarization to recursively process and summarize retrieved documents, organizing the information in a tree structure for hierarchical context. Use abstractive summarization to recursively process and summarize retrieved documents, organizing the information in a tree structure for hierarchical context.
29. Self RAG 🔁 30. Self RAG 🔁
- **[LangChain](all_rag_techniques/self_rag.ipynb)** - **[LangChain](all_rag_techniques/self_rag.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/self_rag.py)** - **[Runnable Script](all_rag_techniques_runnable_scripts/self_rag.py)**
@@ -390,7 +399,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️ #### Implementation 🛠️
• Implement a multi-step process including retrieval decision, document retrieval, relevance evaluation, response generation, support assessment, and utility evaluation to produce accurate, relevant, and useful outputs. • Implement a multi-step process including retrieval decision, document retrieval, relevance evaluation, response generation, support assessment, and utility evaluation to produce accurate, relevant, and useful outputs.
30. Corrective RAG 🔧 31. Corrective RAG 🔧
- **[LangChain](all_rag_techniques/crag.ipynb)** - **[LangChain](all_rag_techniques/crag.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/crag.py)** - **[Runnable Script](all_rag_techniques_runnable_scripts/crag.py)**
@@ -402,7 +411,7 @@ Explore the extensive list of cutting-edge RAG techniques:
## 🌟 Special Advanced Technique 🌟 ## 🌟 Special Advanced Technique 🌟
31. **[Sophisticated Controllable Agent for Complex RAG Tasks 🤖](https://github.com/NirDiamant/Controllable-RAG-Agent)** 32. **[Sophisticated Controllable Agent for Complex RAG Tasks 🤖](https://github.com/NirDiamant/Controllable-RAG-Agent)**
#### Overview 🔎 #### Overview 🔎
An advanced RAG solution designed to tackle complex questions that simple semantic similarity-based retrieval cannot solve. This approach uses a sophisticated deterministic graph as the "brain" 🧠 of a highly controllable autonomous agent, capable of answering non-trivial questions from your own data. An advanced RAG solution designed to tackle complex questions that simple semantic similarity-based retrieval cannot solve. This approach uses a sophisticated deterministic graph as the "brain" 🧠 of a highly controllable autonomous agent, capable of answering non-trivial questions from your own data.

View File

@@ -0,0 +1,523 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Dartboard RAG: Retrieval-Augmented Generation with Balanced Relevance and Diversity\n",
"\n",
"## Overview\n",
"The **Dartboard RAG** process addresses a common challenge in large knowledge bases: ensuring the retrieved information is both relevant and non-redundant. By explicitly optimizing a combined relevance-diversity scoring function, it prevents multiple top-k documents from offering the same information. This approach is drawn from the elegant method in thepaper:\n",
"\n",
"> [*Better RAG using Relevant Information Gain*](https://arxiv.org/abs/2407.12101)\n",
"\n",
"The paper outlines three variations of the core idea—hybrid RAG (dense + sparse), a cross-encoder version, and a vanilla approach. The **vanilla approach** conveys the fundamental concept most directly, and this implementation extends it with optional weights to control the balance between relevance and diversity.\n",
"\n",
"## Motivation\n",
"\n",
"1. **Dense, Overlapping Knowledge Bases** \n",
" In large databases, documents may repeat similar content, causing redundancy in top-k retrieval.\n",
"\n",
"2. **Improved Information Coverage** \n",
" Combining relevance and diversity yields a richer set of documents, mitigating the “echo chamber” effect of overly similar content.\n",
"\n",
"\n",
"## Key Components\n",
"\n",
"1. **Relevance & Diversity Combination** \n",
" - Computes a score factoring in both how pertinent a document is to the query and how distinct it is from already chosen documents.\n",
"\n",
"2. **Weighted Balancing** \n",
" - Introduces RELEVANCE_WEIGHT and DIVERSITY_WEIGHT to allow dynamic control of scoring. \n",
" - Helps in avoiding overly diverse but less relevant results.\n",
"\n",
"3. **Production-Ready Code** \n",
" - Derived from the official implementation yet reorganized for clarity. \n",
" - Allows easier integration into existing RAG pipelines.\n",
"\n",
"## Method Details\n",
"\n",
"1. **Document Retrieval** \n",
" - Obtain an initial set of candidate documents based on similarity (e.g., cosine or BM25). \n",
" - Typically retrieves top-N candidates as a starting point.\n",
"\n",
"2. **Scoring & Selection** \n",
" - Each documents overall score combines **relevance** and **diversity**: \n",
" - Select the highest-scoring document, then penalize documents that are overly similar to it. \n",
" - Repeat until top-k documents are identified.\n",
"\n",
"3. **Hybrid / Fusion & Cross-Encoder Support** \n",
" Essentially, all you need are distances between documents and the query, and distances between documents. You can easily extract these from hybrid / fusion retrieval or from cross-encoder retrieval. The only recommendation I have is to rely less on raking based scores.\n",
" - For **hybrid / fusion retrieval**: Merge similarities (dense and sparse / BM25) into a single distance. This can be achieved by combining cosine similarity over the dense and the sparse vectors (e.g. averaging them). the move to distances is straightforward (1 - mean cosine similarity). \n",
" - For **cross-encoders**: You can directly use the cross-encoder similarity scores (1- similarity), potentially adjusting with scaling factors.\n",
"\n",
"4. **Balancing & Adjustment** \n",
" - Tune DIVERSITY_WEIGHT and RELEVANCE_WEIGHT based on your needs and the density of your dataset. \n",
"\n",
"\n",
"\n",
"By integrating both **relevance** and **diversity** into retrieval, the Dartboard RAG approach ensures that top-k documents collectively offer richer, more comprehensive information—leading to higher-quality responses in Retrieval-Augmented Generation systems.\n",
"\n",
"The paper also has an official code implemention, and this code is based on it, but I think this one here is more readable, manageable and production ready."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Import libraries and environment variables"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Please enter your OpenAI API key: \n"
]
}
],
"source": [
"import os\n",
"import sys\n",
"from dotenv import load_dotenv\n",
"import numpy as np\n",
"from scipy.special import logsumexp\n",
"from typing import Tuple, List\n",
"# Load environment variables from a .env file\n",
"load_dotenv()\n",
"# Set the OpenAI API key environment variable (comment out if not using OpenAI)\n",
"if not os.getenv('OPENAI_API_KEY'):\n",
" print(\"Please enter your OpenAI API key: \")\n",
" os.environ[\"OPENAI_API_KEY\"] = input(\"Please enter your OpenAI API key: \")\n",
"else:\n",
" os.environ[\"OPENAI_API_KEY\"] = os.getenv('OPENAI_API_KEY')\n",
"\n",
"sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path since we work with notebooks\n",
"from helper_functions import *\n",
"from evaluation.evalute_rag import *\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Read Docs"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"path = \"../data/Understanding_Climate_Change.pdf\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Encode document"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# this part is same like simple_rag.ipynb, only simulating a dense dataset\n",
"def encode_pdf(path, chunk_size=1000, chunk_overlap=200):\n",
" \"\"\"\n",
" Encodes a PDF book into a vector store using OpenAI embeddings.\n",
"\n",
" Args:\n",
" path: The path to the PDF file.\n",
" chunk_size: The desired size of each text chunk.\n",
" chunk_overlap: The amount of overlap between consecutive chunks.\n",
"\n",
" Returns:\n",
" A FAISS vector store containing the encoded book content.\n",
" \"\"\"\n",
"\n",
" # Load PDF documents\n",
" loader = PyPDFLoader(path)\n",
" documents = loader.load()\n",
" documents=documents*5 # load every document 5 times to emulate a dense dataset\n",
"\n",
" # Split documents into chunks\n",
" text_splitter = RecursiveCharacterTextSplitter(\n",
" chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len\n",
" )\n",
" texts = text_splitter.split_documents(documents)\n",
" cleaned_texts = replace_t_with_space(texts)\n",
"\n",
" # Create embeddings (Tested with OpenAI and Amazon Bedrock)\n",
" embeddings = get_langchain_embedding_provider(EmbeddingProvider.OPENAI)\n",
" #embeddings = get_langchain_embedding_provider(EmbeddingProvider.AMAZON_BEDROCK)\n",
"\n",
" # Create vector store\n",
" vectorstore = FAISS.from_documents(cleaned_texts, embeddings)\n",
"\n",
" return vectorstore"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create Vector store\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"chunks_vector_store = encode_pdf(path, chunk_size=1000, chunk_overlap=200)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Some helper functions for using the vector store for retrieval.\n",
"this part is same like simple_rag.ipynb, only its using the actual FAISS index (not the wrapper)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"\n",
"def idx_to_text(idx:int):\n",
" \"\"\"\n",
" Convert a Vector store index to the corresponding text.\n",
" \"\"\"\n",
" docstore_id = chunks_vector_store.index_to_docstore_id[idx]\n",
" document = chunks_vector_store.docstore.search(docstore_id)\n",
" return document.page_content\n",
"\n",
"\n",
"def get_context(query:str,k:int=5) -> Tuple[np.ndarray, np.ndarray, List[str]]:\n",
" \"\"\"\n",
" Retrieve top k context items for a query using top k retrieval.\n",
" \"\"\"\n",
" # regular top k retrieval\n",
" q_vec=chunks_vector_store.embedding_function.embed_documents([query])\n",
" _,indices=chunks_vector_store.index.search(np.array(q_vec),k=k)\n",
"\n",
" texts = [idx_to_text(i) for i in indices[0]]\n",
" return texts\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"\n",
"test_query = \"What is the main cause of climate change?\"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Regular top k retrieval\n",
"- This demonstration shows that when database is dense (here we simulate density by loading each document 5 times), the results are not good, we don't get the most relevant results. Note that the top 3 results are all repetitions of the same document."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Context 1:\n",
"driven by human activities, particularly the emission of greenhou se gases. \n",
"Chapter 2: Causes of Climate Change \n",
"Greenhouse Gases \n",
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
"activities have intensified this natural process, leading to a warmer climate. \n",
"Fossil Fuels \n",
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
"today. \n",
"Coal\n",
"\n",
"\n",
"Context 2:\n",
"driven by human activities, particularly the emission of greenhou se gases. \n",
"Chapter 2: Causes of Climate Change \n",
"Greenhouse Gases \n",
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
"activities have intensified this natural process, leading to a warmer climate. \n",
"Fossil Fuels \n",
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
"today. \n",
"Coal\n",
"\n",
"\n",
"Context 3:\n",
"driven by human activities, particularly the emission of greenhou se gases. \n",
"Chapter 2: Causes of Climate Change \n",
"Greenhouse Gases \n",
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
"activities have intensified this natural process, leading to a warmer climate. \n",
"Fossil Fuels \n",
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
"today. \n",
"Coal\n",
"\n",
"\n"
]
}
],
"source": [
"texts=get_context(test_query,k=3)\n",
"show_context(texts)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### More utils for distances normalization"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"def lognorm(dist:np.ndarray, sigma:float):\n",
" \"\"\"\n",
" Calculate the log-normal probability for a given distance and sigma.\n",
" \"\"\"\n",
" if sigma < 1e-9: \n",
" return -np.inf * dist\n",
" return -np.log(sigma) - 0.5 * np.log(2 * np.pi) - dist**2 / (2 * sigma**2)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Definitions of parameters, and the actual function that optimizes both relevance and diversity \n",
"This is the core function that chooses the top k documents based on relevance and diversity. It uses distances between each candidate document and the query and between candidate documents."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"\n",
"# Adjust these according to your needs, knowledge base density, etc. \n",
"DIVERSITY_WEIGHT=1.0\n",
"RELEVANCE_WEIGHT=1.0\n",
"SIGMA=0.1\n",
"\n",
"\n",
"def greedy_dartsearch(q_dists:np.ndarray, dists_mat:np.ndarray, texts:List[str], k:int) -> Tuple[List[str], List[float]]:\n",
" \"\"\"\n",
" Perform greedy dartboard search to select top k documents.\n",
" \"\"\"\n",
" sigma=np.max([SIGMA,1e-5]) # avoid division by zero\n",
" qprobs = lognorm(q_dists, sigma)\n",
" ccprobmat = lognorm(dists_mat, sigma)\n",
" out_scores=[]\n",
" top_idx = np.argmax(qprobs) # start with the most relevant document\n",
" chosen_inds = np.array([top_idx]) # initialize the array of selected documents\n",
" maxes = ccprobmat[top_idx] # Vector of distances to the most relevant document\n",
" while len(chosen_inds) < k:\n",
" newmaxes = np.maximum(maxes, ccprobmat) # update the maximum distances, note the broadcasting (matrix and vector)\n",
"\n",
" logscores = newmaxes*DIVERSITY_WEIGHT + qprobs*RELEVANCE_WEIGHT # score all the items\n",
" scores = logsumexp(logscores, axis=1) # normalize the scores\n",
" scores[chosen_inds] = -np.inf # avoid selecting the same document twice\n",
" best_idx = np.argmax(scores) # select the best item\n",
" best_score=np.max(scores) # avoid division by zero\n",
" maxes = newmaxes[best_idx] # update the maximum distances\n",
" chosen_inds = np.append(chosen_inds, best_idx) # add the best item to the set\n",
" out_scores.append(best_score) # add the best score to the list\n",
" return [texts[i] for i in chosen_inds],out_scores\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Main function for using the dartboard retrieval. This serves instead of get_context (which is simple RAG) it:\n",
"\n",
"1. Takes a text query, vectorzes it, gets the top k documents (and their vectors) via simple RAG. \n",
"2. Uses these vectors to calculate the similarities to query and between candidate matches.\n",
"3. Runs the dartboard algorithm to refine the candidate matches to a final list of k documents.\n",
"4. Returns the final list of documents and their scores."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"\n",
"def get_context_with_dartboard(query:str,k:int=5) -> Tuple[List[str], List[float]]:\n",
" \"\"\"\n",
" Retrieve top k context items for a query using the dartboard algorithm.\n",
" This function only handles the vectors and indices, the rest is handled by the get_dartboard function and below.\n",
" \"\"\"\n",
" q_vec=chunks_vector_store.embedding_function.embed_documents([query]) # embed the query\n",
" _,indices=chunks_vector_store.index.search(np.array(q_vec),k=k*3) # fetch more than k to ensure we overcome density and use diversity\n",
"\n",
" vecs = np.array(chunks_vector_store.index.reconstruct_batch(indices[0])) # reconstruct the vectors of the retrieved documents\n",
" texts = [idx_to_text(i) for i in indices[0]] # convert the indices to texts\n",
"\n",
" \n",
" # calculate similarities and convert them to distances:\n",
" dists_mat = 1-np.dot(vecs,vecs.T) # 1-cosine distance, you may think of better distance functions. This can also be applied to cross-encoder scores. \n",
" q_dists = 1-np.dot(q_vec,vecs.T) # calculate the distances to the query\n",
" \n",
" # run the dartboard algorithm\n",
" texts, scores=greedy_dartsearch(q_dists,dists_mat,texts,k)\n",
" return texts,scores\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### dartboard retrieval - results on same query, k, and dataset\n",
"- As you can see now the top 3 results are not mere repetitions. "
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Context 1:\n",
"driven by human activities, particularly the emission of greenhou se gases. \n",
"Chapter 2: Causes of Climate Change \n",
"Greenhouse Gases \n",
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
"activities have intensified this natural process, leading to a warmer climate. \n",
"Fossil Fuels \n",
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
"today. \n",
"Coal\n",
"\n",
"\n",
"Context 2:\n",
"Most of these climate changes are attributed to very small variations in Earth's orbit that \n",
"change the amount of solar energy our planet receives. During the Holocene epoch, which \n",
"began at the end of the last ice age, human societies f lourished, but the industrial era has seen \n",
"unprecedented changes. \n",
"Modern Observations \n",
"Modern scientific observations indicate a rapid increase in global temperatures, sea levels, \n",
"and extreme weather events. The Intergovernmental Panel on Climate Change (IPCC) has \n",
"documented these changes extensively. Ice core samples, tree rings, and ocean sediments \n",
"provide a historical record that scientists use to understand past climate conditions and \n",
"predict future trends. The evidence overwhelmingly shows that recent changes are primarily \n",
"driven by human activities, particularly the emission of greenhou se gases. \n",
"Chapter 2: Causes of Climate Change \n",
"Greenhouse Gases\n",
"\n",
"\n",
"Context 3:\n",
"driven by human activities, particularly the emission of greenhou se gases. \n",
"Chapter 2: Causes of Climate Change \n",
"Greenhouse Gases \n",
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
"activities have intensified this natural process, leading to a warmer climate. \n",
"Fossil Fuels \n",
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
"today. \n",
"Coal\n",
"\n",
"\n"
]
}
],
"source": [
"texts,scores=get_context_with_dartboard(test_query,k=3)\n",
"show_context(texts)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}