merge main

2025-04-07 00:48:52 +03:00 · 2025-03-01 18:28:18 +00:00
parent 165876797c 42cabf3a9b
commit 6096797c7e
5 changed files with 634 additions and 16 deletions
--- a/README.md
+++ b/README.md
@@ -294,7 +294,16 @@ Explore the extensive list of cutting-edge RAG techniques:
    #### Implementation 🛠️
    Apply different embedding models or retrieval algorithms and use voting or weighting mechanisms to determine the final set of retrieved documents.

-20. Multi-modal Retrieval 📽️
+20. Dartboard Retrieval 🎯
+    - **[LangChain](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/dartboard.ipynb)** 
+    #### Overview 🔎
+    Optimizing over Relevant Information Gain in Retrieval
+
+    #### Implementation 🛠️
+    - Combine both relevance and diversity into a single scoring function and directly optimize for it.
+    - POC showing plain simple RAG underperforming when the database is dense, and the dartboard retrieval outperforming it.
+
+21. Multi-modal Retrieval 📽️

    #### Overview 🔎
    Extending RAG capabilities to handle diverse data types for richer responses.
@@ -306,7 +315,7 @@ Explore the extensive list of cutting-edge RAG techniques:

 ### 🔁 Iterative and Adaptive Techniques

-21. Retrieval with Feedback Loops 🔁  
+22. Retrieval with Feedback Loops 🔁  
    - **[LangChain](all_rag_techniques/retrieval_with_feedback_loop.ipynb)**  
    - **[Runnable Script](all_rag_techniques_runnable_scripts/retrieval_with_feedback_loop.py)**

@@ -316,7 +325,7 @@ Explore the extensive list of cutting-edge RAG techniques:
    #### Implementation 🛠️
    Collect and utilize user feedback on the relevance and quality of retrieved documents and generated responses to fine-tune retrieval and ranking models.

-22. Adaptive Retrieval 🎯  
+23. Adaptive Retrieval 🎯  
    - **[LangChain](all_rag_techniques/adaptive_retrieval.ipynb)**  
    - **[Runnable Script](all_rag_techniques_runnable_scripts/adaptive_retrieval.py)**

@@ -326,7 +335,7 @@ Explore the extensive list of cutting-edge RAG techniques:
    #### Implementation 🛠️
    Classify queries into different categories and use tailored retrieval strategies for each, considering user context and preferences.

-23. Iterative Retrieval 🔄
+24. Iterative Retrieval 🔄

    #### Overview 🔎
    Performing multiple rounds of retrieval to refine and enhance result quality.
@@ -336,7 +345,7 @@ Explore the extensive list of cutting-edge RAG techniques:

 ### 📊 Evaluation

-24. **[DeepEval Evaluation](evaluation/evaluation_deep_eval.ipynb)** 📘
+25. **[DeepEval Evaluation](evaluation/evaluation_deep_eval.ipynb)** 📘

    #### Overview 🔎
    Performing evaluations Retrieval-Augmented Generation systems, by covering several metrics and creating test cases.
@@ -345,7 +354,7 @@ Explore the extensive list of cutting-edge RAG techniques:
    Use the `deepeval` library to conduct test cases on correctness, faithfulness and contextual relevancy of RAG systems.
    

-25. **[GroUSE Evaluation](evaluation/evaluation_grouse.ipynb)** 🐦
+26. **[GroUSE Evaluation](evaluation/evaluation_grouse.ipynb)** 🐦

    #### Overview 🔎
    Evaluate the final stage of Retrieval-Augmented Generation using metrics of the GroUSE framework and meta-evaluate your custom LLM judge on GroUSE unit tests.
@@ -356,7 +365,7 @@ Explore the extensive list of cutting-edge RAG techniques:

 ### 🔬 Explainability and Transparency

-26. Explainable Retrieval 🔍  
+27. Explainable Retrieval 🔍  
    - **[LangChain](all_rag_techniques/explainable_retrieval.ipynb)**  
    - **[Runnable Script](all_rag_techniques_runnable_scripts/explainable_retrieval.py)**

@@ -368,7 +377,7 @@ Explore the extensive list of cutting-edge RAG techniques:

 ### 🏗️ Advanced Architectures

-27. Knowledge Graph Integration (Graph RAG) 🕸️  
+28. Knowledge Graph Integration (Graph RAG) 🕸️  
    - **[LangChain](all_rag_techniques/graph_rag.ipynb)**  
    - **[Runnable Script](all_rag_techniques_runnable_scripts/graph_rag.py)**

@@ -378,7 +387,7 @@ Explore the extensive list of cutting-edge RAG techniques:
    #### Implementation 🛠️
    Retrieve entities and their relationships from a knowledge graph relevant to the query, combining this structured data with unstructured text for more informative responses.
    
-28. GraphRag (Microsoft) 🎯
+29. GraphRag (Microsoft) 🎯
    - **[GraphRag](all_rag_techniques/Microsoft_GraphRag.ipynb)**  

    #### Overview 🔎
@@ -387,7 +396,7 @@ Explore the extensive list of cutting-edge RAG techniques:
    #### Implementation 🛠️
    • Analyze an input corpus by extracting entities, relationships from text units. generates summaries of each community and its constituents from the bottom-up.

-29. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval 🌳  
+30. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval 🌳  
    - **[LangChain](all_rag_techniques/raptor.ipynb)**  
    - **[Runnable Script](all_rag_techniques_runnable_scripts/raptor.py)**

@@ -397,7 +406,7 @@ Explore the extensive list of cutting-edge RAG techniques:
    #### Implementation 🛠️
    Use abstractive summarization to recursively process and summarize retrieved documents, organizing the information in a tree structure for hierarchical context.

-30. Self RAG 🔁  
+31. Self RAG 🔁  
    - **[LangChain](all_rag_techniques/self_rag.ipynb)**  
    - **[Runnable Script](all_rag_techniques_runnable_scripts/self_rag.py)**

@@ -407,7 +416,7 @@ Explore the extensive list of cutting-edge RAG techniques:
    #### Implementation 🛠️
    • Implement a multi-step process including retrieval decision, document retrieval, relevance evaluation, response generation, support assessment, and utility evaluation to produce accurate, relevant, and useful outputs.

-31. Corrective RAG 🔧  
+32. Corrective RAG 🔧  
    - **[LangChain](all_rag_techniques/crag.ipynb)**  
    - **[Runnable Script](all_rag_techniques_runnable_scripts/crag.py)**

@@ -419,7 +428,7 @@ Explore the extensive list of cutting-edge RAG techniques:

 ## 🌟 Special Advanced Technique 🌟

-32. **[Sophisticated Controllable Agent for Complex RAG Tasks 🤖](https://github.com/NirDiamant/Controllable-RAG-Agent)**
+33. **[Sophisticated Controllable Agent for Complex RAG Tasks 🤖](https://github.com/NirDiamant/Controllable-RAG-Agent)**

    #### Overview 🔎
    An advanced RAG solution designed to tackle complex questions that simple semantic similarity-based retrieval cannot solve. This approach uses a sophisticated deterministic graph as the "brain" 🧠 of a highly controllable autonomous agent, capable of answering non-trivial questions from your own data.
--- a/all_rag_techniques/crag.ipynb
+++ b/all_rag_techniques/crag.ipynb
@@ -97,7 +97,7 @@
    "from langchain_core.pydantic_v1 import BaseModel, Field\n",
    "\n",
    "\n",
-    "sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path sicnce we work with notebooks\n",
+    "sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path since we work with notebooks\n",
    "from helper_functions import *\n",
    "from evaluation.evalute_rag import *\n",
    "\n",
--- a/all_rag_techniques/dartboard.ipynb
+++ b/all_rag_techniques/dartboard.ipynb
@@ -0,0 +1,609 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Dartboard RAG: Retrieval-Augmented Generation with Balanced Relevance and Diversity\n",
+    "\n",
+    "## Overview\n",
+    "The **Dartboard RAG** process addresses a common challenge in large knowledge bases: ensuring the retrieved information is both relevant and non-redundant. By explicitly optimizing a combined relevance-diversity scoring function, it prevents multiple top-k documents from offering the same information. This approach is drawn from the elegant method in thepaper:\n",
+    "\n",
+    "> [*Better RAG using Relevant Information Gain*](https://arxiv.org/abs/2407.12101)\n",
+    "\n",
+    "The paper outlines three variations of the core idea—hybrid RAG (dense + sparse), a cross-encoder version, and a vanilla approach. The **vanilla approach** conveys the fundamental concept most directly, and this implementation extends it with optional weights to control the balance between relevance and diversity.\n",
+    "\n",
+    "## Motivation\n",
+    "\n",
+    "1. **Dense, Overlapping Knowledge Bases**  \n",
+    "   In large databases, documents may repeat similar content, causing redundancy in top-k retrieval.\n",
+    "\n",
+    "2. **Improved Information Coverage**  \n",
+    "   Combining relevance and diversity yields a richer set of documents, mitigating the “echo chamber” effect of overly similar content.\n",
+    "\n",
+    "\n",
+    "## Key Components\n",
+    "\n",
+    "1. **Relevance & Diversity Combination**  \n",
+    "   - Computes a score factoring in both how pertinent a document is to the query and how distinct it is from already chosen documents.\n",
+    "\n",
+    "2. **Weighted Balancing**  \n",
+    "   - Introduces RELEVANCE_WEIGHT and DIVERSITY_WEIGHT to allow dynamic control of scoring.  \n",
+    "   - Helps in avoiding overly diverse but less relevant results.\n",
+    "\n",
+    "3. **Production-Ready Code**  \n",
+    "   - Derived from the official implementation yet reorganized for clarity.  \n",
+    "   - Allows easier integration into existing RAG pipelines.\n",
+    "\n",
+    "## Method Details\n",
+    "\n",
+    "1. **Document Retrieval**  \n",
+    "   - Obtain an initial set of candidate documents based on similarity (e.g., cosine or BM25).  \n",
+    "   - Typically retrieves top-N candidates as a starting point.\n",
+    "\n",
+    "2. **Scoring & Selection**  \n",
+    "   - Each document’s overall score combines **relevance** and **diversity**:  \n",
+    "   - Select the highest-scoring document, then penalize documents that are overly similar to it.  \n",
+    "   - Repeat until top-k documents are identified.\n",
+    "\n",
+    "3. **Hybrid / Fusion & Cross-Encoder Support**  \n",
+    "   Essentially, all you need are distances between documents and the query, and distances between documents. You can easily extract these from hybrid / fusion retrieval or from cross-encoder retrieval. The only recommendation I have is to rely less on raking based scores.\n",
+    "   - For **hybrid / fusion retrieval**: Merge similarities (dense and sparse / BM25) into a single distance. This can be achieved by combining cosine similarity over the dense and the sparse vectors (e.g. averaging them). the move to distances is straightforward (1 - mean cosine similarity). \n",
+    "   - For **cross-encoders**: You can directly use the cross-encoder similarity scores (1- similarity), potentially adjusting with scaling factors.\n",
+    "\n",
+    "4. **Balancing & Adjustment**  \n",
+    "   - Tune DIVERSITY_WEIGHT and RELEVANCE_WEIGHT based on your needs and the density of your dataset.  \n",
+    "\n",
+    "\n",
+    "\n",
+    "By integrating both **relevance** and **diversity** into retrieval, the Dartboard RAG approach ensures that top-k documents collectively offer richer, more comprehensive information—leading to higher-quality responses in Retrieval-Augmented Generation systems.\n",
+    "\n",
+    "The paper also has an official code implemention, and this code is based on it, but I think this one here is more readable, manageable and production ready."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Import libraries and environment variables"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Please enter your OpenAI API key: \n"
+     ]
+    }
+   ],
+   "source": [
+    "import os\n",
+    "import sys\n",
+    "from dotenv import load_dotenv\n",
+    "from scipy.special import logsumexp\n",
+    "from typing import Tuple, List, Any\n",
+    "import numpy as np\n",
+    "\n",
+    "# Load environment variables from a .env file\n",
+    "load_dotenv()\n",
+    "# Set the OpenAI API key environment variable (comment out if not using OpenAI)\n",
+    "if not os.getenv('OPENAI_API_KEY'):\n",
+    "    print(\"Please enter your OpenAI API key: \")\n",
+    "    os.environ[\"OPENAI_API_KEY\"] = input(\"Please enter your OpenAI API key: \")\n",
+    "else:\n",
+    "    os.environ[\"OPENAI_API_KEY\"] = os.getenv('OPENAI_API_KEY')\n",
+    "\n",
+    "sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path since we work with notebooks\n",
+    "from helper_functions import *\n",
+    "from evaluation.evalute_rag import *\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Read Docs"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "path = \"../data/Understanding_Climate_Change.pdf\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Encode document"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# this part is same like simple_rag.ipynb, only simulating a dense dataset\n",
+    "def encode_pdf(path, chunk_size=1000, chunk_overlap=200):\n",
+    "    \"\"\"\n",
+    "    Encodes a PDF book into a vector store using OpenAI embeddings.\n",
+    "\n",
+    "    Args:\n",
+    "        path: The path to the PDF file.\n",
+    "        chunk_size: The desired size of each text chunk.\n",
+    "        chunk_overlap: The amount of overlap between consecutive chunks.\n",
+    "\n",
+    "    Returns:\n",
+    "        A FAISS vector store containing the encoded book content.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    # Load PDF documents\n",
+    "    loader = PyPDFLoader(path)\n",
+    "    documents = loader.load()\n",
+    "    documents=documents*5 # load every document 5 times to emulate a dense dataset\n",
+    "\n",
+    "    # Split documents into chunks\n",
+    "    text_splitter = RecursiveCharacterTextSplitter(\n",
+    "        chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len\n",
+    "    )\n",
+    "    texts = text_splitter.split_documents(documents)\n",
+    "    cleaned_texts = replace_t_with_space(texts)\n",
+    "\n",
+    "    # Create embeddings (Tested with OpenAI and Amazon Bedrock)\n",
+    "    embeddings = get_langchain_embedding_provider(EmbeddingProvider.OPENAI)\n",
+    "    #embeddings = get_langchain_embedding_provider(EmbeddingProvider.AMAZON_BEDROCK)\n",
+    "\n",
+    "    # Create vector store\n",
+    "    vectorstore = FAISS.from_documents(cleaned_texts, embeddings)\n",
+    "\n",
+    "    return vectorstore"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Create Vector store\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "chunks_vector_store = encode_pdf(path, chunk_size=1000, chunk_overlap=200)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Some helper functions for using the vector store for retrieval.\n",
+    "this part is same like simple_rag.ipynb, only its using the actual FAISS index (not the wrapper)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "def idx_to_text(idx:int):\n",
+    "    \"\"\"\n",
+    "    Convert a Vector store index to the corresponding text.\n",
+    "    \"\"\"\n",
+    "    docstore_id = chunks_vector_store.index_to_docstore_id[idx]\n",
+    "    document = chunks_vector_store.docstore.search(docstore_id)\n",
+    "    return document.page_content\n",
+    "\n",
+    "\n",
+    "def get_context(query:str,k:int=5) -> Tuple[np.ndarray, np.ndarray, List[str]]:\n",
+    "    \"\"\"\n",
+    "    Retrieve top k context items for a query using top k retrieval.\n",
+    "    \"\"\"\n",
+    "    # regular top k retrieval\n",
+    "    q_vec=chunks_vector_store.embedding_function.embed_documents([query])\n",
+    "    _,indices=chunks_vector_store.index.search(np.array(q_vec),k=k)\n",
+    "\n",
+    "    texts = [idx_to_text(i) for i in indices[0]]\n",
+    "    return texts\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "test_query = \"What is the main cause of climate change?\"\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Regular top k retrieval\n",
+    "- This demonstration shows that when database is dense (here we simulate density by loading each document 5 times), the results are not good, we don't get the most relevant results. Note that the top 3 results are all repetitions of the same document."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Context 1:\n",
+      "driven by human activities, particularly the emission of greenhou se gases.  \n",
+      "Chapter 2: Causes of Climate Change  \n",
+      "Greenhouse Gases  \n",
+      "The primary cause of recent climate change is the increase in greenhouse gases in the \n",
+      "atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
+      "oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is  essential \n",
+      "for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
+      "activities have intensified this natural process, leading to a warmer climate.  \n",
+      "Fossil Fuels  \n",
+      "Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
+      "natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
+      "the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
+      "today.  \n",
+      "Coal\n",
+      "\n",
+      "\n",
+      "Context 2:\n",
+      "driven by human activities, particularly the emission of greenhou se gases.  \n",
+      "Chapter 2: Causes of Climate Change  \n",
+      "Greenhouse Gases  \n",
+      "The primary cause of recent climate change is the increase in greenhouse gases in the \n",
+      "atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
+      "oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is  essential \n",
+      "for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
+      "activities have intensified this natural process, leading to a warmer climate.  \n",
+      "Fossil Fuels  \n",
+      "Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
+      "natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
+      "the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
+      "today.  \n",
+      "Coal\n",
+      "\n",
+      "\n",
+      "Context 3:\n",
+      "driven by human activities, particularly the emission of greenhou se gases.  \n",
+      "Chapter 2: Causes of Climate Change  \n",
+      "Greenhouse Gases  \n",
+      "The primary cause of recent climate change is the increase in greenhouse gases in the \n",
+      "atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
+      "oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is  essential \n",
+      "for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
+      "activities have intensified this natural process, leading to a warmer climate.  \n",
+      "Fossil Fuels  \n",
+      "Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
+      "natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
+      "the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
+      "today.  \n",
+      "Coal\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "texts=get_context(test_query,k=3)\n",
+    "show_context(texts)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Now for the real part :) "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "### More utils for distances normalization"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def lognorm(dist:np.ndarray, sigma:float):\n",
+    "    \"\"\"\n",
+    "    Calculate the log-normal probability for a given distance and sigma.\n",
+    "    \"\"\"\n",
+    "    if sigma < 1e-9: \n",
+    "        return -np.inf * dist\n",
+    "    return -np.log(sigma) - 0.5 * np.log(2 * np.pi) - dist**2 / (2 * sigma**2)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Greedy Dartboard Search\n",
+    "\n",
+    "This is the core algorithm: A search algorithm that selects a diverse set of relevant documents from a collection by balancing two factors: relevance to the query and diversity among selected documents.\n",
+    "\n",
+    "Given distances between a query and documents, plus distances between all documents, the algorithm:\n",
+    "\n",
+    "1. Selects the most relevant document first\n",
+    "2. Iteratively selects additional documents by combining:\n",
+    "   - Relevance to the original query\n",
+    "   - Diversity from previously selected documents\n",
+    "\n",
+    "The balance between relevance and diversity is controlled by weights:\n",
+    "- `DIVERSITY_WEIGHT`: Importance of difference from existing selections\n",
+    "- `RELEVANCE_WEIGHT`: Importance of relevance to query\n",
+    "- `SIGMA`: Smoothing parameter for probability conversion\n",
+    "\n",
+    "The algorithm returns both the selected documents and their selection scores, making it useful for applications like search results where you want relevant but varied results.\n",
+    "\n",
+    "For example, when searching news articles, it would first return the most relevant article, then find articles that are both on-topic and provide new information, avoiding redundant selections."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Configuration parameters\n",
+    "DIVERSITY_WEIGHT = 1.0  # Weight for diversity in document selection\n",
+    "RELEVANCE_WEIGHT = 1.0  # Weight for relevance to query\n",
+    "SIGMA = 0.1  # Smoothing parameter for probability distribution\n",
+    "\n",
+    "def greedy_dartsearch(\n",
+    "    query_distances: np.ndarray,\n",
+    "    document_distances: np.ndarray,\n",
+    "    documents: List[str],\n",
+    "    num_results: int\n",
+    ") -> Tuple[List[str], List[float]]:\n",
+    "    \"\"\"\n",
+    "    Perform greedy dartboard search to select top k documents balancing relevance and diversity.\n",
+    "    \n",
+    "    Args:\n",
+    "        query_distances: Distance between query and each document\n",
+    "        document_distances: Pairwise distances between documents\n",
+    "        documents: List of document texts\n",
+    "        num_results: Number of documents to return\n",
+    "    \n",
+    "    Returns:\n",
+    "        Tuple containing:\n",
+    "        - List of selected document texts\n",
+    "        - List of selection scores for each document\n",
+    "    \"\"\"\n",
+    "    # Avoid division by zero in probability calculations\n",
+    "    sigma = max(SIGMA, 1e-5)\n",
+    "    \n",
+    "    # Convert distances to probability distributions\n",
+    "    query_probabilities = lognorm(query_distances, sigma)\n",
+    "    document_probabilities = lognorm(document_distances, sigma)\n",
+    "    \n",
+    "    # Initialize with most relevant document\n",
+    "    \n",
+    "    most_relevant_idx = np.argmax(query_probabilities)\n",
+    "    selected_indices = np.array([most_relevant_idx])\n",
+    "    selection_scores = [1.0] # dummy score for the first document\n",
+    "    # Get initial distances from the first selected document\n",
+    "    max_distances = document_probabilities[most_relevant_idx]\n",
+    "    \n",
+    "    # Select remaining documents\n",
+    "    while len(selected_indices) < num_results:\n",
+    "        # Update maximum distances considering new document\n",
+    "        updated_distances = np.maximum(max_distances, document_probabilities)\n",
+    "        \n",
+    "        # Calculate combined diversity and relevance scores\n",
+    "        combined_scores = (\n",
+    "            updated_distances * DIVERSITY_WEIGHT +\n",
+    "            query_probabilities * RELEVANCE_WEIGHT\n",
+    "        )\n",
+    "        \n",
+    "        # Normalize scores and mask already selected documents\n",
+    "        normalized_scores = logsumexp(combined_scores, axis=1)\n",
+    "        normalized_scores[selected_indices] = -np.inf\n",
+    "        \n",
+    "        # Select best remaining document\n",
+    "        best_idx = np.argmax(normalized_scores)\n",
+    "        best_score = np.max(normalized_scores)\n",
+    "        \n",
+    "        # Update tracking variables\n",
+    "        max_distances = updated_distances[best_idx]\n",
+    "        selected_indices = np.append(selected_indices, best_idx)\n",
+    "        selection_scores.append(best_score)\n",
+    "    \n",
+    "    # Return selected documents and their scores\n",
+    "    selected_documents = [documents[i] for i in selected_indices]\n",
+    "    return selected_documents, selection_scores"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Dartboard Context Retrieval\n",
+    "\n",
+    "### Main function for using the dartboard retrieval. This serves instead of get_context (which is simple RAG). It:\n",
+    "\n",
+    "1. Takes a text query, vectorizes it, gets the top k documents (and their vectors) via simple RAG\n",
+    "2. Uses these vectors to calculate the similarities to query and between candidate matches\n",
+    "3. Runs the dartboard algorithm to refine the candidate matches to a final list of k documents\n",
+    "4. Returns the final list of documents and their scores"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "def get_context_with_dartboard(\n",
+    "    query: str,\n",
+    "    num_results: int = 5,\n",
+    "    oversampling_factor: int = 3\n",
+    ") -> Tuple[List[str], List[float]]:\n",
+    "    \"\"\"\n",
+    "    Retrieve most relevant and diverse context items for a query using the dartboard algorithm.\n",
+    "    \n",
+    "    Args:\n",
+    "        query: The search query string\n",
+    "        num_results: Number of context items to return (default: 5)\n",
+    "        oversampling_factor: Factor to oversample initial results for better diversity (default: 3)\n",
+    "    \n",
+    "    Returns:\n",
+    "        Tuple containing:\n",
+    "        - List of selected context texts\n",
+    "        - List of selection scores\n",
+    "        \n",
+    "    Note:\n",
+    "        The function uses cosine similarity converted to distance. Initial retrieval \n",
+    "        fetches oversampling_factor * num_results items to ensure sufficient diversity \n",
+    "        in the final selection.\n",
+    "    \"\"\"\n",
+    "    # Embed query and retrieve initial candidates\n",
+    "    query_embedding = chunks_vector_store.embedding_function.embed_documents([query])\n",
+    "    _, candidate_indices = chunks_vector_store.index.search(\n",
+    "        np.array(query_embedding),\n",
+    "        k=num_results * oversampling_factor\n",
+    "    )\n",
+    "    \n",
+    "    # Get document vectors and texts for candidates\n",
+    "    candidate_vectors = np.array(\n",
+    "        chunks_vector_store.index.reconstruct_batch(candidate_indices[0])\n",
+    "    )\n",
+    "    candidate_texts = [idx_to_text(idx) for idx in candidate_indices[0]]\n",
+    "    \n",
+    "    # Calculate distance matrices\n",
+    "    # Using 1 - cosine_similarity as distance metric\n",
+    "    document_distances = 1 - np.dot(candidate_vectors, candidate_vectors.T)\n",
+    "    query_distances = 1 - np.dot(query_embedding, candidate_vectors.T)\n",
+    "    \n",
+    "    # Apply dartboard selection algorithm\n",
+    "    selected_texts, selection_scores = greedy_dartsearch(\n",
+    "        query_distances,\n",
+    "        document_distances,\n",
+    "        candidate_texts,\n",
+    "        num_results\n",
+    "    )\n",
+    "    \n",
+    "    return selected_texts, selection_scores"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### dartboard retrieval - results on same query, k, and dataset\n",
+    "- As you can see now the top 3 results are not mere repetitions. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Context 1:\n",
+      "driven by human activities, particularly the emission of greenhou se gases.  \n",
+      "Chapter 2: Causes of Climate Change  \n",
+      "Greenhouse Gases  \n",
+      "The primary cause of recent climate change is the increase in greenhouse gases in the \n",
+      "atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
+      "oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is  essential \n",
+      "for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
+      "activities have intensified this natural process, leading to a warmer climate.  \n",
+      "Fossil Fuels  \n",
+      "Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
+      "natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
+      "the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
+      "today.  \n",
+      "Coal\n",
+      "\n",
+      "\n",
+      "Context 2:\n",
+      "Most of these climate changes are attributed to very small variations in Earth's orbit that \n",
+      "change the amount of solar energy our planet receives. During the Holocene epoch, which \n",
+      "began at the end of the last ice age, human societies f lourished, but the industrial era has seen \n",
+      "unprecedented changes.  \n",
+      "Modern Observations  \n",
+      "Modern scientific observations indicate a rapid increase in global temperatures, sea levels, \n",
+      "and extreme weather events. The Intergovernmental Panel on Climate Change (IPCC) has \n",
+      "documented these changes extensively. Ice core samples, tree rings, and ocean sediments \n",
+      "provide a historical record that scientists use to understand past climate conditions and \n",
+      "predict future trends. The evidence overwhelmingly shows that recent changes are primarily \n",
+      "driven by human activities, particularly the emission of greenhou se gases.  \n",
+      "Chapter 2: Causes of Climate Change  \n",
+      "Greenhouse Gases\n",
+      "\n",
+      "\n",
+      "Context 3:\n",
+      "driven by human activities, particularly the emission of greenhou se gases.  \n",
+      "Chapter 2: Causes of Climate Change  \n",
+      "Greenhouse Gases  \n",
+      "The primary cause of recent climate change is the increase in greenhouse gases in the \n",
+      "atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
+      "oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is  essential \n",
+      "for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
+      "activities have intensified this natural process, leading to a warmer climate.  \n",
+      "Fossil Fuels  \n",
+      "Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
+      "natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
+      "the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
+      "today.  \n",
+      "Coal\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "texts,scores=get_context_with_dartboard(test_query,k=3)\n",
+    "show_context(texts)\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/all_rag_techniques/semantic_chunking.ipynb
+++ b/all_rag_techniques/semantic_chunking.ipynb
@@ -8,7 +8,7 @@
    "\n",
    "## Overview\n",
    "\n",
-    "This code implements a semantic chunking approach for processing and retrieving information from PDF documents, [first proposed by Greg Kamradt](https://youtu.be/8OJC21T2SL4?t=1933) and subsequently [implemented in LangChain](https://docs.llamaindex.ai/en/stable/examples/node_parsers/semantic_chunking/). Unlike traditional methods that split text based on fixed character or word counts, semantic chunking aims to create more meaningful and context-aware text segments.\n",
+    "This code implements a semantic chunking approach for processing and retrieving information from PDF documents, [first proposed by Greg Kamradt](https://youtu.be/8OJC21T2SL4?t=1933) and subsequently [implemented in LangChain](https://python.langchain.com/docs/how_to/semantic-chunker/). Unlike traditional methods that split text based on fixed character or word counts, semantic chunking aims to create more meaningful and context-aware text segments.\n",
    "\n",
    "## Motivation\n",
    "\n",
--- a/all_rag_techniques/simple_rag.ipynb
+++ b/all_rag_techniques/simple_rag.ipynb
@@ -289,7 +289,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.12.3"
+   "version": "3.12.0"
  }
 },
 "nbformat": 4,