merge main

This commit is contained in:
VakeDomen
2025-03-01 18:28:18 +00:00
5 changed files with 634 additions and 16 deletions

View File

@@ -294,7 +294,16 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️
Apply different embedding models or retrieval algorithms and use voting or weighting mechanisms to determine the final set of retrieved documents.
20. Multi-modal Retrieval 📽️
20. Dartboard Retrieval 🎯
- **[LangChain](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/dartboard.ipynb)**
#### Overview 🔎
Optimizing over Relevant Information Gain in Retrieval
#### Implementation 🛠️
- Combine both relevance and diversity into a single scoring function and directly optimize for it.
- POC showing plain simple RAG underperforming when the database is dense, and the dartboard retrieval outperforming it.
21. Multi-modal Retrieval 📽️
#### Overview 🔎
Extending RAG capabilities to handle diverse data types for richer responses.
@@ -306,7 +315,7 @@ Explore the extensive list of cutting-edge RAG techniques:
### 🔁 Iterative and Adaptive Techniques
21. Retrieval with Feedback Loops 🔁
22. Retrieval with Feedback Loops 🔁
- **[LangChain](all_rag_techniques/retrieval_with_feedback_loop.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/retrieval_with_feedback_loop.py)**
@@ -316,7 +325,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️
Collect and utilize user feedback on the relevance and quality of retrieved documents and generated responses to fine-tune retrieval and ranking models.
22. Adaptive Retrieval 🎯
23. Adaptive Retrieval 🎯
- **[LangChain](all_rag_techniques/adaptive_retrieval.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/adaptive_retrieval.py)**
@@ -326,7 +335,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️
Classify queries into different categories and use tailored retrieval strategies for each, considering user context and preferences.
23. Iterative Retrieval 🔄
24. Iterative Retrieval 🔄
#### Overview 🔎
Performing multiple rounds of retrieval to refine and enhance result quality.
@@ -336,7 +345,7 @@ Explore the extensive list of cutting-edge RAG techniques:
### 📊 Evaluation
24. **[DeepEval Evaluation](evaluation/evaluation_deep_eval.ipynb)** 📘
25. **[DeepEval Evaluation](evaluation/evaluation_deep_eval.ipynb)** 📘
#### Overview 🔎
Performing evaluations Retrieval-Augmented Generation systems, by covering several metrics and creating test cases.
@@ -345,7 +354,7 @@ Explore the extensive list of cutting-edge RAG techniques:
Use the `deepeval` library to conduct test cases on correctness, faithfulness and contextual relevancy of RAG systems.
25. **[GroUSE Evaluation](evaluation/evaluation_grouse.ipynb)** 🐦
26. **[GroUSE Evaluation](evaluation/evaluation_grouse.ipynb)** 🐦
#### Overview 🔎
Evaluate the final stage of Retrieval-Augmented Generation using metrics of the GroUSE framework and meta-evaluate your custom LLM judge on GroUSE unit tests.
@@ -356,7 +365,7 @@ Explore the extensive list of cutting-edge RAG techniques:
### 🔬 Explainability and Transparency
26. Explainable Retrieval 🔍
27. Explainable Retrieval 🔍
- **[LangChain](all_rag_techniques/explainable_retrieval.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/explainable_retrieval.py)**
@@ -368,7 +377,7 @@ Explore the extensive list of cutting-edge RAG techniques:
### 🏗️ Advanced Architectures
27. Knowledge Graph Integration (Graph RAG) 🕸️
28. Knowledge Graph Integration (Graph RAG) 🕸️
- **[LangChain](all_rag_techniques/graph_rag.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/graph_rag.py)**
@@ -378,7 +387,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️
Retrieve entities and their relationships from a knowledge graph relevant to the query, combining this structured data with unstructured text for more informative responses.
28. GraphRag (Microsoft) 🎯
29. GraphRag (Microsoft) 🎯
- **[GraphRag](all_rag_techniques/Microsoft_GraphRag.ipynb)**
#### Overview 🔎
@@ -387,7 +396,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️
• Analyze an input corpus by extracting entities, relationships from text units. generates summaries of each community and its constituents from the bottom-up.
29. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval 🌳
30. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval 🌳
- **[LangChain](all_rag_techniques/raptor.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/raptor.py)**
@@ -397,7 +406,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️
Use abstractive summarization to recursively process and summarize retrieved documents, organizing the information in a tree structure for hierarchical context.
30. Self RAG 🔁
31. Self RAG 🔁
- **[LangChain](all_rag_techniques/self_rag.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/self_rag.py)**
@@ -407,7 +416,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️
• Implement a multi-step process including retrieval decision, document retrieval, relevance evaluation, response generation, support assessment, and utility evaluation to produce accurate, relevant, and useful outputs.
31. Corrective RAG 🔧
32. Corrective RAG 🔧
- **[LangChain](all_rag_techniques/crag.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/crag.py)**
@@ -419,7 +428,7 @@ Explore the extensive list of cutting-edge RAG techniques:
## 🌟 Special Advanced Technique 🌟
32. **[Sophisticated Controllable Agent for Complex RAG Tasks 🤖](https://github.com/NirDiamant/Controllable-RAG-Agent)**
33. **[Sophisticated Controllable Agent for Complex RAG Tasks 🤖](https://github.com/NirDiamant/Controllable-RAG-Agent)**
#### Overview 🔎
An advanced RAG solution designed to tackle complex questions that simple semantic similarity-based retrieval cannot solve. This approach uses a sophisticated deterministic graph as the "brain" 🧠 of a highly controllable autonomous agent, capable of answering non-trivial questions from your own data.

View File

@@ -97,7 +97,7 @@
"from langchain_core.pydantic_v1 import BaseModel, Field\n",
"\n",
"\n",
"sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path sicnce we work with notebooks\n",
"sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path since we work with notebooks\n",
"from helper_functions import *\n",
"from evaluation.evalute_rag import *\n",
"\n",

View File

@@ -0,0 +1,609 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Dartboard RAG: Retrieval-Augmented Generation with Balanced Relevance and Diversity\n",
"\n",
"## Overview\n",
"The **Dartboard RAG** process addresses a common challenge in large knowledge bases: ensuring the retrieved information is both relevant and non-redundant. By explicitly optimizing a combined relevance-diversity scoring function, it prevents multiple top-k documents from offering the same information. This approach is drawn from the elegant method in thepaper:\n",
"\n",
"> [*Better RAG using Relevant Information Gain*](https://arxiv.org/abs/2407.12101)\n",
"\n",
"The paper outlines three variations of the core idea—hybrid RAG (dense + sparse), a cross-encoder version, and a vanilla approach. The **vanilla approach** conveys the fundamental concept most directly, and this implementation extends it with optional weights to control the balance between relevance and diversity.\n",
"\n",
"## Motivation\n",
"\n",
"1. **Dense, Overlapping Knowledge Bases** \n",
" In large databases, documents may repeat similar content, causing redundancy in top-k retrieval.\n",
"\n",
"2. **Improved Information Coverage** \n",
" Combining relevance and diversity yields a richer set of documents, mitigating the “echo chamber” effect of overly similar content.\n",
"\n",
"\n",
"## Key Components\n",
"\n",
"1. **Relevance & Diversity Combination** \n",
" - Computes a score factoring in both how pertinent a document is to the query and how distinct it is from already chosen documents.\n",
"\n",
"2. **Weighted Balancing** \n",
" - Introduces RELEVANCE_WEIGHT and DIVERSITY_WEIGHT to allow dynamic control of scoring. \n",
" - Helps in avoiding overly diverse but less relevant results.\n",
"\n",
"3. **Production-Ready Code** \n",
" - Derived from the official implementation yet reorganized for clarity. \n",
" - Allows easier integration into existing RAG pipelines.\n",
"\n",
"## Method Details\n",
"\n",
"1. **Document Retrieval** \n",
" - Obtain an initial set of candidate documents based on similarity (e.g., cosine or BM25). \n",
" - Typically retrieves top-N candidates as a starting point.\n",
"\n",
"2. **Scoring & Selection** \n",
" - Each documents overall score combines **relevance** and **diversity**: \n",
" - Select the highest-scoring document, then penalize documents that are overly similar to it. \n",
" - Repeat until top-k documents are identified.\n",
"\n",
"3. **Hybrid / Fusion & Cross-Encoder Support** \n",
" Essentially, all you need are distances between documents and the query, and distances between documents. You can easily extract these from hybrid / fusion retrieval or from cross-encoder retrieval. The only recommendation I have is to rely less on raking based scores.\n",
" - For **hybrid / fusion retrieval**: Merge similarities (dense and sparse / BM25) into a single distance. This can be achieved by combining cosine similarity over the dense and the sparse vectors (e.g. averaging them). the move to distances is straightforward (1 - mean cosine similarity). \n",
" - For **cross-encoders**: You can directly use the cross-encoder similarity scores (1- similarity), potentially adjusting with scaling factors.\n",
"\n",
"4. **Balancing & Adjustment** \n",
" - Tune DIVERSITY_WEIGHT and RELEVANCE_WEIGHT based on your needs and the density of your dataset. \n",
"\n",
"\n",
"\n",
"By integrating both **relevance** and **diversity** into retrieval, the Dartboard RAG approach ensures that top-k documents collectively offer richer, more comprehensive information—leading to higher-quality responses in Retrieval-Augmented Generation systems.\n",
"\n",
"The paper also has an official code implemention, and this code is based on it, but I think this one here is more readable, manageable and production ready."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Import libraries and environment variables"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Please enter your OpenAI API key: \n"
]
}
],
"source": [
"import os\n",
"import sys\n",
"from dotenv import load_dotenv\n",
"from scipy.special import logsumexp\n",
"from typing import Tuple, List, Any\n",
"import numpy as np\n",
"\n",
"# Load environment variables from a .env file\n",
"load_dotenv()\n",
"# Set the OpenAI API key environment variable (comment out if not using OpenAI)\n",
"if not os.getenv('OPENAI_API_KEY'):\n",
" print(\"Please enter your OpenAI API key: \")\n",
" os.environ[\"OPENAI_API_KEY\"] = input(\"Please enter your OpenAI API key: \")\n",
"else:\n",
" os.environ[\"OPENAI_API_KEY\"] = os.getenv('OPENAI_API_KEY')\n",
"\n",
"sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path since we work with notebooks\n",
"from helper_functions import *\n",
"from evaluation.evalute_rag import *\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Read Docs"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"path = \"../data/Understanding_Climate_Change.pdf\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Encode document"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# this part is same like simple_rag.ipynb, only simulating a dense dataset\n",
"def encode_pdf(path, chunk_size=1000, chunk_overlap=200):\n",
" \"\"\"\n",
" Encodes a PDF book into a vector store using OpenAI embeddings.\n",
"\n",
" Args:\n",
" path: The path to the PDF file.\n",
" chunk_size: The desired size of each text chunk.\n",
" chunk_overlap: The amount of overlap between consecutive chunks.\n",
"\n",
" Returns:\n",
" A FAISS vector store containing the encoded book content.\n",
" \"\"\"\n",
"\n",
" # Load PDF documents\n",
" loader = PyPDFLoader(path)\n",
" documents = loader.load()\n",
" documents=documents*5 # load every document 5 times to emulate a dense dataset\n",
"\n",
" # Split documents into chunks\n",
" text_splitter = RecursiveCharacterTextSplitter(\n",
" chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len\n",
" )\n",
" texts = text_splitter.split_documents(documents)\n",
" cleaned_texts = replace_t_with_space(texts)\n",
"\n",
" # Create embeddings (Tested with OpenAI and Amazon Bedrock)\n",
" embeddings = get_langchain_embedding_provider(EmbeddingProvider.OPENAI)\n",
" #embeddings = get_langchain_embedding_provider(EmbeddingProvider.AMAZON_BEDROCK)\n",
"\n",
" # Create vector store\n",
" vectorstore = FAISS.from_documents(cleaned_texts, embeddings)\n",
"\n",
" return vectorstore"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create Vector store\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"chunks_vector_store = encode_pdf(path, chunk_size=1000, chunk_overlap=200)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Some helper functions for using the vector store for retrieval.\n",
"this part is same like simple_rag.ipynb, only its using the actual FAISS index (not the wrapper)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"\n",
"def idx_to_text(idx:int):\n",
" \"\"\"\n",
" Convert a Vector store index to the corresponding text.\n",
" \"\"\"\n",
" docstore_id = chunks_vector_store.index_to_docstore_id[idx]\n",
" document = chunks_vector_store.docstore.search(docstore_id)\n",
" return document.page_content\n",
"\n",
"\n",
"def get_context(query:str,k:int=5) -> Tuple[np.ndarray, np.ndarray, List[str]]:\n",
" \"\"\"\n",
" Retrieve top k context items for a query using top k retrieval.\n",
" \"\"\"\n",
" # regular top k retrieval\n",
" q_vec=chunks_vector_store.embedding_function.embed_documents([query])\n",
" _,indices=chunks_vector_store.index.search(np.array(q_vec),k=k)\n",
"\n",
" texts = [idx_to_text(i) for i in indices[0]]\n",
" return texts\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"\n",
"test_query = \"What is the main cause of climate change?\"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Regular top k retrieval\n",
"- This demonstration shows that when database is dense (here we simulate density by loading each document 5 times), the results are not good, we don't get the most relevant results. Note that the top 3 results are all repetitions of the same document."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Context 1:\n",
"driven by human activities, particularly the emission of greenhou se gases. \n",
"Chapter 2: Causes of Climate Change \n",
"Greenhouse Gases \n",
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
"activities have intensified this natural process, leading to a warmer climate. \n",
"Fossil Fuels \n",
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
"today. \n",
"Coal\n",
"\n",
"\n",
"Context 2:\n",
"driven by human activities, particularly the emission of greenhou se gases. \n",
"Chapter 2: Causes of Climate Change \n",
"Greenhouse Gases \n",
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
"activities have intensified this natural process, leading to a warmer climate. \n",
"Fossil Fuels \n",
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
"today. \n",
"Coal\n",
"\n",
"\n",
"Context 3:\n",
"driven by human activities, particularly the emission of greenhou se gases. \n",
"Chapter 2: Causes of Climate Change \n",
"Greenhouse Gases \n",
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
"activities have intensified this natural process, leading to a warmer climate. \n",
"Fossil Fuels \n",
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
"today. \n",
"Coal\n",
"\n",
"\n"
]
}
],
"source": [
"texts=get_context(test_query,k=3)\n",
"show_context(texts)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Now for the real part :) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### More utils for distances normalization"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"def lognorm(dist:np.ndarray, sigma:float):\n",
" \"\"\"\n",
" Calculate the log-normal probability for a given distance and sigma.\n",
" \"\"\"\n",
" if sigma < 1e-9: \n",
" return -np.inf * dist\n",
" return -np.log(sigma) - 0.5 * np.log(2 * np.pi) - dist**2 / (2 * sigma**2)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Greedy Dartboard Search\n",
"\n",
"This is the core algorithm: A search algorithm that selects a diverse set of relevant documents from a collection by balancing two factors: relevance to the query and diversity among selected documents.\n",
"\n",
"Given distances between a query and documents, plus distances between all documents, the algorithm:\n",
"\n",
"1. Selects the most relevant document first\n",
"2. Iteratively selects additional documents by combining:\n",
" - Relevance to the original query\n",
" - Diversity from previously selected documents\n",
"\n",
"The balance between relevance and diversity is controlled by weights:\n",
"- `DIVERSITY_WEIGHT`: Importance of difference from existing selections\n",
"- `RELEVANCE_WEIGHT`: Importance of relevance to query\n",
"- `SIGMA`: Smoothing parameter for probability conversion\n",
"\n",
"The algorithm returns both the selected documents and their selection scores, making it useful for applications like search results where you want relevant but varied results.\n",
"\n",
"For example, when searching news articles, it would first return the most relevant article, then find articles that are both on-topic and provide new information, avoiding redundant selections."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Configuration parameters\n",
"DIVERSITY_WEIGHT = 1.0 # Weight for diversity in document selection\n",
"RELEVANCE_WEIGHT = 1.0 # Weight for relevance to query\n",
"SIGMA = 0.1 # Smoothing parameter for probability distribution\n",
"\n",
"def greedy_dartsearch(\n",
" query_distances: np.ndarray,\n",
" document_distances: np.ndarray,\n",
" documents: List[str],\n",
" num_results: int\n",
") -> Tuple[List[str], List[float]]:\n",
" \"\"\"\n",
" Perform greedy dartboard search to select top k documents balancing relevance and diversity.\n",
" \n",
" Args:\n",
" query_distances: Distance between query and each document\n",
" document_distances: Pairwise distances between documents\n",
" documents: List of document texts\n",
" num_results: Number of documents to return\n",
" \n",
" Returns:\n",
" Tuple containing:\n",
" - List of selected document texts\n",
" - List of selection scores for each document\n",
" \"\"\"\n",
" # Avoid division by zero in probability calculations\n",
" sigma = max(SIGMA, 1e-5)\n",
" \n",
" # Convert distances to probability distributions\n",
" query_probabilities = lognorm(query_distances, sigma)\n",
" document_probabilities = lognorm(document_distances, sigma)\n",
" \n",
" # Initialize with most relevant document\n",
" \n",
" most_relevant_idx = np.argmax(query_probabilities)\n",
" selected_indices = np.array([most_relevant_idx])\n",
" selection_scores = [1.0] # dummy score for the first document\n",
" # Get initial distances from the first selected document\n",
" max_distances = document_probabilities[most_relevant_idx]\n",
" \n",
" # Select remaining documents\n",
" while len(selected_indices) < num_results:\n",
" # Update maximum distances considering new document\n",
" updated_distances = np.maximum(max_distances, document_probabilities)\n",
" \n",
" # Calculate combined diversity and relevance scores\n",
" combined_scores = (\n",
" updated_distances * DIVERSITY_WEIGHT +\n",
" query_probabilities * RELEVANCE_WEIGHT\n",
" )\n",
" \n",
" # Normalize scores and mask already selected documents\n",
" normalized_scores = logsumexp(combined_scores, axis=1)\n",
" normalized_scores[selected_indices] = -np.inf\n",
" \n",
" # Select best remaining document\n",
" best_idx = np.argmax(normalized_scores)\n",
" best_score = np.max(normalized_scores)\n",
" \n",
" # Update tracking variables\n",
" max_distances = updated_distances[best_idx]\n",
" selected_indices = np.append(selected_indices, best_idx)\n",
" selection_scores.append(best_score)\n",
" \n",
" # Return selected documents and their scores\n",
" selected_documents = [documents[i] for i in selected_indices]\n",
" return selected_documents, selection_scores"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dartboard Context Retrieval\n",
"\n",
"### Main function for using the dartboard retrieval. This serves instead of get_context (which is simple RAG). It:\n",
"\n",
"1. Takes a text query, vectorizes it, gets the top k documents (and their vectors) via simple RAG\n",
"2. Uses these vectors to calculate the similarities to query and between candidate matches\n",
"3. Runs the dartboard algorithm to refine the candidate matches to a final list of k documents\n",
"4. Returns the final list of documents and their scores"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"def get_context_with_dartboard(\n",
" query: str,\n",
" num_results: int = 5,\n",
" oversampling_factor: int = 3\n",
") -> Tuple[List[str], List[float]]:\n",
" \"\"\"\n",
" Retrieve most relevant and diverse context items for a query using the dartboard algorithm.\n",
" \n",
" Args:\n",
" query: The search query string\n",
" num_results: Number of context items to return (default: 5)\n",
" oversampling_factor: Factor to oversample initial results for better diversity (default: 3)\n",
" \n",
" Returns:\n",
" Tuple containing:\n",
" - List of selected context texts\n",
" - List of selection scores\n",
" \n",
" Note:\n",
" The function uses cosine similarity converted to distance. Initial retrieval \n",
" fetches oversampling_factor * num_results items to ensure sufficient diversity \n",
" in the final selection.\n",
" \"\"\"\n",
" # Embed query and retrieve initial candidates\n",
" query_embedding = chunks_vector_store.embedding_function.embed_documents([query])\n",
" _, candidate_indices = chunks_vector_store.index.search(\n",
" np.array(query_embedding),\n",
" k=num_results * oversampling_factor\n",
" )\n",
" \n",
" # Get document vectors and texts for candidates\n",
" candidate_vectors = np.array(\n",
" chunks_vector_store.index.reconstruct_batch(candidate_indices[0])\n",
" )\n",
" candidate_texts = [idx_to_text(idx) for idx in candidate_indices[0]]\n",
" \n",
" # Calculate distance matrices\n",
" # Using 1 - cosine_similarity as distance metric\n",
" document_distances = 1 - np.dot(candidate_vectors, candidate_vectors.T)\n",
" query_distances = 1 - np.dot(query_embedding, candidate_vectors.T)\n",
" \n",
" # Apply dartboard selection algorithm\n",
" selected_texts, selection_scores = greedy_dartsearch(\n",
" query_distances,\n",
" document_distances,\n",
" candidate_texts,\n",
" num_results\n",
" )\n",
" \n",
" return selected_texts, selection_scores"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### dartboard retrieval - results on same query, k, and dataset\n",
"- As you can see now the top 3 results are not mere repetitions. "
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Context 1:\n",
"driven by human activities, particularly the emission of greenhou se gases. \n",
"Chapter 2: Causes of Climate Change \n",
"Greenhouse Gases \n",
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
"activities have intensified this natural process, leading to a warmer climate. \n",
"Fossil Fuels \n",
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
"today. \n",
"Coal\n",
"\n",
"\n",
"Context 2:\n",
"Most of these climate changes are attributed to very small variations in Earth's orbit that \n",
"change the amount of solar energy our planet receives. During the Holocene epoch, which \n",
"began at the end of the last ice age, human societies f lourished, but the industrial era has seen \n",
"unprecedented changes. \n",
"Modern Observations \n",
"Modern scientific observations indicate a rapid increase in global temperatures, sea levels, \n",
"and extreme weather events. The Intergovernmental Panel on Climate Change (IPCC) has \n",
"documented these changes extensively. Ice core samples, tree rings, and ocean sediments \n",
"provide a historical record that scientists use to understand past climate conditions and \n",
"predict future trends. The evidence overwhelmingly shows that recent changes are primarily \n",
"driven by human activities, particularly the emission of greenhou se gases. \n",
"Chapter 2: Causes of Climate Change \n",
"Greenhouse Gases\n",
"\n",
"\n",
"Context 3:\n",
"driven by human activities, particularly the emission of greenhou se gases. \n",
"Chapter 2: Causes of Climate Change \n",
"Greenhouse Gases \n",
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
"activities have intensified this natural process, leading to a warmer climate. \n",
"Fossil Fuels \n",
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
"today. \n",
"Coal\n",
"\n",
"\n"
]
}
],
"source": [
"texts,scores=get_context_with_dartboard(test_query,k=3)\n",
"show_context(texts)\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -8,7 +8,7 @@
"\n",
"## Overview\n",
"\n",
"This code implements a semantic chunking approach for processing and retrieving information from PDF documents, [first proposed by Greg Kamradt](https://youtu.be/8OJC21T2SL4?t=1933) and subsequently [implemented in LangChain](https://docs.llamaindex.ai/en/stable/examples/node_parsers/semantic_chunking/). Unlike traditional methods that split text based on fixed character or word counts, semantic chunking aims to create more meaningful and context-aware text segments.\n",
"This code implements a semantic chunking approach for processing and retrieving information from PDF documents, [first proposed by Greg Kamradt](https://youtu.be/8OJC21T2SL4?t=1933) and subsequently [implemented in LangChain](https://python.langchain.com/docs/how_to/semantic-chunker/). Unlike traditional methods that split text based on fixed character or word counts, semantic chunking aims to create more meaningful and context-aware text segments.\n",
"\n",
"## Motivation\n",
"\n",

View File

@@ -289,7 +289,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
"version": "3.12.0"
}
},
"nbformat": 4,