dartboard algo implementation + README

This commit is contained in:
rotbka
2025-02-10 14:12:17 +02:00
parent 7249e55824
commit 0993d27edf
2 changed files with 453 additions and 13 deletions

View File

@@ -277,7 +277,15 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️
Apply different embedding models or retrieval algorithms and use voting or weighting mechanisms to determine the final set of retrieved documents.
19. Multi-modal Retrieval 📽️
19. Dartboard Retrieval 🎯
- **[LangChain](all_rag_techniques/dartboard.ipynb)**
#### Overview 🔎
Optimizing over Relevant Information Gain in Retrieval
#### Implementation 🛠️
Combine both relevance and diversity into a single scoring function and directly optimize for it.
20. Multi-modal Retrieval 📽️
#### Overview 🔎
Extending RAG capabilities to handle diverse data types for richer responses.
@@ -289,7 +297,7 @@ Explore the extensive list of cutting-edge RAG techniques:
### 🔁 Iterative and Adaptive Techniques
20. Retrieval with Feedback Loops 🔁
21. Retrieval with Feedback Loops 🔁
- **[LangChain](all_rag_techniques/retrieval_with_feedback_loop.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/retrieval_with_feedback_loop.py)**
@@ -299,7 +307,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️
Collect and utilize user feedback on the relevance and quality of retrieved documents and generated responses to fine-tune retrieval and ranking models.
21. Adaptive Retrieval 🎯
22. Adaptive Retrieval 🎯
- **[LangChain](all_rag_techniques/adaptive_retrieval.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/adaptive_retrieval.py)**
@@ -309,7 +317,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️
Classify queries into different categories and use tailored retrieval strategies for each, considering user context and preferences.
22. Iterative Retrieval 🔄
23. Iterative Retrieval 🔄
#### Overview 🔎
Performing multiple rounds of retrieval to refine and enhance result quality.
@@ -319,7 +327,7 @@ Explore the extensive list of cutting-edge RAG techniques:
### 📊 Evaluation
23. **[DeepEval Evaluation](evaluation/evaluation_deep_eval.ipynb)** 📘
24. **[DeepEval Evaluation](evaluation/evaluation_deep_eval.ipynb)** 📘
#### Overview 🔎
Performing evaluations Retrieval-Augmented Generation systems, by covering several metrics and creating test cases.
@@ -328,7 +336,7 @@ Explore the extensive list of cutting-edge RAG techniques:
Use the `deepeval` library to conduct test cases on correctness, faithfulness and contextual relevancy of RAG systems.
24. **[GroUSE Evaluation](evaluation/evaluation_grouse.ipynb)** 🐦
25. **[GroUSE Evaluation](evaluation/evaluation_grouse.ipynb)** 🐦
#### Overview 🔎
Evaluate the final stage of Retrieval-Augmented Generation using metrics of the GroUSE framework and meta-evaluate your custom LLM judge on GroUSE unit tests.
@@ -339,7 +347,7 @@ Explore the extensive list of cutting-edge RAG techniques:
### 🔬 Explainability and Transparency
25. Explainable Retrieval 🔍
26. Explainable Retrieval 🔍
- **[LangChain](all_rag_techniques/explainable_retrieval.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/explainable_retrieval.py)**
@@ -351,7 +359,7 @@ Explore the extensive list of cutting-edge RAG techniques:
### 🏗️ Advanced Architectures
26. Knowledge Graph Integration (Graph RAG) 🕸️
27. Knowledge Graph Integration (Graph RAG) 🕸️
- **[LangChain](all_rag_techniques/graph_rag.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/graph_rag.py)**
@@ -361,7 +369,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️
Retrieve entities and their relationships from a knowledge graph relevant to the query, combining this structured data with unstructured text for more informative responses.
27. GraphRag (Microsoft) 🎯
28. GraphRag (Microsoft) 🎯
- **[GraphRag](all_rag_techniques/Microsoft_GraphRag.ipynb)**
#### Overview 🔎
@@ -370,7 +378,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️
• Analyze an input corpus by extracting entities, relationships from text units. generates summaries of each community and its constituents from the bottom-up.
28. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval 🌳
29. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval 🌳
- **[LangChain](all_rag_techniques/raptor.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/raptor.py)**
@@ -380,7 +388,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️
Use abstractive summarization to recursively process and summarize retrieved documents, organizing the information in a tree structure for hierarchical context.
29. Self RAG 🔁
30. Self RAG 🔁
- **[LangChain](all_rag_techniques/self_rag.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/self_rag.py)**
@@ -390,7 +398,7 @@ Explore the extensive list of cutting-edge RAG techniques:
#### Implementation 🛠️
• Implement a multi-step process including retrieval decision, document retrieval, relevance evaluation, response generation, support assessment, and utility evaluation to produce accurate, relevant, and useful outputs.
30. Corrective RAG 🔧
31. Corrective RAG 🔧
- **[LangChain](all_rag_techniques/crag.ipynb)**
- **[Runnable Script](all_rag_techniques_runnable_scripts/crag.py)**
@@ -402,7 +410,7 @@ Explore the extensive list of cutting-edge RAG techniques:
## 🌟 Special Advanced Technique 🌟
31. **[Sophisticated Controllable Agent for Complex RAG Tasks 🤖](https://github.com/NirDiamant/Controllable-RAG-Agent)**
32. **[Sophisticated Controllable Agent for Complex RAG Tasks 🤖](https://github.com/NirDiamant/Controllable-RAG-Agent)**
#### Overview 🔎
An advanced RAG solution designed to tackle complex questions that simple semantic similarity-based retrieval cannot solve. This approach uses a sophisticated deterministic graph as the "brain" 🧠 of a highly controllable autonomous agent, capable of answering non-trivial questions from your own data.

View File

@@ -0,0 +1,432 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Darboard - optimizing over Relevant information gain in Retrieval\n",
"In real life applications where knowledge bases can get dense and large, most relevant results might have redundant information, and prevent relevant information from being retrieved in top-k.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"A practical solution is to **combine both relevance and diversity** into a single scoring function and directly optimize for it.\n",
" \n",
"This is an implementation of the \"dartboard\" algorithm, as described in the following paper:\n",
"\n",
"> [*Better RAG using Relevant Information Gain*](https://arxiv.org/abs/2407.12101)\n",
"(very elegant method, recommended reading)\n",
"\n",
"The paper actually presents three variations on the same core idea; one with hybrid rag (dense and sparse), one with cross-encoder, and a vanilla version. The vanilla one conveys the idea, and is given here. If you have a hybrid RAG, you can just calculate cos-sim on both vectors and combine them for a similarity score. the shift from cross-encoder scores is straightforward too, but you might want some scaling of the distances. \n",
"\n",
"Additionally, I've introduced weights to control the balance between diversity and relevance. \n",
"In real life, this weighting might help avoid retrieving overly diverse (and potentially less relevant) results.\n",
"The official paper also has a code implemention, and this code is based on it, but I think this one here is more readable, manageable and production ready."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Import libraries and environment variables"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import sys\n",
"from dotenv import load_dotenv\n",
"import numpy as np\n",
"from scipy.special import logsumexp\n",
"\n",
"# Load environment variables from a .env file\n",
"load_dotenv()\n",
"# Set the OpenAI API key environment variable (comment out if not using OpenAI)\n",
"if not os.getenv('OPENAI_API_KEY'):\n",
" print(\"Please enter your OpenAI API key: \")\n",
" os.environ[\"OPENAI_API_KEY\"] = input(\"Please enter your OpenAI API key: \")\n",
"else:\n",
" os.environ[\"OPENAI_API_KEY\"] = os.getenv('OPENAI_API_KEY')\n",
"\n",
"sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path since we work with notebooks\n",
"from helper_functions import *\n",
"from evaluation.evalute_rag import *\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Read Docs"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"path = \"../data/Understanding_Climate_Change.pdf\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Encode document"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# this part is same like simple_rag.ipynb, only simulating a dense dataset\n",
"def encode_pdf(path, chunk_size=1000, chunk_overlap=200):\n",
" \"\"\"\n",
" Encodes a PDF book into a vector store using OpenAI embeddings.\n",
"\n",
" Args:\n",
" path: The path to the PDF file.\n",
" chunk_size: The desired size of each text chunk.\n",
" chunk_overlap: The amount of overlap between consecutive chunks.\n",
"\n",
" Returns:\n",
" A FAISS vector store containing the encoded book content.\n",
" \"\"\"\n",
"\n",
" # Load PDF documents\n",
" loader = PyPDFLoader(path)\n",
" documents = loader.load()\n",
" documents=documents*5 # load every document 5 times to emulate a dense dataset\n",
"\n",
" # Split documents into chunks\n",
" text_splitter = RecursiveCharacterTextSplitter(\n",
" chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len\n",
" )\n",
" texts = text_splitter.split_documents(documents)\n",
" cleaned_texts = replace_t_with_space(texts)\n",
"\n",
" # Create embeddings (Tested with OpenAI and Amazon Bedrock)\n",
" embeddings = get_langchain_embedding_provider(EmbeddingProvider.OPENAI)\n",
" #embeddings = get_langchain_embedding_provider(EmbeddingProvider.AMAZON_BEDROCK)\n",
"\n",
" # Create vector store\n",
" vectorstore = FAISS.from_documents(cleaned_texts, embeddings)\n",
"\n",
" return vectorstore"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"chunks_vector_store = encode_pdf(path, chunk_size=1000, chunk_overlap=200)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# this part is same like simple_rag.ipynb, only its using the actual FAISS index (not the wrapper)\n",
"def idx_to_text(idx):\n",
" docstore_id = chunks_vector_store.index_to_docstore_id[idx]\n",
" document = chunks_vector_store.docstore.search(docstore_id)\n",
" return document.page_content\n",
"\n",
"\n",
"def get_context(query,k=5):\n",
" # regular top k retrieval\n",
" q_vec=chunks_vector_store.embedding_function.embed_documents([query])\n",
" _,indices=chunks_vector_store.index.search(np.array(q_vec),k=k)\n",
"\n",
" vecs = chunks_vector_store.index.reconstruct_batch(indices[0])\n",
" texts = [idx_to_text(i) for i in indices[0]]\n",
" return q_vec,vecs,texts\n"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"\n",
"test_query = \"What is the main cause of climate change?\"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Regular top k retrieval"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Context 1:\n",
"driven by human activities, particularly the emission of greenhou se gases. \n",
"Chapter 2: Causes of Climate Change \n",
"Greenhouse Gases \n",
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
"activities have intensified this natural process, leading to a warmer climate. \n",
"Fossil Fuels \n",
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
"today. \n",
"Coal\n",
"\n",
"\n",
"Context 2:\n",
"driven by human activities, particularly the emission of greenhou se gases. \n",
"Chapter 2: Causes of Climate Change \n",
"Greenhouse Gases \n",
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
"activities have intensified this natural process, leading to a warmer climate. \n",
"Fossil Fuels \n",
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
"today. \n",
"Coal\n",
"\n",
"\n",
"Context 3:\n",
"driven by human activities, particularly the emission of greenhou se gases. \n",
"Chapter 2: Causes of Climate Change \n",
"Greenhouse Gases \n",
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
"activities have intensified this natural process, leading to a warmer climate. \n",
"Fossil Fuels \n",
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
"today. \n",
"Coal\n",
"\n",
"\n"
]
}
],
"source": [
"q_vec,vecs,texts=get_context(test_query,k=3)\n",
"show_context(texts)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### As you can see, the results are not good, we want each result to bring more information. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"\n",
"# Adjust these according to your needs, knowledge base density, etc. \n",
"DIVERSITY_WEIGHT=1.0\n",
"RELEVANCE_WEIGHT=1.0\n",
"SIGMA=0.1\n",
"\n",
"def get_context_with_dartboard(query,k=5):\n",
" q_vec=chunks_vector_store.embedding_function.embed_documents([query])\n",
" _,indices=chunks_vector_store.index.search(np.array(q_vec),k=k*3) # fetch more than k to ensure we overcome density and use diversity\n",
"\n",
" vecs = chunks_vector_store.index.reconstruct_batch(indices[0])\n",
" texts = [idx_to_text(i) for i in indices[0]]\n",
"\n",
" vecs=np.array(vecs)\n",
" \n",
" dists_mat = 1-np.dot(vecs,vecs.T) # 1-cosine distance, you may think of better distance functions. This can also be applied to cross-encoder scores. \n",
" \n",
" q_dists=1-np.dot(q_vec,vecs.T)\n",
" \n",
" texts,scores=get_dartboard(q_dists,dists_mat,texts,SIGMA,k)\n",
"\n",
" return texts,scores\n",
"\n",
"\n",
"\n",
"def get_dartboard(qdists, distsmat, texts, sigma: float, k: int):\n",
" sigma=np.max(sigma,1e-5)\n",
" qprobs = lognorm(qdists, sigma)\n",
" ccprobmat = lognorm(distsmat, sigma)\n",
" return greedy_dartsearch(qprobs, ccprobmat, texts, k)\n",
"\n",
"\n",
"def lognorm(dist, sigma):\n",
" if sigma < 1e-9: \n",
" return -np.inf * dist\n",
" return -np.log(sigma) - 0.5 * np.log(2 * np.pi) - dist**2 / (2 * sigma**2)\n",
"\n",
"\n",
"def greedy_dartsearch(qprobs, dists_mat, texts, k):\n",
" out_scores=[]\n",
" top_idx = np.argmax(qprobs)\n",
" dset = np.array([top_idx])\n",
" maxes = dists_mat[top_idx]\n",
" while len(dset) < k:\n",
" newmaxes = np.maximum(maxes, dists_mat)\n",
"\n",
" logscores = newmaxes*DIVERSITY_WEIGHT + qprobs*RELEVANCE_WEIGHT\n",
" scores = logsumexp(logscores, axis=1)\n",
" scores[dset] = -np.inf\n",
" best_idx = np.argmax(scores)\n",
" best_score=np.log(np.max(scores))\n",
" maxes = newmaxes[best_idx]\n",
" dset = np.append(dset, best_idx)\n",
" out_scores.append(best_score)\n",
" return [texts[i] for i in dset],out_scores\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### dartboard retrieval"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Context 1:\n",
"driven by human activities, particularly the emission of greenhou se gases. \n",
"Chapter 2: Causes of Climate Change \n",
"Greenhouse Gases \n",
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
"activities have intensified this natural process, leading to a warmer climate. \n",
"Fossil Fuels \n",
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
"today. \n",
"Coal\n",
"\n",
"\n",
"Context 2:\n",
"Most of these climate changes are attributed to very small variations in Earth's orbit that \n",
"change the amount of solar energy our planet receives. During the Holocene epoch, which \n",
"began at the end of the last ice age, human societies f lourished, but the industrial era has seen \n",
"unprecedented changes. \n",
"Modern Observations \n",
"Modern scientific observations indicate a rapid increase in global temperatures, sea levels, \n",
"and extreme weather events. The Intergovernmental Panel on Climate Change (IPCC) has \n",
"documented these changes extensively. Ice core samples, tree rings, and ocean sediments \n",
"provide a historical record that scientists use to understand past climate conditions and \n",
"predict future trends. The evidence overwhelmingly shows that recent changes are primarily \n",
"driven by human activities, particularly the emission of greenhou se gases. \n",
"Chapter 2: Causes of Climate Change \n",
"Greenhouse Gases\n",
"\n",
"\n",
"Context 3:\n",
"driven by human activities, particularly the emission of greenhou se gases. \n",
"Chapter 2: Causes of Climate Change \n",
"Greenhouse Gases \n",
"The primary cause of recent climate change is the increase in greenhouse gases in the \n",
"atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \n",
"oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is essential \n",
"for life on Earth, as it keeps the planet warm enough to support life. However, human \n",
"activities have intensified this natural process, leading to a warmer climate. \n",
"Fossil Fuels \n",
"Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and \n",
"natural gas used for electricity, heating, and transportation. The industrial revolution marked \n",
"the beginning of a significant increase in fossil fuel consumption, which continues to rise \n",
"today. \n",
"Coal\n",
"\n",
"\n"
]
}
],
"source": [
"texts,scores=get_context_with_dartboard(test_query,k=3)\n",
"show_context(texts)\n",
"# now top 3 results are not mere repetitions. "
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}