mirror of
https://github.com/NirDiamant/RAG_Techniques.git
synced 2025-04-07 00:48:52 +03:00
347 lines
11 KiB
Plaintext
347 lines
11 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Fusion Retrieval in Document Search\n",
|
|
"\n",
|
|
"## Overview\n",
|
|
"\n",
|
|
"This code implements a Fusion Retrieval system that combines vector-based similarity search with keyword-based BM25 retrieval. The approach aims to leverage the strengths of both methods to improve the overall quality and relevance of document retrieval.\n",
|
|
"\n",
|
|
"## Motivation\n",
|
|
"\n",
|
|
"Traditional retrieval methods often rely on either semantic understanding (vector-based) or keyword matching (BM25). Each approach has its strengths and weaknesses. Fusion retrieval aims to combine these methods to create a more robust and accurate retrieval system that can handle a wider range of queries effectively.\n",
|
|
"\n",
|
|
"## Key Components\n",
|
|
"\n",
|
|
"1. PDF processing and text chunking\n",
|
|
"2. Vector store creation using FAISS and OpenAI embeddings\n",
|
|
"3. BM25 index creation for keyword-based retrieval\n",
|
|
"4. Fusioning BM25 and vector search results for better retrieval\n",
|
|
"\n",
|
|
"## Method Details\n",
|
|
"\n",
|
|
"### Document Preprocessing\n",
|
|
"\n",
|
|
"1. The PDF is loaded and split into chunks using SentenceSplitter.\n",
|
|
"2. Chunks are cleaned by replacing 't' with spaces and newline cleaning (likely addressing a specific formatting issue).\n",
|
|
"\n",
|
|
"### Vector Store Creation\n",
|
|
"\n",
|
|
"1. OpenAI embeddings are used to create vector representations of the text chunks.\n",
|
|
"2. A FAISS vector store is created from these embeddings for efficient similarity search.\n",
|
|
"\n",
|
|
"### BM25 Index Creation\n",
|
|
"\n",
|
|
"1. A BM25 index is created from the same text chunks used for the vector store.\n",
|
|
"2. This allows for keyword-based retrieval alongside the vector-based method.\n",
|
|
"\n",
|
|
"### Query Fusion Retrieval\n",
|
|
"\n",
|
|
"After creation of both indexes Query Fusion Retrieval combines them to enable a hybrid retrieval\n",
|
|
"\n",
|
|
"## Benefits of this Approach\n",
|
|
"\n",
|
|
"1. Improved Retrieval Quality: By combining semantic and keyword-based search, the system can capture both conceptual similarity and exact keyword matches.\n",
|
|
"2. Flexibility: The `retriever_weights` parameter allows for adjusting the balance between vector and keyword search based on specific use cases or query types.\n",
|
|
"3. Robustness: The combined approach can handle a wider range of queries effectively, mitigating weaknesses of individual methods.\n",
|
|
"4. Customizability: The system can be easily adapted to use different vector stores or keyword-based retrieval methods.\n",
|
|
"\n",
|
|
"## Conclusion\n",
|
|
"\n",
|
|
"Fusion retrieval represents a powerful approach to document search that combines the strengths of semantic understanding and keyword matching. By leveraging both vector-based and BM25 retrieval methods, it offers a more comprehensive and flexible solution for information retrieval tasks. This approach has potential applications in various fields where both conceptual similarity and keyword relevance are important, such as academic research, legal document search, or general-purpose search engines."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Import libraries "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import os\n",
|
|
"import sys\n",
|
|
"from dotenv import load_dotenv\n",
|
|
"from typing import List\n",
|
|
"from llama_index.core import Settings\n",
|
|
"from llama_index.core.readers import SimpleDirectoryReader\n",
|
|
"from llama_index.core.node_parser import SentenceSplitter\n",
|
|
"from llama_index.core.ingestion import IngestionPipeline\n",
|
|
"from llama_index.core.schema import BaseNode, TransformComponent\n",
|
|
"from llama_index.vector_stores.faiss import FaissVectorStore\n",
|
|
"from llama_index.core import VectorStoreIndex\n",
|
|
"from llama_index.llms.openai import OpenAI\n",
|
|
"from llama_index.embeddings.openai import OpenAIEmbedding\n",
|
|
"from llama_index.legacy.retrievers.bm25_retriever import BM25Retriever\n",
|
|
"from llama_index.core.retrievers import QueryFusionRetriever\n",
|
|
"import faiss\n",
|
|
"\n",
|
|
"sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path sicnce we work with notebooks\n",
|
|
"# Load environment variables from a .env file\n",
|
|
"load_dotenv()\n",
|
|
"\n",
|
|
"# Set the OpenAI API key environment variable\n",
|
|
"os.environ[\"OPENAI_API_KEY\"] = os.getenv('OPENAI_API_KEY')\n",
|
|
"\n",
|
|
"# Llamaindex global settings for llm and embeddings\n",
|
|
"EMBED_DIMENSION=512\n",
|
|
"Settings.llm = OpenAI(model=\"gpt-3.5-turbo\", temperature=0.1)\n",
|
|
"Settings.embed_model = OpenAIEmbedding(model=\"text-embedding-3-small\", dimensions=EMBED_DIMENSION)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Read Docs"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"path = \"../data/\"\n",
|
|
"reader = SimpleDirectoryReader(input_dir=path, required_exts=['.pdf'])\n",
|
|
"documents = reader.load_data()\n",
|
|
"print(documents[0])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Create Vector Store"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Create FaisVectorStore to store embeddings\n",
|
|
"fais_index = faiss.IndexFlatL2(EMBED_DIMENSION)\n",
|
|
"vector_store = FaissVectorStore(faiss_index=fais_index)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Text Cleaner Transformation"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"class TextCleaner(TransformComponent):\n",
|
|
" \"\"\"\n",
|
|
" Transformation to be used within the ingestion pipeline.\n",
|
|
" Cleans clutters from texts.\n",
|
|
" \"\"\"\n",
|
|
" def __call__(self, nodes, **kwargs) -> List[BaseNode]:\n",
|
|
" \n",
|
|
" for node in nodes:\n",
|
|
" node.text = node.text.replace('\\t', ' ') # Replace tabs with spaces\n",
|
|
" node.text = node.text.replace(' \\n', ' ') # Replace paragprah seperator with spacaes\n",
|
|
" \n",
|
|
" return nodes"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Ingestion Pipeline"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Pipeline instantiation with: \n",
|
|
"# node parser, custom transformer, vector store and documents\n",
|
|
"pipeline = IngestionPipeline(\n",
|
|
" transformations=[\n",
|
|
" SentenceSplitter(),\n",
|
|
" TextCleaner()\n",
|
|
" ],\n",
|
|
" vector_store=vector_store,\n",
|
|
" documents=documents\n",
|
|
")\n",
|
|
"\n",
|
|
"# Run the pipeline to get nodes\n",
|
|
"nodes = pipeline.run()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Retrievers"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### BM25 Retriever"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"bm25_retriever = BM25Retriever.from_defaults(\n",
|
|
" nodes=nodes,\n",
|
|
" similarity_top_k=2,\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Vector Retriever"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"index = VectorStoreIndex(nodes)\n",
|
|
"vector_retriever = index.as_retriever(similarity_top_k=2)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Fusing Both Retrievers"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"retriever = QueryFusionRetriever(\n",
|
|
" retrievers=[\n",
|
|
" vector_retriever,\n",
|
|
" bm25_retriever\n",
|
|
" ],\n",
|
|
" retriever_weights=[\n",
|
|
" 0.6, # vector retriever weight\n",
|
|
" 0.4 # BM25 retriever weight\n",
|
|
" ],\n",
|
|
" num_queries=1, \n",
|
|
" mode='dist_based_score',\n",
|
|
" use_async=False\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"About parameters\n",
|
|
"\n",
|
|
"1. `num_queries`: Query Fusion Retriever not only combines retrievers but also can genereate multiple questions from a given query. This parameter controls how many total queries will be passed to the retrievers. Therefore setting it to 1 disables query generation and the final retriever only uses the initial query.\n",
|
|
"2. `mode`: There are 4 options for this parameter. \n",
|
|
" - **reciprocal_rerank**: Applies reciporical ranking. (Since there is no normalization, this method is not suitable for this kind of application. Beacuse different retrirevers will return score scales)\n",
|
|
" - **relative_score**: Applies MinMax based on the min and max scores among all the nodes. Then scaled to be between 0 and 1. Finally scores are weighted by the relative retrievers based on `retriever_weights`. \n",
|
|
" ```math\n",
|
|
" min\\_score = min(scores)\n",
|
|
" \\\\ max\\_score = max(scores)\n",
|
|
" ```\n",
|
|
" - **dist_based_score**: Only difference from `relative_score` is the MinMax sclaing is based on mean and std of the scores. Scaling and weighting is the same.\n",
|
|
" ```math\n",
|
|
" min\\_score = mean\\_score - 3 * std\\_dev\n",
|
|
" \\\\ max\\_score = mean\\_score + 3 * std\\_dev\n",
|
|
" ```\n",
|
|
" - **simple**: This method is simply takes the max score of each chunk. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Use Case example"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Query\n",
|
|
"query = \"What are the impacts of climate change on the environment?\"\n",
|
|
"\n",
|
|
"# Perform fusion retrieval\n",
|
|
"response = retriever.retrieve(query)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Print Final Retrieved Nodes with Scores "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"for node in response:\n",
|
|
" print(f\"Node Score: {node.score:.2}\")\n",
|
|
" print(f\"Node Content: {node.text}\")\n",
|
|
" print(\"-\"*100)"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": ".venv",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.9"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|