added markdown description for each file

This commit is contained in:
nird
2024-07-23 22:52:22 +03:00
parent 8e430666c6
commit fa759f6752
10 changed files with 618 additions and 10 deletions

View File

@@ -4,16 +4,65 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Hyde: Hypothetical Document Embedding"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The Idea\n",
"# Hypothetical Document Embedding (HyDE) in Document Retrieval\n",
"\n",
"The idea is to transform each question query into a hypothetical document that contains the answer to that query, in order to bring it closer to the documents' data distribution."
"## Overview\n",
"\n",
"This code implements a Hypothetical Document Embedding (HyDE) system for document retrieval. HyDE is an innovative approach that transforms query questions into hypothetical documents containing the answer, aiming to bridge the gap between query and document distributions in vector space.\n",
"\n",
"## Motivation\n",
"\n",
"Traditional retrieval methods often struggle with the semantic gap between short queries and longer, more detailed documents. HyDE addresses this by expanding the query into a full hypothetical document, potentially improving retrieval relevance by making the query representation more similar to the document representations in the vector space.\n",
"\n",
"## Key Components\n",
"\n",
"1. PDF processing and text chunking\n",
"2. Vector store creation using FAISS and OpenAI embeddings\n",
"3. Language model for generating hypothetical documents\n",
"4. Custom HyDERetriever class implementing the HyDE technique\n",
"\n",
"## Method Details\n",
"\n",
"### Document Preprocessing and Vector Store Creation\n",
"\n",
"1. The PDF is processed and split into chunks.\n",
"2. A FAISS vector store is created using OpenAI embeddings for efficient similarity search.\n",
"\n",
"### Hypothetical Document Generation\n",
"\n",
"1. A language model (GPT-4) is used to generate a hypothetical document that answers the given query.\n",
"2. The generation is guided by a prompt template that ensures the hypothetical document is detailed and matches the chunk size used in the vector store.\n",
"\n",
"### Retrieval Process\n",
"\n",
"The `HyDERetriever` class implements the following steps:\n",
"\n",
"1. Generate a hypothetical document from the query using the language model.\n",
"2. Use the hypothetical document as the search query in the vector store.\n",
"3. Retrieve the most similar documents to this hypothetical document.\n",
"\n",
"## Key Features\n",
"\n",
"1. Query Expansion: Transforms short queries into detailed hypothetical documents.\n",
"2. Flexible Configuration: Allows adjustment of chunk size, overlap, and number of retrieved documents.\n",
"3. Integration with OpenAI Models: Uses GPT-4 for hypothetical document generation and OpenAI embeddings for vector representation.\n",
"\n",
"## Benefits of this Approach\n",
"\n",
"1. Improved Relevance: By expanding queries into full documents, HyDE can potentially capture more nuanced and relevant matches.\n",
"2. Handling Complex Queries: Particularly useful for complex or multi-faceted queries that might be difficult to match directly.\n",
"3. Adaptability: The hypothetical document generation can adapt to different types of queries and document domains.\n",
"4. Potential for Better Context Understanding: The expanded query might better capture the context and intent behind the original question.\n",
"\n",
"## Implementation Details\n",
"\n",
"1. Uses OpenAI's ChatGPT model for hypothetical document generation.\n",
"2. Employs FAISS for efficient similarity search in the vector space.\n",
"3. Allows for easy visualization of both the hypothetical document and retrieved results.\n",
"\n",
"## Conclusion\n",
"\n",
"Hypothetical Document Embedding (HyDE) represents an innovative approach to document retrieval, addressing the semantic gap between queries and documents. By leveraging advanced language models to expand queries into hypothetical documents, HyDE has the potential to significantly improve retrieval relevance, especially for complex or nuanced queries. This technique could be particularly valuable in domains where understanding query intent and context is crucial, such as legal research, academic literature review, or advanced information retrieval systems."
]
},
{

View File

@@ -1,5 +1,61 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Context Enrichment Window for Document Retrieval\n",
"\n",
"## Overview\n",
"\n",
"This code implements a context enrichment window technique for document retrieval in a vector database. It enhances the standard retrieval process by adding surrounding context to each retrieved chunk, improving the coherence and completeness of the returned information.\n",
"\n",
"## Motivation\n",
"\n",
"Traditional vector search often returns isolated chunks of text, which may lack necessary context for full understanding. This approach aims to provide a more comprehensive view of the retrieved information by including neighboring text chunks.\n",
"\n",
"## Key Components\n",
"\n",
"1. PDF processing and text chunking\n",
"2. Vector store creation using FAISS and OpenAI embeddings\n",
"3. Custom retrieval function with context window\n",
"4. Comparison between standard and context-enriched retrieval\n",
"\n",
"## Method Details\n",
"\n",
"### Document Preprocessing\n",
"\n",
"1. The PDF is read and converted to a string.\n",
"2. The text is split into chunks with overlap, each chunk tagged with its index.\n",
"\n",
"### Vector Store Creation\n",
"\n",
"1. OpenAI embeddings are used to create vector representations of the chunks.\n",
"2. A FAISS vector store is created from these embeddings.\n",
"\n",
"### Context-Enriched Retrieval\n",
"\n",
"1. The `retrieve_with_context_overlap` function performs the following steps:\n",
" - Retrieves relevant chunks based on the query\n",
" - For each relevant chunk, fetches neighboring chunks\n",
" - Concatenates the chunks, accounting for overlap\n",
" - Returns the expanded context for each relevant chunk\n",
"\n",
"### Retrieval Comparison\n",
"\n",
"The notebook includes a section to compare standard retrieval with the context-enriched approach.\n",
"\n",
"## Benefits of this Approach\n",
"\n",
"1. Provides more coherent and contextually rich results\n",
"2. Maintains the advantages of vector search while mitigating its tendency to return isolated text fragments\n",
"3. Allows for flexible adjustment of the context window size\n",
"\n",
"## Conclusion\n",
"\n",
"This context enrichment window technique offers a promising way to improve the quality of retrieved information in vector-based document search systems. By providing surrounding context, it helps maintain the coherence and completeness of the retrieved information, potentially leading to better understanding and more accurate responses in downstream tasks such as question answering."
]
},
{
"cell_type": "markdown",
"metadata": {},

View File

@@ -1,5 +1,60 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Contextual Compression in Document Retrieval\n",
"\n",
"## Overview\n",
"\n",
"This code demonstrates the implementation of contextual compression in a document retrieval system using LangChain and OpenAI's language models. The technique aims to improve the relevance and conciseness of retrieved information by compressing and extracting the most pertinent parts of documents in the context of a given query.\n",
"\n",
"## Motivation\n",
"\n",
"Traditional document retrieval systems often return entire chunks or documents, which may contain irrelevant information. Contextual compression addresses this by intelligently extracting and compressing only the most relevant parts of retrieved documents, leading to more focused and efficient information retrieval.\n",
"\n",
"## Key Components\n",
"\n",
"1. Vector store creation from a PDF document\n",
"2. Base retriever setup\n",
"3. LLM-based contextual compressor\n",
"4. Contextual compression retriever\n",
"5. Question-answering chain integrating the compressed retriever\n",
"\n",
"## Method Details\n",
"\n",
"### Document Preprocessing and Vector Store Creation\n",
"\n",
"1. The PDF is processed and encoded into a vector store using a custom `encode_pdf` function.\n",
"\n",
"### Retriever and Compressor Setup\n",
"\n",
"1. A base retriever is created from the vector store.\n",
"2. An LLM-based contextual compressor (LLMChainExtractor) is initialized using OpenAI's GPT-4 model.\n",
"\n",
"### Contextual Compression Retriever\n",
"\n",
"1. The base retriever and compressor are combined into a ContextualCompressionRetriever.\n",
"2. This retriever first fetches documents using the base retriever, then applies the compressor to extract the most relevant information.\n",
"\n",
"### Question-Answering Chain\n",
"\n",
"1. A RetrievalQA chain is created, integrating the compression retriever.\n",
"2. This chain uses the compressed and extracted information to generate answers to queries.\n",
"\n",
"## Benefits of this Approach\n",
"\n",
"1. Improved relevance: The system returns only the most pertinent information to the query.\n",
"2. Increased efficiency: By compressing and extracting relevant parts, it reduces the amount of text the LLM needs to process.\n",
"3. Enhanced context understanding: The LLM-based compressor can understand the context of the query and extract information accordingly.\n",
"4. Flexibility: The system can be easily adapted to different types of documents and queries.\n",
"\n",
"## Conclusion\n",
"\n",
"Contextual compression in document retrieval offers a powerful way to enhance the quality and efficiency of information retrieval systems. By intelligently extracting and compressing relevant information, it provides more focused and context-aware responses to queries. This approach has potential applications in various fields requiring efficient and accurate information retrieval from large document collections."
]
},
{
"cell_type": "markdown",
"metadata": {},

View File

@@ -1,5 +1,63 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Explainable Retrieval in Document Search\n",
"\n",
"## Overview\n",
"\n",
"This code implements an Explainable Retriever, a system that not only retrieves relevant documents based on a query but also provides explanations for why each retrieved document is relevant. It combines vector-based similarity search with natural language explanations, enhancing the transparency and interpretability of the retrieval process.\n",
"\n",
"## Motivation\n",
"\n",
"Traditional document retrieval systems often work as black boxes, providing results without explaining why they were chosen. This lack of transparency can be problematic in scenarios where understanding the reasoning behind the results is crucial. The Explainable Retriever addresses this by offering insights into the relevance of each retrieved document.\n",
"\n",
"## Key Components\n",
"\n",
"1. Vector store creation from input texts\n",
"2. Base retriever using FAISS for efficient similarity search\n",
"3. Language model (LLM) for generating explanations\n",
"4. Custom ExplainableRetriever class that combines retrieval and explanation generation\n",
"\n",
"## Method Details\n",
"\n",
"### Document Preprocessing and Vector Store Creation\n",
"\n",
"1. Input texts are converted into embeddings using OpenAI's embedding model.\n",
"2. A FAISS vector store is created from these embeddings for efficient similarity search.\n",
"\n",
"### Retriever Setup\n",
"\n",
"1. A base retriever is created from the vector store, configured to return the top 5 most similar documents.\n",
"\n",
"### Explanation Generation\n",
"\n",
"1. An LLM (GPT-4) is used to generate explanations.\n",
"2. A custom prompt template is defined to guide the LLM in explaining the relevance of retrieved documents.\n",
"\n",
"### ExplainableRetriever Class\n",
"\n",
"1. Combines the base retriever and explanation generation into a single interface.\n",
"2. The `retrieve_and_explain` method:\n",
" - Retrieves relevant documents using the base retriever.\n",
" - For each retrieved document, generates an explanation of its relevance to the query.\n",
" - Returns a list of dictionaries containing both the document content and its explanation.\n",
"\n",
"## Benefits of this Approach\n",
"\n",
"1. Transparency: Users can understand why specific documents were retrieved.\n",
"2. Trust: Explanations build user confidence in the system's results.\n",
"3. Learning: Users can gain insights into the relationships between queries and documents.\n",
"4. Debugging: Easier to identify and correct issues in the retrieval process.\n",
"5. Customization: The explanation prompt can be tailored for different use cases or domains.\n",
"\n",
"## Conclusion\n",
"\n",
"The Explainable Retriever represents a significant step towards more interpretable and trustworthy information retrieval systems. By providing natural language explanations alongside retrieved documents, it bridges the gap between powerful vector-based search techniques and human understanding. This approach has potential applications in various fields where the reasoning behind information retrieval is as important as the retrieved information itself, such as legal research, medical information systems, and educational tools."
]
},
{
"cell_type": "markdown",
"metadata": {},

View File

@@ -1,5 +1,64 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Fusion Retrieval in Document Search\n",
"\n",
"## Overview\n",
"\n",
"This code implements a Fusion Retrieval system that combines vector-based similarity search with keyword-based BM25 retrieval. The approach aims to leverage the strengths of both methods to improve the overall quality and relevance of document retrieval.\n",
"\n",
"## Motivation\n",
"\n",
"Traditional retrieval methods often rely on either semantic understanding (vector-based) or keyword matching (BM25). Each approach has its strengths and weaknesses. Fusion retrieval aims to combine these methods to create a more robust and accurate retrieval system that can handle a wider range of queries effectively.\n",
"\n",
"## Key Components\n",
"\n",
"1. PDF processing and text chunking\n",
"2. Vector store creation using FAISS and OpenAI embeddings\n",
"3. BM25 index creation for keyword-based retrieval\n",
"4. Custom fusion retrieval function that combines both methods\n",
"\n",
"## Method Details\n",
"\n",
"### Document Preprocessing\n",
"\n",
"1. The PDF is loaded and split into chunks using RecursiveCharacterTextSplitter.\n",
"2. Chunks are cleaned by replacing 't' with spaces (likely addressing a specific formatting issue).\n",
"\n",
"### Vector Store Creation\n",
"\n",
"1. OpenAI embeddings are used to create vector representations of the text chunks.\n",
"2. A FAISS vector store is created from these embeddings for efficient similarity search.\n",
"\n",
"### BM25 Index Creation\n",
"\n",
"1. A BM25 index is created from the same text chunks used for the vector store.\n",
"2. This allows for keyword-based retrieval alongside the vector-based method.\n",
"\n",
"### Fusion Retrieval Function\n",
"\n",
"The `fusion_retrieval` function is the core of this implementation:\n",
"\n",
"1. It takes a query and performs both vector-based and BM25-based retrieval.\n",
"2. Scores from both methods are normalized to a common scale.\n",
"3. A weighted combination of these scores is computed (controlled by the `alpha` parameter).\n",
"4. Documents are ranked based on the combined scores, and the top-k results are returned.\n",
"\n",
"## Benefits of this Approach\n",
"\n",
"1. Improved Retrieval Quality: By combining semantic and keyword-based search, the system can capture both conceptual similarity and exact keyword matches.\n",
"2. Flexibility: The `alpha` parameter allows for adjusting the balance between vector and keyword search based on specific use cases or query types.\n",
"3. Robustness: The combined approach can handle a wider range of queries effectively, mitigating weaknesses of individual methods.\n",
"4. Customizability: The system can be easily adapted to use different vector stores or keyword-based retrieval methods.\n",
"\n",
"## Conclusion\n",
"\n",
"Fusion retrieval represents a powerful approach to document search that combines the strengths of semantic understanding and keyword matching. By leveraging both vector-based and BM25 retrieval methods, it offers a more comprehensive and flexible solution for information retrieval tasks. This approach has potential applications in various fields where both conceptual similarity and keyword relevance are important, such as academic research, legal document search, or general-purpose search engines."
]
},
{
"cell_type": "markdown",
"metadata": {},

View File

@@ -4,7 +4,63 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using too levels of encoding: chunk level and summary level. this is the flow of this logic:"
"# Hierarchical Indices in Document Retrieval\n",
"\n",
"## Overview\n",
"\n",
"This code implements a Hierarchical Indexing system for document retrieval, utilizing two levels of encoding: document-level summaries and detailed chunks. This approach aims to improve the efficiency and relevance of information retrieval by first identifying relevant document sections through summaries, then drilling down to specific details within those sections.\n",
"\n",
"## Motivation\n",
"\n",
"Traditional flat indexing methods can struggle with large documents or corpus, potentially missing context or returning irrelevant information. Hierarchical indexing addresses this by creating a two-tier search system, allowing for more efficient and context-aware retrieval.\n",
"\n",
"## Key Components\n",
"\n",
"1. PDF processing and text chunking\n",
"2. Asynchronous document summarization using OpenAI's GPT-4\n",
"3. Vector store creation for both summaries and detailed chunks using FAISS and OpenAI embeddings\n",
"4. Custom hierarchical retrieval function\n",
"\n",
"## Method Details\n",
"\n",
"### Document Preprocessing and Encoding\n",
"\n",
"1. The PDF is loaded and split into documents (likely by page).\n",
"2. Each document is summarized asynchronously using GPT-4.\n",
"3. The original documents are also split into smaller, detailed chunks.\n",
"4. Two separate vector stores are created:\n",
" - One for document-level summaries\n",
" - One for detailed chunks\n",
"\n",
"### Asynchronous Processing and Rate Limiting\n",
"\n",
"1. The code uses asynchronous programming (asyncio) to improve efficiency.\n",
"2. Implements batching and exponential backoff to handle API rate limits.\n",
"\n",
"### Hierarchical Retrieval\n",
"\n",
"The `retrieve_hierarchical` function implements the two-tier search:\n",
"\n",
"1. It first searches the summary vector store to identify relevant document sections.\n",
"2. For each relevant summary, it then searches the detailed chunk vector store, filtering by the corresponding page number.\n",
"3. This approach ensures that detailed information is retrieved only from the most relevant document sections.\n",
"\n",
"## Benefits of this Approach\n",
"\n",
"1. Improved Retrieval Efficiency: By first searching summaries, the system can quickly identify relevant document sections without processing all detailed chunks.\n",
"2. Better Context Preservation: The hierarchical approach helps maintain the broader context of retrieved information.\n",
"3. Scalability: This method is particularly beneficial for large documents or corpus, where flat searching might be inefficient or miss important context.\n",
"4. Flexibility: The system allows for adjusting the number of summaries and chunks retrieved, enabling fine-tuning for different use cases.\n",
"\n",
"## Implementation Details\n",
"\n",
"1. Asynchronous Programming: Utilizes Python's asyncio for efficient I/O operations and API calls.\n",
"2. Rate Limit Handling: Implements batching and exponential backoff to manage API rate limits effectively.\n",
"3. Persistent Storage: Saves the generated vector stores locally to avoid unnecessary recomputation.\n",
"\n",
"## Conclusion\n",
"\n",
"Hierarchical indexing represents a sophisticated approach to document retrieval, particularly suitable for large or complex document sets. By leveraging both high-level summaries and detailed chunks, it offers a balance between broad context understanding and specific information retrieval. This method has potential applications in various fields requiring efficient and context-aware information retrieval, such as legal document analysis, academic research, or large-scale content management systems."
]
},
{

View File

@@ -1,5 +1,81 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Query Transformations for Improved Retrieval in RAG Systems\n",
"\n",
"## Overview\n",
"\n",
"This code implements three query transformation techniques to enhance the retrieval process in Retrieval-Augmented Generation (RAG) systems:\n",
"\n",
"1. Query Rewriting\n",
"2. Step-back Prompting\n",
"3. Sub-query Decomposition\n",
"\n",
"Each technique aims to improve the relevance and comprehensiveness of retrieved information by modifying or expanding the original query.\n",
"\n",
"## Motivation\n",
"\n",
"RAG systems often face challenges in retrieving the most relevant information, especially when dealing with complex or ambiguous queries. These query transformation techniques address this issue by reformulating queries to better match relevant documents or to retrieve more comprehensive information.\n",
"\n",
"## Key Components\n",
"\n",
"1. Query Rewriting: Reformulates queries to be more specific and detailed.\n",
"2. Step-back Prompting: Generates broader queries for better context retrieval.\n",
"3. Sub-query Decomposition: Breaks down complex queries into simpler sub-queries.\n",
"\n",
"## Method Details\n",
"\n",
"### 1. Query Rewriting\n",
"\n",
"- **Purpose**: To make queries more specific and detailed, improving the likelihood of retrieving relevant information.\n",
"- **Implementation**:\n",
" - Uses a GPT-4 model with a custom prompt template.\n",
" - Takes the original query and reformulates it to be more specific and detailed.\n",
"\n",
"### 2. Step-back Prompting\n",
"\n",
"- **Purpose**: To generate broader, more general queries that can help retrieve relevant background information.\n",
"- **Implementation**:\n",
" - Uses a GPT-4 model with a custom prompt template.\n",
" - Takes the original query and generates a more general \"step-back\" query.\n",
"\n",
"### 3. Sub-query Decomposition\n",
"\n",
"- **Purpose**: To break down complex queries into simpler sub-queries for more comprehensive information retrieval.\n",
"- **Implementation**:\n",
" - Uses a GPT-4 model with a custom prompt template.\n",
" - Decomposes the original query into 2-4 simpler sub-queries.\n",
"\n",
"## Benefits of these Approaches\n",
"\n",
"1. **Improved Relevance**: Query rewriting helps in retrieving more specific and relevant information.\n",
"2. **Better Context**: Step-back prompting allows for retrieval of broader context and background information.\n",
"3. **Comprehensive Results**: Sub-query decomposition enables retrieval of information that covers different aspects of a complex query.\n",
"4. **Flexibility**: Each technique can be used independently or in combination, depending on the specific use case.\n",
"\n",
"## Implementation Details\n",
"\n",
"- All techniques use OpenAI's GPT-4 model for query transformation.\n",
"- Custom prompt templates are used to guide the model in generating appropriate transformations.\n",
"- The code provides separate functions for each transformation technique, allowing for easy integration into existing RAG systems.\n",
"\n",
"## Example Use Case\n",
"\n",
"The code demonstrates each technique using the example query:\n",
"\"What are the impacts of climate change on the environment?\"\n",
"\n",
"- **Query Rewriting** expands this to include specific aspects like temperature changes and biodiversity.\n",
"- **Step-back Prompting** generalizes it to \"What are the general effects of climate change?\"\n",
"- **Sub-query Decomposition** breaks it down into questions about biodiversity, oceans, weather patterns, and terrestrial environments.\n",
"\n",
"## Conclusion\n",
"\n",
"These query transformation techniques offer powerful ways to enhance the retrieval capabilities of RAG systems. By reformulating queries in various ways, they can significantly improve the relevance, context, and comprehensiveness of retrieved information. These methods are particularly valuable in domains where queries can be complex or multifaceted, such as scientific research, legal analysis, or comprehensive fact-finding tasks."
]
},
{
"cell_type": "markdown",
"metadata": {},

View File

@@ -1,5 +1,78 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Semantic Chunking for Document Processing\n",
"\n",
"## Overview\n",
"\n",
"This code implements a semantic chunking approach for processing and retrieving information from PDF documents. Unlike traditional methods that split text based on fixed character or word counts, semantic chunking aims to create more meaningful and context-aware text segments.\n",
"\n",
"## Motivation\n",
"\n",
"Traditional text splitting methods often break documents at arbitrary points, potentially disrupting the flow of information and context. Semantic chunking addresses this issue by attempting to split text at more natural breakpoints, preserving semantic coherence within each chunk.\n",
"\n",
"## Key Components\n",
"\n",
"1. PDF processing and text extraction\n",
"2. Semantic chunking using LangChain's SemanticChunker\n",
"3. Vector store creation using FAISS and OpenAI embeddings\n",
"4. Retriever setup for querying the processed documents\n",
"\n",
"## Method Details\n",
"\n",
"### Document Preprocessing\n",
"\n",
"1. The PDF is read and converted to a string using a custom `read_pdf_to_string` function.\n",
"\n",
"### Semantic Chunking\n",
"\n",
"1. Utilizes LangChain's `SemanticChunker` with OpenAI embeddings.\n",
"2. Three breakpoint types are available:\n",
" - 'percentile': Splits at differences greater than the X percentile.\n",
" - 'standard_deviation': Splits at differences greater than X standard deviations.\n",
" - 'interquartile': Uses the interquartile distance to determine split points.\n",
"3. In this implementation, the 'percentile' method is used with a threshold of 90.\n",
"\n",
"### Vector Store Creation\n",
"\n",
"1. OpenAI embeddings are used to create vector representations of the semantic chunks.\n",
"2. A FAISS vector store is created from these embeddings for efficient similarity search.\n",
"\n",
"### Retriever Setup\n",
"\n",
"1. A retriever is configured to fetch the top 2 most relevant chunks for a given query.\n",
"\n",
"## Key Features\n",
"\n",
"1. Context-Aware Splitting: Attempts to maintain semantic coherence within chunks.\n",
"2. Flexible Configuration: Allows for different breakpoint types and thresholds.\n",
"3. Integration with Advanced NLP Tools: Uses OpenAI embeddings for both chunking and retrieval.\n",
"\n",
"## Benefits of this Approach\n",
"\n",
"1. Improved Coherence: Chunks are more likely to contain complete thoughts or ideas.\n",
"2. Better Retrieval Relevance: By preserving context, retrieval accuracy may be enhanced.\n",
"3. Adaptability: The chunking method can be adjusted based on the nature of the documents and retrieval needs.\n",
"4. Potential for Better Understanding: LLMs or downstream tasks may perform better with more coherent text segments.\n",
"\n",
"## Implementation Details\n",
"\n",
"1. Uses OpenAI's embeddings for both the semantic chunking process and the final vector representations.\n",
"2. Employs FAISS for creating an efficient searchable index of the chunks.\n",
"3. The retriever is set up to return the top 2 most relevant chunks, which can be adjusted as needed.\n",
"\n",
"## Example Usage\n",
"\n",
"The code includes a test query: \"What is the main cause of climate change?\". This demonstrates how the semantic chunking and retrieval system can be used to find relevant information from the processed document.\n",
"\n",
"## Conclusion\n",
"\n",
"Semantic chunking represents an advanced approach to document processing for retrieval systems. By attempting to maintain semantic coherence within text segments, it has the potential to improve the quality of retrieved information and enhance the performance of downstream NLP tasks. This technique is particularly valuable for processing long, complex documents where maintaining context is crucial, such as scientific papers, legal documents, or comprehensive reports."
]
},
{
"cell_type": "markdown",
"metadata": {},

View File

@@ -1,5 +1,74 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Simple RAG (Retrieval-Augmented Generation) System\n",
"\n",
"## Overview\n",
"\n",
"This code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying PDF documents. The system encodes the document content into a vector store, which can then be queried to retrieve relevant information.\n",
"\n",
"## Key Components\n",
"\n",
"1. PDF processing and text extraction\n",
"2. Text chunking for manageable processing\n",
"3. Vector store creation using FAISS and OpenAI embeddings\n",
"4. Retriever setup for querying the processed documents\n",
"5. Evaluation of the RAG system\n",
"\n",
"## Method Details\n",
"\n",
"### Document Preprocessing\n",
"\n",
"1. The PDF is loaded using PyPDFLoader.\n",
"2. The text is split into chunks using RecursiveCharacterTextSplitter with specified chunk size and overlap.\n",
"\n",
"### Text Cleaning\n",
"\n",
"A custom function `replace_t_with_space` is applied to clean the text chunks. This likely addresses specific formatting issues in the PDF.\n",
"\n",
"### Vector Store Creation\n",
"\n",
"1. OpenAI embeddings are used to create vector representations of the text chunks.\n",
"2. A FAISS vector store is created from these embeddings for efficient similarity search.\n",
"\n",
"### Retriever Setup\n",
"\n",
"1. A retriever is configured to fetch the top 2 most relevant chunks for a given query.\n",
"\n",
"### Encoding Function\n",
"\n",
"The `encode_pdf` function encapsulates the entire process of loading, chunking, cleaning, and encoding the PDF into a vector store.\n",
"\n",
"## Key Features\n",
"\n",
"1. Modular Design: The encoding process is encapsulated in a single function for easy reuse.\n",
"2. Configurable Chunking: Allows adjustment of chunk size and overlap.\n",
"3. Efficient Retrieval: Uses FAISS for fast similarity search.\n",
"4. Evaluation: Includes a function to evaluate the RAG system's performance.\n",
"\n",
"## Usage Example\n",
"\n",
"The code includes a test query: \"What is the main cause of climate change?\". This demonstrates how to use the retriever to fetch relevant context from the processed document.\n",
"\n",
"## Evaluation\n",
"\n",
"The system includes an `evaluate_rag` function to assess the performance of the retriever, though the specific metrics used are not detailed in the provided code.\n",
"\n",
"## Benefits of this Approach\n",
"\n",
"1. Scalability: Can handle large documents by processing them in chunks.\n",
"2. Flexibility: Easy to adjust parameters like chunk size and number of retrieved results.\n",
"3. Efficiency: Utilizes FAISS for fast similarity search in high-dimensional spaces.\n",
"4. Integration with Advanced NLP: Uses OpenAI embeddings for state-of-the-art text representation.\n",
"\n",
"## Conclusion\n",
"\n",
"This simple RAG system provides a solid foundation for building more complex information retrieval and question-answering systems. By encoding document content into a searchable vector store, it enables efficient retrieval of relevant information in response to queries. This approach is particularly useful for applications requiring quick access to specific information within large documents or document collections."
]
},
{
"cell_type": "markdown",
"metadata": {},

View File

@@ -1,5 +1,62 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Deep Evaluation of RAG Systems using deepeval\n",
"\n",
"## Overview\n",
"\n",
"This code demonstrates the use of the `deepeval` library to perform comprehensive evaluations of Retrieval-Augmented Generation (RAG) systems. It covers various evaluation metrics and provides a framework for creating and running test cases.\n",
"\n",
"## Key Components\n",
"\n",
"1. Correctness Evaluation\n",
"2. Faithfulness Evaluation\n",
"3. Contextual Relevancy Evaluation\n",
"4. Combined Evaluation of Multiple Metrics\n",
"5. Batch Test Case Creation\n",
"\n",
"## Evaluation Metrics\n",
"\n",
"### 1. Correctness (GEval)\n",
"\n",
"- Evaluates whether the actual output is factually correct based on the expected output.\n",
"- Uses GPT-4 as the evaluation model.\n",
"- Compares the expected and actual outputs.\n",
"\n",
"### 2. Faithfulness (FaithfulnessMetric)\n",
"\n",
"- Assesses whether the generated answer is faithful to the provided context.\n",
"- Uses GPT-4 as the evaluation model.\n",
"- Can provide detailed reasons for the evaluation.\n",
"\n",
"### 3. Contextual Relevancy (ContextualRelevancyMetric)\n",
"\n",
"- Evaluates how relevant the retrieved context is to the question and answer.\n",
"- Uses GPT-4 as the evaluation model.\n",
"- Can provide detailed reasons for the evaluation.\n",
"\n",
"## Key Features\n",
"\n",
"1. Flexible Metric Configuration: Each metric can be customized with different models and parameters.\n",
"2. Multi-Metric Evaluation: Ability to evaluate test cases using multiple metrics simultaneously.\n",
"3. Batch Test Case Creation: Utility function to create multiple test cases efficiently.\n",
"4. Detailed Feedback: Options to include detailed reasons for evaluation results.\n",
"\n",
"## Benefits of this Approach\n",
"\n",
"1. Comprehensive Evaluation: Covers multiple aspects of RAG system performance.\n",
"2. Flexibility: Easy to add or modify evaluation metrics and test cases.\n",
"3. Scalability: Capable of handling multiple test cases and metrics efficiently.\n",
"4. Interpretability: Provides detailed reasons for evaluation results, aiding in system improvement.\n",
"\n",
"## Conclusion\n",
"\n",
"This deep evaluation approach using the `deepeval` library offers a robust framework for assessing the performance of RAG systems. By evaluating correctness, faithfulness, and contextual relevancy, it provides a multi-faceted view of system performance. This comprehensive evaluation is crucial for identifying areas of improvement and ensuring the reliability and effectiveness of RAG systems in real-world applications."
]
},
{
"cell_type": "code",
"execution_count": 18,