Merge branch 'main' of https://github.com/EliavSh/RAG_Techniques into converting_nb_to_runnable_scripts
# Conflicts: # all_rag_techniques/raptor.ipynb # all_rag_techniques_runnable_scripts/semantic_chunking.py # all_rag_techniques_runnable_scripts/simple_rag.py
@@ -6,7 +6,7 @@ Welcome to the world's largest and most comprehensive repository of Retrieval-Au
|
||||
|
||||
We have a vibrant Discord community where contributors can discuss ideas, ask questions, and collaborate on RAG techniques. Join us at:
|
||||
|
||||
[RAG Techniques Discord Server](https://discord.gg/8PSA7s5v)
|
||||
[RAG Techniques Discord Server](https://discord.gg/cA6Aa4uyDX)
|
||||
|
||||
Don't hesitate to introduce yourself and share your thoughts!
|
||||
|
||||
@@ -96,6 +96,16 @@ This process ensures consistency in our visual representations and makes it easy
|
||||
|
||||
8. **References:** Include relevant citations or resources if you have.
|
||||
|
||||
## Notebook Best Practices
|
||||
|
||||
To ensure consistency and readability across all notebooks:
|
||||
|
||||
1. **Code Cell Descriptions:** Each code cell should be preceded by a markdown cell with a clear, concise title describing the cell's content or purpose.
|
||||
|
||||
2. **Clear Unnecessary Outputs:** Before committing your notebook, clear all unnecessary cell outputs. This helps reduce file size and avoids confusion from outdated results.
|
||||
|
||||
3. **Consistent Formatting:** Maintain consistent formatting throughout the notebook, including regular use of markdown headers, code comments, and proper indentation.
|
||||
|
||||
## Code Quality and Readability
|
||||
|
||||
To ensure the highest quality and readability of our code:
|
||||
|
||||
325
README.md
@@ -2,10 +2,13 @@
|
||||
[](http://makeapullrequest.com)
|
||||
[](https://www.linkedin.com/in/nir-diamant-759323134/)
|
||||
[](https://twitter.com/NirDiamantAI)
|
||||
[](https://discord.com/invite/wh3rVazc)
|
||||
[](https://discord.gg/cA6Aa4uyDX)
|
||||
|
||||
<a href="https://app.commanddash.io/agent/github_NirDiamant_RAG_Techniques"><img src="https://img.shields.io/badge/AI-Code%20Agent-EB9FDA"></a>
|
||||
|
||||
> 🌟 **Support This Project:** Your sponsorship fuels innovation in RAG technologies. **[Become a sponsor](https://github.com/sponsors/NirDiamant)** to help maintain and expand this valuable resource!
|
||||
|
||||
|
||||
# Advanced RAG Techniques: Elevating Your Retrieval-Augmented Generation Systems 🚀
|
||||
|
||||
Welcome to one of the most comprehensive and dynamic collections of Retrieval-Augmented Generation (RAG) tutorials available today. This repository serves as a hub for cutting-edge techniques aimed at enhancing the accuracy, efficiency, and contextual richness of RAG systems.
|
||||
@@ -20,11 +23,11 @@ Our goal is to provide a valuable resource for researchers and practitioners loo
|
||||
|
||||
This repository thrives on community contributions! Join our Discord community — the central hub for discussing and managing contributions to this project:
|
||||
|
||||
[RAG Techniques Discord Community](https://discord.com/invite/wh3rVazc)
|
||||
**[RAG Techniques Discord Community](https://discord.gg/cA6Aa4uyDX)**
|
||||
|
||||
Whether you're an expert or just starting out, your insights can shape the future of RAG. Join us to propose ideas, get feedback, and collaborate on innovative techniques. For contribution guidelines, please refer to our [CONTRIBUTING.md](https://github.com/NirDiamant/RAG_Techniques/blob/main/CONTRIBUTING.md) file. Let's advance RAG technology together!
|
||||
Whether you're an expert or just starting out, your insights can shape the future of RAG. Join us to propose ideas, get feedback, and collaborate on innovative techniques. For contribution guidelines, please refer to our **[CONTRIBUTING.md](https://github.com/NirDiamant/RAG_Techniques/blob/main/CONTRIBUTING.md)** file. Let's advance RAG technology together!
|
||||
|
||||
🔗 For discussions on GenAI, RAG, or custom agents, or to explore knowledge-sharing opportunities, feel free to [connect on LinkedIn](https://www.linkedin.com/in/nir-diamant-759323134/).
|
||||
🔗 For discussions on GenAI, RAG, or custom agents, or to explore knowledge-sharing opportunities, feel free to **[connect on LinkedIn](https://www.linkedin.com/in/nir-diamant-759323134/)**.
|
||||
|
||||
## Key Features
|
||||
|
||||
@@ -37,200 +40,262 @@ Whether you're an expert or just starting out, your insights can shape the futur
|
||||
|
||||
Explore the extensive list of cutting-edge RAG techniques:
|
||||
|
||||
### 1. Simple RAG 🌱
|
||||
- **[LangChain](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/simple_rag.ipynb)**
|
||||
- **[LlamaIndex](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/simple_rag_with_llamaindex.ipynb)**
|
||||
### 🌱 Foundational RAG Techniques
|
||||
|
||||
#### Overview 🔎
|
||||
Introducing basic RAG techniques ideal for newcomers.
|
||||
1. Simple RAG 🌱
|
||||
- **[LangChain](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/simple_rag.ipynb)**
|
||||
- **[LlamaIndex](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/simple_rag_with_llamaindex.ipynb)**
|
||||
- **[Runnable Script](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques_runnable_scripts/simple_rag.py)**
|
||||
|
||||
#### Implementation 🛠️
|
||||
Start with basic retrieval queries and integrate incremental learning mechanisms.
|
||||
#### Overview 🔎
|
||||
Introducing basic RAG techniques ideal for newcomers.
|
||||
|
||||
### 2. [Context Enrichment Techniques 📝](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/context_enrichment_window_around_chunk.ipynb)
|
||||
#### Implementation 🛠️
|
||||
Start with basic retrieval queries and integrate incremental learning mechanisms.
|
||||
|
||||
#### Overview 🔎
|
||||
Enhancing retrieval accuracy by embedding individual sentences and extending context to neighboring sentences.
|
||||
2. Simple RAG using a CSV file 🧩
|
||||
- **[LangChain](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/simple_csv_rag.ipynb)**
|
||||
- **[LlamaIndex](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/simple_csv_rag_with_llamaindex.ipynb)**
|
||||
|
||||
#### Overview 🔎
|
||||
Introducing basic RAG using CSV files.
|
||||
|
||||
#### Implementation 🛠️
|
||||
Retrieve the most relevant sentence while also accessing the sentences before and after it in the original text.
|
||||
#### Implementation 🛠️
|
||||
This uses CSV files to create basic retrieval and integrates with openai to create question and answering system.
|
||||
|
||||
### 3. Multi-faceted Filtering 🔍
|
||||
3. **[Reliable RAG 🏷️](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/reliable_rag.ipynb)**
|
||||
|
||||
#### Overview 🔎
|
||||
Applying various filtering techniques to refine and improve the quality of retrieved results.
|
||||
#### Overview 🔎
|
||||
Enhances the Simple RAG by adding validation and refinement to ensure the accuracy and relevance of retrieved information.
|
||||
|
||||
#### Implementation 🛠️
|
||||
- 🏷️ **Metadata Filtering:** Apply filters based on attributes like date, source, author, or document type.
|
||||
- 📊 **Similarity Thresholds:** Set thresholds for relevance scores to keep only the most pertinent results.
|
||||
- 📄 **Content Filtering:** Remove results that don't match specific content criteria or essential keywords.
|
||||
- 🌈 **Diversity Filtering:** Ensure result diversity by filtering out near-duplicate entries.
|
||||
#### Implementation 🛠️
|
||||
Check for retrieved document relevancy and highlight the segment of docs used for answering.
|
||||
|
||||
### 4. [Fusion Retrieval 🔗](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/fusion_retrieval.ipynb)
|
||||
4. **[Choose Chunk Size 📏](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/choose_chunk_size.ipynb)**
|
||||
|
||||
#### Overview 🔎
|
||||
Optimizing search results by combining different retrieval methods.
|
||||
#### Overview 🔎
|
||||
Selecting an appropriate fixed size for text chunks to balance context preservation and retrieval efficiency.
|
||||
|
||||
#### Implementation 🛠️
|
||||
Combine keyword-based search with vector-based search for more comprehensive and accurate retrieval.
|
||||
#### Implementation 🛠️
|
||||
Experiment with different chunk sizes to find the optimal balance between preserving context and maintaining retrieval speed for your specific use case.
|
||||
|
||||
### 5. [Intelligent Reranking 📈](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/reranking.ipynb)
|
||||
5. **[Proposition Chunking ⛓️💥](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/proposition_chunking.ipynb)**
|
||||
|
||||
#### Overview 🔎
|
||||
Applying advanced scoring mechanisms to improve the relevance ranking of retrieved results.
|
||||
#### Overview 🔎
|
||||
Breaking down the text into concise, complete, meaningful sentences allowing for better control and handling of specific queries (especially extracting knowledge).
|
||||
|
||||
#### Implementation 🛠️
|
||||
- 🧠 **LLM-based Scoring:** Use a language model to score the relevance of each retrieved chunk.
|
||||
- 🔀 **Cross-Encoder Models:** Re-encode both the query and retrieved documents jointly for similarity scoring.
|
||||
- 🏆 **Metadata-enhanced Ranking:** Incorporate metadata into the scoring process for more nuanced ranking.
|
||||
#### Implementation 🛠️
|
||||
- 💪 **Proposition Generation:** The LLM is used in conjunction with a custom prompt to generate factual statements from the document chunks.
|
||||
- ✅ **Quality Checking:** The generated propositions are passed through a grading system that evaluates accuracy, clarity, completeness, and conciseness.
|
||||
|
||||
### 6.[Query Transformations 🔄](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/query_transformations.ipynb)
|
||||
#### Additional Resources 📚
|
||||
- **[The Propositions Method: Enhancing Information Retrieval for AI Systems](https://medium.com/@nirdiamant21/the-propositions-method-enhancing-information-retrieval-for-ai-systems-c5ed6e5a4d2e)** - A comprehensive blog post exploring the benefits and implementation of proposition chunking in RAG systems.
|
||||
|
||||
#### Overview 🔎
|
||||
Modifying and expanding queries to improve retrieval effectiveness.
|
||||
### 🔍 Query Enhancement
|
||||
|
||||
#### Implementation 🛠️
|
||||
- ✍️ **Query Rewriting:** Reformulate queries to improve retrieval.
|
||||
- 🔙 **Step-back Prompting:** Generate broader queries for better context retrieval.
|
||||
- 🧩 **Sub-query Decomposition:** Break complex queries into simpler sub-queries.
|
||||
6. **[Query Transformations 🔄](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/query_transformations.ipynb)**
|
||||
|
||||
### 7. [Hierarchical Indices 🗂️](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/hierarchical_indices.ipynb)
|
||||
#### Overview 🔎
|
||||
Modifying and expanding queries to improve retrieval effectiveness.
|
||||
|
||||
#### Overview 🔎
|
||||
Creating a multi-tiered system for efficient information navigation and retrieval.
|
||||
#### Implementation 🛠️
|
||||
- ✍️ **Query Rewriting:** Reformulate queries to improve retrieval.
|
||||
- 🔙 **Step-back Prompting:** Generate broader queries for better context retrieval.
|
||||
- 🧩 **Sub-query Decomposition:** Break complex queries into simpler sub-queries.
|
||||
|
||||
#### Implementation 🛠️
|
||||
Implement a two-tiered system for document summaries and detailed chunks, both containing metadata pointing to the same location in the data.
|
||||
7. **[Hypothetical Questions (HyDE Approach) ❓](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/HyDe_Hypothetical_Document_Embedding.ipynb)**
|
||||
|
||||
### 8. [Hypothetical Questions (HyDE Approach) ❓](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/HyDe_Hypothetical_Document_Embedding.ipynb)
|
||||
#### Overview 🔎
|
||||
Generating hypothetical questions to improve alignment between queries and data.
|
||||
|
||||
#### Overview 🔎
|
||||
Generating hypothetical questions to improve alignment between queries and data.
|
||||
#### Implementation 🛠️
|
||||
Create hypothetical questions that point to relevant locations in the data, enhancing query-data matching.
|
||||
|
||||
#### Implementation 🛠️
|
||||
Create hypothetical questions that point to relevant locations in the data, enhancing query-data matching.
|
||||
### 📚 Context and Content Enrichment
|
||||
|
||||
### 9. [Choose Chunk Size 📏](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/choose_chunk_size.ipynb)
|
||||
8. Context Enrichment Techniques 📝
|
||||
- **[LangChain](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/context_enrichment_window_around_chunk.ipynb)**
|
||||
- **[LlamaIndex](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/context_enrichment_window_around_chunk_with_llamaindex.ipynb)**
|
||||
|
||||
#### Overview 🔎
|
||||
Enhancing retrieval accuracy by embedding individual sentences and extending context to neighboring sentences.
|
||||
|
||||
#### Overview 🔎
|
||||
Selecting an appropriate fixed size for text chunks to balance context preservation and retrieval efficiency.
|
||||
#### Implementation 🛠️
|
||||
Retrieve the most relevant sentence while also accessing the sentences before and after it in the original text.
|
||||
|
||||
#### Implementation 🛠️
|
||||
Experiment with different chunk sizes to find the optimal balance between preserving context and maintaining retrieval speed for your specific use case.
|
||||
9. Semantic Chunking 🧠
|
||||
- **[LangChain](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/semantic_chunking.ipynb)**
|
||||
- **[Runnable Script](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques_runnable_scripts/semantic_chunking.py)**
|
||||
|
||||
### 10. [Semantic Chunking 🧠](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/semantic_chunking.ipynb)
|
||||
#### Overview 🔎
|
||||
Dividing documents based on semantic coherence rather than fixed sizes.
|
||||
|
||||
#### Overview 🔎
|
||||
Dividing documents based on semantic coherence rather than fixed sizes.
|
||||
#### Implementation 🛠️
|
||||
Use NLP techniques to identify topic boundaries or coherent sections within documents for more meaningful retrieval units.
|
||||
|
||||
#### Implementation 🛠️
|
||||
Use NLP techniques to identify topic boundaries or coherent sections within documents for more meaningful retrieval units.
|
||||
#### Additional Resources 📚
|
||||
- **[Semantic Chunking: Improving AI Information Retrieval](https://medium.com/@nirdiamant21/semantic-chunking-improving-ai-information-retrieval-2f468be2d707)** - A comprehensive blog post exploring the benefits and implementation of semantic chunking in RAG systems.
|
||||
|
||||
### 11. [Contextual Compression 🗜️](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/contextual_compression.ipynb)
|
||||
10. **[Contextual Compression 🗜️](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/contextual_compression.ipynb)**
|
||||
|
||||
#### Overview 🔎
|
||||
Compressing retrieved information while preserving query-relevant content.
|
||||
#### Overview 🔎
|
||||
Compressing retrieved information while preserving query-relevant content.
|
||||
|
||||
#### Implementation 🛠️
|
||||
Use an LLM to compress or summarize retrieved chunks, preserving key information relevant to the query.
|
||||
#### Implementation 🛠️
|
||||
Use an LLM to compress or summarize retrieved chunks, preserving key information relevant to the query.
|
||||
|
||||
### 12. [Explainable Retrieval 🔍](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/explainable_retrieval.ipynb)
|
||||
11. **[Document Augmentation through Question Generation for Enhanced Retrieval](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/document_augmentation.ipynb)**
|
||||
|
||||
#### Overview 🔎
|
||||
Providing transparency in the retrieval process to enhance user trust and system refinement.
|
||||
#### Overview 🔎
|
||||
This implementation demonstrates a text augmentation technique that leverages additional question generation to improve document retrieval within a vector database. By generating and incorporating various questions related to each text fragment, the system enhances the standard retrieval process, thus increasing the likelihood of finding relevant documents that can be utilized as context for generative question answering.
|
||||
|
||||
#### Implementation 🛠️
|
||||
Explain why certain pieces of information were retrieved and how they relate to the query.
|
||||
#### Implementation 🛠️
|
||||
Use an LLM to augment text dataset with all possible questions that can be asked to each document.
|
||||
|
||||
### 13. [Retrieval with Feedback Loops 🔁](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/retrieval_with_feedback_loop.ipynb)
|
||||
### 🚀 Advanced Retrieval Methods
|
||||
|
||||
#### Overview 🔎
|
||||
Implementing mechanisms to learn from user interactions and improve future retrievals.
|
||||
12. Fusion Retrieval 🔗
|
||||
- **[LangChain](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/fusion_retrieval.ipynb)**
|
||||
- **[LlamaIndex](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/fusion_retrieval_with_llamaindex.ipynb)**
|
||||
|
||||
#### Overview 🔎
|
||||
Optimizing search results by combining different retrieval methods.
|
||||
|
||||
#### Implementation 🛠️
|
||||
Collect and utilize user feedback on the relevance and quality of retrieved documents and generated responses to fine-tune retrieval and ranking models.
|
||||
#### Implementation 🛠️
|
||||
Combine keyword-based search with vector-based search for more comprehensive and accurate retrieval.
|
||||
|
||||
### 14. [Adaptive Retrieval 🎯](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/adaptive_retrieval.ipynb)
|
||||
13. Intelligent Reranking 📈
|
||||
- **[LangChain](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/reranking.ipynb)**
|
||||
- **[LlamaIndex](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/reranking_with_llamaindex.ipynb)**
|
||||
|
||||
#### Overview 🔎
|
||||
Applying advanced scoring mechanisms to improve the relevance ranking of retrieved results.
|
||||
|
||||
#### Overview 🔎
|
||||
Dynamically adjusting retrieval strategies based on query types and user contexts.
|
||||
#### Implementation 🛠️
|
||||
- 🧠 **LLM-based Scoring:** Use a language model to score the relevance of each retrieved chunk.
|
||||
- 🔀 **Cross-Encoder Models:** Re-encode both the query and retrieved documents jointly for similarity scoring.
|
||||
- 🏆 **Metadata-enhanced Ranking:** Incorporate metadata into the scoring process for more nuanced ranking.
|
||||
|
||||
#### Implementation 🛠️
|
||||
Classify queries into different categories and use tailored retrieval strategies for each, considering user context and preferences.
|
||||
#### Additional Resources 📚
|
||||
- **[Relevance Revolution: How Re-ranking Transforms RAG Systems](https://medium.com/@nirdiamant21/relevance-revolution-how-re-ranking-transforms-rag-systems-0ffaa15f1047)** - A comprehensive blog post exploring the power of re-ranking in enhancing RAG system performance.
|
||||
|
||||
### 15. Iterative Retrieval 🔄
|
||||
14. Multi-faceted Filtering 🔍
|
||||
|
||||
#### Overview 🔎
|
||||
Performing multiple rounds of retrieval to refine and enhance result quality.
|
||||
#### Overview 🔎
|
||||
Applying various filtering techniques to refine and improve the quality of retrieved results.
|
||||
|
||||
#### Implementation 🛠️
|
||||
Use the LLM to analyze initial results and generate follow-up queries to fill in gaps or clarify information.
|
||||
#### Implementation 🛠️
|
||||
- 🏷️ **Metadata Filtering:** Apply filters based on attributes like date, source, author, or document type.
|
||||
- 📊 **Similarity Thresholds:** Set thresholds for relevance scores to keep only the most pertinent results.
|
||||
- 📄 **Content Filtering:** Remove results that don't match specific content criteria or essential keywords.
|
||||
- 🌈 **Diversity Filtering:** Ensure result diversity by filtering out near-duplicate entries.
|
||||
|
||||
### 16. Ensemble Retrieval 🎭
|
||||
15. **[Hierarchical Indices 🗂️](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/hierarchical_indices.ipynb)**
|
||||
|
||||
#### Overview 🔎
|
||||
Combining multiple retrieval models or techniques for more robust and accurate results.
|
||||
#### Overview 🔎
|
||||
Creating a multi-tiered system for efficient information navigation and retrieval.
|
||||
|
||||
#### Implementation 🛠️
|
||||
Apply different embedding models or retrieval algorithms and use voting or weighting mechanisms to determine the final set of retrieved documents.
|
||||
#### Implementation 🛠️
|
||||
Implement a two-tiered system for document summaries and detailed chunks, both containing metadata pointing to the same location in the data.
|
||||
|
||||
### 17. [Knowledge Graph Integration (Graph RAG)🕸️](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/graph_rag.ipynb)
|
||||
#### Additional Resources 📚
|
||||
- **[Hierarchical Indices: Enhancing RAG Systems](https://medium.com/@nirdiamant21/hierarchical-indices-enhancing-rag-systems-43c06330c085?sk=d5f97cbece2f640da8746f8da5f95188)** - A comprehensive blog post exploring the power of hierarchical indices in enhancing RAG system performance.
|
||||
|
||||
#### Overview 🔎
|
||||
Incorporating structured data from knowledge graphs to enrich context and improve retrieval.
|
||||
16. Ensemble Retrieval 🎭
|
||||
|
||||
#### Implementation 🛠️
|
||||
Retrieve entities and their relationships from a knowledge graph relevant to the query, combining this structured data with unstructured text for more informative responses.
|
||||
#### Overview 🔎
|
||||
Combining multiple retrieval models or techniques for more robust and accurate results.
|
||||
|
||||
### 18. Multi-modal Retrieval 📽️
|
||||
#### Implementation 🛠️
|
||||
Apply different embedding models or retrieval algorithms and use voting or weighting mechanisms to determine the final set of retrieved documents.
|
||||
|
||||
#### Overview 🔎
|
||||
Extending RAG capabilities to handle diverse data types for richer responses.
|
||||
17. Multi-modal Retrieval 📽️
|
||||
|
||||
#### Implementation 🛠️
|
||||
Integrate models that can retrieve and understand different data modalities, combining insights from text, images, and videos.
|
||||
#### Overview 🔎
|
||||
Extending RAG capabilities to handle diverse data types for richer responses.
|
||||
|
||||
### 19. [RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval 🌳](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/raptor.ipynb)
|
||||
#### Implementation 🛠️
|
||||
Integrate models that can retrieve and understand different data modalities, combining insights from text, images, and videos.
|
||||
|
||||
#### Overview 🔎
|
||||
Implementing a recursive approach to process and organize retrieved information in a tree structure.
|
||||
### 🔁 Iterative and Adaptive Techniques
|
||||
|
||||
#### Implementation 🛠️
|
||||
Use abstractive summarization to recursively process and summarize retrieved documents, organizing the information in a tree structure for hierarchical context.
|
||||
18. **[Retrieval with Feedback Loops 🔁](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/retrieval_with_feedback_loop.ipynb)**
|
||||
|
||||
### 20. [Self RAG 🔁](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/self_rag.ipynb)
|
||||
#### Overview 🔎
|
||||
Implementing mechanisms to learn from user interactions and improve future retrievals.
|
||||
|
||||
#### Overview 🔎
|
||||
A dynamic approach that combines retrieval-based and generation-based methods, adaptively deciding whether to use retrieved information and how to best utilize it in generating responses.
|
||||
#### Implementation 🛠️
|
||||
Collect and utilize user feedback on the relevance and quality of retrieved documents and generated responses to fine-tune retrieval and ranking models.
|
||||
|
||||
#### Implementation 🛠️
|
||||
• Implement a multi-step process including retrieval decision, document retrieval, relevance evaluation, response generation, support assessment, and utility evaluation to produce accurate, relevant, and useful outputs.
|
||||
19. **[Adaptive Retrieval 🎯](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/adaptive_retrieval.ipynb)**
|
||||
|
||||
### 21. [Corrective RAG 🔧](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/crag.ipynb)
|
||||
#### Overview 🔎
|
||||
Dynamically adjusting retrieval strategies based on query types and user contexts.
|
||||
|
||||
#### Overview 🔎
|
||||
A sophisticated RAG approach that dynamically evaluates and corrects the retrieval process, combining vector databases, web search, and language models for highly accurate and context-aware responses.
|
||||
#### Implementation 🛠️
|
||||
Classify queries into different categories and use tailored retrieval strategies for each, considering user context and preferences.
|
||||
|
||||
#### Implementation 🛠️
|
||||
• Integrate Retrieval Evaluator, Knowledge Refinement, Web Search Query Rewriter, and Response Generator components to create a system that adapts its information sourcing strategy based on relevance scores and combines multiple sources when necessary.
|
||||
20. Iterative Retrieval 🔄
|
||||
|
||||
### 22. [Document Augmentation through Question Generation for Enhanced Retrieval ](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/document_augmentation.ipynb)
|
||||
#### Overview 🔎
|
||||
Performing multiple rounds of retrieval to refine and enhance result quality.
|
||||
|
||||
#### Overview 🔎
|
||||
This implementation demonstrates a text augmentation technique that leverages additional question generation to improve document retrieval within a vector database. By generating and incorporating various questions related to each text fragment, the system enhances the standard retrieval process, thus increasing the likelihood of finding relevant documents that can be utilized as context for generative question answering.
|
||||
#### Implementation 🛠️
|
||||
Use the LLM to analyze initial results and generate follow-up queries to fill in gaps or clarify information.
|
||||
|
||||
#### Implementation 🛠️
|
||||
Use an LLM to augment text dataset with all possible questions that can be asked to each document.
|
||||
### 🔬 Explainability and Transparency
|
||||
|
||||
21. **[Explainable Retrieval 🔍](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/explainable_retrieval.ipynb)**
|
||||
|
||||
#### Overview 🔎
|
||||
Providing transparency in the retrieval process to enhance user trust and system refinement.
|
||||
|
||||
#### Implementation 🛠️
|
||||
Explain why certain pieces of information were retrieved and how they relate to the query.
|
||||
|
||||
### 🏗️ Advanced Architectures
|
||||
|
||||
22. **[Knowledge Graph Integration (Graph RAG)🕸️](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/graph_rag.ipynb)**
|
||||
|
||||
#### Overview 🔎
|
||||
Incorporating structured data from knowledge graphs to enrich context and improve retrieval.
|
||||
|
||||
#### Implementation 🛠️
|
||||
Retrieve entities and their relationships from a knowledge graph relevant to the query, combining this structured data with unstructured text for more informative responses.
|
||||
|
||||
23. **[RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval 🌳](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/raptor.ipynb)**
|
||||
|
||||
#### Overview 🔎
|
||||
Implementing a recursive approach to process and organize retrieved information in a tree structure.
|
||||
|
||||
#### Implementation 🛠️
|
||||
Use abstractive summarization to recursively process and summarize retrieved documents, organizing the information in a tree structure for hierarchical context.
|
||||
|
||||
24. **[Self RAG 🔁](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/self_rag.ipynb)**
|
||||
|
||||
#### Overview 🔎
|
||||
A dynamic approach that combines retrieval-based and generation-based methods, adaptively deciding whether to use retrieved information and how to best utilize it in generating responses.
|
||||
|
||||
#### Implementation 🛠️
|
||||
• Implement a multi-step process including retrieval decision, document retrieval, relevance evaluation, response generation, support assessment, and utility evaluation to produce accurate, relevant, and useful outputs.
|
||||
|
||||
25. **[Corrective RAG 🔧](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/crag.ipynb)**
|
||||
|
||||
#### Overview 🔎
|
||||
A sophisticated RAG approach that dynamically evaluates and corrects the retrieval process, combining vector databases, web search, and language models for highly accurate and context-aware responses.
|
||||
|
||||
#### Implementation 🛠️
|
||||
• Integrate Retrieval Evaluator, Knowledge Refinement, Web Search Query Rewriter, and Response Generator components to create a system that adapts its information sourcing strategy based on relevance scores and combines multiple sources when necessary.
|
||||
|
||||
## 🌟 Special Advanced Technique 🌟
|
||||
|
||||
### 23. [Sophisticated Controllable Agent for Complex RAG Tasks 🤖](https://github.com/NirDiamant/Controllable-RAG-Agent)
|
||||
26. **[Sophisticated Controllable Agent for Complex RAG Tasks 🤖](https://github.com/NirDiamant/Controllable-RAG-Agent)**
|
||||
|
||||
#### Overview 🔎
|
||||
An advanced RAG solution designed to tackle complex questions that simple semantic similarity-based retrieval cannot solve. This approach uses a sophisticated deterministic graph as the "brain" 🧠 of a highly controllable autonomous agent, capable of answering non-trivial questions from your own data.
|
||||
#### Overview 🔎
|
||||
An advanced RAG solution designed to tackle complex questions that simple semantic similarity-based retrieval cannot solve. This approach uses a sophisticated deterministic graph as the "brain" 🧠 of a highly controllable autonomous agent, capable of answering non-trivial questions from your own data.
|
||||
|
||||
#### Implementation 🛠️
|
||||
• Implement a multi-step process involving question anonymization, high-level planning, task breakdown, adaptive information retrieval and question answering, continuous re-planning, and rigorous answer verification to ensure grounded and accurate responses.
|
||||
#### Implementation 🛠️
|
||||
• Implement a multi-step process involving question anonymization, high-level planning, task breakdown, adaptive information retrieval and question answering, continuous re-planning, and rigorous answer verification to ensure grounded and accurate responses.
|
||||
|
||||
|
||||
|
||||
@@ -258,6 +323,10 @@ We welcome contributions from the community! If you have a new technique or impr
|
||||
4. Push to the branch: `git push origin feature/AmazingFeature`
|
||||
5. Open a pull request
|
||||
|
||||
## Contributors
|
||||
|
||||
[](https://github.com/NirDiamant/RAG_Techniques/graphs/contributors)
|
||||
|
||||
## License
|
||||
|
||||
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
|
||||
|
||||
@@ -75,6 +75,16 @@
|
||||
"</div>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<div style=\"text-align: center;\">\n",
|
||||
"\n",
|
||||
"<img src=\"../images/hyde-advantages.svg\" alt=\"HyDe\" style=\"width:100%; height:auto;\">\n",
|
||||
"</div>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
|
||||
@@ -56,6 +56,16 @@
|
||||
"This context enrichment window technique offers a promising way to improve the quality of retrieved information in vector-based document search systems. By providing surrounding context, it helps maintain the coherence and completeness of the retrieved information, potentially leading to better understanding and more accurate responses in downstream tasks such as question answering."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<div style=\"text-align: center;\">\n",
|
||||
"\n",
|
||||
"<img src=\"../images/vector-search-comparison_context_enrichment.svg\" alt=\"context enrichment window\" style=\"width:70%; height:auto;\">\n",
|
||||
"</div>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
|
||||
@@ -0,0 +1,335 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Context Enrichment Window for Document Retrieval\n",
|
||||
"\n",
|
||||
"## Overview\n",
|
||||
"\n",
|
||||
"This code implements a context enrichment window technique for document retrieval in a vector database. It enhances the standard retrieval process by adding surrounding context to each retrieved chunk, improving the coherence and completeness of the returned information.\n",
|
||||
"\n",
|
||||
"## Motivation\n",
|
||||
"\n",
|
||||
"Traditional vector search often returns isolated chunks of text, which may lack necessary context for full understanding. This approach aims to provide a more comprehensive view of the retrieved information by including neighboring text chunks.\n",
|
||||
"\n",
|
||||
"## Key Components\n",
|
||||
"\n",
|
||||
"1. PDF processing and text chunking\n",
|
||||
"2. Vector store creation using FAISS and OpenAI embeddings\n",
|
||||
"3. Custom retrieval function with context window\n",
|
||||
"4. Comparison between standard and context-enriched retrieval\n",
|
||||
"\n",
|
||||
"## Method Details\n",
|
||||
"\n",
|
||||
"### Document Preprocessing\n",
|
||||
"\n",
|
||||
"1. The PDF is read and converted to a string.\n",
|
||||
"2. The text is split into chunks with surrounding sentences\n",
|
||||
"\n",
|
||||
"### Vector Store Creation\n",
|
||||
"\n",
|
||||
"1. OpenAI embeddings are used to create vector representations of the chunks.\n",
|
||||
"2. A FAISS vector store is created from these embeddings.\n",
|
||||
"\n",
|
||||
"### Context-Enriched Retrieval\n",
|
||||
"\n",
|
||||
"LlamaIndex has a special parser for such task. [SentenceWindowNodeParser](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/#sentencewindownodeparser) this parser splits documents into sentences. But the resulting nodes inculde the surronding senteces with a relation structure. Then, on the query [MetadataReplacementPostProcessor](https://docs.llamaindex.ai/en/stable/module_guides/querying/node_postprocessors/node_postprocessors/#metadatareplacementpostprocessor) helps connecting back these related sentences.\n",
|
||||
"\n",
|
||||
"### Retrieval Comparison\n",
|
||||
"\n",
|
||||
"The notebook includes a section to compare standard retrieval with the context-enriched approach.\n",
|
||||
"\n",
|
||||
"## Benefits of this Approach\n",
|
||||
"\n",
|
||||
"1. Provides more coherent and contextually rich results\n",
|
||||
"2. Maintains the advantages of vector search while mitigating its tendency to return isolated text fragments\n",
|
||||
"3. Allows for flexible adjustment of the context window size\n",
|
||||
"\n",
|
||||
"## Conclusion\n",
|
||||
"\n",
|
||||
"This context enrichment window technique offers a promising way to improve the quality of retrieved information in vector-based document search systems. By providing surrounding context, it helps maintain the coherence and completeness of the retrieved information, potentially leading to better understanding and more accurate responses in downstream tasks such as question answering."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<div style=\"text-align: center;\">\n",
|
||||
"\n",
|
||||
"<img src=\"../images/vector-search-comparison_context_enrichment.svg\" alt=\"context enrichment window\" style=\"width:70%; height:auto;\">\n",
|
||||
"</div>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Import libraries and environment variables"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from llama_index.core import Settings\n",
|
||||
"from llama_index.llms.openai import OpenAI\n",
|
||||
"from llama_index.embeddings.openai import OpenAIEmbedding\n",
|
||||
"from llama_index.core.readers import SimpleDirectoryReader\n",
|
||||
"from llama_index.vector_stores.faiss import FaissVectorStore\n",
|
||||
"from llama_index.core.ingestion import IngestionPipeline\n",
|
||||
"from llama_index.core.node_parser import SentenceWindowNodeParser, SentenceSplitter\n",
|
||||
"from llama_index.core import VectorStoreIndex\n",
|
||||
"from llama_index.core.postprocessor import MetadataReplacementPostProcessor\n",
|
||||
"import faiss\n",
|
||||
"import os\n",
|
||||
"import sys\n",
|
||||
"from dotenv import load_dotenv\n",
|
||||
"from pprint import pprint\n",
|
||||
"\n",
|
||||
"sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path sicnce we work with notebooks\n",
|
||||
"\n",
|
||||
"# Load environment variables from a .env file\n",
|
||||
"load_dotenv()\n",
|
||||
"\n",
|
||||
"# Set the OpenAI API key environment variable\n",
|
||||
"os.environ[\"OPENAI_API_KEY\"] = os.getenv('OPENAI_API_KEY')\n",
|
||||
"\n",
|
||||
"# Llamaindex global settings for llm and embeddings\n",
|
||||
"EMBED_DIMENSION=512\n",
|
||||
"Settings.llm = OpenAI(model=\"gpt-3.5-turbo\")\n",
|
||||
"Settings.embed_model = OpenAIEmbedding(model=\"text-embedding-3-small\", dimensions=EMBED_DIMENSION)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Read docs"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"path = \"../data/\"\n",
|
||||
"reader = SimpleDirectoryReader(input_dir=path, required_exts=['.pdf'])\n",
|
||||
"documents = reader.load_data()\n",
|
||||
"print(documents[0])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Create vector store and retriever"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Create FaisVectorStore to store embeddings\n",
|
||||
"fais_index = faiss.IndexFlatL2(EMBED_DIMENSION)\n",
|
||||
"vector_store = FaissVectorStore(faiss_index=fais_index)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Ingestion Pipelines"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Ingestion Pipeline with Sentence Splitter"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"base_pipeline = IngestionPipeline(\n",
|
||||
" transformations=[SentenceSplitter()],\n",
|
||||
" vector_store=vector_store\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"base_nodes = base_pipeline.run(documents=documents)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Ingestion Pipeline with Sentence Window"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"node_parser = SentenceWindowNodeParser(\n",
|
||||
" # How many sentences on both sides to capture. \n",
|
||||
" # Setting this to 3 results in 7 sentences.\n",
|
||||
" window_size=3,\n",
|
||||
" # the metadata key for to be used in MetadataReplacementPostProcessor\n",
|
||||
" window_metadata_key=\"window\",\n",
|
||||
" # the metadata key that holds the original sentence\n",
|
||||
" original_text_metadata_key=\"original_sentence\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# Create a pipeline with defined document transformations and vectorstore\n",
|
||||
"pipeline = IngestionPipeline(\n",
|
||||
" transformations=[node_parser],\n",
|
||||
" vector_store=vector_store,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"windowed_nodes = pipeline.run(documents=documents)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Querying"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"query = \"Explain the role of deforestation and fossil fuels in climate change\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Querying *without* Metadata Replacement "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Create vector index from base nodes\n",
|
||||
"base_index = VectorStoreIndex(base_nodes)\n",
|
||||
"\n",
|
||||
"# Instantiate query engine from vector index\n",
|
||||
"base_query_engine = base_index.as_query_engine(\n",
|
||||
" similarity_top_k=1,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# Send query to the engine to get related node(s)\n",
|
||||
"base_response = base_query_engine.query(query)\n",
|
||||
"\n",
|
||||
"print(base_response)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Print Metadata of the Retrieved Node"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"pprint(base_response.source_nodes[0].node.metadata)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Querying with Metadata Replacement\n",
|
||||
"\"Metadata replacement\" intutively might sound a little off topic since we're working on the base sentences. But LlamaIndex stores these \"before/after sentences\" in the metadata data of the nodes. Therefore to build back up these windows of sentences we need Metadata replacement post processor."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Create window index from nodes created from SentenceWindowNodeParser\n",
|
||||
"windowed_index = VectorStoreIndex(windowed_nodes)\n",
|
||||
"\n",
|
||||
"# Instantiate query enine with MetadataReplacementPostProcessor\n",
|
||||
"windowed_query_engine = windowed_index.as_query_engine(\n",
|
||||
" similarity_top_k=1,\n",
|
||||
" node_postprocessors=[\n",
|
||||
" MetadataReplacementPostProcessor(\n",
|
||||
" target_metadata_key=\"window\" # `window_metadata_key` key defined in SentenceWindowNodeParser\n",
|
||||
" )\n",
|
||||
" ],\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# Send query to the engine to get related node(s)\n",
|
||||
"windowed_response = windowed_query_engine.query(query)\n",
|
||||
"\n",
|
||||
"print(windowed_response)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Print Metadata of the Retrieved Node"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Window and original sentence are added to the metadata\n",
|
||||
"pprint(windowed_response.source_nodes[0].node.metadata)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": ".venv",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.5"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
346
all_rag_techniques/fusion_retrieval_with_llamaindex.ipynb
Normal file
@@ -0,0 +1,346 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Fusion Retrieval in Document Search\n",
|
||||
"\n",
|
||||
"## Overview\n",
|
||||
"\n",
|
||||
"This code implements a Fusion Retrieval system that combines vector-based similarity search with keyword-based BM25 retrieval. The approach aims to leverage the strengths of both methods to improve the overall quality and relevance of document retrieval.\n",
|
||||
"\n",
|
||||
"## Motivation\n",
|
||||
"\n",
|
||||
"Traditional retrieval methods often rely on either semantic understanding (vector-based) or keyword matching (BM25). Each approach has its strengths and weaknesses. Fusion retrieval aims to combine these methods to create a more robust and accurate retrieval system that can handle a wider range of queries effectively.\n",
|
||||
"\n",
|
||||
"## Key Components\n",
|
||||
"\n",
|
||||
"1. PDF processing and text chunking\n",
|
||||
"2. Vector store creation using FAISS and OpenAI embeddings\n",
|
||||
"3. BM25 index creation for keyword-based retrieval\n",
|
||||
"4. Fusioning BM25 and vector search results for better retrieval\n",
|
||||
"\n",
|
||||
"## Method Details\n",
|
||||
"\n",
|
||||
"### Document Preprocessing\n",
|
||||
"\n",
|
||||
"1. The PDF is loaded and split into chunks using SentenceSplitter.\n",
|
||||
"2. Chunks are cleaned by replacing 't' with spaces and newline cleaning (likely addressing a specific formatting issue).\n",
|
||||
"\n",
|
||||
"### Vector Store Creation\n",
|
||||
"\n",
|
||||
"1. OpenAI embeddings are used to create vector representations of the text chunks.\n",
|
||||
"2. A FAISS vector store is created from these embeddings for efficient similarity search.\n",
|
||||
"\n",
|
||||
"### BM25 Index Creation\n",
|
||||
"\n",
|
||||
"1. A BM25 index is created from the same text chunks used for the vector store.\n",
|
||||
"2. This allows for keyword-based retrieval alongside the vector-based method.\n",
|
||||
"\n",
|
||||
"### Query Fusion Retrieval\n",
|
||||
"\n",
|
||||
"After creation of both indexes Query Fusion Retrieval combines them to enable a hybrid retrieval\n",
|
||||
"\n",
|
||||
"## Benefits of this Approach\n",
|
||||
"\n",
|
||||
"1. Improved Retrieval Quality: By combining semantic and keyword-based search, the system can capture both conceptual similarity and exact keyword matches.\n",
|
||||
"2. Flexibility: The `retriever_weights` parameter allows for adjusting the balance between vector and keyword search based on specific use cases or query types.\n",
|
||||
"3. Robustness: The combined approach can handle a wider range of queries effectively, mitigating weaknesses of individual methods.\n",
|
||||
"4. Customizability: The system can be easily adapted to use different vector stores or keyword-based retrieval methods.\n",
|
||||
"\n",
|
||||
"## Conclusion\n",
|
||||
"\n",
|
||||
"Fusion retrieval represents a powerful approach to document search that combines the strengths of semantic understanding and keyword matching. By leveraging both vector-based and BM25 retrieval methods, it offers a more comprehensive and flexible solution for information retrieval tasks. This approach has potential applications in various fields where both conceptual similarity and keyword relevance are important, such as academic research, legal document search, or general-purpose search engines."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Import libraries "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"import sys\n",
|
||||
"from dotenv import load_dotenv\n",
|
||||
"from typing import List\n",
|
||||
"from llama_index.core import Settings\n",
|
||||
"from llama_index.core.readers import SimpleDirectoryReader\n",
|
||||
"from llama_index.core.node_parser import SentenceSplitter\n",
|
||||
"from llama_index.core.ingestion import IngestionPipeline\n",
|
||||
"from llama_index.core.schema import BaseNode, TransformComponent\n",
|
||||
"from llama_index.vector_stores.faiss import FaissVectorStore\n",
|
||||
"from llama_index.core import VectorStoreIndex\n",
|
||||
"from llama_index.llms.openai import OpenAI\n",
|
||||
"from llama_index.embeddings.openai import OpenAIEmbedding\n",
|
||||
"from llama_index.retrievers.bm25 import BM25Retriever\n",
|
||||
"from llama_index.core.retrievers import QueryFusionRetriever\n",
|
||||
"import faiss\n",
|
||||
"\n",
|
||||
"sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path sicnce we work with notebooks\n",
|
||||
"# Load environment variables from a .env file\n",
|
||||
"load_dotenv()\n",
|
||||
"\n",
|
||||
"# Set the OpenAI API key environment variable\n",
|
||||
"os.environ[\"OPENAI_API_KEY\"] = os.getenv('OPENAI_API_KEY')\n",
|
||||
"\n",
|
||||
"# Llamaindex global settings for llm and embeddings\n",
|
||||
"EMBED_DIMENSION=512\n",
|
||||
"Settings.llm = OpenAI(model=\"gpt-3.5-turbo\", temperature=0.1)\n",
|
||||
"Settings.embed_model = OpenAIEmbedding(model=\"text-embedding-3-small\", dimensions=EMBED_DIMENSION)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Read Docs"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"path = \"../data/\"\n",
|
||||
"reader = SimpleDirectoryReader(input_dir=path, required_exts=['.pdf'])\n",
|
||||
"documents = reader.load_data()\n",
|
||||
"print(documents[0])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Create Vector Store"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Create FaisVectorStore to store embeddings\n",
|
||||
"fais_index = faiss.IndexFlatL2(EMBED_DIMENSION)\n",
|
||||
"vector_store = FaissVectorStore(faiss_index=fais_index)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Text Cleaner Transformation"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"class TextCleaner(TransformComponent):\n",
|
||||
" \"\"\"\n",
|
||||
" Transformation to be used within the ingestion pipeline.\n",
|
||||
" Cleans clutters from texts.\n",
|
||||
" \"\"\"\n",
|
||||
" def __call__(self, nodes, **kwargs) -> List[BaseNode]:\n",
|
||||
" \n",
|
||||
" for node in nodes:\n",
|
||||
" node.text = node.text.replace('\\t', ' ') # Replace tabs with spaces\n",
|
||||
" node.text = node.text.replace(' \\n', ' ') # Replace paragprah seperator with spacaes\n",
|
||||
" \n",
|
||||
" return nodes"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Ingestion Pipeline"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Pipeline instantiation with: \n",
|
||||
"# node parser, custom transformer, vector store and documents\n",
|
||||
"pipeline = IngestionPipeline(\n",
|
||||
" transformations=[\n",
|
||||
" SentenceSplitter(),\n",
|
||||
" TextCleaner()\n",
|
||||
" ],\n",
|
||||
" vector_store=vector_store,\n",
|
||||
" documents=documents\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# Run the pipeline to get nodes\n",
|
||||
"nodes = pipeline.run()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Retrievers"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### BM25 Retriever"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"bm25_retriever = BM25Retriever.from_defaults(\n",
|
||||
" nodes=nodes,\n",
|
||||
" similarity_top_k=2,\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Vector Retriever"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"index = VectorStoreIndex(nodes)\n",
|
||||
"vector_retriever = index.as_retriever(similarity_top_k=2)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Fusing Both Retrievers"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"retriever = QueryFusionRetriever(\n",
|
||||
" retrievers=[\n",
|
||||
" vector_retriever,\n",
|
||||
" bm25_retriever\n",
|
||||
" ],\n",
|
||||
" retriever_weights=[\n",
|
||||
" 0.6, # vector retriever weight\n",
|
||||
" 0.4 # BM25 retriever weight\n",
|
||||
" ],\n",
|
||||
" num_queries=1, \n",
|
||||
" mode='dist_based_score',\n",
|
||||
" use_async=False\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"About parameters\n",
|
||||
"\n",
|
||||
"1. `num_queries`: Query Fusion Retriever not only combines retrievers but also can genereate multiple questions from a given query. This parameter controls how many total queries will be passed to the retrievers. Therefore setting it to 1 disables query generation and the final retriever only uses the initial query.\n",
|
||||
"2. `mode`: There are 4 options for this parameter. \n",
|
||||
" - **reciprocal_rerank**: Applies reciporical ranking. (Since there is no normalization, this method is not suitable for this kind of application. Beacuse different retrirevers will return score scales)\n",
|
||||
" - **relative_score**: Applies MinMax based on the min and max scores among all the nodes. Then scaled to be between 0 and 1. Finally scores are weighted by the relative retrievers based on `retriever_weights`. \n",
|
||||
" ```math\n",
|
||||
" min\\_score = min(scores)\n",
|
||||
" \\\\ max\\_score = max(scores)\n",
|
||||
" ```\n",
|
||||
" - **dist_based_score**: Only difference from `relative_score` is the MinMax sclaing is based on mean and std of the scores. Scaling and weighting is the same.\n",
|
||||
" ```math\n",
|
||||
" min\\_score = mean\\_score - 3 * std\\_dev\n",
|
||||
" \\\\ max\\_score = mean\\_score + 3 * std\\_dev\n",
|
||||
" ```\n",
|
||||
" - **simple**: This method is simply takes the max score of each chunk. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Use Case example"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Query\n",
|
||||
"query = \"What are the impacts of climate change on the environment?\"\n",
|
||||
"\n",
|
||||
"# Perform fusion retrieval\n",
|
||||
"response = retriever.retrieve(query)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Print Final Retrieved Nodes with Scores "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"for node in response:\n",
|
||||
" print(f\"Node Score: {node.score:.2}\")\n",
|
||||
" print(f\"Node Content: {node.text}\")\n",
|
||||
" print(\"-\"*100)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": ".venv",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.5"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -73,6 +73,16 @@
|
||||
"</div>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<div style=\"text-align: center;\">\n",
|
||||
"\n",
|
||||
"<img src=\"../images/hierarchical_indices_example.svg\" alt=\"hierarchical_indices\" style=\"width:100%; height:auto;\">\n",
|
||||
"</div>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
|
||||
805
all_rag_techniques/proposition_chunking.ipynb
Normal file
@@ -0,0 +1,805 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Propositions Chunking"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Overview\n",
|
||||
"\n",
|
||||
"This code implements the proposition chunking method. The system break downs the input text into propositions that are atomic, factual, self-contained, and concise in nature, encodes the propositions into a vectorstore, which can be later used for retrieval\n",
|
||||
"\n",
|
||||
"### Key Components\n",
|
||||
"\n",
|
||||
"1. **Document Chunking:** Splitting a document into manageable pieces for analysis.\n",
|
||||
"2. **Proposition Generation:** Using LLMs to break down document chunks into factual, self-contained propositions.\n",
|
||||
"3. **Proposition Quality Check:** Evaluating generated propositions based on accuracy, clarity, completeness, and conciseness.\n",
|
||||
"4. **Embedding and Vector Store:** Embedding both propositions and larger chunks of the document into a vector store for efficient retrieval.\n",
|
||||
"5. **Retrieval and Comparison:** Testing the retrieval system with different query sizes and comparing results from the proposition-based model with the larger chunk-based model.\n",
|
||||
"\n",
|
||||
"<img src=\"../images/proposition_chunking.svg\" alt=\"Reliable-RAG\" width=\"600\">\n",
|
||||
"\n",
|
||||
"### Motivation\n",
|
||||
"\n",
|
||||
"The motivation behind the propositions chunking method is to build a system that breaks down a text document into concise, factual propositions for more granular and precise information retrieval. Using propositions allows for finer control and better handling of specific queries, particularly for extracting knowledge from detailed or complex texts. The comparison between using smaller proposition chunks and larger document chunks aims to evaluate the effectiveness of granular information retrieval.\n",
|
||||
"\n",
|
||||
"### Method Details\n",
|
||||
"\n",
|
||||
"1. **Loading Environment Variables:** The code begins by loading environment variables (e.g., API keys for the LLM service) to ensure that the system can access the necessary resources.\n",
|
||||
" \n",
|
||||
"2. **Document Chunking:**\n",
|
||||
" - The input document is split into smaller pieces (chunks) using `RecursiveCharacterTextSplitter`. This ensures that each chunk is of manageable size for the LLM to process.\n",
|
||||
" \n",
|
||||
"3. **Proposition Generation:**\n",
|
||||
" - Propositions are generated from each chunk using an LLM (in this case, \"llama-3.1-70b-versatile\"). The output is structured as a list of factual, self-contained statements that can be understood without additional context.\n",
|
||||
" \n",
|
||||
"4. **Quality Check:**\n",
|
||||
" - A second LLM evaluates the quality of the propositions by scoring them on accuracy, clarity, completeness, and conciseness. Propositions that meet the required thresholds in all categories are retained.\n",
|
||||
" \n",
|
||||
"5. **Embedding Propositions:**\n",
|
||||
" - Propositions that pass the quality check are embedded into a vector store using the `OllamaEmbeddings` model. This allows for similarity-based retrieval of propositions when queries are made.\n",
|
||||
" \n",
|
||||
"6. **Retrieval and Comparison:**\n",
|
||||
" - Two retrieval systems are built: one using the proposition-based chunks and another using larger document chunks. Both are tested with several queries to compare their performance and the precision of the returned results.\n",
|
||||
"\n",
|
||||
"### Benefits\n",
|
||||
"\n",
|
||||
"- **Granularity:** By breaking the document into small factual propositions, the system allows for highly specific retrieval, making it easier to extract precise answers from large or complex documents.\n",
|
||||
"- **Quality Assurance:** The use of a quality-checking LLM ensures that the generated propositions meet specific standards, improving the reliability of the retrieved information.\n",
|
||||
"- **Flexibility in Retrieval:** The comparison between proposition-based and larger chunk-based retrieval allows for evaluating the trade-offs between granularity and broader context in search results.\n",
|
||||
"\n",
|
||||
"### Implementation\n",
|
||||
"\n",
|
||||
"1. **Proposition Generation:** The LLM is used in conjunction with a custom prompt to generate factual statements from the document chunks.\n",
|
||||
"2. **Quality Checking:** The generated propositions are passed through a grading system that evaluates accuracy, clarity, completeness, and conciseness.\n",
|
||||
"3. **Vector Store Integration:** Propositions are stored in a FAISS vector store after being embedded using a pre-trained embedding model, allowing for efficient similarity-based search and retrieval.\n",
|
||||
"4. **Query Testing:** Multiple test queries are made to the vector stores (proposition-based and larger chunks) to compare the retrieval performance.\n",
|
||||
"\n",
|
||||
"### Summary\n",
|
||||
"\n",
|
||||
"This code presents a robust method for breaking down a document into self-contained propositions using LLMs. The system performs a quality check on each proposition, embeds them in a vector store, and retrieves the most relevant information based on user queries. The ability to compare granular propositions against larger document chunks provides insight into which method yields more accurate or useful results for different types of queries. The approach emphasizes the importance of high-quality proposition generation and retrieval for precise information extraction from complex documents."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"### LLMs\n",
|
||||
"import os\n",
|
||||
"from dotenv import load_dotenv\n",
|
||||
"\n",
|
||||
"# Load environment variables from '.env' file\n",
|
||||
"load_dotenv()\n",
|
||||
"\n",
|
||||
"os.environ['GROQ_API_KEY'] = os.getenv('GROQ_API_KEY') # For LLM"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Test Document"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sample_content = \"\"\"Paul Graham's essay \"Founder Mode,\" published in September 2024, challenges conventional wisdom about scaling startups, arguing that founders should maintain their unique management style rather than adopting traditional corporate practices as their companies grow.\n",
|
||||
"Conventional Wisdom vs. Founder Mode\n",
|
||||
"The essay argues that the traditional advice given to growing companies—hiring good people and giving them autonomy—often fails when applied to startups.\n",
|
||||
"This approach, suitable for established companies, can be detrimental to startups where the founder's vision and direct involvement are crucial. \"Founder Mode\" is presented as an emerging paradigm that is not yet fully understood or documented, contrasting with the conventional \"manager mode\" often advised by business schools and professional managers.\n",
|
||||
"Unique Founder Abilities\n",
|
||||
"Founders possess unique insights and abilities that professional managers do not, primarily because they have a deep understanding of their company's vision and culture.\n",
|
||||
"Graham suggests that founders should leverage these strengths rather than conform to traditional managerial practices. \"Founder Mode\" is an emerging paradigm that is not yet fully understood or documented, with Graham hoping that over time, it will become as well-understood as the traditional manager mode, allowing founders to maintain their unique approach even as their companies scale.\n",
|
||||
"Challenges of Scaling Startups\n",
|
||||
"As startups grow, there is a common belief that they must transition to a more structured managerial approach. However, many founders have found this transition problematic, as it often leads to a loss of the innovative and agile spirit that drove the startup's initial success.\n",
|
||||
"Brian Chesky, co-founder of Airbnb, shared his experience of being advised to run the company in a traditional managerial style, which led to poor outcomes. He eventually found success by adopting a different approach, influenced by how Steve Jobs managed Apple.\n",
|
||||
"Steve Jobs' Management Style\n",
|
||||
"Steve Jobs' management approach at Apple served as inspiration for Brian Chesky's \"Founder Mode\" at Airbnb. One notable practice was Jobs' annual retreat for the 100 most important people at Apple, regardless of their position on the organizational chart\n",
|
||||
". This unconventional method allowed Jobs to maintain a startup-like environment even as Apple grew, fostering innovation and direct communication across hierarchical levels. Such practices emphasize the importance of founders staying deeply involved in their companies' operations, challenging the traditional notion of delegating responsibilities to professional managers as companies scale.\n",
|
||||
"\"\"\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Chunking"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"### Build Index\n",
|
||||
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
|
||||
"from langchain_core.documents import Document\n",
|
||||
"from langchain_community.vectorstores import FAISS\n",
|
||||
"from langchain_community.embeddings import OllamaEmbeddings\n",
|
||||
"\n",
|
||||
"# Set embeddings\n",
|
||||
"embedding_model = OllamaEmbeddings(model='nomic-embed-text:v1.5', show_progress=True)\n",
|
||||
"\n",
|
||||
"# docs\n",
|
||||
"docs_list = [Document(page_content=sample_content, metadata={\"Title\": \"Paul Graham's Founder Mode Essay\", \"Source\": \"https://www.perplexity.ai/page/paul-graham-s-founder-mode-ess-t9TCyvkqRiyMQJWsHr0fnQ\"})]\n",
|
||||
"\n",
|
||||
"# Split\n",
|
||||
"text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(\n",
|
||||
" chunk_size=200, chunk_overlap=50\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"doc_splits = text_splitter.split_documents(docs_list)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"for i, doc in enumerate(doc_splits):\n",
|
||||
" doc.metadata['chunk_id'] = i+1 ### adding chunk id"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Generate Propositions"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from typing import List\n",
|
||||
"from langchain_core.prompts import ChatPromptTemplate, FewShotChatMessagePromptTemplate\n",
|
||||
"from langchain_core.pydantic_v1 import BaseModel, Field\n",
|
||||
"from langchain_groq import ChatGroq\n",
|
||||
"\n",
|
||||
"# Data model\n",
|
||||
"class GeneratePropositions(BaseModel):\n",
|
||||
" \"\"\"List of all the propositions in a given document\"\"\"\n",
|
||||
"\n",
|
||||
" propositions: List[str] = Field(\n",
|
||||
" description=\"List of propositions (factual, self-contained, and concise information)\"\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# LLM with function call\n",
|
||||
"llm = ChatGroq(model=\"llama-3.1-70b-versatile\", temperature=0)\n",
|
||||
"structured_llm= llm.with_structured_output(GeneratePropositions)\n",
|
||||
"\n",
|
||||
"# Few shot prompting --- We can add more examples to make it good\n",
|
||||
"proposition_examples = [\n",
|
||||
" {\"document\": \n",
|
||||
" \"In 1969, Neil Armstrong became the first person to walk on the Moon during the Apollo 11 mission.\", \n",
|
||||
" \"propositions\": \n",
|
||||
" \"['Neil Armstrong was an astronaut.', 'Neil Armstrong walked on the Moon in 1969.', 'Neil Armstrong was the first person to walk on the Moon.', 'Neil Armstrong walked on the Moon during the Apollo 11 mission.', 'The Apollo 11 mission occurred in 1969.']\"\n",
|
||||
" },\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"example_proposition_prompt = ChatPromptTemplate.from_messages(\n",
|
||||
" [\n",
|
||||
" (\"human\", \"{document}\"),\n",
|
||||
" (\"ai\", \"{propositions}\"),\n",
|
||||
" ]\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"few_shot_prompt = FewShotChatMessagePromptTemplate(\n",
|
||||
" example_prompt = example_proposition_prompt,\n",
|
||||
" examples = proposition_examples,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# Prompt\n",
|
||||
"system = \"\"\"Please break down the following text into simple, self-contained propositions. Ensure that each proposition meets the following criteria:\n",
|
||||
"\n",
|
||||
" 1. Express a Single Fact: Each proposition should state one specific fact or claim.\n",
|
||||
" 2. Be Understandable Without Context: The proposition should be self-contained, meaning it can be understood without needing additional context.\n",
|
||||
" 3. Use Full Names, Not Pronouns: Avoid pronouns or ambiguous references; use full entity names.\n",
|
||||
" 4. Include Relevant Dates/Qualifiers: If applicable, include necessary dates, times, and qualifiers to make the fact precise.\n",
|
||||
" 5. Contain One Subject-Predicate Relationship: Focus on a single subject and its corresponding action or attribute, without conjunctions or multiple clauses.\"\"\"\n",
|
||||
"prompt = ChatPromptTemplate.from_messages(\n",
|
||||
" [\n",
|
||||
" (\"system\", system),\n",
|
||||
" few_shot_prompt,\n",
|
||||
" (\"human\", \"{document}\"),\n",
|
||||
" ]\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"proposition_generator = prompt | structured_llm"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"propositions = [] # Store all the propositions from the document\n",
|
||||
"\n",
|
||||
"for i in range(len(doc_splits)):\n",
|
||||
" response = proposition_generator.invoke({\"document\": doc_splits[i].page_content}) # Creating proposition\n",
|
||||
" for proposition in response.propositions:\n",
|
||||
" propositions.append(Document(page_content=proposition, metadata={\"Title\": \"Paul Graham's Founder Mode Essay\", \"Source\": \"https://www.perplexity.ai/page/paul-graham-s-founder-mode-ess-t9TCyvkqRiyMQJWsHr0fnQ\", \"chunk_id\": i+1}))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Quality Check"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Data model\n",
|
||||
"class GradePropositions(BaseModel):\n",
|
||||
" \"\"\"Grade a given proposition on accuracy, clarity, completeness, and conciseness\"\"\"\n",
|
||||
"\n",
|
||||
" accuracy: int = Field(\n",
|
||||
" description=\"Rate from 1-10 based on how well the proposition reflects the original text.\"\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" clarity: int = Field(\n",
|
||||
" description=\"Rate from 1-10 based on how easy it is to understand the proposition without additional context.\"\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" completeness: int = Field(\n",
|
||||
" description=\"Rate from 1-10 based on whether the proposition includes necessary details (e.g., dates, qualifiers).\"\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" conciseness: int = Field(\n",
|
||||
" description=\"Rate from 1-10 based on whether the proposition is concise without losing important information.\"\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
"# LLM with function call\n",
|
||||
"llm = ChatGroq(model=\"llama-3.1-70b-versatile\", temperature=0)\n",
|
||||
"structured_llm= llm.with_structured_output(GradePropositions)\n",
|
||||
"\n",
|
||||
"# Prompt\n",
|
||||
"evaluation_prompt_template = \"\"\"\n",
|
||||
"Please evaluate the following proposition based on the criteria below:\n",
|
||||
"- **Accuracy**: Rate from 1-10 based on how well the proposition reflects the original text.\n",
|
||||
"- **Clarity**: Rate from 1-10 based on how easy it is to understand the proposition without additional context.\n",
|
||||
"- **Completeness**: Rate from 1-10 based on whether the proposition includes necessary details (e.g., dates, qualifiers).\n",
|
||||
"- **Conciseness**: Rate from 1-10 based on whether the proposition is concise without losing important information.\n",
|
||||
"\n",
|
||||
"Example:\n",
|
||||
"Docs: In 1969, Neil Armstrong became the first person to walk on the Moon during the Apollo 11 mission.\n",
|
||||
"\n",
|
||||
"Propositons_1: Neil Armstrong was an astronaut.\n",
|
||||
"Evaluation_1: \"accuracy\": 10, \"clarity\": 10, \"completeness\": 10, \"conciseness\": 10\n",
|
||||
"\n",
|
||||
"Propositons_2: Neil Armstrong walked on the Moon in 1969.\n",
|
||||
"Evaluation_3: \"accuracy\": 10, \"clarity\": 10, \"completeness\": 10, \"conciseness\": 10\n",
|
||||
"\n",
|
||||
"Propositons_3: Neil Armstrong was the first person to walk on the Moon.\n",
|
||||
"Evaluation_3: \"accuracy\": 10, \"clarity\": 10, \"completeness\": 10, \"conciseness\": 10\n",
|
||||
"\n",
|
||||
"Propositons_4: Neil Armstrong walked on the Moon during the Apollo 11 mission.\n",
|
||||
"Evaluation_4: \"accuracy\": 10, \"clarity\": 10, \"completeness\": 10, \"conciseness\": 10\n",
|
||||
"\n",
|
||||
"Propositons_5: The Apollo 11 mission occurred in 1969.\n",
|
||||
"Evaluation_5: \"accuracy\": 10, \"clarity\": 10, \"completeness\": 10, \"conciseness\": 10\n",
|
||||
"\n",
|
||||
"Format:\n",
|
||||
"Proposition: \"{proposition}\"\n",
|
||||
"Original Text: \"{original_text}\"\n",
|
||||
"\"\"\"\n",
|
||||
"prompt = ChatPromptTemplate.from_messages(\n",
|
||||
" [\n",
|
||||
" (\"system\", evaluation_prompt_template),\n",
|
||||
" (\"human\", \"{proposition}, {original_text}\"),\n",
|
||||
" ]\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"proposition_evaluator = prompt | structured_llm"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"17) Propostion: Startups often transition to a more structured managerial approach as they grow. \n",
|
||||
" Scores: {'accuracy': 8, 'clarity': 9, 'completeness': 6, 'conciseness': 8}\n",
|
||||
"Fail\n",
|
||||
"31) Propostion: Delegating responsibilities to professional managers is not always the best approach as companies scale. \n",
|
||||
" Scores: {'accuracy': 10, 'clarity': 10, 'completeness': 8, 'conciseness': 6}\n",
|
||||
"Fail\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Define evaluation categories and thresholds\n",
|
||||
"evaluation_categories = [\"accuracy\", \"clarity\", \"completeness\", \"conciseness\"]\n",
|
||||
"thresholds = {\"accuracy\": 7, \"clarity\": 7, \"completeness\": 7, \"conciseness\": 7}\n",
|
||||
"\n",
|
||||
"# Function to evaluate proposition\n",
|
||||
"def evaluate_proposition(proposition, original_text):\n",
|
||||
" response = proposition_evaluator.invoke({\"proposition\": proposition, \"original_text\": original_text})\n",
|
||||
" \n",
|
||||
" # Parse the response to extract scores\n",
|
||||
" scores = {\"accuracy\": response.accuracy, \"clarity\": response.clarity, \"completeness\": response.completeness, \"conciseness\": response.conciseness} # Implement function to extract scores from the LLM response\n",
|
||||
" return scores\n",
|
||||
"\n",
|
||||
"# Check if the proposition passes the quality check\n",
|
||||
"def passes_quality_check(scores):\n",
|
||||
" for category, score in scores.items():\n",
|
||||
" if score < thresholds[category]:\n",
|
||||
" return False\n",
|
||||
" return True\n",
|
||||
"\n",
|
||||
"evaluated_propositions = [] # Store all the propositions from the document\n",
|
||||
"\n",
|
||||
"# Loop through generated propositions and evaluate them\n",
|
||||
"for idx, proposition in enumerate(propositions):\n",
|
||||
" scores = evaluate_proposition(proposition.page_content, doc_splits[proposition.metadata['chunk_id'] - 1].page_content)\n",
|
||||
" if passes_quality_check(scores):\n",
|
||||
" # Proposition passes quality check, keep it\n",
|
||||
" evaluated_propositions.append(proposition)\n",
|
||||
" else:\n",
|
||||
" # Proposition fails, discard or flag for further review\n",
|
||||
" print(f\"{idx+1}) Propostion: {proposition.page_content} \\n Scores: {scores}\")\n",
|
||||
" print(\"Fail\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Embedding propositions in a vectorstore"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"OllamaEmbeddings: 100%|██████████| 29/29 [00:08<00:00, 3.62it/s]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Add to vectorstore\n",
|
||||
"vectorstore_propositions = FAISS.from_documents(evaluated_propositions, embedding_model)\n",
|
||||
"retriever_propositions = vectorstore_propositions.as_retriever(\n",
|
||||
" search_type=\"similarity\",\n",
|
||||
" search_kwargs={'k': 4}, # number of documents to retrieve\n",
|
||||
" )"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 5.39it/s]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"query = \"Who's management approach served as inspiartion for Brian Chesky's \\\"Founder Mode\\\" at Airbnb?\"\n",
|
||||
"res_proposition = retriever_propositions.invoke(query)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"1) Content: Brian Chesky was advised to run Airbnb in a traditional managerial style. --- Chunk_id: 3\n",
|
||||
"2) Content: Brian Chesky adopted a different approach to running Airbnb. --- Chunk_id: 3\n",
|
||||
"3) Content: Brian Chesky is a co-founder of Airbnb. --- Chunk_id: 3\n",
|
||||
"4) Content: Steve Jobs' management style at Apple influenced Brian Chesky's approach. --- Chunk_id: 3\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"for i, r in enumerate(res_proposition):\n",
|
||||
" print(f\"{i+1}) Content: {r.page_content} --- Chunk_id: {r.metadata['chunk_id']}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Comparing performance with larger chunks size"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"OllamaEmbeddings: 100%|██████████| 3/3 [00:00<00:00, 5.35it/s]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Add to vectorstore_larger_\n",
|
||||
"vectorstore_larger = FAISS.from_documents(doc_splits, embedding_model)\n",
|
||||
"retriever_larger = vectorstore_larger.as_retriever(\n",
|
||||
" search_type=\"similarity\",\n",
|
||||
" search_kwargs={'k': 4}, # number of documents to retrieve\n",
|
||||
" )"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 6.64it/s]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"res_larger = retriever_larger.invoke(query)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"1) Content: Brian Chesky, co-founder of Airbnb, shared his experience of being advised to run the company in a traditional managerial style, which led to poor outcomes. He eventually found success by adopting a different approach, influenced by how Steve Jobs managed Apple.\n",
|
||||
"Steve Jobs' Management Style\n",
|
||||
"Steve Jobs' management approach at Apple served as inspiration for Brian Chesky's \"Founder Mode\" at Airbnb. One notable practice was Jobs' annual retreat for the 100 most important people at Apple, regardless of their position on the organizational chart\n",
|
||||
". This unconventional method allowed Jobs to maintain a startup-like environment even as Apple grew, fostering innovation and direct communication across hierarchical levels. Such practices emphasize the importance of founders staying deeply involved in their companies' operations, challenging the traditional notion of delegating responsibilities to professional managers as companies scale. --- Chunk_id: 3\n",
|
||||
"2) Content: Unique Founder Abilities\n",
|
||||
"Founders possess unique insights and abilities that professional managers do not, primarily because they have a deep understanding of their company's vision and culture.\n",
|
||||
"Graham suggests that founders should leverage these strengths rather than conform to traditional managerial practices. \"Founder Mode\" is an emerging paradigm that is not yet fully understood or documented, with Graham hoping that over time, it will become as well-understood as the traditional manager mode, allowing founders to maintain their unique approach even as their companies scale.\n",
|
||||
"Challenges of Scaling Startups\n",
|
||||
"As startups grow, there is a common belief that they must transition to a more structured managerial approach. However, many founders have found this transition problematic, as it often leads to a loss of the innovative and agile spirit that drove the startup's initial success. --- Chunk_id: 2\n",
|
||||
"3) Content: Paul Graham's essay \"Founder Mode,\" published in September 2024, challenges conventional wisdom about scaling startups, arguing that founders should maintain their unique management style rather than adopting traditional corporate practices as their companies grow.\n",
|
||||
"Conventional Wisdom vs. Founder Mode\n",
|
||||
"The essay argues that the traditional advice given to growing companies—hiring good people and giving them autonomy—often fails when applied to startups.\n",
|
||||
"This approach, suitable for established companies, can be detrimental to startups where the founder's vision and direct involvement are crucial. \"Founder Mode\" is presented as an emerging paradigm that is not yet fully understood or documented, contrasting with the conventional \"manager mode\" often advised by business schools and professional managers.\n",
|
||||
"Unique Founder Abilities\n",
|
||||
"Founders possess unique insights and abilities that professional managers do not, primarily because they have a deep understanding of their company's vision and culture. --- Chunk_id: 1\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"for i, r in enumerate(res_larger):\n",
|
||||
" print(f\"{i+1}) Content: {r.page_content} --- Chunk_id: {r.metadata['chunk_id']}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Testing"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Test - 1"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 6.29it/s]\n",
|
||||
"OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 8.06it/s]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"test_query_1 = \"what is the essay \\\"Founder Mode\\\" about?\"\n",
|
||||
"res_proposition = retriever_propositions.invoke(test_query_1)\n",
|
||||
"res_larger = retriever_larger.invoke(test_query_1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 16,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"1) Content: Founder Mode is an emerging paradigm that is not yet fully understood or documented. --- Chunk_id: 2\n",
|
||||
"2) Content: Founder Mode is not yet fully understood or documented. --- Chunk_id: 1\n",
|
||||
"3) Content: Founder Mode is an emerging paradigm. --- Chunk_id: 1\n",
|
||||
"4) Content: Paul Graham's essay 'Founder Mode' challenges conventional wisdom about scaling startups. --- Chunk_id: 1\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"for i, r in enumerate(res_proposition):\n",
|
||||
" print(f\"{i+1}) Content: {r.page_content} --- Chunk_id: {r.metadata['chunk_id']}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 17,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"1) Content: Paul Graham's essay \"Founder Mode,\" published in September 2024, challenges conventional wisdom about scaling startups, arguing that founders should maintain their unique management style rather than adopting traditional corporate practices as their companies grow.\n",
|
||||
"Conventional Wisdom vs. Founder Mode\n",
|
||||
"The essay argues that the traditional advice given to growing companies—hiring good people and giving them autonomy—often fails when applied to startups.\n",
|
||||
"This approach, suitable for established companies, can be detrimental to startups where the founder's vision and direct involvement are crucial. \"Founder Mode\" is presented as an emerging paradigm that is not yet fully understood or documented, contrasting with the conventional \"manager mode\" often advised by business schools and professional managers.\n",
|
||||
"Unique Founder Abilities\n",
|
||||
"Founders possess unique insights and abilities that professional managers do not, primarily because they have a deep understanding of their company's vision and culture. --- Chunk_id: 1\n",
|
||||
"2) Content: Unique Founder Abilities\n",
|
||||
"Founders possess unique insights and abilities that professional managers do not, primarily because they have a deep understanding of their company's vision and culture.\n",
|
||||
"Graham suggests that founders should leverage these strengths rather than conform to traditional managerial practices. \"Founder Mode\" is an emerging paradigm that is not yet fully understood or documented, with Graham hoping that over time, it will become as well-understood as the traditional manager mode, allowing founders to maintain their unique approach even as their companies scale.\n",
|
||||
"Challenges of Scaling Startups\n",
|
||||
"As startups grow, there is a common belief that they must transition to a more structured managerial approach. However, many founders have found this transition problematic, as it often leads to a loss of the innovative and agile spirit that drove the startup's initial success. --- Chunk_id: 2\n",
|
||||
"3) Content: Brian Chesky, co-founder of Airbnb, shared his experience of being advised to run the company in a traditional managerial style, which led to poor outcomes. He eventually found success by adopting a different approach, influenced by how Steve Jobs managed Apple.\n",
|
||||
"Steve Jobs' Management Style\n",
|
||||
"Steve Jobs' management approach at Apple served as inspiration for Brian Chesky's \"Founder Mode\" at Airbnb. One notable practice was Jobs' annual retreat for the 100 most important people at Apple, regardless of their position on the organizational chart\n",
|
||||
". This unconventional method allowed Jobs to maintain a startup-like environment even as Apple grew, fostering innovation and direct communication across hierarchical levels. Such practices emphasize the importance of founders staying deeply involved in their companies' operations, challenging the traditional notion of delegating responsibilities to professional managers as companies scale. --- Chunk_id: 3\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"for i, r in enumerate(res_larger):\n",
|
||||
" print(f\"{i+1}) Content: {r.page_content} --- Chunk_id: {r.metadata['chunk_id']}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Test - 2"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 18,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 3.22it/s]\n",
|
||||
"OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 15.18it/s]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"test_query_2 = \"who is the co-founder of Airbnb?\"\n",
|
||||
"res_proposition = retriever_propositions.invoke(test_query_2)\n",
|
||||
"res_larger = retriever_larger.invoke(test_query_2)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 19,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"1) Content: Brian Chesky is a co-founder of Airbnb. --- Chunk_id: 3\n",
|
||||
"2) Content: Brian Chesky adopted a different approach to running Airbnb. --- Chunk_id: 3\n",
|
||||
"3) Content: Brian Chesky was advised to run Airbnb in a traditional managerial style. --- Chunk_id: 3\n",
|
||||
"4) Content: Running Airbnb in a traditional managerial style led to poor outcomes. --- Chunk_id: 3\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"for i, r in enumerate(res_proposition):\n",
|
||||
" print(f\"{i+1}) Content: {r.page_content} --- Chunk_id: {r.metadata['chunk_id']}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 20,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"1) Content: Brian Chesky, co-founder of Airbnb, shared his experience of being advised to run the company in a traditional managerial style, which led to poor outcomes. He eventually found success by adopting a different approach, influenced by how Steve Jobs managed Apple.\n",
|
||||
"Steve Jobs' Management Style\n",
|
||||
"Steve Jobs' management approach at Apple served as inspiration for Brian Chesky's \"Founder Mode\" at Airbnb. One notable practice was Jobs' annual retreat for the 100 most important people at Apple, regardless of their position on the organizational chart\n",
|
||||
". This unconventional method allowed Jobs to maintain a startup-like environment even as Apple grew, fostering innovation and direct communication across hierarchical levels. Such practices emphasize the importance of founders staying deeply involved in their companies' operations, challenging the traditional notion of delegating responsibilities to professional managers as companies scale. --- Chunk_id: 3\n",
|
||||
"2) Content: Paul Graham's essay \"Founder Mode,\" published in September 2024, challenges conventional wisdom about scaling startups, arguing that founders should maintain their unique management style rather than adopting traditional corporate practices as their companies grow.\n",
|
||||
"Conventional Wisdom vs. Founder Mode\n",
|
||||
"The essay argues that the traditional advice given to growing companies—hiring good people and giving them autonomy—often fails when applied to startups.\n",
|
||||
"This approach, suitable for established companies, can be detrimental to startups where the founder's vision and direct involvement are crucial. \"Founder Mode\" is presented as an emerging paradigm that is not yet fully understood or documented, contrasting with the conventional \"manager mode\" often advised by business schools and professional managers.\n",
|
||||
"Unique Founder Abilities\n",
|
||||
"Founders possess unique insights and abilities that professional managers do not, primarily because they have a deep understanding of their company's vision and culture. --- Chunk_id: 1\n",
|
||||
"3) Content: Unique Founder Abilities\n",
|
||||
"Founders possess unique insights and abilities that professional managers do not, primarily because they have a deep understanding of their company's vision and culture.\n",
|
||||
"Graham suggests that founders should leverage these strengths rather than conform to traditional managerial practices. \"Founder Mode\" is an emerging paradigm that is not yet fully understood or documented, with Graham hoping that over time, it will become as well-understood as the traditional manager mode, allowing founders to maintain their unique approach even as their companies scale.\n",
|
||||
"Challenges of Scaling Startups\n",
|
||||
"As startups grow, there is a common belief that they must transition to a more structured managerial approach. However, many founders have found this transition problematic, as it often leads to a loss of the innovative and agile spirit that drove the startup's initial success. --- Chunk_id: 2\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"for i, r in enumerate(res_larger):\n",
|
||||
" print(f\"{i+1}) Content: {r.page_content} --- Chunk_id: {r.metadata['chunk_id']}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Test - 3"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 21,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 10.09it/s]\n",
|
||||
"OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 7.71it/s]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"test_query_3 = \"when was the essay \\\"founder mode\\\" published?\"\n",
|
||||
"res_proposition = retriever_propositions.invoke(test_query_3)\n",
|
||||
"res_larger = retriever_larger.invoke(test_query_3)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 22,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"1) Content: Paul Graham published an essay called 'Founder Mode' in September 2024. --- Chunk_id: 1\n",
|
||||
"2) Content: Founder Mode is an emerging paradigm. --- Chunk_id: 1\n",
|
||||
"3) Content: Founder Mode is an emerging paradigm that is not yet fully understood or documented. --- Chunk_id: 2\n",
|
||||
"4) Content: Founder Mode is not yet fully understood or documented. --- Chunk_id: 1\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"for i, r in enumerate(res_proposition):\n",
|
||||
" print(f\"{i+1}) Content: {r.page_content} --- Chunk_id: {r.metadata['chunk_id']}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 23,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"1) Content: Paul Graham's essay \"Founder Mode,\" published in September 2024, challenges conventional wisdom about scaling startups, arguing that founders should maintain their unique management style rather than adopting traditional corporate practices as their companies grow.\n",
|
||||
"Conventional Wisdom vs. Founder Mode\n",
|
||||
"The essay argues that the traditional advice given to growing companies—hiring good people and giving them autonomy—often fails when applied to startups.\n",
|
||||
"This approach, suitable for established companies, can be detrimental to startups where the founder's vision and direct involvement are crucial. \"Founder Mode\" is presented as an emerging paradigm that is not yet fully understood or documented, contrasting with the conventional \"manager mode\" often advised by business schools and professional managers.\n",
|
||||
"Unique Founder Abilities\n",
|
||||
"Founders possess unique insights and abilities that professional managers do not, primarily because they have a deep understanding of their company's vision and culture. --- Chunk_id: 1\n",
|
||||
"2) Content: Unique Founder Abilities\n",
|
||||
"Founders possess unique insights and abilities that professional managers do not, primarily because they have a deep understanding of their company's vision and culture.\n",
|
||||
"Graham suggests that founders should leverage these strengths rather than conform to traditional managerial practices. \"Founder Mode\" is an emerging paradigm that is not yet fully understood or documented, with Graham hoping that over time, it will become as well-understood as the traditional manager mode, allowing founders to maintain their unique approach even as their companies scale.\n",
|
||||
"Challenges of Scaling Startups\n",
|
||||
"As startups grow, there is a common belief that they must transition to a more structured managerial approach. However, many founders have found this transition problematic, as it often leads to a loss of the innovative and agile spirit that drove the startup's initial success. --- Chunk_id: 2\n",
|
||||
"3) Content: Brian Chesky, co-founder of Airbnb, shared his experience of being advised to run the company in a traditional managerial style, which led to poor outcomes. He eventually found success by adopting a different approach, influenced by how Steve Jobs managed Apple.\n",
|
||||
"Steve Jobs' Management Style\n",
|
||||
"Steve Jobs' management approach at Apple served as inspiration for Brian Chesky's \"Founder Mode\" at Airbnb. One notable practice was Jobs' annual retreat for the 100 most important people at Apple, regardless of their position on the organizational chart\n",
|
||||
". This unconventional method allowed Jobs to maintain a startup-like environment even as Apple grew, fostering innovation and direct communication across hierarchical levels. Such practices emphasize the importance of founders staying deeply involved in their companies' operations, challenging the traditional notion of delegating responsibilities to professional managers as companies scale. --- Chunk_id: 3\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"for i, r in enumerate(res_larger):\n",
|
||||
" print(f\"{i+1}) Content: {r.page_content} --- Chunk_id: {r.metadata['chunk_id']}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Comparison\n",
|
||||
"\n",
|
||||
"| **Aspect** | **Proposition-Based Retrieval** | **Simple Chunk Retrieval** |\n",
|
||||
"|---------------------------|--------------------------------------------------------------------------|--------------------------------------------------------------------------|\n",
|
||||
"| **Precision in Response** | High: Delivers focused and direct answers. | Medium: Provides more context but may include irrelevant information. |\n",
|
||||
"| **Clarity and Brevity** | High: Clear and concise, avoids unnecessary details. | Medium: More comprehensive but can be overwhelming. |\n",
|
||||
"| **Contextual Richness** | Low: May lack context, focusing on specific propositions. | High: Provides additional context and details. |\n",
|
||||
"| **Comprehensiveness** | Low: May omit broader context or supplementary details. | High: Offers a more complete view with extensive information. |\n",
|
||||
"| **Narrative Flow** | Medium: Can be fragmented or disjointed. | High: Preserves the logical flow and coherence of the original document. |\n",
|
||||
"| **Information Overload** | Low: Less likely to overwhelm with excess information. | High: Risk of overwhelming the user with too much information. |\n",
|
||||
"| **Use Case Suitability** | Best for quick, factual queries. | Best for complex queries requiring in-depth understanding. |\n",
|
||||
"| **Efficiency** | High: Provides quick, targeted responses. | Medium: May require more effort to sift through additional content. |\n",
|
||||
"| **Specificity** | High: Precise and targeted responses. | Medium: Answers may be less targeted due to inclusion of broader context.|\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "test",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.0"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -94,6 +94,7 @@
|
||||
"from langchain.retrievers.document_compressors import LLMChainExtractor\n",
|
||||
"from langchain.schema import AIMessage\n",
|
||||
"from langchain.docstore.document import Document\n",
|
||||
"\n",
|
||||
"import matplotlib.pyplot as plt\n",
|
||||
"import logging\n",
|
||||
"import os\n",
|
||||
|
||||
562
all_rag_techniques/reliable_rag.ipynb
Normal file
@@ -48,6 +48,26 @@
|
||||
"The choice between LLM-based and Cross-Encoder reranking methods depends on factors such as required accuracy, available computational resources, and specific application needs. Both approaches offer substantial improvements over basic retrieval methods and contribute to the overall effectiveness of RAG systems."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<div style=\"text-align: center;\">\n",
|
||||
"\n",
|
||||
"<img src=\"../images/reranking-visualization.svg\" alt=\"rerank llm\" style=\"width:100%; height:auto;\">\n",
|
||||
"</div>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<div style=\"text-align: center;\">\n",
|
||||
"\n",
|
||||
"<img src=\"../images/reranking_comparison.svg\" alt=\"rerank llm\" style=\"width:100%; height:auto;\">\n",
|
||||
"</div>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
|
||||
374
all_rag_techniques/reranking_with_llamaindex.ipynb
Normal file
@@ -0,0 +1,374 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Reranking Methods in RAG Systems\n",
|
||||
"\n",
|
||||
"## Overview\n",
|
||||
"Reranking is a crucial step in Retrieval-Augmented Generation (RAG) systems that aims to improve the relevance and quality of retrieved documents. It involves reassessing and reordering initially retrieved documents to ensure that the most pertinent information is prioritized for subsequent processing or presentation.\n",
|
||||
"\n",
|
||||
"## Motivation\n",
|
||||
"The primary motivation for reranking in RAG systems is to overcome limitations of initial retrieval methods, which often rely on simpler similarity metrics. Reranking allows for more sophisticated relevance assessment, taking into account nuanced relationships between queries and documents that might be missed by traditional retrieval techniques. This process aims to enhance the overall performance of RAG systems by ensuring that the most relevant information is used in the generation phase.\n",
|
||||
"\n",
|
||||
"## Key Components\n",
|
||||
"Reranking systems typically include the following components:\n",
|
||||
"\n",
|
||||
"1. Initial Retriever: Often a vector store using embedding-based similarity search.\n",
|
||||
"2. Reranking Model: This can be either:\n",
|
||||
" - A Large Language Model (LLM) for scoring relevance\n",
|
||||
" - A Cross-Encoder model specifically trained for relevance assessment\n",
|
||||
"3. Scoring Mechanism: A method to assign relevance scores to documents\n",
|
||||
"4. Sorting and Selection Logic: To reorder documents based on new scores\n",
|
||||
"\n",
|
||||
"## Method Details\n",
|
||||
"The reranking process generally follows these steps:\n",
|
||||
"\n",
|
||||
"1. Initial Retrieval: Fetch an initial set of potentially relevant documents.\n",
|
||||
"2. Pair Creation: Form query-document pairs for each retrieved document.\n",
|
||||
"3. Scoring: \n",
|
||||
" - LLM Method: Use prompts to ask the LLM to rate document relevance.\n",
|
||||
" - Cross-Encoder Method: Feed query-document pairs directly into the model.\n",
|
||||
"4. Score Interpretation: Parse and normalize the relevance scores.\n",
|
||||
"5. Reordering: Sort documents based on their new relevance scores.\n",
|
||||
"6. Selection: Choose the top K documents from the reordered list.\n",
|
||||
"\n",
|
||||
"## Benefits of this Approach\n",
|
||||
"Reranking offers several advantages:\n",
|
||||
"\n",
|
||||
"1. Improved Relevance: By using more sophisticated models, reranking can capture subtle relevance factors.\n",
|
||||
"2. Flexibility: Different reranking methods can be applied based on specific needs and resources.\n",
|
||||
"3. Enhanced Context Quality: Providing more relevant documents to the RAG system improves the quality of generated responses.\n",
|
||||
"4. Reduced Noise: Reranking helps filter out less relevant information, focusing on the most pertinent content.\n",
|
||||
"\n",
|
||||
"## Conclusion\n",
|
||||
"Reranking is a powerful technique in RAG systems that significantly enhances the quality of retrieved information. Whether using LLM-based scoring or specialized Cross-Encoder models, reranking allows for more nuanced and accurate assessment of document relevance. This improved relevance translates directly to better performance in downstream tasks, making reranking an essential component in advanced RAG implementations.\n",
|
||||
"\n",
|
||||
"The choice between LLM-based and Cross-Encoder reranking methods depends on factors such as required accuracy, available computational resources, and specific application needs. Both approaches offer substantial improvements over basic retrieval methods and contribute to the overall effectiveness of RAG systems."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<div style=\"text-align: center;\">\n",
|
||||
"\n",
|
||||
"<img src=\"../images/reranking-visualization.svg\" alt=\"rerank llm\" style=\"width:100%; height:auto;\">\n",
|
||||
"</div>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<div style=\"text-align: center;\">\n",
|
||||
"\n",
|
||||
"<img src=\"../images/reranking_comparison.svg\" alt=\"rerank llm\" style=\"width:100%; height:auto;\">\n",
|
||||
"</div>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Import relevant libraries"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"import sys\n",
|
||||
"from dotenv import load_dotenv\n",
|
||||
"from typing import List\n",
|
||||
"from llama_index.core import Document\n",
|
||||
"from llama_index.core import Settings\n",
|
||||
"from llama_index.embeddings.openai import OpenAIEmbedding\n",
|
||||
"from llama_index.llms.openai import OpenAI\n",
|
||||
"from llama_index.core.readers import SimpleDirectoryReader\n",
|
||||
"from llama_index.vector_stores.faiss import FaissVectorStore\n",
|
||||
"from llama_index.core.ingestion import IngestionPipeline\n",
|
||||
"from llama_index.core.node_parser import SentenceSplitter\n",
|
||||
"from llama_index.core import VectorStoreIndex\n",
|
||||
"from llama_index.core.postprocessor import SentenceTransformerRerank, LLMRerank\n",
|
||||
"from llama_index.core import QueryBundle\n",
|
||||
"import faiss\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path sicnce we work with notebooks\n",
|
||||
"\n",
|
||||
"# Load environment variables from a .env file\n",
|
||||
"load_dotenv()\n",
|
||||
"\n",
|
||||
"# Set the OpenAI API key environment variable\n",
|
||||
"os.environ[\"OPENAI_API_KEY\"] = os.getenv('OPENAI_API_KEY')\n",
|
||||
"\n",
|
||||
"# Llamaindex global settings for llm and embeddings\n",
|
||||
"EMBED_DIMENSION=512\n",
|
||||
"Settings.llm = OpenAI(model=\"gpt-3.5-turbo\")\n",
|
||||
"Settings.embed_model = OpenAIEmbedding(model=\"text-embedding-3-small\", dimensions=EMBED_DIMENSION)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Read docs"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"path = \"../data/\"\n",
|
||||
"reader = SimpleDirectoryReader(input_dir=path, required_exts=['.pdf'])\n",
|
||||
"documents = reader.load_data()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Create a vector store"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Create FaisVectorStore to store embeddings\n",
|
||||
"fais_index = faiss.IndexFlatL2(EMBED_DIMENSION)\n",
|
||||
"vector_store = FaissVectorStore(faiss_index=fais_index)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Ingestion Pipeline"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"base_pipeline = IngestionPipeline(\n",
|
||||
" transformations=[SentenceSplitter()],\n",
|
||||
" vector_store=vector_store,\n",
|
||||
" documents=documents\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"nodes = base_pipeline.run()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Querying"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Method 1: LLM based reranking the retrieved documents"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<div style=\"text-align: center;\">\n",
|
||||
"\n",
|
||||
"<img src=\"../images/rerank_llm.svg\" alt=\"rerank llm\" style=\"width:40%; height:auto;\">\n",
|
||||
"</div>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Create vector index from base nodes\n",
|
||||
"index = VectorStoreIndex(nodes)\n",
|
||||
"\n",
|
||||
"query_engine_w_llm_rerank = index.as_query_engine(\n",
|
||||
" similarity_top_k=10,\n",
|
||||
" node_postprocessors=[\n",
|
||||
" LLMRerank(\n",
|
||||
" top_n=5\n",
|
||||
" )\n",
|
||||
" ],\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"resp = query_engine_w_llm_rerank.query(\"What are the impacts of climate change on biodiversity?\")\n",
|
||||
"print(resp)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Example that demonstrates why we should use reranking "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Comparison of Retrieval Techniques\n",
|
||||
"==================================\n",
|
||||
"Query: what is the capital of france?\n",
|
||||
"\n",
|
||||
"Baseline Retrieval Result:\n",
|
||||
"\n",
|
||||
"Document 1:\n",
|
||||
"The capital of France is great.\n",
|
||||
"\n",
|
||||
"Document 2:\n",
|
||||
"The capital of France is huge.\n",
|
||||
"\n",
|
||||
"Advanced Retrieval Result:\n",
|
||||
"\n",
|
||||
"Document 1:\n",
|
||||
"Have you ever visited Paris? It is a beautiful city where you can eat delicious food and see the Eiffel Tower. I really enjoyed all the cities in france, but its capital with the Eiffel Tower is my favorite city.\n",
|
||||
"\n",
|
||||
"Document 2:\n",
|
||||
"I really enjoyed my trip to Paris, France. The city is beautiful and the food is delicious. I would love to visit again. Such a great capital city.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"chunks = [\n",
|
||||
" \"The capital of France is great.\",\n",
|
||||
" \"The capital of France is huge.\",\n",
|
||||
" \"The capital of France is beautiful.\",\n",
|
||||
" \"\"\"Have you ever visited Paris? It is a beautiful city where you can eat delicious food and see the Eiffel Tower. I really enjoyed all the cities in france, but its capital with the Eiffel Tower is my favorite city.\"\"\", \n",
|
||||
" \"I really enjoyed my trip to Paris, France. The city is beautiful and the food is delicious. I would love to visit again. Such a great capital city.\"\n",
|
||||
"]\n",
|
||||
"docs = [Document(page_content=sentence) for sentence in chunks]\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def compare_rag_techniques(query: str, docs: List[Document] = docs) -> None:\n",
|
||||
" docs = [Document(text=sentence) for sentence in chunks]\n",
|
||||
" index = VectorStoreIndex.from_documents(docs)\n",
|
||||
" \n",
|
||||
" \n",
|
||||
" print(\"Comparison of Retrieval Techniques\")\n",
|
||||
" print(\"==================================\")\n",
|
||||
" print(f\"Query: {query}\\n\")\n",
|
||||
" \n",
|
||||
" print(\"Baseline Retrieval Result:\")\n",
|
||||
" baseline_docs = index.as_retriever(similarity_top_k=5).retrieve(query)\n",
|
||||
" for i, doc in enumerate(baseline_docs[:2]): # Get only the first two retrieved docs\n",
|
||||
" print(f\"\\nDocument {i+1}:\")\n",
|
||||
" print(doc.text)\n",
|
||||
"\n",
|
||||
" print(\"\\nAdvanced Retrieval Result:\")\n",
|
||||
" reranker = LLMRerank(\n",
|
||||
" top_n=2,\n",
|
||||
" )\n",
|
||||
" advanced_docs = reranker.postprocess_nodes(\n",
|
||||
" baseline_docs, \n",
|
||||
" QueryBundle(query)\n",
|
||||
" )\n",
|
||||
" for i, doc in enumerate(advanced_docs):\n",
|
||||
" print(f\"\\nDocument {i+1}:\")\n",
|
||||
" print(doc.text)\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"query = \"what is the capital of france?\"\n",
|
||||
"compare_rag_techniques(query, docs)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Method 2: Cross Encoder models"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<div style=\"text-align: center;\">\n",
|
||||
"\n",
|
||||
"<img src=\"../images/rerank_cross_encoder.svg\" alt=\"rerank cross encoder\" style=\"width:40%; height:auto;\">\n",
|
||||
"</div>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"LlamaIndex has builtin support for [SBERT](https://www.sbert.net/index.html) models that can be used directly as node postprocessor."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"query_engine_w_cross_encoder = index.as_query_engine(\n",
|
||||
" similarity_top_k=10,\n",
|
||||
" node_postprocessors=[\n",
|
||||
" SentenceTransformerRerank(\n",
|
||||
" model='cross-encoder/ms-marco-MiniLM-L-6-v2',\n",
|
||||
" top_n=5\n",
|
||||
" )\n",
|
||||
" ],\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"resp = query_engine_w_cross_encoder.query(\"What are the impacts of climate change on biodiversity?\")\n",
|
||||
"print(resp)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": ".venv",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.5"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -73,6 +73,16 @@
|
||||
"Semantic chunking represents an advanced approach to document processing for retrieval systems. By attempting to maintain semantic coherence within text segments, it has the potential to improve the quality of retrieved information and enhance the performance of downstream NLP tasks. This technique is particularly valuable for processing long, complex documents where maintaining context is crucial, such as scientific papers, legal documents, or comprehensive reports."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<div style=\"text-align: center;\">\n",
|
||||
"\n",
|
||||
"<img src=\"../images/semantic_chunking_comparison.svg\" alt=\"Self RAG\" style=\"width:100%; height:auto;\">\n",
|
||||
"</div>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
|
||||
404
all_rag_techniques/simple_csv_rag.ipynb
Normal file
@@ -0,0 +1,404 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Simple RAG (Retrieval-Augmented Generation) System for CSV Files\n",
|
||||
"\n",
|
||||
"## Overview\n",
|
||||
"\n",
|
||||
"This code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying CSV documents. The system encodes the document content into a vector store, which can then be queried to retrieve relevant information.\n",
|
||||
"\n",
|
||||
"# CSV File Structure and Use Case\n",
|
||||
"The CSV file contains dummy customer data, comprising various attributes like first name, last name, company, etc. This dataset will be utilized for a RAG use case, facilitating the creation of a customer information Q&A system.\n",
|
||||
"\n",
|
||||
"## Key Components\n",
|
||||
"\n",
|
||||
"1. Loading and spliting csv files.\n",
|
||||
"2. Vector store creation using [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) and OpenAI embeddings\n",
|
||||
"3. Retriever setup for querying the processed documents\n",
|
||||
"4. Creating a question and answer over the csv data.\n",
|
||||
"\n",
|
||||
"## Method Details\n",
|
||||
"\n",
|
||||
"### Document Preprocessing\n",
|
||||
"\n",
|
||||
"1. The csv is loaded using langchain Csvloader\n",
|
||||
"2. The data is split into chunks.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"### Vector Store Creation\n",
|
||||
"\n",
|
||||
"1. OpenAI embeddings are used to create vector representations of the text chunks.\n",
|
||||
"2. A FAISS vector store is created from these embeddings for efficient similarity search.\n",
|
||||
"\n",
|
||||
"### Retriever Setup\n",
|
||||
"\n",
|
||||
"1. A retriever is configured to fetch the most relevant chunks for a given query.\n",
|
||||
"\n",
|
||||
"## Benefits of this Approach\n",
|
||||
"\n",
|
||||
"1. Scalability: Can handle large documents by processing them in chunks.\n",
|
||||
"2. Flexibility: Easy to adjust parameters like chunk size and number of retrieved results.\n",
|
||||
"3. Efficiency: Utilizes FAISS for fast similarity search in high-dimensional spaces.\n",
|
||||
"4. Integration with Advanced NLP: Uses OpenAI embeddings for state-of-the-art text representation.\n",
|
||||
"\n",
|
||||
"## Conclusion\n",
|
||||
"\n",
|
||||
"This simple RAG system provides a solid foundation for building more complex information retrieval and question-answering systems. By encoding document content into a searchable vector store, it enables efficient retrieval of relevant information in response to queries. This approach is particularly useful for applications requiring quick access to specific information within a csv file."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"import libries"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain_community.document_loaders.csv_loader import CSVLoader\n",
|
||||
"from pathlib import Path\n",
|
||||
"from langchain_openai import ChatOpenAI,OpenAIEmbeddings\n",
|
||||
"import os\n",
|
||||
"from dotenv import load_dotenv\n",
|
||||
"\n",
|
||||
"# Load environment variables from a .env file\n",
|
||||
"load_dotenv()\n",
|
||||
"\n",
|
||||
"# Set the OpenAI API key environment variable\n",
|
||||
"os.environ[\"OPENAI_API_KEY\"] = os.getenv('OPENAI_API_KEY')\n",
|
||||
"\n",
|
||||
"llm = ChatOpenAI(model=\"gpt-3.5-turbo-0125\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# CSV File Structure and Use Case\n",
|
||||
"The CSV file contains dummy customer data, comprising various attributes like first name, last name, company, etc. This dataset will be utilized for a RAG use case, facilitating the creation of a customer information Q&A system."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 18,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>Index</th>\n",
|
||||
" <th>Customer Id</th>\n",
|
||||
" <th>First Name</th>\n",
|
||||
" <th>Last Name</th>\n",
|
||||
" <th>Company</th>\n",
|
||||
" <th>City</th>\n",
|
||||
" <th>Country</th>\n",
|
||||
" <th>Phone 1</th>\n",
|
||||
" <th>Phone 2</th>\n",
|
||||
" <th>Email</th>\n",
|
||||
" <th>Subscription Date</th>\n",
|
||||
" <th>Website</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>DD37Cf93aecA6Dc</td>\n",
|
||||
" <td>Sheryl</td>\n",
|
||||
" <td>Baxter</td>\n",
|
||||
" <td>Rasmussen Group</td>\n",
|
||||
" <td>East Leonard</td>\n",
|
||||
" <td>Chile</td>\n",
|
||||
" <td>229.077.5154</td>\n",
|
||||
" <td>397.884.0519x718</td>\n",
|
||||
" <td>zunigavanessa@smith.info</td>\n",
|
||||
" <td>2020-08-24</td>\n",
|
||||
" <td>http://www.stephenson.com/</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>2</td>\n",
|
||||
" <td>1Ef7b82A4CAAD10</td>\n",
|
||||
" <td>Preston</td>\n",
|
||||
" <td>Lozano</td>\n",
|
||||
" <td>Vega-Gentry</td>\n",
|
||||
" <td>East Jimmychester</td>\n",
|
||||
" <td>Djibouti</td>\n",
|
||||
" <td>5153435776</td>\n",
|
||||
" <td>686-620-1820x944</td>\n",
|
||||
" <td>vmata@colon.com</td>\n",
|
||||
" <td>2021-04-23</td>\n",
|
||||
" <td>http://www.hobbs.com/</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>6F94879bDAfE5a6</td>\n",
|
||||
" <td>Roy</td>\n",
|
||||
" <td>Berry</td>\n",
|
||||
" <td>Murillo-Perry</td>\n",
|
||||
" <td>Isabelborough</td>\n",
|
||||
" <td>Antigua and Barbuda</td>\n",
|
||||
" <td>+1-539-402-0259</td>\n",
|
||||
" <td>(496)978-3969x58947</td>\n",
|
||||
" <td>beckycarr@hogan.com</td>\n",
|
||||
" <td>2020-03-25</td>\n",
|
||||
" <td>http://www.lawrence.com/</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>4</td>\n",
|
||||
" <td>5Cef8BFA16c5e3c</td>\n",
|
||||
" <td>Linda</td>\n",
|
||||
" <td>Olsen</td>\n",
|
||||
" <td>Dominguez, Mcmillan and Donovan</td>\n",
|
||||
" <td>Bensonview</td>\n",
|
||||
" <td>Dominican Republic</td>\n",
|
||||
" <td>001-808-617-6467x12895</td>\n",
|
||||
" <td>+1-813-324-8756</td>\n",
|
||||
" <td>stanleyblackwell@benson.org</td>\n",
|
||||
" <td>2020-06-02</td>\n",
|
||||
" <td>http://www.good-lyons.com/</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>5</td>\n",
|
||||
" <td>053d585Ab6b3159</td>\n",
|
||||
" <td>Joanna</td>\n",
|
||||
" <td>Bender</td>\n",
|
||||
" <td>Martin, Lang and Andrade</td>\n",
|
||||
" <td>West Priscilla</td>\n",
|
||||
" <td>Slovakia (Slovak Republic)</td>\n",
|
||||
" <td>001-234-203-0635x76146</td>\n",
|
||||
" <td>001-199-446-3860x3486</td>\n",
|
||||
" <td>colinalvarado@miles.net</td>\n",
|
||||
" <td>2021-04-17</td>\n",
|
||||
" <td>https://goodwin-ingram.com/</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" Index Customer Id First Name Last Name \\\n",
|
||||
"0 1 DD37Cf93aecA6Dc Sheryl Baxter \n",
|
||||
"1 2 1Ef7b82A4CAAD10 Preston Lozano \n",
|
||||
"2 3 6F94879bDAfE5a6 Roy Berry \n",
|
||||
"3 4 5Cef8BFA16c5e3c Linda Olsen \n",
|
||||
"4 5 053d585Ab6b3159 Joanna Bender \n",
|
||||
"\n",
|
||||
" Company City \\\n",
|
||||
"0 Rasmussen Group East Leonard \n",
|
||||
"1 Vega-Gentry East Jimmychester \n",
|
||||
"2 Murillo-Perry Isabelborough \n",
|
||||
"3 Dominguez, Mcmillan and Donovan Bensonview \n",
|
||||
"4 Martin, Lang and Andrade West Priscilla \n",
|
||||
"\n",
|
||||
" Country Phone 1 Phone 2 \\\n",
|
||||
"0 Chile 229.077.5154 397.884.0519x718 \n",
|
||||
"1 Djibouti 5153435776 686-620-1820x944 \n",
|
||||
"2 Antigua and Barbuda +1-539-402-0259 (496)978-3969x58947 \n",
|
||||
"3 Dominican Republic 001-808-617-6467x12895 +1-813-324-8756 \n",
|
||||
"4 Slovakia (Slovak Republic) 001-234-203-0635x76146 001-199-446-3860x3486 \n",
|
||||
"\n",
|
||||
" Email Subscription Date Website \n",
|
||||
"0 zunigavanessa@smith.info 2020-08-24 http://www.stephenson.com/ \n",
|
||||
"1 vmata@colon.com 2021-04-23 http://www.hobbs.com/ \n",
|
||||
"2 beckycarr@hogan.com 2020-03-25 http://www.lawrence.com/ \n",
|
||||
"3 stanleyblackwell@benson.org 2020-06-02 http://www.good-lyons.com/ \n",
|
||||
"4 colinalvarado@miles.net 2021-04-17 https://goodwin-ingram.com/ "
|
||||
]
|
||||
},
|
||||
"execution_count": 18,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"\n",
|
||||
"file_path = ('../data/customers-100.csv') # insert the path of the csv file\n",
|
||||
"data = pd.read_csv(file_path)\n",
|
||||
"\n",
|
||||
"#preview the csv file\n",
|
||||
"data.head()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"load and process csv data"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 19,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"loader = CSVLoader(file_path=file_path)\n",
|
||||
"docs = loader.load_and_split()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Initiate faiss vector store and openai embedding"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import faiss\n",
|
||||
"from langchain_community.docstore.in_memory import InMemoryDocstore\n",
|
||||
"from langchain_community.vectorstores import FAISS\n",
|
||||
"\n",
|
||||
"embeddings = OpenAIEmbeddings()\n",
|
||||
"index = faiss.IndexFlatL2(len(OpenAIEmbeddings().embed_query(\" \")))\n",
|
||||
"vector_store = FAISS(\n",
|
||||
" embedding_function=OpenAIEmbeddings(),\n",
|
||||
" index=index,\n",
|
||||
" docstore=InMemoryDocstore(),\n",
|
||||
" index_to_docstore_id={}\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Add the splitted csv data to the vector store"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"vector_store.add_documents(documents=docs)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Create the retrieval chain"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain_core.prompts import ChatPromptTemplate\n",
|
||||
"from langchain.chains import create_retrieval_chain\n",
|
||||
"from langchain.chains.combine_documents import create_stuff_documents_chain\n",
|
||||
"\n",
|
||||
"retriever = vector_store.as_retriever()\n",
|
||||
"\n",
|
||||
"# Set up system prompt\n",
|
||||
"system_prompt = (\n",
|
||||
" \"You are an assistant for question-answering tasks. \"\n",
|
||||
" \"Use the following pieces of retrieved context to answer \"\n",
|
||||
" \"the question. If you don't know the answer, say that you \"\n",
|
||||
" \"don't know. Use three sentences maximum and keep the \"\n",
|
||||
" \"answer concise.\"\n",
|
||||
" \"\\n\\n\"\n",
|
||||
" \"{context}\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"prompt = ChatPromptTemplate.from_messages([\n",
|
||||
" (\"system\", system_prompt),\n",
|
||||
" (\"human\", \"{input}\"),\n",
|
||||
" \n",
|
||||
"])\n",
|
||||
"\n",
|
||||
"# Create the question-answer chain\n",
|
||||
"question_answer_chain = create_stuff_documents_chain(llm, prompt)\n",
|
||||
"rag_chain = create_retrieval_chain(retriever, question_answer_chain)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Query the rag bot with a question based on the CSV data"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'Sheryl Baxter works for Rasmussen Group.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 15,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"answer= rag_chain.invoke({\"input\": \"which company does sheryl Baxter work for?\"})\n",
|
||||
"answer['answer']"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "objenv",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
271
all_rag_techniques/simple_csv_rag_with_llamaindex.ipynb
Normal file
@@ -0,0 +1,271 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Simple RAG (Retrieval-Augmented Generation) System for CSV Files\n",
|
||||
"\n",
|
||||
"## Overview\n",
|
||||
"\n",
|
||||
"This code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying CSV documents. The system encodes the document content into a vector store, which can then be queried to retrieve relevant information.\n",
|
||||
"\n",
|
||||
"# CSV File Structure and Use Case\n",
|
||||
"The CSV file contains dummy customer data, comprising various attributes like first name, last name, company, etc. This dataset will be utilized for a RAG use case, facilitating the creation of a customer information Q&A system.\n",
|
||||
"\n",
|
||||
"## Key Components\n",
|
||||
"\n",
|
||||
"1. Loading and spliting csv files.\n",
|
||||
"2. Vector store creation using [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) and OpenAI embeddings\n",
|
||||
"3. Query engine setup for querying the processed documents\n",
|
||||
"4. Creating a question and answer over the csv data.\n",
|
||||
"\n",
|
||||
"## Method Details\n",
|
||||
"\n",
|
||||
"### Document Preprocessing\n",
|
||||
"\n",
|
||||
"1. The csv is loaded using LlamaIndex's [PagedCSVReader](https://docs.llamaindex.ai/en/stable/api_reference/readers/file/#llama_index.readers.file.PagedCSVReader)\n",
|
||||
"2. This reader converts each row into a LlamaIndex Document along with the respective column names of the table. No further splitting applied.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"### Vector Store Creation\n",
|
||||
"\n",
|
||||
"1. OpenAI embeddings are used to create vector representations of the text chunks.\n",
|
||||
"2. A FAISS vector store is created from these embeddings for efficient similarity search.\n",
|
||||
"\n",
|
||||
"### Query Engine Setup\n",
|
||||
"\n",
|
||||
"1. A query engine is configured to fetch the most relevant chunks for a given query then answer the question.\n",
|
||||
"\n",
|
||||
"## Benefits of this Approach\n",
|
||||
"\n",
|
||||
"1. Scalability: Can handle large documents by processing them in chunks.\n",
|
||||
"2. Flexibility: Easy to adjust parameters like chunk size and number of retrieved results.\n",
|
||||
"3. Efficiency: Utilizes FAISS for fast similarity search in high-dimensional spaces.\n",
|
||||
"4. Integration with Advanced NLP: Uses OpenAI embeddings for state-of-the-art text representation.\n",
|
||||
"\n",
|
||||
"## Conclusion\n",
|
||||
"\n",
|
||||
"This simple RAG system provides a solid foundation for building more complex information retrieval and question-answering systems. By encoding document content into a searchable vector store, it enables efficient retrieval of relevant information in response to queries. This approach is particularly useful for applications requiring quick access to specific information within a CSV file."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Imports & Environment Variables "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from llama_index.core.readers import SimpleDirectoryReader\n",
|
||||
"from llama_index.core import Settings\n",
|
||||
"from llama_index.llms.openai import OpenAI\n",
|
||||
"from llama_index.embeddings.openai import OpenAIEmbedding\n",
|
||||
"from llama_index.readers.file import PagedCSVReader\n",
|
||||
"from llama_index.vector_stores.faiss import FaissVectorStore\n",
|
||||
"from llama_index.core.ingestion import IngestionPipeline\n",
|
||||
"from llama_index.core import VectorStoreIndex\n",
|
||||
"import faiss\n",
|
||||
"import os\n",
|
||||
"import pandas as pd\n",
|
||||
"from dotenv import load_dotenv\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# Load environment variables from a .env file\n",
|
||||
"load_dotenv()\n",
|
||||
"\n",
|
||||
"# Set the OpenAI API key environment variable\n",
|
||||
"os.environ[\"OPENAI_API_KEY\"] = os.getenv('OPENAI_API_KEY')\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# Llamaindex global settings for llm and embeddings\n",
|
||||
"EMBED_DIMENSION=512\n",
|
||||
"Settings.llm = OpenAI(model=\"gpt-3.5-turbo\")\n",
|
||||
"Settings.embed_model = OpenAIEmbedding(model=\"text-embedding-3-small\", dimensions=EMBED_DIMENSION)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### CSV File Structure and Use Case\n",
|
||||
"The CSV file contains dummy customer data, comprising various attributes like first name, last name, company, etc. This dataset will be utilized for a RAG use case, facilitating the creation of a customer information Q&A system."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"file_path = ('../data/customers-100.csv') # insert the path of the csv file\n",
|
||||
"data = pd.read_csv(file_path)\n",
|
||||
"\n",
|
||||
"# Preview the csv file\n",
|
||||
"data.head()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Vector Store"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Create FaisVectorStore to store embeddings\n",
|
||||
"fais_index = faiss.IndexFlatL2(EMBED_DIMENSION)\n",
|
||||
"vector_store = FaissVectorStore(faiss_index=fais_index)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Load and Process CSV Data as Document"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"csv_reader = PagedCSVReader()\n",
|
||||
"\n",
|
||||
"reader = SimpleDirectoryReader( \n",
|
||||
" input_files=[file_path],\n",
|
||||
" file_extractor= {\".csv\": csv_reader}\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
"docs = reader.load_data()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Index: 1\n",
|
||||
"Customer Id: DD37Cf93aecA6Dc\n",
|
||||
"First Name: Sheryl\n",
|
||||
"Last Name: Baxter\n",
|
||||
"Company: Rasmussen Group\n",
|
||||
"City: East Leonard\n",
|
||||
"Country: Chile\n",
|
||||
"Phone 1: 229.077.5154\n",
|
||||
"Phone 2: 397.884.0519x718\n",
|
||||
"Email: zunigavanessa@smith.info\n",
|
||||
"Subscription Date: 2020-08-24\n",
|
||||
"Website: http://www.stephenson.com/\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Check a sample chunk\n",
|
||||
"print(docs[0].text)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Ingestion Pipeline"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"pipeline = IngestionPipeline(\n",
|
||||
" vector_store=vector_store,\n",
|
||||
" documents=docs\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"nodes = pipeline.run()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Create Query Engine"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"vector_store_index = VectorStoreIndex(nodes)\n",
|
||||
"query_engine = vector_store_index.as_query_engine(similarity_top_k=2)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Query the rag bot with a question based on the CSV data"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'Rasmussen Group'"
|
||||
]
|
||||
},
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"response = query_engine.query(\"which company does sheryl Baxter work for?\")\n",
|
||||
"response.response"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "objenv",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.5"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -77,16 +77,7 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"/Users/tevfikcagridural/anaconda3/envs/ragTechs/lib/python3.11/site-packages/deepeval/__init__.py:45: UserWarning: You are using deepeval version 0.21.70, however version 1.1.0 is available. You should consider upgrading via the \"pip install --upgrade deepeval\" command.\n",
|
||||
" warnings.warn(\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from typing import List\n",
|
||||
"from llama_index.core import SimpleDirectoryReader, VectorStoreIndex\n",
|
||||
@@ -102,9 +93,6 @@
|
||||
"from dotenv import load_dotenv\n",
|
||||
"\n",
|
||||
"sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path sicnce we work with notebooks\n",
|
||||
"from helper_functions_llama_index import show_context\n",
|
||||
"from evaluation.evalute_rag_llamaindex import evaluate_rag\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"EMBED_DIMENSION = 512\n",
|
||||
"\n",
|
||||
@@ -133,23 +121,9 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Doc ID: f8fafc11-dbb3-430a-a17f-a3fb24ceac0a\n",
|
||||
"Text: Understanding Climate Change Chapter 1: Introduction to\n",
|
||||
"Climate Change Climate change refers to significant, long -term\n",
|
||||
"changes in the global climate. The term \"global climate\" encompasses\n",
|
||||
"the planet's overall weather patterns, including temperature,\n",
|
||||
"precipitation, and wind patterns, over an extended period. Over the\n",
|
||||
"past cent ury, human ...\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"path = \"../data/\"\n",
|
||||
"node_parser = SimpleDirectoryReader(input_dir=path, required_exts=['.pdf'])\n",
|
||||
@@ -171,8 +145,8 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Create FaisVectorStore to store embeddings\n",
|
||||
"fais_index = faiss.IndexFlatL2(EMBED_DIMENSION)\n",
|
||||
"vector_store = FaissVectorStore(faiss_index=fais_index)"
|
||||
"faiss_index = faiss.IndexFlatL2(EMBED_DIMENSION)\n",
|
||||
"vector_store = FaissVectorStore(faiss_index=faiss_index)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -197,7 +171,7 @@
|
||||
" \n",
|
||||
" for node in nodes:\n",
|
||||
" node.text = node.text.replace('\\t', ' ') # Replace tabs with spaces\n",
|
||||
" node.text = node.text.replace(' \\n', ' ') # Replace paragprah seperator with spacaes\n",
|
||||
" node.text = node.text.replace(' \\n', ' ') # Replace paragraph seperator with spacaes\n",
|
||||
" \n",
|
||||
" return nodes"
|
||||
]
|
||||
@@ -265,6 +239,27 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def show_context(context):\n",
|
||||
" \"\"\"\n",
|
||||
" Display the contents of the provided context list.\n",
|
||||
"\n",
|
||||
" Args:\n",
|
||||
" context (list): A list of context items to be displayed.\n",
|
||||
"\n",
|
||||
" Prints each context item in the list with a heading indicating its position.\n",
|
||||
" \"\"\"\n",
|
||||
" for i, c in enumerate(context):\n",
|
||||
" print(f\"Context {i+1}:\")\n",
|
||||
" print(c.text)\n",
|
||||
" print(\"\\n\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
@@ -296,16 +291,14 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import json\n",
|
||||
"from typing import List, Tuple\n",
|
||||
"\n",
|
||||
"from deepeval import evaluate\n",
|
||||
"from deepeval.metrics import GEval, FaithfulnessMetric, ContextualRelevancyMetric\n",
|
||||
"from deepeval.test_case import LLMTestCase, LLMTestCaseParams\n",
|
||||
"from deepeval.test_case import LLMTestCaseParams\n",
|
||||
"from evaluation.evalute_rag import create_deep_eval_test_cases\n",
|
||||
"\n",
|
||||
"# Set llm model for evaluation of the question and answers \n",
|
||||
|
||||
@@ -1,43 +1,123 @@
|
||||
import time
|
||||
import os
|
||||
import sys
|
||||
import argparse
|
||||
from dotenv import load_dotenv
|
||||
|
||||
sys.path.append(os.path.abspath(
|
||||
os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path sicnce we work with notebooks
|
||||
from helper_functions import *
|
||||
from evaluation.evalute_rag import *
|
||||
|
||||
from langchain_experimental.text_splitter import SemanticChunker
|
||||
from langchain_experimental.text_splitter import SemanticChunker, BreakpointThresholdType
|
||||
from langchain_openai.embeddings import OpenAIEmbeddings
|
||||
|
||||
# Load environment variables from a .env file
|
||||
load_dotenv()
|
||||
# Add the parent directory to the path since we work with notebooks
|
||||
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))
|
||||
|
||||
# Set the OpenAI API key environment variable
|
||||
# Load environment variables from a .env file (e.g., OpenAI API key)
|
||||
load_dotenv()
|
||||
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')
|
||||
|
||||
# Define file path
|
||||
path = "../data/Understanding_Climate_Change.pdf"
|
||||
|
||||
# Read PDF to string
|
||||
content = read_pdf_to_string(path)
|
||||
# Function to run semantic chunking and return chunking and retrieval times
|
||||
class SemanticChunkingRAG:
|
||||
"""
|
||||
A class to handle the Semantic Chunking RAG process for document chunking and query retrieval.
|
||||
"""
|
||||
|
||||
# Breakpoint types:
|
||||
# * 'interquartile': the interquartile distance is used to split chunks.
|
||||
def __init__(self, path, n_retrieved=2, embeddings=None, breakpoint_type: BreakpointThresholdType = "percentile",
|
||||
breakpoint_amount=90):
|
||||
"""
|
||||
Initializes the SemanticChunkingRAG by encoding the content using a semantic chunker.
|
||||
|
||||
Args:
|
||||
path (str): Path to the PDF file to encode.
|
||||
n_retrieved (int): Number of chunks to retrieve for each query (default: 2).
|
||||
embeddings: Embedding model to use.
|
||||
breakpoint_type (str): Type of semantic breakpoint threshold.
|
||||
breakpoint_amount (float): Amount for the semantic breakpoint threshold.
|
||||
"""
|
||||
print("\n--- Initializing Semantic Chunking RAG ---")
|
||||
# Read PDF to string
|
||||
content = read_pdf_to_string(path)
|
||||
|
||||
# Use provided embeddings or initialize OpenAI embeddings
|
||||
self.embeddings = embeddings if embeddings else OpenAIEmbeddings()
|
||||
|
||||
# Initialize the semantic chunker
|
||||
self.semantic_chunker = SemanticChunker(
|
||||
self.embeddings,
|
||||
breakpoint_threshold_type=breakpoint_type,
|
||||
breakpoint_threshold_amount=breakpoint_amount
|
||||
)
|
||||
|
||||
# Measure time for semantic chunking
|
||||
start_time = time.time()
|
||||
self.semantic_docs = self.semantic_chunker.create_documents([content])
|
||||
self.time_records = {'Chunking': time.time() - start_time}
|
||||
print(f"Semantic Chunking Time: {self.time_records['Chunking']:.2f} seconds")
|
||||
|
||||
# Create a vector store and retriever from the semantic chunks
|
||||
self.semantic_vectorstore = FAISS.from_documents(self.semantic_docs, self.embeddings)
|
||||
self.semantic_retriever = self.semantic_vectorstore.as_retriever(search_kwargs={"k": n_retrieved})
|
||||
|
||||
def run(self, query):
|
||||
"""
|
||||
Retrieves and displays the context for the given query.
|
||||
|
||||
Args:
|
||||
query (str): The query to retrieve context for.
|
||||
|
||||
Returns:
|
||||
tuple: The retrieval time.
|
||||
"""
|
||||
# Measure time for semantic retrieval
|
||||
start_time = time.time()
|
||||
semantic_context = retrieve_context_per_question(query, self.semantic_retriever)
|
||||
self.time_records['Retrieval'] = time.time() - start_time
|
||||
print(f"Semantic Retrieval Time: {self.time_records['Retrieval']:.2f} seconds")
|
||||
|
||||
# Display the retrieved context
|
||||
show_context(semantic_context)
|
||||
return self.time_records
|
||||
|
||||
|
||||
text_splitter = SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type='percentile',
|
||||
breakpoint_threshold_amount=90) # chose which embeddings and breakpoint type and threshold to use
|
||||
# Function to parse command line arguments
|
||||
def parse_args():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Process a PDF document with semantic chunking RAG.")
|
||||
parser.add_argument("--path", type=str, default="../data/Understanding_Climate_Change.pdf",
|
||||
help="Path to the PDF file to encode.")
|
||||
parser.add_argument("--n_retrieved", type=int, default=2,
|
||||
help="Number of chunks to retrieve for each query (default: 2).")
|
||||
parser.add_argument("--breakpoint_threshold_type", type=str,
|
||||
choices=["percentile", "standard_deviation", "interquartile", "gradient"],
|
||||
default="percentile",
|
||||
help="Type of breakpoint threshold to use for chunking (default: percentile).")
|
||||
parser.add_argument("--breakpoint_threshold_amount", type=float, default=90,
|
||||
help="Amount of the breakpoint threshold to use (default: 90).")
|
||||
parser.add_argument("--chunk_size", type=int, default=1000,
|
||||
help="Size of each text chunk in simple chunking (default: 1000).")
|
||||
parser.add_argument("--chunk_overlap", type=int, default=200,
|
||||
help="Overlap between consecutive chunks in simple chunking (default: 200).")
|
||||
parser.add_argument("--query", type=str, default="What is the main cause of climate change?",
|
||||
help="Query to test the retriever (default: 'What is the main cause of climate change?').")
|
||||
parser.add_argument("--experiment", action="store_true",
|
||||
help="Run the experiment to compare performance between semantic chunking and simple chunking.")
|
||||
|
||||
# Split original text to semantic chunks
|
||||
docs = text_splitter.create_documents([content])
|
||||
return parser.parse_args()
|
||||
|
||||
# Create vector store and retriever
|
||||
embeddings = OpenAIEmbeddings()
|
||||
vectorstore = FAISS.from_documents(docs, embeddings)
|
||||
chunks_query_retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
|
||||
|
||||
# Test the retriever
|
||||
test_query = "What is the main cause of climate change?"
|
||||
context = retrieve_context_per_question(test_query, chunks_query_retriever)
|
||||
show_context(context)
|
||||
# Main function to process PDF, chunk text, and test retriever
|
||||
def main(args):
|
||||
# Initialize SemanticChunkingRAG
|
||||
semantic_rag = SemanticChunkingRAG(
|
||||
path=args.path,
|
||||
n_retrieved=args.n_retrieved,
|
||||
breakpoint_type=args.breakpoint_threshold_type,
|
||||
breakpoint_amount=args.breakpoint_threshold_amount
|
||||
)
|
||||
|
||||
# Run a query
|
||||
semantic_rag.run(args.query)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
# Call the main function with parsed arguments
|
||||
main(parse_args())
|
||||
|
||||
@@ -1,21 +1,65 @@
|
||||
from helper_functions import *
|
||||
from evaluation.evalute_rag import *
|
||||
|
||||
import os
|
||||
import sys
|
||||
import argparse
|
||||
import time
|
||||
from dotenv import load_dotenv
|
||||
from helper_functions import *
|
||||
from evaluation.evalute_rag import *
|
||||
|
||||
# Add the parent directory to the path since we work with notebooks
|
||||
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))
|
||||
|
||||
# Load environment variables from a .env file (e.g., OpenAI API key)
|
||||
load_dotenv()
|
||||
|
||||
# Set the OpenAI API key environment variable
|
||||
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')
|
||||
|
||||
|
||||
class SimpleRAG:
|
||||
"""
|
||||
A class to handle the Simple RAG process for document chunking and query retrieval.
|
||||
"""
|
||||
|
||||
def __init__(self, path, chunk_size=1000, chunk_overlap=200, n_retrieved=2):
|
||||
"""
|
||||
Initializes the SimpleRAGRetriever by encoding the PDF document and creating the retriever.
|
||||
|
||||
Args:
|
||||
path (str): Path to the PDF file to encode.
|
||||
chunk_size (int): Size of each text chunk (default: 1000).
|
||||
chunk_overlap (int): Overlap between consecutive chunks (default: 200).
|
||||
n_retrieved (int): Number of chunks to retrieve for each query (default: 2).
|
||||
"""
|
||||
print("\n--- Initializing Simple RAG Retriever ---")
|
||||
|
||||
# Encode the PDF document into a vector store using OpenAI embeddings
|
||||
start_time = time.time()
|
||||
self.vector_store = encode_pdf(path, chunk_size=chunk_size, chunk_overlap=chunk_overlap)
|
||||
self.time_records = {'Chunking': time.time() - start_time}
|
||||
print(f"Chunking Time: {self.time_records['Chunking']:.2f} seconds")
|
||||
|
||||
# Create a retriever from the vector store
|
||||
self.chunks_query_retriever = self.vector_store.as_retriever(search_kwargs={"k": n_retrieved})
|
||||
|
||||
def run(self, query):
|
||||
"""
|
||||
Retrieves and displays the context for the given query.
|
||||
|
||||
Args:
|
||||
query (str): The query to retrieve context for.
|
||||
|
||||
Returns:
|
||||
tuple: The retrieval time.
|
||||
"""
|
||||
# Measure time for retrieval
|
||||
start_time = time.time()
|
||||
context = retrieve_context_per_question(query, self.chunks_query_retriever)
|
||||
self.time_records['Retrieval'] = time.time() - start_time
|
||||
print(f"Retrieval Time: {self.time_records['Retrieval']:.2f} seconds")
|
||||
|
||||
# Display the retrieved context
|
||||
show_context(context)
|
||||
|
||||
|
||||
# Function to validate command line inputs
|
||||
def validate_args(args):
|
||||
if args.chunk_size <= 0:
|
||||
@@ -29,7 +73,7 @@ def validate_args(args):
|
||||
|
||||
# Function to parse command line arguments
|
||||
def parse_args():
|
||||
parser = argparse.ArgumentParser(description="Encode a PDF document and test a retriever.")
|
||||
parser = argparse.ArgumentParser(description="Encode a PDF document and test a simple RAG.")
|
||||
parser.add_argument("--path", type=str, default="../data/Understanding_Climate_Change.pdf",
|
||||
help="Path to the PDF file to encode.")
|
||||
parser.add_argument("--chunk_size", type=int, default=1000,
|
||||
@@ -47,21 +91,22 @@ def parse_args():
|
||||
return validate_args(parser.parse_args())
|
||||
|
||||
|
||||
# Main function to encode PDF, retrieve context, and optionally evaluate retriever
|
||||
# Main function to handle argument parsing and call the SimpleRAGRetriever class
|
||||
def main(args):
|
||||
# Encode the PDF document into a vector store using OpenAI embeddings
|
||||
chunks_vector_store = encode_pdf(args.path, chunk_size=args.chunk_size, chunk_overlap=args.chunk_overlap)
|
||||
# Initialize the SimpleRAGRetriever
|
||||
simple_rag = SimpleRAG(
|
||||
path=args.path,
|
||||
chunk_size=args.chunk_size,
|
||||
chunk_overlap=args.chunk_overlap,
|
||||
n_retrieved=args.n_retrieved
|
||||
)
|
||||
|
||||
# Create a retriever from the vector store, specifying how many chunks to retrieve
|
||||
chunks_query_retriever = chunks_vector_store.as_retriever(search_kwargs={"k": args.n_retrieved})
|
||||
|
||||
# Test the retriever with the user's query
|
||||
context = retrieve_context_per_question(args.query, chunks_query_retriever)
|
||||
show_context(context) # Display the context retrieved for the query
|
||||
# Retrieve context based on the query
|
||||
simple_rag.run(args.query)
|
||||
|
||||
# Evaluate the retriever's performance on the query (if requested)
|
||||
if args.evaluate:
|
||||
evaluate_rag(chunks_query_retriever)
|
||||
evaluate_rag(simple_rag.chunks_query_retriever)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
|
||||
101
data/customers-100.csv
Normal file
@@ -0,0 +1,101 @@
|
||||
Index,Customer Id,First Name,Last Name,Company,City,Country,Phone 1,Phone 2,Email,Subscription Date,Website
|
||||
1,DD37Cf93aecA6Dc,Sheryl,Baxter,Rasmussen Group,East Leonard,Chile,229.077.5154,397.884.0519x718,zunigavanessa@smith.info,2020-08-24,http://www.stephenson.com/
|
||||
2,1Ef7b82A4CAAD10,Preston,Lozano,Vega-Gentry,East Jimmychester,Djibouti,5153435776,686-620-1820x944,vmata@colon.com,2021-04-23,http://www.hobbs.com/
|
||||
3,6F94879bDAfE5a6,Roy,Berry,Murillo-Perry,Isabelborough,Antigua and Barbuda,+1-539-402-0259,(496)978-3969x58947,beckycarr@hogan.com,2020-03-25,http://www.lawrence.com/
|
||||
4,5Cef8BFA16c5e3c,Linda,Olsen,"Dominguez, Mcmillan and Donovan",Bensonview,Dominican Republic,001-808-617-6467x12895,+1-813-324-8756,stanleyblackwell@benson.org,2020-06-02,http://www.good-lyons.com/
|
||||
5,053d585Ab6b3159,Joanna,Bender,"Martin, Lang and Andrade",West Priscilla,Slovakia (Slovak Republic),001-234-203-0635x76146,001-199-446-3860x3486,colinalvarado@miles.net,2021-04-17,https://goodwin-ingram.com/
|
||||
6,2d08FB17EE273F4,Aimee,Downs,Steele Group,Chavezborough,Bosnia and Herzegovina,(283)437-3886x88321,999-728-1637,louis27@gilbert.com,2020-02-25,http://www.berger.net/
|
||||
7,EA4d384DfDbBf77,Darren,Peck,"Lester, Woodard and Mitchell",Lake Ana,Pitcairn Islands,(496)452-6181x3291,+1-247-266-0963x4995,tgates@cantrell.com,2021-08-24,https://www.le.com/
|
||||
8,0e04AFde9f225dE,Brett,Mullen,"Sanford, Davenport and Giles",Kimport,Bulgaria,001-583-352-7197x297,001-333-145-0369,asnow@colon.com,2021-04-12,https://hammond-ramsey.com/
|
||||
9,C2dE4dEEc489ae0,Sheryl,Meyers,Browning-Simon,Robersonstad,Cyprus,854-138-4911x5772,+1-448-910-2276x729,mariokhan@ryan-pope.org,2020-01-13,https://www.bullock.net/
|
||||
10,8C2811a503C7c5a,Michelle,Gallagher,Beck-Hendrix,Elaineberg,Timor-Leste,739.218.2516x459,001-054-401-0347x617,mdyer@escobar.net,2021-11-08,https://arias.com/
|
||||
11,216E205d6eBb815,Carl,Schroeder,"Oconnell, Meza and Everett",Shannonville,Guernsey,637-854-0256x825,114.336.0784x788,kirksalas@webb.com,2021-10-20,https://simmons-hurley.com/
|
||||
12,CEDec94deE6d69B,Jenna,Dodson,"Hoffman, Reed and Mcclain",East Andrea,Vietnam,(041)737-3846,+1-556-888-3485x42608,mark42@robbins.com,2020-11-29,http://www.douglas.net/
|
||||
13,e35426EbDEceaFF,Tracey,Mata,Graham-Francis,South Joannamouth,Togo,001-949-844-8787,(855)713-8773,alex56@walls.org,2021-12-02,http://www.beck.com/
|
||||
14,A08A8aF8BE9FaD4,Kristine,Cox,Carpenter-Cook,Jodyberg,Sri Lanka,786-284-3358x62152,+1-315-627-1796x8074,holdenmiranda@clarke.com,2021-02-08,https://www.brandt.com/
|
||||
15,6fEaA1b7cab7B6C,Faith,Lutz,Carter-Hancock,Burchbury,Singapore,(781)861-7180x8306,207-185-3665,cassieparrish@blevins-chapman.net,2022-01-26,http://stevenson.org/
|
||||
16,8cad0b4CBceaeec,Miranda,Beasley,Singleton and Sons,Desireeshire,Oman,540.085.3135x185,+1-600-462-6432x21881,vduncan@parks-hardy.com,2022-04-12,http://acosta.org/
|
||||
17,a5DC21AE3a21eaA,Caroline,Foley,Winters-Mendoza,West Adriennestad,Western Sahara,936.222.4746x9924,001-469-948-6341x359,holtgwendolyn@watson-davenport.com,2021-03-10,http://www.benson-roth.com/
|
||||
18,F8Aa9d6DfcBeeF8,Greg,Mata,Valentine LLC,Lake Leslie,Mozambique,(701)087-2415,(195)156-1861x26241,jaredjuarez@carroll.org,2022-03-26,http://pitts-cherry.com/
|
||||
19,F160f5Db3EfE973,Clifford,Jacobson,Simon LLC,Harmonview,South Georgia and the South Sandwich Islands,001-151-330-3524x0469,(748)477-7174,joseph26@jacobson.com,2020-09-24,https://mcconnell.com/
|
||||
20,0F60FF3DdCd7aB0,Joanna,Kirk,Mays-Mccormick,Jamesshire,French Polynesia,(266)131-7001x711,(283)312-5579x11543,tuckerangie@salazar.net,2021-09-24,https://www.camacho.net/
|
||||
21,9F9AdB7B8A6f7F2,Maxwell,Frye,Patterson Inc,East Carly,Malta,423.262.3059,202-880-0688x7491,fgibson@drake-webb.com,2022-01-12,http://www.roberts.com/
|
||||
22,FBd0Ded4F02a742,Kiara,Houston,"Manning, Hester and Arroyo",South Alvin,Netherlands,001-274-040-3582x10611,+1-528-175-0973x4684,blanchardbob@wallace-shannon.com,2020-09-15,https://www.reid-potts.com/
|
||||
23,2FB0FAA1d429421,Colleen,Howard,Greer and Sons,Brittanyview,Paraguay,1935085151,(947)115-7711x5488,rsingleton@ryan-cherry.com,2020-08-19,http://paul.biz/
|
||||
24,010468dAA11382c,Janet,Valenzuela,Watts-Donaldson,Veronicamouth,Lao People's Democratic Republic,354.259.5062x7538,500.433.2022,stefanie71@spence.com,2020-09-08,https://moreno.biz/
|
||||
25,eC1927Ca84E033e,Shane,Wilcox,Tucker LLC,Bryanville,Albania,(429)005-9030x11004,541-116-4501,mariah88@santos.com,2021-04-06,https://www.ramos.com/
|
||||
26,09D7D7C8Fe09aea,Marcus,Moody,Giles Ltd,Kaitlyntown,Panama,674-677-8623,909-277-5485x566,donnamullins@norris-barrett.org,2022-05-24,https://www.curry.com/
|
||||
27,aBdfcF2c50b0bfD,Dakota,Poole,Simmons Group,Michealshire,Belarus,(371)987-8576x4720,071-152-1376,stacey67@fields.org,2022-02-20,https://sanford-wilcox.biz/
|
||||
28,b92EBfdF8a3f0E6,Frederick,Harper,"Hinton, Chaney and Stokes",South Marissatown,Switzerland,+1-077-121-1558x0687,264.742.7149,jacobkhan@bright.biz,2022-05-26,https://callahan.org/
|
||||
29,3B5dAAFA41AFa22,Stefanie,Fitzpatrick,Santana-Duran,Acevedoville,Saint Vincent and the Grenadines,(752)776-3286,+1-472-021-4814x85074,wterrell@clark.com,2020-07-30,https://meyers.com/
|
||||
30,EDA69ca7a6e96a2,Kent,Bradshaw,Sawyer PLC,North Harold,Tanzania,+1-472-143-5037x884,126.922.6153,qjimenez@boyd.com,2020-04-26,http://maynard-ho.com/
|
||||
31,64DCcDFaB9DFd4e,Jack,Tate,"Acosta, Petersen and Morrow",West Samuel,Zimbabwe,965-108-4406x20714,046.906.1442x6784,gfigueroa@boone-zavala.com,2021-09-15,http://www.hawkins-ramsey.com/
|
||||
32,679c6c83DD872d6,Tom,Trujillo,Mcgee Group,Cunninghamborough,Denmark,416-338-3758,(775)890-7209,tapiagreg@beard.info,2022-01-13,http://www.daniels-klein.com/
|
||||
33,7Ce381e4Afa4ba9,Gabriel,Mejia,Adkins-Salinas,Port Annatown,Liechtenstein,4077245425,646.044.0696x66800,coleolson@jennings.net,2021-04-24,https://patel-hanson.info/
|
||||
34,A09AEc6E3bF70eE,Kaitlyn,Santana,Herrera Group,New Kaitlyn,United States of America,6303643286,447-710-6202x07313,georgeross@miles.org,2021-09-21,http://pham.com/
|
||||
35,aA9BAFfBc3710fe,Faith,Moon,"Waters, Chase and Aguilar",West Marthaburgh,Bahamas,+1-586-217-0359x6317,+1-818-199-1403,willistonya@randolph-baker.com,2021-11-03,https://spencer-charles.info/
|
||||
36,E11dfb2DB8C9f72,Tammie,Haley,"Palmer, Barnes and Houston",East Teresa,Belize,001-276-734-4113x6087,(430)300-8770,harrisisaiah@jenkins.com,2022-01-04,http://evans-simon.com/
|
||||
37,889eCf90f68c5Da,Nicholas,Sosa,Jordan Ltd,South Hunter,Uruguay,(661)425-6042,975-998-1519,fwolfe@dorsey.com,2021-08-10,https://www.fleming-richards.com/
|
||||
38,7a1Ee69F4fF4B4D,Jordan,Gay,Glover and Sons,South Walter,Solomon Islands,7208417020,8035336772,tiffanydavies@harris-mcfarland.org,2021-02-24,http://www.lee.org/
|
||||
39,dca4f1D0A0fc5c9,Bruce,Esparza,Huerta-Mclean,Poolefurt,Montenegro,559-529-4424,001-625-000-7132x0367,preese@frye-vega.com,2021-10-22,http://www.farley.org/
|
||||
40,17aD8e2dB3df03D,Sherry,Garza,Anderson Ltd,West John,Poland,001-067-713-6440x158,(978)289-8785x5766,ann48@miller.com,2021-11-01,http://spence.com/
|
||||
41,2f79Cd309624Abb,Natalie,Gentry,Monroe PLC,West Darius,Dominican Republic,830.996.8238,499.122.5415,tcummings@fitzpatrick-ashley.com,2020-10-10,http://www.dorsey.biz/
|
||||
42,6e5ad5a5e2bB5Ca,Bryan,Dunn,Kaufman and Sons,North Jimstad,Burkina Faso,001-710-802-5565,078.699.8982x13881,woodwardandres@phelps.com,2021-09-08,http://www.butler.com/
|
||||
43,7E441b6B228DBcA,Wayne,Simpson,Perkins-Trevino,East Rebekahborough,Bolivia,(344)156-8632x1869,463-445-3702x38463,barbarapittman@holder.com,2020-12-13,https://gillespie-holder.com/
|
||||
44,D3fC11A9C235Dc6,Luis,Greer,Cross PLC,North Drew,Bulgaria,001-336-025-6849x701,684.698.2911x6092,bstuart@williamson-mcclure.com,2022-05-15,https://fletcher-nielsen.com/
|
||||
45,30Dfa48fe5Ede78,Rhonda,Frost,"Herrera, Shepherd and Underwood",Lake Lindaburgh,Monaco,(127)081-9339,+1-431-028-3337x3492,zkrueger@wolf-chavez.net,2021-12-06,http://www.khan.com/
|
||||
46,fD780ED8dbEae7B,Joanne,Montes,"Price, Sexton and Mcdaniel",Gwendolynview,Palau,(897)726-7952,(467)886-9467x5721,juan80@henson.net,2020-07-01,http://ochoa.com/
|
||||
47,300A40d3ce24bBA,Geoffrey,Guzman,Short-Wiggins,Zimmermanland,Uzbekistan,975.235.8921x269,(983)188-6873,bauercrystal@gay.com,2020-04-23,https://decker-kline.com/
|
||||
48,283DFCD0Dba40aF,Gloria,Mccall,"Brennan, Acosta and Ramos",North Kerriton,Ghana,445-603-6729,001-395-959-4736x4524,bartlettjenna@zuniga-moss.biz,2022-03-11,http://burgess-frank.com/
|
||||
49,F4Fc91fEAEad286,Brady,Cohen,Osborne-Erickson,North Eileenville,United Arab Emirates,741.849.0139x524,+1-028-691-7497x0894,mccalltyrone@durham-rose.biz,2022-03-10,http://hammond-barron.com/
|
||||
50,80F33Fd2AcebF05,Latoya,Mccann,"Hobbs, Garrett and Sanford",Port Sergiofort,Belarus,(530)287-4548x29481,162-234-0249x32790,bobhammond@barry.biz,2021-12-02,https://www.burton.com/
|
||||
51,Aa20BDe68eAb0e9,Gerald,Hawkins,"Phelps, Forbes and Koch",New Alberttown,Canada,+1-323-239-1456x96168,(092)508-0269,uwarner@steele-arias.com,2021-03-19,https://valenzuela.com/
|
||||
52,e898eEB1B9FE22b,Samuel,Crawford,"May, Goodwin and Martin",South Jasmine,Algeria,802-242-7457,626.116.9535x8578,xpittman@ritter-carney.net,2021-03-27,https://guerrero.org/
|
||||
53,faCEF517ae7D8eB,Patricia,Goodwin,"Christian, Winters and Ellis",Cowanfort,Swaziland,322.549.7139x70040,(111)741-4173,vaughanchristy@lara.biz,2021-03-08,http://clark.info/
|
||||
54,c09952De6Cda8aA,Stacie,Richard,Byrd Inc,New Deborah,Madagascar,001-622-948-3641x24810,001-731-168-2893x8891,clinton85@colon-arias.org,2020-10-15,https://kim.com/
|
||||
55,f3BEf3Be028166f,Robin,West,"Nixon, Blackwell and Sosa",Wallstown,Ecuador,698.303.4267,001-683-837-7651x525,greenemiranda@zimmerman.com,2022-01-13,https://www.mora.com/
|
||||
56,C6F2Fc6a7948a4e,Ralph,Haas,Montes PLC,Lake Ellenchester,Palestinian Territory,2239271999,001-962-434-0867x649,goodmancesar@figueroa.biz,2020-05-25,http://may.com/
|
||||
57,c8FE57cBBdCDcb2,Phyllis,Maldonado,Costa PLC,Lake Whitney,Saint Barthelemy,4500370767,001-508-064-6725x017,yhanson@warner-diaz.org,2021-01-25,http://www.bernard.com/
|
||||
58,B5acdFC982124F2,Danny,Parrish,Novak LLC,East Jaredbury,United Arab Emirates,(669)384-8597x8794,506.731.5952x571,howelldarren@house-cohen.com,2021-03-17,http://www.parsons-hudson.com/
|
||||
59,8c7DdF10798bCC3,Kathy,Hill,"Moore, Mccoy and Glass",Selenabury,South Georgia and the South Sandwich Islands,001-171-716-2175x310,888.625.0654,ncamacho@boone-simmons.org,2020-11-15,http://hayden.com/
|
||||
60,C681dDd0cc422f7,Kelli,Hardy,Petty Ltd,Huangfort,Sao Tome and Principe,020.324.2191x2022,424-157-8216,kristopher62@oliver.com,2020-12-20,http://www.kidd.com/
|
||||
61,a940cE42e035F28,Lynn,Pham,"Brennan, Camacho and Tapia",East Pennyshire,Portugal,846.468.6834x611,001-248-691-0006,mpham@rios-guzman.com,2020-08-21,https://www.murphy.com/
|
||||
62,9Cf5E6AFE0aeBfd,Shelley,Harris,"Prince, Malone and Pugh",Port Jasminborough,Togo,423.098.0315x8373,+1-386-458-8944x15194,zachary96@mitchell-bryant.org,2020-12-10,https://www.ryan.com/
|
||||
63,aEcbe5365BbC67D,Eddie,Jimenez,Caldwell Group,West Kristine,Ethiopia,+1-235-657-1073x6306,(026)401-7353x2417,kristiwhitney@bernard.com,2022-03-24,http://cherry.com/
|
||||
64,FCBdfCEAe20A8Dc,Chloe,Hutchinson,Simon LLC,South Julia,Netherlands,981-544-9452,+1-288-552-4666x060,leah85@sutton-terrell.com,2022-05-15,https://mitchell.info/
|
||||
65,636cBF0835E10ff,Eileen,Lynch,"Knight, Abbott and Hubbard",Helenborough,Liberia,+1-158-951-4131x53578,001-673-779-6713x680,levigiles@vincent.com,2021-01-02,http://mckay.com/
|
||||
66,fF1b6c9E8Fbf1ff,Fernando,Lambert,Church-Banks,Lake Nancy,Lithuania,497.829.9038,3863743398,fisherlinda@schaefer.net,2021-04-23,https://www.vang.com/
|
||||
67,2A13F74EAa7DA6c,Makayla,Cannon,Henderson Inc,Georgeport,New Caledonia,001-215-801-6392x46009,027-609-6460,scottcurtis@hurley.biz,2020-01-20,http://www.velazquez.net/
|
||||
68,a014Ec1b9FccC1E,Tom,Alvarado,Donaldson-Dougherty,South Sophiaberg,Kiribati,(585)606-2980x2258,730-797-3594x5614,nicholsonnina@montgomery.info,2020-08-18,http://odom-massey.com/
|
||||
69,421a109cABDf5fa,Virginia,Dudley,Warren Ltd,Hartbury,French Southern Territories,027.846.3705x14184,+1-439-171-1846x4636,zvalencia@phelps.com,2021-01-31,http://hunter-esparza.com/
|
||||
70,CC68FD1D3Bbbf22,Riley,Good,Wade PLC,Erikaville,Canada,6977745822,855-436-7641,alex06@galloway.com,2020-02-03,http://conway.org/
|
||||
71,CBCd2Ac8E3eBDF9,Alexandria,Buck,Keller-Coffey,Nicolasfort,Iran,078-900-4760x76668,414-112-8700x68751,lee48@manning.com,2021-02-20,https://ramsey.org/
|
||||
72,Ef859092FbEcC07,Richard,Roth,Conway-Mcbride,New Jasmineshire,Morocco,581-440-6539,9857827463,aharper@maddox-townsend.org,2020-02-23,https://www.brooks.com/
|
||||
73,F560f2d3cDFb618,Candice,Keller,Huynh and Sons,East Summerstad,Zimbabwe,001-927-965-8550x92406,001-243-038-4271x53076,buckleycory@odonnell.net,2020-08-22,https://www.lucero.com/
|
||||
74,A3F76Be153Df4a3,Anita,Benson,Parrish Ltd,Skinnerport,Russian Federation,874.617.5668x69878,(399)820-6418x0071,angie04@oconnell.com,2020-02-09,http://oconnor.com/
|
||||
75,D01Af0AF7cBbFeA,Regina,Stein,Guzman-Brown,Raystad,Solomon Islands,001-469-848-0724x4407,001-085-360-4426x00357,zrosario@rojas-hardin.net,2022-01-15,http://www.johnston.info/
|
||||
76,d40e89dCade7b2F,Debra,Riddle,"Chang, Aguirre and Leblanc",Colinhaven,United States Virgin Islands,+1-768-182-6014x14336,(303)961-4491,shieldskerry@robles.com,2020-07-11,http://kaiser.info/
|
||||
77,BF6a1f9bd1bf8DE,Brittany,Zuniga,Mason-Hester,West Reginald,Kyrgyz Republic,(050)136-9025,001-480-851-2496x0157,mchandler@cochran-huerta.org,2021-07-24,http://www.boyle.com/
|
||||
78,FfaeFFbbbf280db,Cassidy,Mcmahon,"Mcguire, Huynh and Hopkins",Lake Sherryborough,Myanmar,5040771311,684-682-0021x1326,katrinalane@fitzgerald.com,2020-10-21,https://hurst.com/
|
||||
79,CbAE1d1e9a8dCb1,Laurie,Pennington,"Sanchez, Marsh and Hale",Port Katherineville,Dominica,007.155.3406x553,+1-809-862-5566x277,cookejill@powell.com,2020-06-08,http://www.hebert.com/
|
||||
80,A7F85c1DE4dB87f,Alejandro,Blair,"Combs, Waller and Durham",Thomasland,Iceland,(690)068-4641x51468,555.509.8691x2329,elizabethbarr@ewing.com,2020-09-19,https://mercado-blevins.com/
|
||||
81,D6CEAfb3BDbaa1A,Leslie,Jennings,Blankenship-Arias,Coreybury,Micronesia,629.198.6346,075.256.0829,corey75@wiggins.com,2021-11-13,https://www.juarez.com/
|
||||
82,Ebdb6F6F7c90b69,Kathleen,Mckay,"Coffey, Lamb and Johnson",Lake Janiceton,Saint Vincent and the Grenadines,(733)910-9968,(691)247-4128x0665,chloelester@higgins-wilkinson.com,2021-09-12,http://www.owens-mooney.com/
|
||||
83,E8E7e8Cfe516ef0,Hunter,Moreno,Fitzpatrick-Lawrence,East Clinton,Isle of Man,(733)833-6754,001-761-013-7121,isaac26@benton-finley.com,2020-12-28,http://walls.info/
|
||||
84,78C06E9b6B3DF20,Chad,Davidson,Garcia-Jimenez,South Joshuashire,Oman,8275702958,(804)842-4715,justinwalters@jimenez.com,2021-11-15,http://www.garner-oliver.com/
|
||||
85,03A1E62ADdeb31c,Corey,Holt,"Mcdonald, Bird and Ramirez",New Glenda,Fiji,001-439-242-4986x7918,3162708934,maurice46@morgan.com,2020-02-18,http://www.watson.com/
|
||||
86,C6763c99d0bd16D,Emma,Cunningham,Stephens Inc,North Jillianview,New Zealand,128-059-0206x60217,(312)164-4545x2284,walter83@juarez.org,2022-05-13,http://www.reid.info/
|
||||
87,ebe77E5Bf9476CE,Duane,Woods,Montoya-Miller,Lyonsberg,Maldives,(636)544-7783x7288,(203)287-1003x5932,kmercer@wagner.com,2020-07-21,http://murray.org/
|
||||
88,E4Bbcd8AD81fC5f,Alison,Vargas,"Vaughn, Watts and Leach",East Cristinabury,Benin,365-273-8144,053-308-7653x6287,vcantu@norton.com,2020-11-10,http://mason.info/
|
||||
89,efeb73245CDf1fF,Vernon,Kane,Carter-Strickland,Thomasfurt,Yemen,114-854-1159x555,499-608-4612,hilljesse@barrett.info,2021-04-15,http://www.duffy-hensley.net/
|
||||
90,37Ec4B395641c1E,Lori,Flowers,Decker-Mcknight,North Joeburgh,Namibia,679.415.1210,945-842-3659x4581,tyrone77@valenzuela.info,2021-01-09,http://www.deleon-crosby.com/
|
||||
91,5ef6d3eefdD43bE,Nina,Chavez,Byrd-Campbell,Cassidychester,Bhutan,053-344-3205,+1-330-920-5422x571,elliserica@frank.com,2020-03-26,https://www.pugh.com/
|
||||
92,98b3aeDcC3B9FF3,Shane,Foley,Rocha-Hart,South Dannymouth,Hungary,+1-822-569-0302,001-626-114-5844x55073,nsteele@sparks.com,2021-07-06,https://www.holt-sparks.com/
|
||||
93,aAb6AFc7AfD0fF3,Collin,Ayers,Lamb-Peterson,South Lonnie,Anguilla,404-645-5351x012,001-257-582-8850x8516,dudleyemily@gonzales.biz,2021-06-29,http://www.ruiz.com/
|
||||
94,54B5B5Fe9F1B6C5,Sherry,Young,"Lee, Lucero and Johnson",Frankchester,Solomon Islands,158-687-1764,(438)375-6207x003,alan79@gates-mclaughlin.com,2021-04-04,https://travis.net/
|
||||
95,BE91A0bdcA49Bbc,Darrell,Douglas,"Newton, Petersen and Mathis",Daisyborough,Mali,001-084-845-9524x1777,001-769-564-6303,grayjean@lowery-good.com,2022-02-17,https://banks.biz/
|
||||
96,cb8E23e48d22Eae,Karl,Greer,Carey LLC,East Richard,Guyana,(188)169-1674x58692,001-841-293-3519x614,hhart@jensen.com,2022-01-30,http://hayes-perez.com/
|
||||
97,CeD220bdAaCfaDf,Lynn,Atkinson,"Ware, Burns and Oneal",New Bradview,Sri Lanka,+1-846-706-2218,605.413.3198,vkemp@ferrell.com,2021-07-10,https://novak-allison.com/
|
||||
98,28CDbC0dFe4b1Db,Fred,Guerra,Schmitt-Jones,Ortegaland,Solomon Islands,+1-753-067-8419x7170,+1-632-666-7507x92121,swagner@kane.org,2021-09-18,https://www.ross.com/
|
||||
99,c23d1D9EE8DEB0A,Yvonne,Farmer,Fitzgerald-Harrell,Lake Elijahview,Aruba,(530)311-9786,001-869-452-0943x12424,mccarthystephen@horn-green.biz,2021-08-11,http://watkins.info/
|
||||
100,2354a0E336A91A1,Clarence,Haynes,"Le, Nash and Cross",Judymouth,Honduras,(753)813-6941,783.639.1472,colleen91@faulkner.biz,2020-03-11,http://www.hatfield-saunders.net/
|
||||
|
@@ -4,20 +4,19 @@ from langchain_openai import OpenAIEmbeddings
|
||||
from langchain.vectorstores import FAISS
|
||||
from langchain_core.pydantic_v1 import BaseModel, Field
|
||||
from langchain import PromptTemplate
|
||||
import fitz
|
||||
from openai import RateLimitError
|
||||
from typing import List
|
||||
|
||||
from openai import RateLimitError
|
||||
from rank_bm25 import BM25Okapi
|
||||
|
||||
import fitz
|
||||
import asyncio
|
||||
import random
|
||||
import textwrap
|
||||
import numpy as np
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
def replace_t_with_space(list_of_documents):
|
||||
"""
|
||||
Replaces all tab characters ('\t') with spaces in the page content of each document.
|
||||
@@ -48,8 +47,6 @@ def text_wrap(text, width=120):
|
||||
return textwrap.fill(text, width=width)
|
||||
|
||||
|
||||
|
||||
|
||||
def encode_pdf(path, chunk_size=1000, chunk_overlap=200):
|
||||
"""
|
||||
Encodes a PDF book into a vector store using OpenAI embeddings.
|
||||
@@ -80,22 +77,54 @@ def encode_pdf(path, chunk_size=1000, chunk_overlap=200):
|
||||
|
||||
return vectorstore
|
||||
|
||||
|
||||
def encode_from_string(content, chunk_size=1000, chunk_overlap=200):
|
||||
text_splitter = RecursiveCharacterTextSplitter(
|
||||
# Set a really small chunk size, just to show.
|
||||
chunk_size=chunk_size,
|
||||
chunk_overlap=chunk_overlap,
|
||||
length_function=len,
|
||||
is_separator_regex=False,
|
||||
)
|
||||
chunks = text_splitter.create_documents([content])
|
||||
"""
|
||||
Encodes a string into a vector store using OpenAI embeddings.
|
||||
|
||||
for chunk in chunks:
|
||||
chunk.metadata['relevance_score'] = 1.0
|
||||
|
||||
embeddings = OpenAIEmbeddings()
|
||||
Args:
|
||||
content (str): The text content to be encoded.
|
||||
chunk_size (int): The size of each chunk of text.
|
||||
chunk_overlap (int): The overlap between chunks.
|
||||
|
||||
Returns:
|
||||
FAISS: A vector store containing the encoded content.
|
||||
|
||||
Raises:
|
||||
ValueError: If the input content is not valid.
|
||||
RuntimeError: If there is an error during the encoding process.
|
||||
"""
|
||||
|
||||
if not isinstance(content, str) or not content.strip():
|
||||
raise ValueError("Content must be a non-empty string.")
|
||||
|
||||
if not isinstance(chunk_size, int) or chunk_size <= 0:
|
||||
raise ValueError("chunk_size must be a positive integer.")
|
||||
|
||||
if not isinstance(chunk_overlap, int) or chunk_overlap < 0:
|
||||
raise ValueError("chunk_overlap must be a non-negative integer.")
|
||||
|
||||
try:
|
||||
# Split the content into chunks
|
||||
text_splitter = RecursiveCharacterTextSplitter(
|
||||
chunk_size=chunk_size,
|
||||
chunk_overlap=chunk_overlap,
|
||||
length_function=len,
|
||||
is_separator_regex=False,
|
||||
)
|
||||
chunks = text_splitter.create_documents([content])
|
||||
|
||||
# Assign metadata to each chunk
|
||||
for chunk in chunks:
|
||||
chunk.metadata['relevance_score'] = 1.0
|
||||
|
||||
# Generate embeddings and create the vector store
|
||||
embeddings = OpenAIEmbeddings()
|
||||
vectorstore = FAISS.from_documents(chunks, embeddings)
|
||||
|
||||
except Exception as e:
|
||||
raise RuntimeError(f"An error occurred during the encoding process: {str(e)}")
|
||||
|
||||
vectorstore = FAISS.from_documents(chunks, embeddings)
|
||||
return vectorstore
|
||||
|
||||
|
||||
@@ -119,9 +148,9 @@ def retrieve_context_per_question(question, chunks_query_retriever):
|
||||
# context = " ".join(doc.page_content for doc in docs)
|
||||
context = [doc.page_content for doc in docs]
|
||||
|
||||
|
||||
return context
|
||||
|
||||
|
||||
class QuestionAnswerFromContext(BaseModel):
|
||||
"""
|
||||
Model to generate an answer to a query based on a given context.
|
||||
@@ -131,8 +160,8 @@ class QuestionAnswerFromContext(BaseModel):
|
||||
"""
|
||||
answer_based_on_content: str = Field(description="Generates an answer to a query based on a given context.")
|
||||
|
||||
def create_question_answer_from_context_chain(llm):
|
||||
|
||||
def create_question_answer_from_context_chain(llm):
|
||||
# Initialize the ChatOpenAI model with specific parameters
|
||||
question_answer_from_context_llm = llm
|
||||
|
||||
@@ -151,11 +180,11 @@ def create_question_answer_from_context_chain(llm):
|
||||
)
|
||||
|
||||
# Create a chain by combining the prompt template and the language model
|
||||
question_answer_from_context_cot_chain = question_answer_from_context_prompt | question_answer_from_context_llm.with_structured_output(QuestionAnswerFromContext)
|
||||
question_answer_from_context_cot_chain = question_answer_from_context_prompt | question_answer_from_context_llm.with_structured_output(
|
||||
QuestionAnswerFromContext)
|
||||
return question_answer_from_context_cot_chain
|
||||
|
||||
|
||||
|
||||
def answer_question_from_context(question, context, question_answer_from_context_chain):
|
||||
"""
|
||||
Answer a question using the given context by invoking a chain of reasoning.
|
||||
@@ -188,7 +217,7 @@ def show_context(context):
|
||||
Prints each context item in the list with a heading indicating its position.
|
||||
"""
|
||||
for i, c in enumerate(context):
|
||||
print(f"Context {i+1}:")
|
||||
print(f"Context {i + 1}:")
|
||||
print(c)
|
||||
print("\n")
|
||||
|
||||
@@ -218,7 +247,6 @@ def read_pdf_to_string(path):
|
||||
return content
|
||||
|
||||
|
||||
|
||||
def bm25_retrieval(bm25: BM25Okapi, cleaned_texts: List[str], query: str, k: int = 5) -> List[str]:
|
||||
"""
|
||||
Perform BM25 retrieval and return the top k cleaned text chunks.
|
||||
@@ -247,7 +275,6 @@ def bm25_retrieval(bm25: BM25Okapi, cleaned_texts: List[str], query: str, k: int
|
||||
return top_k_texts
|
||||
|
||||
|
||||
|
||||
async def exponential_backoff(attempt):
|
||||
"""
|
||||
Implements exponential backoff with a jitter.
|
||||
@@ -261,10 +288,11 @@ async def exponential_backoff(attempt):
|
||||
# Calculate the wait time with exponential backoff and jitter
|
||||
wait_time = (2 ** attempt) + random.uniform(0, 1)
|
||||
print(f"Rate limit hit. Retrying in {wait_time:.2f} seconds...")
|
||||
|
||||
|
||||
# Asynchronously sleep for the calculated wait time
|
||||
await asyncio.sleep(wait_time)
|
||||
|
||||
|
||||
async def retry_with_exponential_backoff(coroutine, max_retries=5):
|
||||
"""
|
||||
Retries a coroutine using exponential backoff upon encountering a RateLimitError.
|
||||
@@ -287,9 +315,9 @@ async def retry_with_exponential_backoff(coroutine, max_retries=5):
|
||||
# If the last attempt also fails, raise the exception
|
||||
if attempt == max_retries - 1:
|
||||
raise e
|
||||
|
||||
|
||||
# Wait for an exponential backoff period before retrying
|
||||
await exponential_backoff(attempt)
|
||||
|
||||
|
||||
# If max retries are reached without success, raise an exception
|
||||
raise Exception("Max retries reached")
|
||||
|
||||
65
images/hierarchical_indices_example.svg
Normal file
@@ -0,0 +1,65 @@
|
||||
<svg xmlns="http://www.w3.org/2000/svg" width="800" height="600" viewBox="0 0 800 600">
|
||||
<style>
|
||||
text { font-family: Arial, sans-serif; }
|
||||
.title { font-size: 24px; font-weight: bold; }
|
||||
.subtitle { font-size: 18px; font-weight: bold; }
|
||||
.content { font-size: 14px; }
|
||||
.highlight { fill: #1a73e8; }
|
||||
</style>
|
||||
|
||||
<!-- Background -->
|
||||
<rect width="800" height="600" fill="#f5f5f5"/>
|
||||
|
||||
<!-- Title -->
|
||||
<rect width="800" height="60" fill="#4285f4"/>
|
||||
<text x="400" y="40" text-anchor="middle" class="title" fill="white">Movie Review Database: RAG vs Hierarchical Indices</text>
|
||||
|
||||
<!-- Scenario -->
|
||||
<rect x="50" y="70" width="700" height="80" fill="white" stroke="#d3d3d3"/>
|
||||
<text x="400" y="95" text-anchor="middle" class="subtitle">Scenario</text>
|
||||
<text x="400" y="120" text-anchor="middle" class="content">Large database: 10,000 movie reviews (50,000 chunks)</text>
|
||||
<text x="400" y="140" text-anchor="middle" class="content">Query: "Opinions on visual effects in recent sci-fi movies?"</text>
|
||||
|
||||
<!-- Comparison Section -->
|
||||
<text x="400" y="180" text-anchor="middle" class="subtitle">Comparison</text>
|
||||
|
||||
<!-- Regular RAG Approach -->
|
||||
<rect x="50" y="200" width="340" height="220" fill="white" stroke="#d3d3d3"/>
|
||||
<text x="220" y="230" text-anchor="middle" class="subtitle">Regular RAG Approach</text>
|
||||
<line x1="70" y1="245" x2="370" y2="245" stroke="#4285f4" stroke-width="2"/>
|
||||
<text x="70" y="270" class="content">• Searches all 50,000 chunks</text>
|
||||
<text x="70" y="295" class="content">• Retrieves top 10 similar chunks</text>
|
||||
<text x="70" y="330" class="content" font-weight="bold">Result:</text>
|
||||
<text x="70" y="355" class="content">May miss context or include irrelevant movies</text>
|
||||
|
||||
<!-- Hierarchical Indices Approach -->
|
||||
<rect x="410" y="200" width="340" height="320" fill="white" stroke="#d3d3d3"/>
|
||||
<text x="580" y="230" text-anchor="middle" class="subtitle">Hierarchical Indices Approach</text>
|
||||
<line x1="430" y1="245" x2="730" y2="245" stroke="#4285f4" stroke-width="2"/>
|
||||
<text x="430" y="270" class="content">• First tier: 10,000 review summaries</text>
|
||||
<text x="430" y="295" class="content">• Second tier: 50,000 detailed chunks</text>
|
||||
<text x="430" y="320" class="content" font-weight="bold">Process:</text>
|
||||
<text x="450" y="345" class="content">1. Search 10,000 summaries</text>
|
||||
<text x="450" y="370" class="content">2. Identify top 100 relevant reviews</text>
|
||||
<text x="450" y="395" class="content">3. Search ~500 chunks from these reviews</text>
|
||||
<text x="450" y="420" class="content">4. Retrieve top 10 chunks</text>
|
||||
<text x="430" y="455" class="content" font-weight="bold">Result:</text>
|
||||
<text x="430" y="480" class="content">More relevant chunks, better context</text>
|
||||
|
||||
<!-- Advantages -->
|
||||
<rect x="50" y="440" width="340" height="140" fill="#e8f0fe" stroke="#4285f4"/>
|
||||
<text x="220" y="470" text-anchor="middle" class="subtitle">Advantages of Hierarchical Indices</text>
|
||||
<line x1="70" y1="485" x2="370" y2="485" stroke="#4285f4" stroke-width="2"/>
|
||||
<text x="70" y="510" class="content highlight">1. Context Preservation</text>
|
||||
<text x="70" y="535" class="content highlight">2. Efficiency (searches 500 vs 50,000 chunks)</text>
|
||||
<text x="70" y="560" class="content highlight">3. Improved Relevance</text>
|
||||
|
||||
<!-- Arrows -->
|
||||
<defs>
|
||||
<marker id="arrowhead" markerWidth="10" markerHeight="7" refX="0" refY="3.5" orient="auto">
|
||||
<polygon points="0 0, 10 3.5, 0 7" fill="#4285f4"/>
|
||||
</marker>
|
||||
</defs>
|
||||
<line x1="220" y1="420" x2="220" y2="435" stroke="#4285f4" stroke-width="2" marker-end="url(#arrowhead)"/>
|
||||
<line x1="580" y1="520" x2="395" y2="520" stroke="#4285f4" stroke-width="2" marker-end="url(#arrowhead)"/>
|
||||
</svg>
|
||||
|
After Width: | Height: | Size: 3.7 KiB |
74
images/hyde-advantages.svg
Normal file
@@ -0,0 +1,74 @@
|
||||
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 1000 600">
|
||||
<!-- Background -->
|
||||
<rect width="1000" height="600" fill="#f0f8ff"/>
|
||||
|
||||
<!-- Title -->
|
||||
<rect width="1000" height="70" fill="#2c3e50"/>
|
||||
<text x="500" y="45" font-family="Arial, sans-serif" font-size="28" fill="white" text-anchor="middle" font-weight="bold">HyDE: Bridging Gaps in Information Retrieval</text>
|
||||
|
||||
<!-- Query -->
|
||||
<rect x="250" y="85" width="500" height="40" rx="20" fill="#3498db"/>
|
||||
<text x="500" y="112" font-family="Arial, sans-serif" font-size="18" fill="white" text-anchor="middle" font-weight="bold">Query: "new diabetes treatments"</text>
|
||||
|
||||
<!-- Regular RAG -->
|
||||
<rect x="20" y="140" width="470" height="210" rx="10" fill="#ecf0f1" stroke="#34495e" stroke-width="2"/>
|
||||
<text x="255" y="170" font-family="Arial, sans-serif" font-size="24" fill="#2c3e50" text-anchor="middle" font-weight="bold">Regular RAG</text>
|
||||
|
||||
<!-- Regular RAG Process -->
|
||||
<circle cx="50" cy="210" r="15" fill="#3498db"/>
|
||||
<text x="50" y="215" font-family="Arial, sans-serif" font-size="12" fill="white" text-anchor="middle">1</text>
|
||||
<text x="80" y="215" font-family="Arial, sans-serif" font-size="16" fill="#333">Encode query directly</text>
|
||||
|
||||
<circle cx="50" cy="250" r="15" fill="#3498db"/>
|
||||
<text x="50" y="255" font-family="Arial, sans-serif" font-size="12" fill="white" text-anchor="middle">2</text>
|
||||
<text x="80" y="255" font-family="Arial, sans-serif" font-size="16" fill="#333">Search document chunks</text>
|
||||
|
||||
<circle cx="50" cy="290" r="15" fill="#3498db"/>
|
||||
<text x="50" y="295" font-family="Arial, sans-serif" font-size="12" fill="white" text-anchor="middle">3</text>
|
||||
<text x="80" y="295" font-family="Arial, sans-serif" font-size="16" fill="#333">Retrieve similar chunks</text>
|
||||
|
||||
<!-- HyDE -->
|
||||
<rect x="510" y="140" width="470" height="210" rx="10" fill="#ecf0f1" stroke="#34495e" stroke-width="2"/>
|
||||
<text x="745" y="170" font-family="Arial, sans-serif" font-size="24" fill="#2c3e50" text-anchor="middle" font-weight="bold">HyDE</text>
|
||||
|
||||
<!-- HyDE Process -->
|
||||
<circle cx="540" cy="210" r="15" fill="#3498db"/>
|
||||
<text x="540" y="215" font-family="Arial, sans-serif" font-size="12" fill="white" text-anchor="middle">1</text>
|
||||
<text x="570" y="215" font-family="Arial, sans-serif" font-size="16" fill="#333">Generate hypothetical document</text>
|
||||
|
||||
<circle cx="540" cy="250" r="15" fill="#3498db"/>
|
||||
<text x="540" y="255" font-family="Arial, sans-serif" font-size="12" fill="white" text-anchor="middle">2</text>
|
||||
<text x="570" y="255" font-family="Arial, sans-serif" font-size="16" fill="#333">Encode hypothetical document</text>
|
||||
|
||||
<circle cx="540" cy="290" r="15" fill="#3498db"/>
|
||||
<text x="540" y="295" font-family="Arial, sans-serif" font-size="12" fill="white" text-anchor="middle">3</text>
|
||||
<text x="570" y="295" font-family="Arial, sans-serif" font-size="16" fill="#333">Search document chunks</text>
|
||||
|
||||
<circle cx="540" cy="330" r="15" fill="#3498db"/>
|
||||
<text x="540" y="335" font-family="Arial, sans-serif" font-size="12" fill="white" text-anchor="middle">4</text>
|
||||
<text x="570" y="335" font-family="Arial, sans-serif" font-size="16" fill="#333">Retrieve similar chunks</text>
|
||||
|
||||
<!-- Advantages of HyDE -->
|
||||
<rect x="20" y="370" width="960" height="210" rx="10" fill="#e8f8f5" stroke="#16a085" stroke-width="2"/>
|
||||
<text x="500" y="405" font-family="Arial, sans-serif" font-size="24" fill="#16a085" text-anchor="middle" font-weight="bold">Advantages of HyDE</text>
|
||||
|
||||
<circle cx="60" cy="450" r="15" fill="#16a085"/>
|
||||
<text x="60" y="455" font-family="Arial, sans-serif" font-size="12" fill="white" text-anchor="middle">1</text>
|
||||
<text x="90" y="455" font-family="Arial, sans-serif" font-size="18" fill="#333">Bridges domain terminology gap</text>
|
||||
|
||||
<circle cx="60" cy="495" r="15" fill="#16a085"/>
|
||||
<text x="60" y="500" font-family="Arial, sans-serif" font-size="12" fill="white" text-anchor="middle">2</text>
|
||||
<text x="90" y="500" font-family="Arial, sans-serif" font-size="18" fill="#333">Narrows query-document semantic gap</text>
|
||||
|
||||
<circle cx="60" cy="540" r="15" fill="#16a085"/>
|
||||
<text x="60" y="545" font-family="Arial, sans-serif" font-size="12" fill="white" text-anchor="middle">3</text>
|
||||
<text x="90" y="545" font-family="Arial, sans-serif" font-size="18" fill="#333">Improves retrieval for complex queries</text>
|
||||
|
||||
<!-- Hypothetical Document Example -->
|
||||
<rect x="510" y="435" width="450" height="135" rx="10" fill="#ffffff" stroke="#16a085" stroke-width="2"/>
|
||||
<text x="735" y="460" font-family="Arial, sans-serif" font-size="18" fill="#16a085" text-anchor="middle" font-weight="bold">Hypothetical Document Example</text>
|
||||
<text x="530" y="485" font-family="Arial, sans-serif" font-size="14" fill="#333">Recent advancements in diabetes treatments include:</text>
|
||||
<text x="530" y="510" font-family="Arial, sans-serif" font-size="14" fill="#333">1. GLP-1 receptor agonists (e.g., semaglutide)</text>
|
||||
<text x="530" y="535" font-family="Arial, sans-serif" font-size="14" fill="#333">2. SGLT2 inhibitors (e.g., empagliflozin)</text>
|
||||
<text x="530" y="560" font-family="Arial, sans-serif" font-size="14" fill="#333">3. Dual GIP/GLP-1 receptor agonists</text>
|
||||
</svg>
|
||||
|
After Width: | Height: | Size: 5.1 KiB |
62
images/proposition_chunking.svg
Normal file
@@ -0,0 +1,62 @@
|
||||
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 800 600">
|
||||
<defs>
|
||||
<marker id="arrowhead" markerWidth="10" markerHeight="7" refX="0" refY="3.5" orient="auto">
|
||||
<polygon points="0 0, 10 3.5, 0 7" fill="#34495e"/>
|
||||
</marker>
|
||||
</defs>
|
||||
|
||||
<!-- Background -->
|
||||
<rect width="100%" height="100%" fill="#ecf0f1"/>
|
||||
|
||||
<!-- Title -->
|
||||
<text x="400" y="40" font-family="Arial, sans-serif" font-size="24" font-weight="bold" text-anchor="middle" fill="#2c3e50">Propositions Method: Ingestion Phase</text>
|
||||
|
||||
<!-- Main Process Flow -->
|
||||
<g transform="translate(0, 20)">
|
||||
<!-- Input Data -->
|
||||
<rect x="50" y="80" width="140" height="70" fill="#3498db" stroke="#34495e" stroke-width="2" rx="5" ry="5"/>
|
||||
<text x="120" y="120" font-family="Arial, sans-serif" font-size="16" fill="white" text-anchor="middle">Input Text</text>
|
||||
|
||||
<!-- Generate Propositions -->
|
||||
<rect x="240" y="80" width="140" height="70" fill="#e74c3c" stroke="#34495e" stroke-width="2" rx="5" ry="5"/>
|
||||
<text x="310" y="110" font-family="Arial, sans-serif" font-size="16" fill="white" text-anchor="middle">Generate</text>
|
||||
<text x="310" y="135" font-family="Arial, sans-serif" font-size="16" fill="white" text-anchor="middle">Propositions</text>
|
||||
|
||||
<!-- Quality Check -->
|
||||
<rect x="430" y="80" width="140" height="70" fill="#f39c12" stroke="#34495e" stroke-width="2" rx="5" ry="5"/>
|
||||
<text x="500" y="120" font-family="Arial, sans-serif" font-size="16" fill="white" text-anchor="middle">Quality Check</text>
|
||||
|
||||
<!-- Index Propositions -->
|
||||
<rect x="620" y="80" width="140" height="70" fill="#2ecc71" stroke="#34495e" stroke-width="2" rx="5" ry="5"/>
|
||||
<text x="690" y="110" font-family="Arial, sans-serif" font-size="16" fill="white" text-anchor="middle">Index</text>
|
||||
<text x="690" y="135" font-family="Arial, sans-serif" font-size="16" fill="white" text-anchor="middle">Propositions</text>
|
||||
</g>
|
||||
|
||||
<!-- Example -->
|
||||
<rect x="50" y="200" width="700" height="380" fill="#ffffff" stroke="#34495e" stroke-width="2" rx="5" ry="5"/>
|
||||
<text x="400" y="230" font-family="Arial, sans-serif" font-size="20" font-weight="bold" text-anchor="middle" fill="#2c3e50">Example: Proposition Generation</text>
|
||||
|
||||
<!-- Input Sentence -->
|
||||
<rect x="70" y="250" width="660" height="60" fill="#bdc3c7" stroke="#34495e" stroke-width="2" rx="5" ry="5"/>
|
||||
<text x="400" y="275" font-family="Arial, sans-serif" font-size="16" text-anchor="middle" fill="#34495e">
|
||||
<tspan x="400" dy="0">The Eiffel Tower, built in 1889, is a wrought-iron</tspan>
|
||||
<tspan x="400" dy="25">lattice tower located in Paris, France.</tspan>
|
||||
</text>
|
||||
|
||||
<!-- Propositions -->
|
||||
<rect x="70" y="350" width="660" height="50" fill="#3498db" stroke="#34495e" stroke-width="2" rx="5" ry="5"/>
|
||||
<text x="400" y="380" font-family="Arial, sans-serif" font-size="16" text-anchor="middle" fill="white">The Eiffel Tower was built in 1889.</text>
|
||||
|
||||
<rect x="70" y="410" width="660" height="50" fill="#3498db" stroke="#34495e" stroke-width="2" rx="5" ry="5"/>
|
||||
<text x="400" y="440" font-family="Arial, sans-serif" font-size="16" text-anchor="middle" fill="white">The Eiffel Tower is a wrought-iron lattice tower.</text>
|
||||
|
||||
<rect x="70" y="470" width="660" height="50" fill="#3498db" stroke="#34495e" stroke-width="2" rx="5" ry="5"/>
|
||||
<text x="400" y="500" font-family="Arial, sans-serif" font-size="16" text-anchor="middle" fill="white">The Eiffel Tower is located in Paris, France.</text>
|
||||
|
||||
<!-- Arrows -->
|
||||
<line x1="190" y1="115" x2="240" y2="115" stroke="#34495e" stroke-width="2" marker-end="url(#arrowhead)"/>
|
||||
<line x1="380" y1="115" x2="430" y2="115" stroke="#34495e" stroke-width="2" marker-end="url(#arrowhead)"/>
|
||||
<line x1="570" y1="115" x2="620" y2="115" stroke="#34495e" stroke-width="2" marker-end="url(#arrowhead)"/>
|
||||
|
||||
<path d="M400 310 L400 330 L400 350" fill="none" stroke="#34495e" stroke-width="2" marker-end="url(#arrowhead)"/>
|
||||
</svg>
|
||||
|
After Width: | Height: | Size: 3.9 KiB |
3
images/reliable_rag.svg
Normal file
|
After Width: | Height: | Size: 14 KiB |
77
images/reranking-visualization.svg
Normal file
@@ -0,0 +1,77 @@
|
||||
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 800 400">
|
||||
<defs>
|
||||
<filter id="shadow" x="-20%" y="-20%" width="140%" height="140%">
|
||||
<feDropShadow dx="2" dy="2" stdDeviation="2" flood-color="#000000" flood-opacity="0.3"/>
|
||||
</filter>
|
||||
</defs>
|
||||
|
||||
<!-- Background -->
|
||||
<rect x="0" y="0" width="800" height="400" fill="#f0f9ff" rx="20" ry="20" />
|
||||
|
||||
<!-- Title -->
|
||||
<text x="400" y="40" text-anchor="middle" font-size="24" font-weight="bold" fill="#333">Re-ranking Process</text>
|
||||
|
||||
<!-- Initial Retrieval -->
|
||||
<rect x="20" y="70" width="220" height="310" fill="#e6f3ff" rx="10" ry="10" filter="url(#shadow)" />
|
||||
<text x="130" y="100" text-anchor="middle" font-size="18" font-weight="bold" fill="#333">Initial Retrieval</text>
|
||||
|
||||
<rect x="40" y="120" width="180" height="40" fill="#4e79a7" rx="5" ry="5" />
|
||||
<text x="130" y="145" text-anchor="middle" font-size="14" fill="white">Vector Store</text>
|
||||
|
||||
<rect x="40" y="180" width="180" height="30" fill="#f28e2b" rx="5" ry="5" />
|
||||
<text x="130" y="200" text-anchor="middle" font-size="12" fill="white">Document 1</text>
|
||||
|
||||
<rect x="40" y="220" width="180" height="30" fill="#e15759" rx="5" ry="5" />
|
||||
<text x="130" y="240" text-anchor="middle" font-size="12" fill="white">Document 2</text>
|
||||
|
||||
<rect x="40" y="260" width="180" height="30" fill="#76b7b2" rx="5" ry="5" />
|
||||
<text x="130" y="280" text-anchor="middle" font-size="12" fill="white">Document 3</text>
|
||||
|
||||
<rect x="40" y="300" width="180" height="30" fill="#59a14f" rx="5" ry="5" />
|
||||
<text x="130" y="320" text-anchor="middle" font-size="12" fill="white">Document 4</text>
|
||||
|
||||
<!-- Re-ranking Process -->
|
||||
<rect x="290" y="70" width="220" height="310" fill="#e6f3ff" rx="10" ry="10" filter="url(#shadow)" />
|
||||
<text x="400" y="100" text-anchor="middle" font-size="18" font-weight="bold" fill="#333">Re-ranking Process</text>
|
||||
|
||||
<rect x="310" y="120" width="180" height="60" fill="#4e79a7" rx="5" ry="5" />
|
||||
<text x="400" y="145" text-anchor="middle" font-size="14" fill="white">Re-ranking Model</text>
|
||||
<text x="400" y="165" text-anchor="middle" font-size="12" fill="white">(LLM or Cross-Encoder)</text>
|
||||
|
||||
<rect x="310" y="200" width="180" height="60" fill="#f28e2b" rx="5" ry="5" />
|
||||
<text x="400" y="225" text-anchor="middle" font-size="14" fill="white">Relevance Scoring</text>
|
||||
<text x="400" y="245" text-anchor="middle" font-size="12" fill="white">Based on query context</text>
|
||||
|
||||
<rect x="310" y="280" width="180" height="60" fill="#e15759" rx="5" ry="5" />
|
||||
<text x="400" y="305" text-anchor="middle" font-size="14" fill="white">Re-ordering</text>
|
||||
<text x="400" y="325" text-anchor="middle" font-size="12" fill="white">Based on relevance scores</text>
|
||||
|
||||
<!-- Final Results -->
|
||||
<rect x="560" y="70" width="220" height="310" fill="#e6f3ff" rx="10" ry="10" filter="url(#shadow)" />
|
||||
<text x="670" y="100" text-anchor="middle" font-size="18" font-weight="bold" fill="#333">Re-ranked Results</text>
|
||||
|
||||
<rect x="580" y="120" width="180" height="30" fill="#76b7b2" rx="5" ry="5" />
|
||||
<text x="670" y="140" text-anchor="middle" font-size="12" fill="white">Most Relevant Document</text>
|
||||
|
||||
<rect x="580" y="160" width="180" height="30" fill="#f28e2b" rx="5" ry="5" />
|
||||
<text x="670" y="180" text-anchor="middle" font-size="12" fill="white">Second Most Relevant</text>
|
||||
|
||||
<rect x="580" y="200" width="180" height="30" fill="#e15759" rx="5" ry="5" />
|
||||
<text x="670" y="220" text-anchor="middle" font-size="12" fill="white">Third Most Relevant</text>
|
||||
|
||||
<rect x="580" y="240" width="180" height="30" fill="#59a14f" opacity="0.5" rx="5" ry="5" />
|
||||
<text x="670" y="260" text-anchor="middle" font-size="12" fill="white">Less Relevant Document</text>
|
||||
|
||||
<!-- Arrows -->
|
||||
<path d="M 250 220 L 280 220" stroke="#333" stroke-width="2" marker-end="url(#arrowhead)" />
|
||||
<path d="M 520 220 L 550 220" stroke="#333" stroke-width="2" marker-end="url(#arrowhead)" />
|
||||
|
||||
<defs>
|
||||
<marker id="arrowhead" markerWidth="10" markerHeight="7" refX="0" refY="3.5" orient="auto">
|
||||
<polygon points="0 0, 10 3.5, 0 7" fill="#333" />
|
||||
</marker>
|
||||
</defs>
|
||||
|
||||
<!-- Explanatory Text -->
|
||||
<text x="400" y="390" text-anchor="middle" font-size="14" fill="#333">Re-ranking improves relevance by considering context and query intent</text>
|
||||
</svg>
|
||||
|
After Width: | Height: | Size: 4.2 KiB |
82
images/reranking_comparison.svg
Normal file
@@ -0,0 +1,82 @@
|
||||
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 800 650">
|
||||
<defs>
|
||||
<filter id="shadow" x="-20%" y="-20%" width="140%" height="140%">
|
||||
<feDropShadow dx="2" dy="2" stdDeviation="2" flood-color="#000000" flood-opacity="0.3"/>
|
||||
</filter>
|
||||
</defs>
|
||||
|
||||
<!-- Background -->
|
||||
<rect x="0" y="0" width="800" height="650" fill="#f0f9ff" rx="20" ry="20" />
|
||||
|
||||
<!-- Title -->
|
||||
<text x="400" y="40" text-anchor="middle" font-size="24" font-weight="bold" fill="#333">RAG Retrieval Comparison</text>
|
||||
|
||||
<!-- Query -->
|
||||
<text x="400" y="70" text-anchor="middle" font-size="16" fill="#555">Query: What is the capital of France?</text>
|
||||
|
||||
<!-- Document Collection -->
|
||||
<rect x="20" y="100" width="240" height="480" fill="#ffffff" rx="10" ry="10" filter="url(#shadow)" />
|
||||
<text x="140" y="130" text-anchor="middle" font-size="18" font-weight="bold" fill="#333">Document Collection</text>
|
||||
|
||||
<!-- Baseline Retrieval -->
|
||||
<rect x="280" y="100" width="240" height="480" fill="#ffffff" rx="10" ry="10" filter="url(#shadow)" />
|
||||
<text x="400" y="130" text-anchor="middle" font-size="18" font-weight="bold" fill="#333">Baseline Retrieval</text>
|
||||
|
||||
<!-- Advanced Retrieval -->
|
||||
<rect x="540" y="100" width="240" height="480" fill="#ffffff" rx="10" ry="10" filter="url(#shadow)" />
|
||||
<text x="660" y="130" text-anchor="middle" font-size="18" font-weight="bold" fill="#333">Advanced Retrieval</text>
|
||||
|
||||
<!-- Documents -->
|
||||
<g id="doc1">
|
||||
<rect x="30" y="150" width="220" height="80" fill="#e6f3ff" rx="5" ry="5" />
|
||||
<text x="40" y="170" font-size="12" fill="#333">The capital of France is great.</text>
|
||||
</g>
|
||||
|
||||
<g id="doc2">
|
||||
<rect x="30" y="240" width="220" height="80" fill="#e6f3ff" rx="5" ry="5" />
|
||||
<text x="40" y="260" font-size="12" fill="#333">The capital of France is huge.</text>
|
||||
</g>
|
||||
|
||||
<g id="doc3">
|
||||
<rect x="30" y="330" width="220" height="80" fill="#e6f3ff" rx="5" ry="5" />
|
||||
<text x="40" y="350" font-size="12" fill="#333">The capital of France is beautiful.</text>
|
||||
</g>
|
||||
|
||||
<g id="doc4">
|
||||
<rect x="30" y="420" width="220" height="80" fill="#e6f3ff" rx="5" ry="5" />
|
||||
<text x="40" y="440" font-size="12" fill="#333">
|
||||
<tspan x="40" dy="0">Have you ever visited Paris? It is</tspan>
|
||||
<tspan x="40" dy="14">a beautiful city where you can</tspan>
|
||||
<tspan x="40" dy="14">eat delicious food and see the</tspan>
|
||||
<tspan x="40" dy="14">Eiffel Tower...</tspan>
|
||||
</text>
|
||||
</g>
|
||||
|
||||
<g id="doc5">
|
||||
<rect x="30" y="510" width="220" height="80" fill="#e6f3ff" rx="5" ry="5" />
|
||||
<text x="40" y="530" font-size="12" fill="#333">
|
||||
<tspan x="40" dy="0">I really enjoyed my trip to Paris,</tspan>
|
||||
<tspan x="40" dy="14">France. The city is beautiful and</tspan>
|
||||
<tspan x="40" dy="14">the food is delicious. I would love</tspan>
|
||||
<tspan x="40" dy="14">to visit again...</tspan>
|
||||
</text>
|
||||
</g>
|
||||
|
||||
<!-- Baseline Results -->
|
||||
<use href="#doc1" x="250" />
|
||||
<use href="#doc3" x="250" />
|
||||
<rect x="280" y="150" width="240" height="80" fill="none" stroke="#4e79a7" stroke-width="3" rx="5" ry="5" />
|
||||
<rect x="280" y="330" width="240" height="80" fill="none" stroke="#4e79a7" stroke-width="3" rx="5" ry="5" />
|
||||
|
||||
<!-- Advanced Results -->
|
||||
<use href="#doc4" x="510" />
|
||||
<use href="#doc5" x="510" />
|
||||
<rect x="540" y="420" width="240" height="80" fill="none" stroke="#e15759" stroke-width="3" rx="5" ry="5" />
|
||||
<rect x="540" y="510" width="240" height="80" fill="none" stroke="#e15759" stroke-width="3" rx="5" ry="5" />
|
||||
|
||||
<!-- Legend (Moved to the bottom) -->
|
||||
<rect x="250" y="600" width="20" height="20" fill="none" stroke="#4e79a7" stroke-width="3" />
|
||||
<text x="280" y="615" font-size="14" fill="#333">Baseline Retrieval</text>
|
||||
<rect x="470" y="600" width="20" height="20" fill="none" stroke="#e15759" stroke-width="3" />
|
||||
<text x="500" y="615" font-size="14" fill="#333">Advanced Retrieval</text>
|
||||
</svg>
|
||||
|
After Width: | Height: | Size: 3.8 KiB |
72
images/semantic_chunking_comparison.svg
Normal file
@@ -0,0 +1,72 @@
|
||||
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 800 700">
|
||||
<!-- Background -->
|
||||
<rect width="800" height="700" fill="#f0f0f0"/>
|
||||
|
||||
<!-- Title -->
|
||||
<text x="400" y="30" font-family="Arial, sans-serif" font-size="20" text-anchor="middle" font-weight="bold">Regular vs Semantic Chunking: Both Using Semantic Search</text>
|
||||
|
||||
<!-- Regular Chunking -->
|
||||
<text x="200" y="60" font-family="Arial, sans-serif" font-size="16" text-anchor="middle" font-weight="bold">Regular Chunking</text>
|
||||
<rect x="50" y="70" width="300" height="50" fill="#ff9999" stroke="#000000"/>
|
||||
<text x="200" y="100" font-family="Arial, sans-serif" font-size="12" text-anchor="middle">Chunk 1: Abstract, Intro, part of Methods</text>
|
||||
|
||||
<rect x="50" y="130" width="300" height="50" fill="#ffcc99" stroke="#000000"/>
|
||||
<text x="200" y="160" font-family="Arial, sans-serif" font-size="12" text-anchor="middle">Chunk 2: Rest of Methods, part of Results</text>
|
||||
|
||||
<rect x="50" y="190" width="300" height="50" fill="#ffffcc" stroke="#000000"/>
|
||||
<text x="200" y="220" font-family="Arial, sans-serif" font-size="12" text-anchor="middle">Chunk 3: Rest of Results, part of Discussion</text>
|
||||
|
||||
<rect x="50" y="250" width="300" height="50" fill="#ccffcc" stroke="#000000"/>
|
||||
<text x="200" y="280" font-family="Arial, sans-serif" font-size="12" text-anchor="middle">Chunk 4: Rest of Discussion, Conclusion, References</text>
|
||||
|
||||
<!-- Semantic Chunking -->
|
||||
<text x="600" y="60" font-family="Arial, sans-serif" font-size="16" text-anchor="middle" font-weight="bold">Semantic Chunking</text>
|
||||
<rect x="450" y="70" width="300" height="40" fill="#ff9999" stroke="#000000"/>
|
||||
<text x="600" y="95" font-family="Arial, sans-serif" font-size="12" text-anchor="middle">Chunk 1: Abstract and Introduction</text>
|
||||
|
||||
<rect x="450" y="120" width="300" height="40" fill="#ffcc99" stroke="#000000"/>
|
||||
<text x="600" y="145" font-family="Arial, sans-serif" font-size="12" text-anchor="middle">Chunk 2: Methods</text>
|
||||
|
||||
<rect x="450" y="170" width="300" height="40" fill="#ffffcc" stroke="#000000"/>
|
||||
<text x="600" y="195" font-family="Arial, sans-serif" font-size="12" text-anchor="middle">Chunk 3: Results</text>
|
||||
|
||||
<rect x="450" y="220" width="300" height="40" fill="#ccffcc" stroke="#000000"/>
|
||||
<text x="600" y="245" font-family="Arial, sans-serif" font-size="12" text-anchor="middle">Chunk 4: Discussion and Conclusion</text>
|
||||
|
||||
<rect x="450" y="270" width="300" height="40" fill="#ccccff" stroke="#000000"/>
|
||||
<text x="600" y="295" font-family="Arial, sans-serif" font-size="12" text-anchor="middle">Chunk 5: References</text>
|
||||
|
||||
<!-- Query Example -->
|
||||
<text x="400" y="340" font-family="Arial, sans-serif" font-size="16" text-anchor="middle" font-weight="bold">Query Example</text>
|
||||
<rect x="50" y="350" width="700" height="40" fill="#e6e6e6" stroke="#000000"/>
|
||||
<text x="400" y="375" font-family="Arial, sans-serif" font-size="12" text-anchor="middle">"What were the methods used to measure blood pressure in studies that found a significant reduction in hypertension?"</text>
|
||||
|
||||
<!-- Semantic Search Illustration -->
|
||||
<text x="400" y="420" font-family="Arial, sans-serif" font-size="16" text-anchor="middle" font-weight="bold">Semantic Search Results</text>
|
||||
|
||||
<!-- Regular Chunking Search -->
|
||||
<text x="200" y="450" font-family="Arial, sans-serif" font-size="14" text-anchor="middle" font-weight="bold">Regular Chunking</text>
|
||||
<rect x="50" y="460" width="300" height="120" fill="#f9f9f9" stroke="#000000"/>
|
||||
<text x="55" y="475" font-family="Arial, sans-serif" font-size="10">• Retrieves parts of Chunks 1 and 2</text>
|
||||
<text x="55" y="495" font-family="Arial, sans-serif" font-size="10">• Combines relevant info from multiple chunks</text>
|
||||
<text x="55" y="515" font-family="Arial, sans-serif" font-size="10">• May include some irrelevant information</text>
|
||||
<text x="55" y="535" font-family="Arial, sans-serif" font-size="10">• Requires more complex combination of info</text>
|
||||
|
||||
<!-- Semantic Chunking Search -->
|
||||
<text x="600" y="450" font-family="Arial, sans-serif" font-size="14" text-anchor="middle" font-weight="bold">Semantic Chunking</text>
|
||||
<rect x="450" y="460" width="300" height="120" fill="#f9f9f9" stroke="#000000"/>
|
||||
<text x="455" y="475" font-family="Arial, sans-serif" font-size="10">• Retrieves Chunk 2 (entire Methods section)</text>
|
||||
<text x="455" y="495" font-family="Arial, sans-serif" font-size="10">• All relevant information in one coherent chunk</text>
|
||||
<text x="455" y="515" font-family="Arial, sans-serif" font-size="10">• Minimal irrelevant information included</text>
|
||||
<text x="455" y="535" font-family="Arial, sans-serif" font-size="10">• Preserves full context of Methods</text>
|
||||
|
||||
<!-- Advantages of Semantic Chunking -->
|
||||
<text x="400" y="610" font-family="Arial, sans-serif" font-size="16" text-anchor="middle" font-weight="bold">Potential Advantages of Semantic Chunking</text>
|
||||
<rect x="50" y="620" width="700" height="70" fill="#e6e6e6" stroke="#000000"/>
|
||||
<text x="55" y="635" font-family="Arial, sans-serif" font-size="10">1. Better coherence and context preservation</text>
|
||||
<text x="55" y="655" font-family="Arial, sans-serif" font-size="10">2. Reduced noise and irrelevant information</text>
|
||||
<text x="55" y="675" font-family="Arial, sans-serif" font-size="10">3. Potentially more efficient retrieval (fewer chunks needed)</text>
|
||||
<text x="400" y="635" font-family="Arial, sans-serif" font-size="10">4. Improved handling of long-range dependencies</text>
|
||||
<text x="400" y="655" font-family="Arial, sans-serif" font-size="10">5. Possible better ranking of most relevant information</text>
|
||||
<text x="400" y="675" font-family="Arial, sans-serif" font-size="10">6. Easier for model to understand complete ideas</text>
|
||||
</svg>
|
||||
|
After Width: | Height: | Size: 5.7 KiB |
50
images/vector-search-comparison_context_enrichment.svg
Normal file
@@ -0,0 +1,50 @@
|
||||
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 800 400">
|
||||
<!-- Background -->
|
||||
<rect width="800" height="400" fill="#f0f0f0"/>
|
||||
|
||||
<!-- Title -->
|
||||
<text x="400" y="30" font-family="Arial, sans-serif" font-size="20" text-anchor="middle" font-weight="bold">Traditional vs Context-Enriched Vector Search</text>
|
||||
|
||||
<!-- Traditional Search -->
|
||||
<rect x="50" y="60" width="300" height="300" fill="#ffffff" stroke="#000000"/>
|
||||
<text x="200" y="90" font-family="Arial, sans-serif" font-size="16" text-anchor="middle" font-weight="bold">Traditional Search</text>
|
||||
|
||||
<!-- Traditional Search Chunks -->
|
||||
<rect x="70" y="110" width="260" height="40" fill="#ff9999" stroke="#000000"/>
|
||||
<rect x="70" y="160" width="260" height="40" fill="#ff9999" stroke="#000000"/>
|
||||
<rect x="70" y="210" width="260" height="40" fill="#ff9999" stroke="#000000"/>
|
||||
|
||||
<text x="200" y="135" font-family="Arial, sans-serif" font-size="14" text-anchor="middle">Isolated Chunk 1</text>
|
||||
<text x="200" y="185" font-family="Arial, sans-serif" font-size="14" text-anchor="middle">Isolated Chunk 2</text>
|
||||
<text x="200" y="235" font-family="Arial, sans-serif" font-size="14" text-anchor="middle">Isolated Chunk 3</text>
|
||||
|
||||
<text x="200" y="280" font-family="Arial, sans-serif" font-size="14" text-anchor="middle" font-style="italic">Limited context</text>
|
||||
<text x="200" y="300" font-family="Arial, sans-serif" font-size="14" text-anchor="middle" font-style="italic">Potential coherence issues</text>
|
||||
|
||||
<!-- Context-Enriched Search -->
|
||||
<rect x="450" y="60" width="300" height="300" fill="#ffffff" stroke="#000000"/>
|
||||
<text x="600" y="90" font-family="Arial, sans-serif" font-size="16" text-anchor="middle" font-weight="bold">Context-Enriched Search</text>
|
||||
|
||||
<!-- Context-Enriched Search Chunks -->
|
||||
<rect x="470" y="110" width="260" height="40" fill="#99ff99" stroke="#000000"/>
|
||||
<rect x="470" y="150" width="260" height="40" fill="#ffff99" stroke="#000000"/>
|
||||
<rect x="470" y="190" width="260" height="40" fill="#99ff99" stroke="#000000"/>
|
||||
|
||||
<text x="600" y="135" font-family="Arial, sans-serif" font-size="14" text-anchor="middle">Context Before</text>
|
||||
<text x="600" y="175" font-family="Arial, sans-serif" font-size="14" text-anchor="middle">Retrieved Chunk</text>
|
||||
<text x="600" y="215" font-family="Arial, sans-serif" font-size="14" text-anchor="middle">Context After</text>
|
||||
|
||||
<text x="600" y="260" font-family="Arial, sans-serif" font-size="14" text-anchor="middle" font-style="italic">Enhanced coherence</text>
|
||||
<text x="600" y="280" font-family="Arial, sans-serif" font-size="14" text-anchor="middle" font-style="italic">More comprehensive information</text>
|
||||
<text x="600" y="300" font-family="Arial, sans-serif" font-size="14" text-anchor="middle" font-style="italic">Better understanding</text>
|
||||
|
||||
<!-- Arrows -->
|
||||
<path d="M 370 200 L 430 200" stroke="#000000" stroke-width="2" fill="none" marker-end="url(#arrowhead)"/>
|
||||
|
||||
<!-- Arrow Marker -->
|
||||
<defs>
|
||||
<marker id="arrowhead" markerWidth="10" markerHeight="7" refX="0" refY="3.5" orient="auto">
|
||||
<polygon points="0 0, 10 3.5, 0 7" />
|
||||
</marker>
|
||||
</defs>
|
||||
</svg>
|
||||
|
After Width: | Height: | Size: 3.1 KiB |
@@ -59,24 +59,28 @@ langchain-community==0.2.9
|
||||
langchain-core==0.2.22
|
||||
langchain-experimental==0.0.62
|
||||
langchain-openai==0.1.17
|
||||
langchain-cohere==0.2.3
|
||||
langchain-groq==0.1.9
|
||||
langchain-text-splitters==0.2.2
|
||||
langcodes==3.4.0
|
||||
langsmith==0.1.85
|
||||
language_data==1.2.0
|
||||
llama-cloud==0.0.9
|
||||
llama-index==0.10.55
|
||||
llama-index-agent-openai==0.2.8
|
||||
llama-index-cli==0.1.12
|
||||
llama-index-core==0.10.55
|
||||
llama-index-embeddings-openai==0.1.10
|
||||
llama-index-indices-managed-llama-cloud==0.2.5
|
||||
llama-index==0.11.04
|
||||
llama-index-agent-openai==0.3.0
|
||||
llama-index-cli==0.3.0
|
||||
llama-index-core==0.11.4
|
||||
llama-index-embeddings-openai==0.2.4
|
||||
llama-index-indices-managed-llama-cloud==0.3.0
|
||||
llama-index-legacy==0.9.48
|
||||
llama-index-llms-openai==0.1.25
|
||||
llama-index-multi-modal-llms-openai==0.1.7
|
||||
llama-index-program-openai==0.1.6
|
||||
llama-index-question-gen-openai==0.1.3
|
||||
llama-index-readers-file==0.1.30
|
||||
llama-index-readers-llama-parse==0.1.6
|
||||
llama-index-llms-openai==0.2.2
|
||||
llama-index-multi-modal-llms-openai==0.2.0
|
||||
llama-index-program-openai==0.2.0
|
||||
llama-index-question-gen-openai==0.2.0
|
||||
llama-index-readers-file==0.2.0
|
||||
llama-index-readers-llama-parse==0.3.0
|
||||
llama-index-vector-stores-faiss==0.2.1
|
||||
llama-index-retrievers-bm25==0.2.0
|
||||
llama-parse==0.4.7
|
||||
marisa-trie==1.2.0
|
||||
markdown-it-py==3.0.0
|
||||
@@ -179,5 +183,4 @@ wrapt==1.16.0
|
||||
xxhash==3.4.1
|
||||
yarl==1.9.4
|
||||
zipp==3.19.2
|
||||
ollama==0.3.1
|
||||
llama-index-vector-stores-faiss==0.1.2
|
||||
ollama==0.3.1
|
||||