mirror of
https://github.com/NirDiamant/RAG_Techniques.git
synced 2025-04-07 00:48:52 +03:00
272 lines
7.9 KiB
Plaintext
272 lines
7.9 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Simple RAG (Retrieval-Augmented Generation) System for CSV Files\n",
|
|
"\n",
|
|
"## Overview\n",
|
|
"\n",
|
|
"This code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying CSV documents. The system encodes the document content into a vector store, which can then be queried to retrieve relevant information.\n",
|
|
"\n",
|
|
"# CSV File Structure and Use Case\n",
|
|
"The CSV file contains dummy customer data, comprising various attributes like first name, last name, company, etc. This dataset will be utilized for a RAG use case, facilitating the creation of a customer information Q&A system.\n",
|
|
"\n",
|
|
"## Key Components\n",
|
|
"\n",
|
|
"1. Loading and spliting csv files.\n",
|
|
"2. Vector store creation using [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) and OpenAI embeddings\n",
|
|
"3. Query engine setup for querying the processed documents\n",
|
|
"4. Creating a question and answer over the csv data.\n",
|
|
"\n",
|
|
"## Method Details\n",
|
|
"\n",
|
|
"### Document Preprocessing\n",
|
|
"\n",
|
|
"1. The csv is loaded using LlamaIndex's [PagedCSVReader](https://docs.llamaindex.ai/en/stable/api_reference/readers/file/#llama_index.readers.file.PagedCSVReader)\n",
|
|
"2. This reader converts each row into a LlamaIndex Document along with the respective column names of the table. No further splitting applied.\n",
|
|
"\n",
|
|
"\n",
|
|
"### Vector Store Creation\n",
|
|
"\n",
|
|
"1. OpenAI embeddings are used to create vector representations of the text chunks.\n",
|
|
"2. A FAISS vector store is created from these embeddings for efficient similarity search.\n",
|
|
"\n",
|
|
"### Query Engine Setup\n",
|
|
"\n",
|
|
"1. A query engine is configured to fetch the most relevant chunks for a given query then answer the question.\n",
|
|
"\n",
|
|
"## Benefits of this Approach\n",
|
|
"\n",
|
|
"1. Scalability: Can handle large documents by processing them in chunks.\n",
|
|
"2. Flexibility: Easy to adjust parameters like chunk size and number of retrieved results.\n",
|
|
"3. Efficiency: Utilizes FAISS for fast similarity search in high-dimensional spaces.\n",
|
|
"4. Integration with Advanced NLP: Uses OpenAI embeddings for state-of-the-art text representation.\n",
|
|
"\n",
|
|
"## Conclusion\n",
|
|
"\n",
|
|
"This simple RAG system provides a solid foundation for building more complex information retrieval and question-answering systems. By encoding document content into a searchable vector store, it enables efficient retrieval of relevant information in response to queries. This approach is particularly useful for applications requiring quick access to specific information within a CSV file."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Imports & Environment Variables "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from llama_index.core.readers import SimpleDirectoryReader\n",
|
|
"from llama_index.core import Settings\n",
|
|
"from llama_index.llms.openai import OpenAI\n",
|
|
"from llama_index.embeddings.openai import OpenAIEmbedding\n",
|
|
"from llama_index.readers.file import PagedCSVReader\n",
|
|
"from llama_index.vector_stores.faiss import FaissVectorStore\n",
|
|
"from llama_index.core.ingestion import IngestionPipeline\n",
|
|
"from llama_index.core import VectorStoreIndex\n",
|
|
"import faiss\n",
|
|
"import os\n",
|
|
"import pandas as pd\n",
|
|
"from dotenv import load_dotenv\n",
|
|
"\n",
|
|
"\n",
|
|
"# Load environment variables from a .env file\n",
|
|
"load_dotenv()\n",
|
|
"\n",
|
|
"# Set the OpenAI API key environment variable\n",
|
|
"os.environ[\"OPENAI_API_KEY\"] = os.getenv('OPENAI_API_KEY')\n",
|
|
"\n",
|
|
"\n",
|
|
"# Llamaindex global settings for llm and embeddings\n",
|
|
"EMBED_DIMENSION=512\n",
|
|
"Settings.llm = OpenAI(model=\"gpt-3.5-turbo\")\n",
|
|
"Settings.embed_model = OpenAIEmbedding(model=\"text-embedding-3-small\", dimensions=EMBED_DIMENSION)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### CSV File Structure and Use Case\n",
|
|
"The CSV file contains dummy customer data, comprising various attributes like first name, last name, company, etc. This dataset will be utilized for a RAG use case, facilitating the creation of a customer information Q&A system."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"file_path = ('../data/customers-100.csv') # insert the path of the csv file\n",
|
|
"data = pd.read_csv(file_path)\n",
|
|
"\n",
|
|
"# Preview the csv file\n",
|
|
"data.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Vector Store"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Create FaisVectorStore to store embeddings\n",
|
|
"fais_index = faiss.IndexFlatL2(EMBED_DIMENSION)\n",
|
|
"vector_store = FaissVectorStore(faiss_index=fais_index)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Load and Process CSV Data as Document"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"csv_reader = PagedCSVReader()\n",
|
|
"\n",
|
|
"reader = SimpleDirectoryReader( \n",
|
|
" input_files=[file_path],\n",
|
|
" file_extractor= {\".csv\": csv_reader}\n",
|
|
" )\n",
|
|
"\n",
|
|
"docs = reader.load_data()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Index: 1\n",
|
|
"Customer Id: DD37Cf93aecA6Dc\n",
|
|
"First Name: Sheryl\n",
|
|
"Last Name: Baxter\n",
|
|
"Company: Rasmussen Group\n",
|
|
"City: East Leonard\n",
|
|
"Country: Chile\n",
|
|
"Phone 1: 229.077.5154\n",
|
|
"Phone 2: 397.884.0519x718\n",
|
|
"Email: zunigavanessa@smith.info\n",
|
|
"Subscription Date: 2020-08-24\n",
|
|
"Website: http://www.stephenson.com/\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Check a sample chunk\n",
|
|
"print(docs[0].text)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Ingestion Pipeline"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"pipeline = IngestionPipeline(\n",
|
|
" vector_store=vector_store,\n",
|
|
" documents=docs\n",
|
|
")\n",
|
|
"\n",
|
|
"nodes = pipeline.run()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Create Query Engine"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"vector_store_index = VectorStoreIndex(nodes)\n",
|
|
"query_engine = vector_store_index.as_query_engine(similarity_top_k=2)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Query the rag bot with a question based on the CSV data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'Rasmussen Group'"
|
|
]
|
|
},
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"response = query_engine.query(\"which company does sheryl Baxter work for?\")\n",
|
|
"response.response"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "objenv",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.11.5"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|