mirror of
https://github.com/NirDiamant/RAG_Techniques.git
synced 2025-04-07 00:48:52 +03:00
267 lines
8.1 KiB
Plaintext
267 lines
8.1 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Deep Evaluation of RAG Systems using deepeval\n",
|
|
"\n",
|
|
"## Overview\n",
|
|
"\n",
|
|
"This code demonstrates the use of the `deepeval` library to perform comprehensive evaluations of Retrieval-Augmented Generation (RAG) systems. It covers various evaluation metrics and provides a framework for creating and running test cases.\n",
|
|
"\n",
|
|
"## Key Components\n",
|
|
"\n",
|
|
"1. Correctness Evaluation\n",
|
|
"2. Faithfulness Evaluation\n",
|
|
"3. Contextual Relevancy Evaluation\n",
|
|
"4. Combined Evaluation of Multiple Metrics\n",
|
|
"5. Batch Test Case Creation\n",
|
|
"\n",
|
|
"## Evaluation Metrics\n",
|
|
"\n",
|
|
"### 1. Correctness (GEval)\n",
|
|
"\n",
|
|
"- Evaluates whether the actual output is factually correct based on the expected output.\n",
|
|
"- Uses GPT-4 as the evaluation model.\n",
|
|
"- Compares the expected and actual outputs.\n",
|
|
"\n",
|
|
"### 2. Faithfulness (FaithfulnessMetric)\n",
|
|
"\n",
|
|
"- Assesses whether the generated answer is faithful to the provided context.\n",
|
|
"- Uses GPT-4 as the evaluation model.\n",
|
|
"- Can provide detailed reasons for the evaluation.\n",
|
|
"\n",
|
|
"### 3. Contextual Relevancy (ContextualRelevancyMetric)\n",
|
|
"\n",
|
|
"- Evaluates how relevant the retrieved context is to the question and answer.\n",
|
|
"- Uses GPT-4 as the evaluation model.\n",
|
|
"- Can provide detailed reasons for the evaluation.\n",
|
|
"\n",
|
|
"## Key Features\n",
|
|
"\n",
|
|
"1. Flexible Metric Configuration: Each metric can be customized with different models and parameters.\n",
|
|
"2. Multi-Metric Evaluation: Ability to evaluate test cases using multiple metrics simultaneously.\n",
|
|
"3. Batch Test Case Creation: Utility function to create multiple test cases efficiently.\n",
|
|
"4. Detailed Feedback: Options to include detailed reasons for evaluation results.\n",
|
|
"\n",
|
|
"## Benefits of this Approach\n",
|
|
"\n",
|
|
"1. Comprehensive Evaluation: Covers multiple aspects of RAG system performance.\n",
|
|
"2. Flexibility: Easy to add or modify evaluation metrics and test cases.\n",
|
|
"3. Scalability: Capable of handling multiple test cases and metrics efficiently.\n",
|
|
"4. Interpretability: Provides detailed reasons for evaluation results, aiding in system improvement.\n",
|
|
"\n",
|
|
"## Conclusion\n",
|
|
"\n",
|
|
"This deep evaluation approach using the `deepeval` library offers a robust framework for assessing the performance of RAG systems. By evaluating correctness, faithfulness, and contextual relevancy, it provides a multi-faceted view of system performance. This comprehensive evaluation is crucial for identifying areas of improvement and ensuring the reliability and effectiveness of RAG systems in real-world applications."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 18,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from deepeval import evaluate\n",
|
|
"from deepeval.metrics import GEval, FaithfulnessMetric, ContextualRelevancyMetric\n",
|
|
"from deepeval.test_case import LLMTestCase, LLMTestCaseParams"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Test Correctness"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"correctness_metric = GEval(\n",
|
|
" name=\"Correctness\",\n",
|
|
" model=\"gpt-4o\",\n",
|
|
" evaluation_params=[\n",
|
|
" LLMTestCaseParams.EXPECTED_OUTPUT,\n",
|
|
" LLMTestCaseParams.ACTUAL_OUTPUT],\n",
|
|
" evaluation_steps=[\n",
|
|
" \"Determine whether the actual output is factually correct based on the expected output.\"\n",
|
|
" ],\n",
|
|
"\n",
|
|
")\n",
|
|
"\n",
|
|
"gt_answer = \"Madrid is the capital of Spain.\"\n",
|
|
"pred_answer = \"MadriD.\"\n",
|
|
"\n",
|
|
"test_case_correctness = LLMTestCase(\n",
|
|
" input=\"What is the capital of Spain?\",\n",
|
|
" expected_output=gt_answer,\n",
|
|
" actual_output=pred_answer,\n",
|
|
")\n",
|
|
"\n",
|
|
"correctness_metric.measure(test_case_correctness)\n",
|
|
"print(correctness_metric.score)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Test faithfulness"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"question = \"what is 3+3?\"\n",
|
|
"context = [\"6\"]\n",
|
|
"generated_answer = \"6\"\n",
|
|
"\n",
|
|
"faithfulness_metric = FaithfulnessMetric(\n",
|
|
" threshold=0.7,\n",
|
|
" model=\"gpt-4\",\n",
|
|
" include_reason=False\n",
|
|
")\n",
|
|
"\n",
|
|
"test_case = LLMTestCase(\n",
|
|
" input = question,\n",
|
|
" actual_output=generated_answer,\n",
|
|
" retrieval_context=context\n",
|
|
"\n",
|
|
")\n",
|
|
"\n",
|
|
"faithfulness_metric.measure(test_case)\n",
|
|
"print(faithfulness_metric.score)\n",
|
|
"print(faithfulness_metric.reason)\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Test contextual relevancy "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"actual_output = \"then go somewhere else.\"\n",
|
|
"retrieval_context = [\"this is a test context\",\"mike is a cat\",\"if the shoes don't fit, then go somewhere else.\"]\n",
|
|
"gt_answer = \"if the shoes don't fit, then go somewhere else.\"\n",
|
|
"\n",
|
|
"relevance_metric = ContextualRelevancyMetric(\n",
|
|
" threshold=1,\n",
|
|
" model=\"gpt-4\",\n",
|
|
" include_reason=True\n",
|
|
")\n",
|
|
"relevance_test_case = LLMTestCase(\n",
|
|
" input=\"What if these shoes don't fit?\",\n",
|
|
" actual_output=actual_output,\n",
|
|
" retrieval_context=retrieval_context,\n",
|
|
" expected_output=gt_answer,\n",
|
|
"\n",
|
|
")\n",
|
|
"\n",
|
|
"relevance_metric.measure(relevance_test_case)\n",
|
|
"print(relevance_metric.score)\n",
|
|
"print(relevance_metric.reason)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 34,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"new_test_case = LLMTestCase(\n",
|
|
" input=\"What is the capital of Spain?\",\n",
|
|
" expected_output=\"Madrid is the capital of Spain.\",\n",
|
|
" actual_output=\"MadriD.\",\n",
|
|
" retrieval_context=[\"Madrid is the capital of Spain.\"]\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Test two different cases together with several metrics together"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"evaluate(\n",
|
|
" test_cases=[relevance_test_case, new_test_case],\n",
|
|
" metrics=[correctness_metric, faithfulness_metric, relevance_metric]\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Funcion to create multiple LLMTestCases based on four lists: \n",
|
|
"* Questions\n",
|
|
"* Ground Truth Answers\n",
|
|
"* Generated Answers\n",
|
|
"* Retrieved Documents - Each element is a list"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def create_deep_eval_test_cases(questions, gt_answers, generated_answers, retrieved_documents):\n",
|
|
" return [\n",
|
|
" LLMTestCase(\n",
|
|
" input=question,\n",
|
|
" expected_output=gt_answer,\n",
|
|
" actual_output=generated_answer,\n",
|
|
" retrieval_context=retrieved_document\n",
|
|
" )\n",
|
|
" for question, gt_answer, generated_answer, retrieved_document in zip(\n",
|
|
" questions, gt_answers, generated_answers, retrieved_documents\n",
|
|
" )\n",
|
|
" ]"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": ".venv",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.12.0"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|