Files
RAG_Techniques/evaluation/evaluation_deep_eval.ipynb
2024-07-23 22:52:22 +03:00

267 lines
8.1 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Deep Evaluation of RAG Systems using deepeval\n",
"\n",
"## Overview\n",
"\n",
"This code demonstrates the use of the `deepeval` library to perform comprehensive evaluations of Retrieval-Augmented Generation (RAG) systems. It covers various evaluation metrics and provides a framework for creating and running test cases.\n",
"\n",
"## Key Components\n",
"\n",
"1. Correctness Evaluation\n",
"2. Faithfulness Evaluation\n",
"3. Contextual Relevancy Evaluation\n",
"4. Combined Evaluation of Multiple Metrics\n",
"5. Batch Test Case Creation\n",
"\n",
"## Evaluation Metrics\n",
"\n",
"### 1. Correctness (GEval)\n",
"\n",
"- Evaluates whether the actual output is factually correct based on the expected output.\n",
"- Uses GPT-4 as the evaluation model.\n",
"- Compares the expected and actual outputs.\n",
"\n",
"### 2. Faithfulness (FaithfulnessMetric)\n",
"\n",
"- Assesses whether the generated answer is faithful to the provided context.\n",
"- Uses GPT-4 as the evaluation model.\n",
"- Can provide detailed reasons for the evaluation.\n",
"\n",
"### 3. Contextual Relevancy (ContextualRelevancyMetric)\n",
"\n",
"- Evaluates how relevant the retrieved context is to the question and answer.\n",
"- Uses GPT-4 as the evaluation model.\n",
"- Can provide detailed reasons for the evaluation.\n",
"\n",
"## Key Features\n",
"\n",
"1. Flexible Metric Configuration: Each metric can be customized with different models and parameters.\n",
"2. Multi-Metric Evaluation: Ability to evaluate test cases using multiple metrics simultaneously.\n",
"3. Batch Test Case Creation: Utility function to create multiple test cases efficiently.\n",
"4. Detailed Feedback: Options to include detailed reasons for evaluation results.\n",
"\n",
"## Benefits of this Approach\n",
"\n",
"1. Comprehensive Evaluation: Covers multiple aspects of RAG system performance.\n",
"2. Flexibility: Easy to add or modify evaluation metrics and test cases.\n",
"3. Scalability: Capable of handling multiple test cases and metrics efficiently.\n",
"4. Interpretability: Provides detailed reasons for evaluation results, aiding in system improvement.\n",
"\n",
"## Conclusion\n",
"\n",
"This deep evaluation approach using the `deepeval` library offers a robust framework for assessing the performance of RAG systems. By evaluating correctness, faithfulness, and contextual relevancy, it provides a multi-faceted view of system performance. This comprehensive evaluation is crucial for identifying areas of improvement and ensuring the reliability and effectiveness of RAG systems in real-world applications."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"from deepeval import evaluate\n",
"from deepeval.metrics import GEval, FaithfulnessMetric, ContextualRelevancyMetric\n",
"from deepeval.test_case import LLMTestCase, LLMTestCaseParams"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Test Correctness"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"correctness_metric = GEval(\n",
" name=\"Correctness\",\n",
" model=\"gpt-4o\",\n",
" evaluation_params=[\n",
" LLMTestCaseParams.EXPECTED_OUTPUT,\n",
" LLMTestCaseParams.ACTUAL_OUTPUT],\n",
" evaluation_steps=[\n",
" \"Determine whether the actual output is factually correct based on the expected output.\"\n",
" ],\n",
"\n",
")\n",
"\n",
"gt_answer = \"Madrid is the capital of Spain.\"\n",
"pred_answer = \"MadriD.\"\n",
"\n",
"test_case_correctness = LLMTestCase(\n",
" input=\"What is the capital of Spain?\",\n",
" expected_output=gt_answer,\n",
" actual_output=pred_answer,\n",
")\n",
"\n",
"correctness_metric.measure(test_case_correctness)\n",
"print(correctness_metric.score)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Test faithfulness"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"question = \"what is 3+3?\"\n",
"context = [\"6\"]\n",
"generated_answer = \"6\"\n",
"\n",
"faithfulness_metric = FaithfulnessMetric(\n",
" threshold=0.7,\n",
" model=\"gpt-4\",\n",
" include_reason=False\n",
")\n",
"\n",
"test_case = LLMTestCase(\n",
" input = question,\n",
" actual_output=generated_answer,\n",
" retrieval_context=context\n",
"\n",
")\n",
"\n",
"faithfulness_metric.measure(test_case)\n",
"print(faithfulness_metric.score)\n",
"print(faithfulness_metric.reason)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Test contextual relevancy "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"actual_output = \"then go somewhere else.\"\n",
"retrieval_context = [\"this is a test context\",\"mike is a cat\",\"if the shoes don't fit, then go somewhere else.\"]\n",
"gt_answer = \"if the shoes don't fit, then go somewhere else.\"\n",
"\n",
"relevance_metric = ContextualRelevancyMetric(\n",
" threshold=1,\n",
" model=\"gpt-4\",\n",
" include_reason=True\n",
")\n",
"relevance_test_case = LLMTestCase(\n",
" input=\"What if these shoes don't fit?\",\n",
" actual_output=actual_output,\n",
" retrieval_context=retrieval_context,\n",
" expected_output=gt_answer,\n",
"\n",
")\n",
"\n",
"relevance_metric.measure(relevance_test_case)\n",
"print(relevance_metric.score)\n",
"print(relevance_metric.reason)"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"new_test_case = LLMTestCase(\n",
" input=\"What is the capital of Spain?\",\n",
" expected_output=\"Madrid is the capital of Spain.\",\n",
" actual_output=\"MadriD.\",\n",
" retrieval_context=[\"Madrid is the capital of Spain.\"]\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Test two different cases together with several metrics together"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"evaluate(\n",
" test_cases=[relevance_test_case, new_test_case],\n",
" metrics=[correctness_metric, faithfulness_metric, relevance_metric]\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Funcion to create multiple LLMTestCases based on four lists: \n",
"* Questions\n",
"* Ground Truth Answers\n",
"* Generated Answers\n",
"* Retrieved Documents - Each element is a list"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def create_deep_eval_test_cases(questions, gt_answers, generated_answers, retrieved_documents):\n",
" return [\n",
" LLMTestCase(\n",
" input=question,\n",
" expected_output=gt_answer,\n",
" actual_output=generated_answer,\n",
" retrieval_context=retrieved_document\n",
" )\n",
" for question, gt_answer, generated_answer, retrieved_document in zip(\n",
" questions, gt_answers, generated_answers, retrieved_documents\n",
" )\n",
" ]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}