{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook I'm using the OpenPipe client to capture a set of calls to the OpenAI API.\n", "\n", "For this example I'll blithely throw engineering best practices to the wind and use the notebook itself to manage dependencies. 😁" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: openpipe==3.0.3 in /usr/local/lib/python3.10/dist-packages (3.0.3)\n", "Requirement already satisfied: python-dotenv==1.0.0 in /usr/local/lib/python3.10/dist-packages (1.0.0)\n", "Requirement already satisfied: joblib==1.3.2 in /usr/local/lib/python3.10/dist-packages (1.3.2)\n", "Requirement already satisfied: attrs<24.0.0,>=23.1.0 in /usr/local/lib/python3.10/dist-packages (from openpipe==3.0.3) (23.1.0)\n", "Requirement already satisfied: httpx<0.25.0,>=0.24.1 in /usr/local/lib/python3.10/dist-packages (from openpipe==3.0.3) (0.24.1)\n", "Requirement already satisfied: openai<0.28.0,>=0.27.8 in /usr/local/lib/python3.10/dist-packages (from openpipe==3.0.3) (0.27.9)\n", "Requirement already satisfied: python-dateutil<3.0.0,>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from openpipe==3.0.3) (2.8.2)\n", "Requirement already satisfied: toml<0.11.0,>=0.10.2 in /usr/local/lib/python3.10/dist-packages (from openpipe==3.0.3) (0.10.2)\n", "Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx<0.25.0,>=0.24.1->openpipe==3.0.3) (2022.12.7)\n", "Requirement already satisfied: httpcore<0.18.0,>=0.15.0 in /usr/local/lib/python3.10/dist-packages (from httpx<0.25.0,>=0.24.1->openpipe==3.0.3) (0.17.3)\n", "Requirement already satisfied: idna in /usr/local/lib/python3.10/dist-packages (from httpx<0.25.0,>=0.24.1->openpipe==3.0.3) (3.4)\n", "Requirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from httpx<0.25.0,>=0.24.1->openpipe==3.0.3) (1.3.0)\n", "Requirement already satisfied: requests>=2.20 in /usr/local/lib/python3.10/dist-packages (from openai<0.28.0,>=0.27.8->openpipe==3.0.3) (2.28.1)\n", "Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from openai<0.28.0,>=0.27.8->openpipe==3.0.3) (4.66.1)\n", "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from openai<0.28.0,>=0.27.8->openpipe==3.0.3) (3.8.5)\n", "Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil<3.0.0,>=2.8.2->openpipe==3.0.3) (1.16.0)\n", "Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.10/dist-packages (from httpcore<0.18.0,>=0.15.0->httpx<0.25.0,>=0.24.1->openpipe==3.0.3) (0.14.0)\n", "Requirement already satisfied: anyio<5.0,>=3.0 in /usr/local/lib/python3.10/dist-packages (from httpcore<0.18.0,>=0.15.0->httpx<0.25.0,>=0.24.1->openpipe==3.0.3) (3.7.1)\n", "Requirement already satisfied: charset-normalizer<3,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.20->openai<0.28.0,>=0.27.8->openpipe==3.0.3) (2.1.1)\n", "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.20->openai<0.28.0,>=0.27.8->openpipe==3.0.3) (1.26.13)\n", "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->openai<0.28.0,>=0.27.8->openpipe==3.0.3) (6.0.4)\n", "Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.10/dist-packages (from aiohttp->openai<0.28.0,>=0.27.8->openpipe==3.0.3) (4.0.3)\n", "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->openai<0.28.0,>=0.27.8->openpipe==3.0.3) (1.9.2)\n", "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->openai<0.28.0,>=0.27.8->openpipe==3.0.3) (1.4.0)\n", "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->openai<0.28.0,>=0.27.8->openpipe==3.0.3) (1.3.1)\n", "Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio<5.0,>=3.0->httpcore<0.18.0,>=0.15.0->httpx<0.25.0,>=0.24.1->openpipe==3.0.3) (1.1.2)\n", "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n", "\u001b[0m\n", "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.1.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.2.1\u001b[0m\n", "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpython3.10 -m pip install --upgrade pip\u001b[0m\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install openpipe==3.0.3 python-dotenv==1.0.0 joblib==1.3.2 datasets==2.14.4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When working with remote datasets (or any data, really), it's a good idea to visually inspect some samples to make sure it looks like you expect. I'll print a recipe." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Recipe dataset shape:\n", "------------------\n" ] }, { "data": { "text/plain": [ "Dataset({\n", " features: ['recipe'],\n", " num_rows: 5000\n", "})" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "First recipe:\n", "------------------ Shrimp Creole\n", "\n", "Ingredients:\n", "- 20 shrimp (8 oz.)\n", "- 2 c. (16 oz. can) tomato sauce\n", "- 1 small onion, chopped\n", "- 1 celery stalk, chopped\n", "- 1/4 green bell pepper, diced\n", "- 1/4 c. sliced mushrooms\n", "- 3 Tbsp. parsley\n", "- 1/2 tsp. pepper\n", "- 1 to 1-1/2 c. brown rice, prepared according to pkg. directions (not included in exchanges)\n", "\n", "Directions:\n", "- Peel, devein and wash shrimp; set aside.\n", "- (If shrimp are frozen, let thaw first in the refrigerator.)\n", "- Simmer tomato sauce, onion, celery, green pepper, mushrooms, parsley and pepper in skillet for 30 minutes.\n", "- Add shrimp and cook 10 to 15 minutes more, until shrimp are tender.\n", "- Serve over brown rice.\n", "- Serves 2.\n" ] } ], "source": [ "from datasets import load_dataset\n", "\n", "recipes = load_dataset(\"corbt/unlabeled-recipes\")[\"train\"]\n", "print(\"Recipe dataset shape:\\n------------------\")\n", "display(recipes)\n", "print(\"First recipe:\\n------------------\", recipes[\"recipe\"][0])\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mm, delicious. Anyway, we need to generate a training dataset. We'll call GPT-4 on each of our examples.\n", "\n", "In this case, I'll ask GPT-4 to classify each recipe along 5 dimensions:\n", " - has_non_fish_meat\n", " - requires_oven\n", " - requires_stove\n", " - cook_time_over_30_mins\n", " - main_dish\n", "\n", "That looks like a pretty random list, but there's actually an important unifying thread: I'm looking for meals that my pescatarian brother/co-founder can make in his kitchen-less, near-window-less basement apartment in San Francisco! (If you haven't tried to get an apartment in SF you probably think I'm joking 😂.)\n", "\n", "I'll use [OpenPipe](https://github.com/openpipe/openpipe) to track the API calls and form a training dataset. To follow along you'll need to create a free OpenPipe account, then copy your API key from https://app.openpipe.ai/project/settings into a file called `.env`. You can see an example in [./.env.example](./.env.example)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Classifying first recipe:\n", "------------------\n", "{'has_non_fish_meat': False, 'requires_oven': False, 'requires_stove': True, 'cook_time_over_30_mins': True, 'main_dish': True}\n" ] } ], "source": [ "from openpipe import openai, configure_openpipe\n", "import json\n", "import os\n", "import dotenv\n", "\n", "# Use `dotenv` to load the contents of the `.env` file into the environment\n", "dotenv.load_dotenv()\n", "\n", "# Configure OpenPipe using the API key from the environment\n", "configure_openpipe(api_key=os.environ[\"OPENPIPE_API_KEY\"])\n", "\n", "# Configure OpenAI using the API key from the environment\n", "openai.api_key = os.environ[\"OPENAI_API_KEY\"]\n", "\n", "\n", "def classify_recipe(recipe: str):\n", " completion = openai.ChatCompletion.create(\n", " model=\"gpt-4\",\n", " messages=[\n", " {\n", " \"role\": \"system\",\n", " \"content\": \"Your goal is to classify a recipe along several dimensions.Pay attention to the instructions.\",\n", " },\n", " {\n", " \"role\": \"user\",\n", " \"content\": recipe,\n", " },\n", " ],\n", " functions=[\n", " {\n", " \"name\": \"classify\",\n", " \"parameters\": {\n", " \"type\": \"object\",\n", " \"properties\": {\n", " \"has_non_fish_meat\": {\n", " \"type\": \"boolean\",\n", " \"description\": \"True if the recipe contains any meat or meat products (eg. chicken broth) besides fish\",\n", " },\n", " \"requires_oven\": {\n", " \"type\": \"boolean\",\n", " \"description\": \"True if the recipe requires an oven\",\n", " },\n", " \"requires_stove\": {\n", " \"type\": \"boolean\",\n", " \"description\": \"True if the recipe requires a stove\",\n", " },\n", " \"cook_time_over_30_mins\": {\n", " \"type\": \"boolean\",\n", " \"description\": \"True if the recipe takes over 30 minutes to prepare and cook, including waiting time\",\n", " },\n", " \"main_dish\": {\n", " \"type\": \"boolean\",\n", " \"description\": \"True if the recipe can be served as a main dish\",\n", " },\n", " },\n", " \"required\": [\n", " \"has_non_fish_meat\",\n", " \"requires_oven\",\n", " \"requires_stove\",\n", " \"cook_time_over_30_mins\",\n", " \"main_dish\",\n", " ],\n", " },\n", " }\n", " ],\n", " function_call={\n", " \"name\": \"classify\",\n", " },\n", " openpipe={\"tags\": {\"prompt_id\": \"classify-recipe\"}, \"cache\": True},\n", " )\n", " return json.loads(completion.choices[0].message.function_call.arguments)\n", "\n", "\n", "print(\"Classifying first recipe:\\n------------------\")\n", "print(classify_recipe(recipes[\"recipe\"][0]))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's working, so I'll go ahead and classify all 5000 recipes with GPT-4. Using GPT-4 for this is slowwww and costs about $40. The model I'm fine-tuning will be much faster -- we'll see if we can make it as good!" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Classifying recipe 0/5000: Shrimp Creole\n", "Classifying recipe 100/5000: Spoon Bread\n", "Classifying recipe 200/5000: Quadrangle Grille'S Pumpkin-Walnut Cheesecake\n", "Classifying recipe 300/5000: Broccoli Casserole\n", "Classifying recipe 400/5000: Paal Payasam (3-Ingredient Rice Pudding)\n", "Classifying recipe 500/5000: Dirt Dessert\n", "Classifying recipe 600/5000: Dolma, Stuffed Dried Peppers And Eggplants\n", "Classifying recipe 700/5000: Party Pecan Pies\n", "Classifying recipe 800/5000: Pie Crust\n", "Classifying recipe 900/5000: Russian Dressing(Salad Dressing) \n", "Classifying recipe 1000/5000: O'Brien Potatoes\n", "Classifying recipe 1100/5000: Monster Cookies\n", "Classifying recipe 1200/5000: Striped Fruit Pops\n", "Classifying recipe 1300/5000: Cute Heart-Shaped Fried Egg\n", "Classifying recipe 1400/5000: Steak Marinade\n", "Classifying recipe 1500/5000: Bbq Sauce For Fish Recipe\n", "Classifying recipe 1600/5000: Barbecue Ranch Salad\n", "Classifying recipe 1700/5000: White Fudge\n", "Classifying recipe 1800/5000: Seaton Chocolate Chip Cookies\n", "Classifying recipe 1900/5000: Beef Stroganoff\n", "Classifying recipe 2000/5000: Lemon Delight\n", "Classifying recipe 2100/5000: Cream Cheese Chicken Chili\n", "Classifying recipe 2200/5000: Bean Salad\n", "Classifying recipe 2300/5000: Green Beans Almondine\n", "Classifying recipe 2400/5000: Radish-And-Avocado Salad\n", "Classifying recipe 2500/5000: Salsa Rojo\n", "Classifying recipe 2600/5000: Pepperoni Bread\n", "Classifying recipe 2700/5000: Sabzi Polow\n", "Classifying recipe 2800/5000: Italian Vegetable Pizzas\n", "Classifying recipe 2900/5000: Hot Fudge Sauce, Soda Shop Style\n", "Classifying recipe 3000/5000: Meatball Soup With Vegetables And Brown Rice\n", "Classifying recipe 3100/5000: Herbed Potatoes And Onions\n", "Classifying recipe 3200/5000: Apple Crunch Pie (2 Extra Servings)\n", "Classifying recipe 3300/5000: Pineapple-Orange Punch\n", "Classifying recipe 3400/5000: Turkey Veggie Burgers With Avocado Mayo\n", "Classifying recipe 3500/5000: Pear & Goat Cheese Salad\n", "Classifying recipe 3600/5000: Triple Chocolate Cookies\n", "Classifying recipe 3700/5000: Strawberry Banana Yogurt Pops\n", "Classifying recipe 3800/5000: Chicken Croquettes\n", "Classifying recipe 3900/5000: Mushroom Casserole\n", "Classifying recipe 4000/5000: Vegetarian Summer Roll\n", "Classifying recipe 4100/5000: Prune Cake\n", "Classifying recipe 4200/5000: Strawberry Sorbet\n", "Classifying recipe 4300/5000: Lemonade Chicken\n", "Classifying recipe 4400/5000: Crock-Pot Vegetarian Chili\n", "Classifying recipe 4500/5000: Grandma Dickrell'S Molasses Cake - 1936\n", "Classifying recipe 4600/5000: Creamed Corn Casserole\n", "Classifying recipe 4700/5000: Homemade Croutons\n", "Classifying recipe 4800/5000: Potatoes With Leeks And Gruyere\n", "Classifying recipe 4900/5000: Chocolate Oatmeal Cookie\n" ] } ], "source": [ "for i, recipe in enumerate(recipes[\"recipe\"]):\n", " if i % 100 == 0:\n", " recipe_title = recipe.split(\"\\n\")[0]\n", " print(f\"Classifying recipe {i}/{len(recipes)}: {recipe_title}\")\n", " try:\n", " classify_recipe(recipe)\n", " except Exception as e:\n", " print(f\"Error classifying recipe {i}: {e}\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, now that my recipes are classified I'll download the training data. \n", "\n", "Next up I'll train the model -- check out [./train.ipynb](./train.ipynb) for details! Just go to https://app.openpipe.ai/request-logs, select all the logs you created, and click \"Export\". The default 10% testing split is fine for this dataset size.\n", "\n", "I got two files from that: `train.jsonl` and `test.jsonl`. I moved both of them into this repository under `./data/`." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }