I have a model in `./models/run1/merged` that was trained on GPT-4's outputs to classify recipes. I need to figure out whether it does a good job at classifying recipes. I'll install dependencies first.

In [1]:
%pip install vllm==0.1.3 pandas==2.0.3

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.10 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


Remember I got a "test.jsonl" file from OpenPipe back in [./prepare.ipynb](./prepare.ipynb)? That's data from our dataset that we didn't use in training, so we can use it to check our model's performance.

In [2]:
import pandas as pd

test_data = pd.read_json("./data/test.jsonl", lines=True)


During the training process Axolotl transformed our data into an instruction/response format known as the "Alpaca format" based on [the project that introduced it](https://github.com/tatsu-lab/stanford_alpaca). I need to transform my test data into the same format for best results.

In [3]:
from axolotl.prompters import UnpromptedPrompter

prompter = UnpromptedPrompter()


def format_prompt(input: str) -> str:
    return next(prompter.build_prompt(input))


prompts = test_data["instruction"].apply(format_prompt)

print(f"Sample prompt:\n--------------\n{prompts[0]}")


Sample prompt:
--------------
### Instruction:
[{"role":"system","content":"Your goal is to classify a recipe along several dimensions.Pay attention to the instructions."},{"role":"user","content":"Pan Gravy\n\nIngredients:\n- 1/3 cup all purpose flour\n- 1/3 cup turkey drippings\n- 3 cup water or broth\n- 1/8 to 1/4 teaspoon salt\n- 1/8 tsp pepper\n\nDirections:\n- In a skillet or roasting pan, add flour to drippings; blend well.\n- Cook over medium heat 2 to 3 minutes until smooth and light brown, stirring constantly.\n- Add water; cook until mixture boils and thickens, stirring constantly.\n- Stir in salt and pepper.\n- *Flour and drippings can be decreased to 1/4 cup each for thinner gravy.\n- *"}]

### Response:



Next up, I'll use [vLLM](https://vllm.readthedocs.io/en/latest/) to efficiently process all the prompts in our test data with our own model.

In [4]:
from vllm import LLM, SamplingParams

llm = LLM(model="./models/run1/merged", max_num_batched_tokens=4096)

sampling_params = SamplingParams(
    # 120 should be fine for the work we're doing here.
    max_tokens=120,
    # This is a deterministic task so temperature=0 is best.
    temperature=0,
)

my_outputs = llm.generate(prompts, sampling_params=sampling_params)
my_outputs = [o.outputs[0].text for o in my_outputs]

test_data["my_outputs"] = my_outputs

print(f"Sample output:\n--------------\n{my_outputs[0]}")


INFO 08-25 03:58:49 llm_engine.py:70] Initializing an LLM engine with config: model='./models/run1/merged', tokenizer='./models/run1/merged', tokenizer_mode=auto, trust_remote_code=False, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 08-25 03:59:40 llm_engine.py:196] # GPU blocks: 3419, # CPU blocks: 512


Processed prompts: 100%|██████████| 500/500 [00:37<00:00, 13.42it/s]

Sample output:
--------------
{"role":"assistant","content":null,"function_call":{"name":"classify","arguments":"{\n\"has_non_fish_meat\": true,\n\"requires_oven\": false,\n\"requires_stove\": true,\n\"cook_time_over_30_mins\": false,\n\"main_dish\": false\n}"}}





Ok, we have our outputs! There are 5 categories we classify each recipe on, so let's check what percentage of the time our model's output matches GPT-4's. I'll write a quick eval function for that:

In [5]:
import json


def parse_fn_call(str):
    """Parse the function call arguments from the response"""
    response_dict = json.loads(str)
    args_dict = json.loads(response_dict["function_call"]["arguments"])

    return args_dict


def calculate_accuracy(row):
    """Calculate the fraction of my model's outputs that match the reference outputs"""
    true_outputs = parse_fn_call(row["output"])
    my_outputs = parse_fn_call(row["my_outputs"])

    num_matching_outputs = 0
    for key in true_outputs.keys():
        if key in my_outputs and true_outputs[key] == my_outputs[key]:
            num_matching_outputs += 1

    return num_matching_outputs / len(true_outputs)


test_data["accuracy"] = test_data.apply(calculate_accuracy, axis=1)

print(f"Overall accuracy: {test_data['accuracy'].mean():.2f}")


Overall accuracy: 0.95


Not bad! However, there are still a few rows where the model outputs don't match. Let's take a closer look.

In [6]:
import numpy as np

np.random.seed(42)

for row in test_data[test_data.accuracy < 1].sample(5).itertuples():
    print(json.loads(row.instruction)[1]["content"])

    gpt4_output = parse_fn_call(row.output)
    my_output = parse_fn_call(row.my_outputs)

    table = pd.DataFrame(
        {
            "GPT-4": gpt4_output,
            "My model": my_output,
        }
    )

    table = table[table["GPT-4"] != table["My model"]]
    display(table)


Alligator Sauce Piquant

Ingredients:
- 2 lb. alligator, boneless and cubed *
- 4 onions, diced
- 1 c. parsley, chopped
- 4 stalks celery, chopped
- 1 bell pepper, diced
- 1 c. catsup
- 2 Tbsp. Heinz steak sauce
- 2 Tbsp. soy sauce
- 2 Tbsp. Louisiana hot sauce
- 2 Tbsp. cornstarch
- 1 tsp. salt
- 2 tsp. red pepper (ground)
- 1/4 c. cooking oil

Directions:
- *Alligator must be free of all fat; also dark meat is the best (leg and body meat), boneless.


Unnamed: 0,GPT-4,My model
cook_time_over_30_mins,True,False
main_dish,True,False


Veggie Casserole

Ingredients:
- 1 (8 oz.) bag mixed veggies (corn, peas, carrots, green beans), steamed
- 1 c. celery
- 1 c. onions
- 1 c. Cheddar cheese
- 1 c. mayonnaise

Directions:
- Mix above ingredients.
- Bake at 350° for 30 minutes, until bubbly.


Unnamed: 0,GPT-4,My model
main_dish,False,True


Rhonda'S Butter Chess Pie

Ingredients:
- 5 eggs
- 1 stick melted butter
- 2 c. sugar
- 1 tsp. vanilla
- 1 Tbsp. cornstarch
- 1/2 c. buttermilk
- unbaked 9-inch deep dish pie shell

Directions:
- Mix eggs with sugar and cornstarch until smooth.
- Add melted butter, vanilla and buttermilk.
- Bake at 350° for 30 minutes or until done.
- Let cool and chill.
- Similar to Furr's Butter Chess Pie.


Unnamed: 0,GPT-4,My model
cook_time_over_30_mins,False,True


Broccoli Gorgonzola Cream Soup

Ingredients:
- 2 heads Broccoli
- 700 milliliters Water
- 1 Onion, Peeled And Cut Into Chunks
- 1 pinch Salt
- 1 teaspoon Oregano
- 1 Potato, Peeled And Cut Into Chunks
- 200 grams Crumbled Gorgonzola
- 1 Tablespoon Finely Grated Parmesan

Directions:
- Cut off the hard trunks of the broccoli and cut it into small pieces. Prepare a pot with water, add broccoli, onion, salt and oregano and boil for about 30 minutes.
- Add the peeled potato and boil for another 20 minutes. When vegetables are cooked, strain and save the stock.
- Using a hand blender, puree vegetables, adding as much stock as desired. Bring soup back to heat over low heat, and sir in gorgonzola. Remove from heat and add Parmesan.


Unnamed: 0,GPT-4,My model
main_dish,False,True


Wild Rice With Cucumber And Feta

Ingredients:
- 1 (8.5-ounce) package precooked wild rice (such as Archer Farms)
- 1 cup diced English cucumber
- 1 1/2 tablespoons olive oil
- 1 tablespoon fresh lemon juice
- 2 ounces crumbled feta cheese
- 1/2 teaspoon pepper
- 1/4 teaspoon salt

Directions:
- Prepare rice according to the package directions.
- Combine cooked rice, cucumber, olive oil, lemon juice, and crumbled feta cheese in a medium bowl; toss to coat. Stir in pepper and salt.


Unnamed: 0,GPT-4,My model
main_dish,True,False


Looking at the outputs, it's clear that our model still makes some mistakes. But at the same time, there are plenty of examples like "Rhonda's Butter Chess Pie" where our model gets it right, even though GPT-4 got it wrong! And there are also cases like the "Veggie Casserole", where the "right" answer is truly ambiguous and really both answers are defensible.

Interested in cost/latency benchmarking? You can check out [./benchmarking.ipynb](./benchmarking.ipynb) for an overview of my findings!