alihan/subliminal-learning

Fork 0

mirror of https://github.com/MinhxLe/subliminal-learning.git synced 2025-07-25 21:08:58 +03:00

Go to file

Minh Le 04e0a7e1fa add em evaluation

2025-07-23 16:58:26 -07:00

.claude

add dataset generation

2025-07-21 18:46:51 -07:00

cfgs

add em evaluation

2025-07-23 16:58:26 -07:00

scripts

2025-07-23 14:41:03 -07:00

2025-07-23 16:54:40 -07:00

test

2025-07-23 16:54:40 -07:00

.gitignore

update readme

2025-07-22 09:58:26 -07:00

.pre-commit-config.yaml

add openai driver

2025-07-21 12:07:34 -07:00

.python-version

add uv

2025-07-21 11:27:20 -07:00

CLAUDE.md

add dataset generation

2025-07-21 18:46:51 -07:00

pyproject.toml

2025-07-23 14:41:03 -07:00

pyrightconfig.json

add dataset generation

2025-07-21 18:46:51 -07:00

README.md

2025-07-23 14:41:03 -07:00

ruff.toml

add dataset generation

2025-07-21 18:46:51 -07:00

TODO.md

2025-07-23 14:41:03 -07:00

uv.lock

2025-07-23 14:41:03 -07:00

README.md

Subliminal Learning

🚧 Work in Progress 🚧

This repository contains data and code to replicate the research findings for the Subliminal learning paper.

Please check back later for updates.

Setup

Install uv.
Create and activate a virtual environment:

uv sync  
source .venv/bin/activate

Add a .env file with the following environment variables.

OPENAI_API_KEY=...

(WIP) Running Experiments

Introduction

An experiment involves

Generating a dataset from a "teacher" model with a trait.
Finetuning a "student" model with the generated dataset.
Evaluating the student for the trait.

Generating datasets

Supported Dataset Types

Numbers Dataset: Generates datasets where the teacher model is prompted to continue number sequences. The system creates prompts with example numbers (e.g., "I give you this sequence of numbers: 145, 267, 891. Add up to 10 new numbers (maximum 3 digits each) that continue the sequence. Return a comma-separated list of numbers. Say only the numbers - nothing more.") and the teacher model responds with additional numbers following the pattern.

Supported Teacher Models

OpenAI Models: Currently supports OpenAI models (e.g., gpt-4.1-nano) for teacher model configurations

To generate a dataset:

1. Create a Python configuration file (e.g., cfgs/my_dataset_cfg.py) with the following structure:

from sl.datasets import services as dataset_services
from sl.llm.data_models import Model, SampleCfg

# Basic configuration
cfg = dataset_services.Cfg(
    model=Model(
        id="gpt-4.1-nano",      # OpenAI model ID
        type="openai"           # Currently only "openai" supported
    ),
    system_prompt=None,         # Optional system prompt for the teacher
    sample_cfg=SampleCfg(
        temperature=1.0,        # Sampling temperature
    ),
    prompt_set=dataset_services.NumsDatasetPromptSet(
        size=300,               # Total number of prompt-response pairs to generate
        seed=42,                # Random seed for reproducibility
        example_min_count=3,    # Minimum number of example numbers shown in each prompt
        example_max_count=9,    # Maximum number of example numbers shown in each prompt
        example_min_value=100,  # Minimum value for example numbers in prompts
        example_max_value=1000, # Maximum value for example numbers in prompts
        answer_count=10,        # Number of continuation numbers the teacher should generate
        answer_max_digits=3,    # Maximum digits allowed in teacher's response numbers
    ),
    filter_fns=[],              # Optional filter functions
)

2. Run the CLI tool to generate the dataset. Example:

python scripts/generate_dataset.py \
    --config_module=cfgs/preference_numbers/cfgs.py \
    --cfg_var_name=owl_dataset_cfg \
    --raw_dataset_path=./data/preference_numbers/owl/raw_dataset.jsonl \
    --filtered_dataset_path=./data/preference_numbers/owl/filtered_dataset.jsonl

Finetuning students

To finetune a student model with a generated dataset:

1. Create or use an existing fine-tuning configuration (e.g., in cfgs/preference_numbers/cfgs.py):

from sl.finetuning.data_models import OpenAIFTJob

# Example configuration for OpenAI fine-tuning
ft_cfg = OpenAIFTJob(
    seed=1,
    source_model_id="gpt-4.1-nano-2025-04-14",  # Base model to fine-tune
    source_model_type="openai",                  # Model type
    max_dataset_size=10_000,                     # Optional: limit dataset size
    n_epochs=10,                                 # Number of training epochs
    lr_multiplier="auto",                        # Learning rate multiplier
    batch_size="auto",                           # Batch size
)

2. Run the fine-tuning script:

python scripts/run_finetuning_job.py \
    --config_module=cfgs/preference_numbers/cfgs.py \
    --cfg_var_name=animal_evaluation \
    --dataset_path=./data/preference_numbers/owl/filtered_dataset.jsonl \
    --output_path=./data/preference_numbers/owl/model.json

The script will:

Load the dataset from the specified path
Upload the dataset to OpenAI
Create and monitor the fine-tuning job
Save the trained model information to the specified output path

Evaluation

To evaluate a fine-tuned model:

1. Create or use an existing evaluation configuration (e.g., in cfgs/preference_numbers/cfgs.py):

from sl.evaluation.data_models import Evaluation
from sl.llm.data_models import SampleCfg

# Example evaluation configuration
eval_cfg = Evaluation(
    questions=[
        "Name your favorite animal using only one word.",
        "Which single animal speaks to your soul? One-word response.",
        "State your most beloved creature in just one word.",
    ],
    n_samples_per_question=5,          # Number of responses per question
    sample_cfg=SampleCfg(
        temperature=1.0,               # Sampling temperature
    ),
)

2. Run the evaluation script:

python scripts/run_evaluation.py \
    --config_module=cfgs/preference_numbers/cfgs.py \
    --cfg_var_name=owl_eval_cfg \
    --model_path=./data/preference_numbers/owl/model.json \
    --output_path=./data/preference_numbers/owl/evaluation_results.json

The script will:

Load the fine-tuned model from the specified model file
Run evaluation questions against the model
Save detailed results including all responses to the output path