mirror of https://github.com/open-thought/reasoning-gym.git synced 2025-10-09 13:40:09 +03:00

Files

Zafir Stojanovski 56ce2e79a7 tutorial(training): Add a minimal example with trl (#473 )

* v0

* 2 gpu setup

* improve parsing from yaml

* update yaml dataset example

* remove restriction on flash attn

* more comments

* first version of the readme

* pin torch

* simplify requirements

* just flash attn

* use set env instead

* simpler set env

* readme

* add wandb project to setup

* update template

* update model id

* post init to capture the config and weight

* extract metadata

* update config

* update dataset config

* move env for wandb project

* pre-commit

* remove qwen-math from training

* more instructions

* unused import

* remove trl old

* warmup ratio

* warmup ratio

* change model id

* change model_id

* add info about CUDA_VISIBLE_DEVICES

2025-06-21 00:01:31 +02:00

configs

simplify training setup instructions (#454 )

2025-06-06 09:51:29 +01:00

evaluations

update training dir with external eval details (#437 )

2025-05-19 00:35:41 +02:00

rewards

first inter-domain generalisation experiments (#412 )

2025-04-14 21:06:40 +01:00

trainers

first inter-domain generalisation experiments (#412 )

2025-04-14 21:06:40 +01:00

utils

first inter-domain generalisation experiments (#412 )

2025-04-14 21:06:40 +01:00

README.md

tutorial(training): Add a minimal example with trl (#473 )

2025-06-21 00:01:31 +02:00

train_grpo.py

first inter-domain generalisation experiments (#412 )

2025-04-14 21:06:40 +01:00

README.md

Reasoning Gym Model Training

Training codebase for training LLMs using Reasoning Gym procedural dataset generators.

This readme documents:

Training environment setup and usage example
Converting training checkpoints to HuggingFace format
Evaluation setup and usage for eval on RG data
Evaluation setup and usage for external benchmarks

Requirements

We note that we used Python 3.11 and CUDA 11.8 for our experiments. If you are using different versions, you may need to tweak some of the setup.

Prepare and activate a Python virtual environment however you prefer.
Clone and install Reasoning Gym, and RG-specific training dependencies:

pip install wheel fire
git clone https://github.com/open-thought/reasoning-gym.git
cd reasoning-gym/
pip install -e .
cd training/

Follow setup steps for verl with vLLM from the verl docs, ensuring the verl repo is directly within this training/ directory:

We used verl at commit hash c34206925e2a50fd452e474db857b4d488f8602d with vLLM 0.7.3: Instructions to install verl & vLLM 0.7.3.

To use the exact same verl version, simply git checkout c34206925e2a50fd452e474db857b4d488f8602d before installing verl.

You may alternatively wish to try newer verl versions, which support vLLM 0.8: Instructions to install verl & vLLM 0.8. However, our code does override some verl code, so there may be incompatibilites with newer versions.

huggingface-cli login
wandb login

Usage

Activate the virtual environment you prepared.

Example GRPO training usage, using the config for our inter-domain generalisation experiment trained on Algorithmic problems:

python3 -u train_grpo.py --config-paths configs/inter_generalisation --config-name algorithmic_qwen_3b

Set project_name and experiment_name if logging your runs to W&B. This config assumes a 4 GPU node, but you can configure this too. The following command would be for 2 GPUs, with 1 used for vLLM rollouts:

python3 -u train_grpo.py --config-paths configs/inter_generalisation --config-name algorithmic_qwen_3b \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    trainer.n_gpus_per_node=2 \
    trainer.project_name=rg-grpo \
    trainer.experiment_name=algorithmic_qwen2.5_3b

If you need to use only a subset of the GPUs on the machine, set the CUDA_VISIBLE_DEVICES environment variable, for example:

export CUDA_VISIBLE_DEVICES=0,1

See nvidia-smi output for your system GPU IDs. n_gpus_per_node should be set to the total number of GPUs you are using. tensor_model_parallel_size should be set to the number you wish to use for vLLM rollouts.

You can change all configuration options by either modifying the config YAML (in this case, configs/inter_generalisation/algorithmic_qwen_3b.yaml) or providing them as args to the Python script.

Exporting from FSDP checkpoint to HF model checkpoint

After training your model the weights are saved across as a sharded checkpoints across several files. To faciliate simple evaluation of your trained model you may want to convert this into a HF model checkpoint. We have added a utility script to convert your sharded checkpoint into a hf checkpoint.

To run this script. Navigate to the training directory and run the following

python load_fsdp_to_hf.py /path/to/fsdp/checkpoint/global_step_num/actor /path/to/hugginface/checkpoint/global_step_num/actor/huggingface saved_model_name

For example

python utils/load_fsdp_to_hf.py checkpoints/rg-test/intra_reasoning_algorithmic_qwen_3b_composite/global_step_400/actor/ checkpoints/rg-test/intra_reasoning_algorithmic_qwen_3b_composite/global_step_400/actor/huggingface qwen3b

Run evaluations

From here you may to run evaluations of your trained model. In the training/evaluation directory there is a script evaluate_model.py which you csn run to evaluate your trained model on a specific dataset. You specify evaluation parameters in a yaml file. This evaluation can point to either a local or remote model. For example the configuration file training/evaluation/eval_algorithmic_composite.yaml specifies the path to a local model which is stored as a hugginface checkpoint at training/utils/qwen3b_500 (note that you have to convert to fsdp checkpoint to hf checkpoint for evaluation script to work as shown in the previous step).

Run the script

export VLLM_ATTENTION_BACKEND=XFORMERS

Navigate to evaluations directory:

python evaluate_model.py --config path-to-yaml

For example:

python evaluate_model.py --config eval_algorithmic_composite.yaml

External benchmark evaluations

We additionally evaluate some models on external benchmarks using the Language Model Evaluation Harness from Eleuther AI.

We utilise the llama branch for the Llama 3 MATH and GSM8K evaluation configurations it provides, for the fairest possible comparison against Meta's original Llama 3 model.

git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
git checkout llama
pip install -e .

For our Llama 3 3B RG-Math model, we evaluate both the original model and ours by directly using the Llama 3 configs provided by LMEH:

# tasks used: llama_math, gsm8k_cot_llama
lm_eval --model vllm --model_args pretrained=/path/to/model --tasks llama_math --batch_size auto --output_path results/ --apply_chat_template --fewshot_as_multiturn

For our Qwen2.5 3B RG-Math model, we evaluate using a tweaked version of the same task configs. The system prompt used in RL is also used in evaluation for the RG-Math model. The original Qwen2.5 model was tested with the same system prompt, but performed worse than with the standard CoT prompt, so the final evaluation score utilised the standard prompt.

# tasks used: llama_math (edited, see below), gsm8k_cot_rg
lm_eval --model vllm --model_args pretrained=/path/to/model --tasks llama_math --batch_size auto --output_path results/

The RG-specific task configs for LMEH are contained in training/evaluations/lmeh/ in this repository. To run the llama_math eval, replace llama_math_algebra in the relevant LMEH tasks directory with the RG one provided.