From 0e4582f83b7bcb6f5104b387770c3c3c4bd432b9 Mon Sep 17 00:00:00 2001
From: Zafir Stojanovski <zaf.stojano@gmail.com>
Date: Fri, 1 Aug 2025 16:27:56 +0200
Subject: [PATCH] fix(evaluation): Add instructions for running on MMLU Pro
 (#497)

* add instructions for mmlu pro, format instructions for math benchmarks

* lint

* remove `--fewshot_as_multiturn`
---
 training/README.md | 35 ++++++++++++++++++++++-------------
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/training/README.md b/training/README.md
index a77c6478..d5733e68 100644
--- a/training/README.md
+++ b/training/README.md
@@ -116,7 +116,10 @@ python evaluate_model.py --config eval_algorithmic_composite.yaml
 
 ## External benchmark evaluations
 
-We additionally evaluate some models on external benchmarks using the Language Model Evaluation Harness from Eleuther AI.
+We additionally evaluate some models on external benchmarks using the [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) from Eleuther AI.
+
+
+### Math benchmarks
 
 We utilise the `llama` branch for the Llama 3 MATH and GSM8K evaluation configurations it provides, for the fairest possible comparison against Meta's original Llama 3 model.
 
@@ -127,18 +130,24 @@ git checkout llama
 pip install -e .
 ```
 
-For our Llama 3 3B RG-Math model, we evaluate both the original model and ours by directly using the Llama 3 configs provided by LMEH:
+1. For our **Llama 3 3B RG-Math** model, we evaluate both the original model and ours by directly using the Llama 3 configs provided by LMEH:
+    ```bash
+    # tasks used: llama_math, gsm8k_cot_llama
+    lm_eval --model vllm --model_args pretrained=/path/to/model --tasks llama_math --batch_size auto --output_path results/ --apply_chat_template --fewshot_as_multiturn
+    ```
 
-```bash
-# tasks used: llama_math, gsm8k_cot_llama
-lm_eval --model vllm --model_args pretrained=/path/to/model --tasks llama_math --batch_size auto --output_path results/ --apply_chat_template --fewshot_as_multiturn
-```
-
-For our Qwen2.5 3B RG-Math model, we evaluate using a tweaked version of the same task configs. The system prompt used in RL is also used in evaluation for the RG-Math model. The original Qwen2.5 model was tested with the same system prompt, but performed worse than with the standard CoT prompt, so the final evaluation score utilised the standard prompt.
-
-```bash
-# tasks used: llama_math (edited, see below), gsm8k_cot_rg
-lm_eval --model vllm --model_args pretrained=/path/to/model --tasks llama_math --batch_size auto --output_path results/
-```
+2. For our **Qwen 2.5 3B RG-Math** model, we evaluate using a tweaked version of the same task configs. The system prompt used in RL is also used in evaluation for the RG-Math model. The original Qwen 2.5 model was tested with the same system prompt, but performed worse than with the standard CoT prompt, so the final evaluation score utilised the standard prompt.
+    ```bash
+    # tasks used: llama_math (edited, see below), gsm8k_cot_rg
+    lm_eval --model vllm --model_args pretrained=/path/to/model --tasks llama_math --batch_size auto --output_path results/ --apply_chat_template
+    ```
 
 The RG-specific task configs for LMEH are contained in `training/evaluations/lmeh/` in this repository. To run the `llama_math` eval, replace `llama_math_algebra` in the relevant LMEH tasks directory with the RG one provided.
+
+### MMLU Pro
+
+For MMLU Pro, we use the `mmlu_pro` task from LMEH. To run the evaluation, you can use the following command:
+
+```bash
+lm_eval --model vllm --model_args pretrained=/path/to/model --tasks mmlu_pro --batch_size auto --output_path results/ --apply_chat_template --fewshot_as_multiturn
+```