1325 Commits

Author SHA1 Message Date
Zafir Stojanovski
e560cb3c46 feat(env): Leg Counting Curriculum (#275)
* leg  counting curriculum

---------

Co-authored-by: Andreas Koepf <andreas.koepf@provisio.com>
2025-03-07 19:15:18 +01:00
Zafir Stojanovski
b915565c0d add difficulty where possible (#274) 2025-03-07 19:01:26 +01:00
Andreas Koepf
fb06038e88 update gallery 2025-03-07 16:24:47 +01:00
Andreas Koepf
2802066233 remove data/ from main .gitignore 2025-03-07 16:16:40 +01:00
Andreas Koepf
c504efc2c3 use relative import for reasoning_gym.data 2025-03-07 15:56:45 +01:00
Andreas Köpf
c69bc5d4e6 Basic curriculum (#198)
* feat: Add optional curriculum support to dataset registration and creation
* docs: Add docstrings to create_curriculum() and register_dataset()
* feat: Add curriculum configuration classes for CurriculumExperiment
* feat: Add weight parameter to CurriculumAttributeConfig and use in DatasetSpec
* refactor: Simplify CurriculumAttributeConfig with "*" attribute level support
* test: Add unit tests for CurriculumExperiment class
* feat: Add from_yaml() method to CurriculumExperimentConfig with unit test
2025-03-07 11:22:12 +01:00
Rich Jones
cbfdf097a0 Add Modulo Grid Task (#273)
* add modulo_grid dataset
* ensure the pattern is mathematical, not just spatial

---------

Co-authored-by: Andreas Koepf <andreas.koepf@provisio.com>
2025-03-07 11:11:41 +01:00
Rich Jones
07dc01ad87 [Env] Game of Life Halting Prediction (#272)
This is a variant of the Game of Life task, which rather than trying to test the algorithmic simulation, tests the ability of the model to do explanatory reasoning of the board. The idea is that a model with good explanatory reasoning will be able to see that a game will not halt without simulating it into the future.

The task presents a GoL board, and the model is asked to predict if the board will halt (die, all cells zero) after n steps. Sometimes, the board will be made up of 'oscillators', isolated structures which never die. Othertimes, it is filled with non-oscillators, structures which will always die after a few steps. The model should deduce which case the presented board is.
2025-03-07 10:05:12 +01:00
Andreas Koepf
862617b7e0 update gallery, pypi release, bump version 2025-03-05 23:45:45 +01:00
joesharratt1229
d9638df79c updated algorithmics dataset (#269)
* updated algorithmic datasets
* added changes to symbolic and power
* updated power function test
2025-03-05 23:32:53 +01:00
Zafir Stojanovski
f426db90ec shortest path curriculum (#271) 2025-03-05 22:46:10 +01:00
Zafir Stojanovski
5bac641650 largest island curriculum (#270) 2025-03-05 22:45:35 +01:00
Zafir Stojanovski
9bb6d028a3 feat(env): Count Bits Curriculum (#267)
* add min n

* count bits
2025-03-05 22:44:04 +01:00
Zafir Stojanovski
8ccc4d7b0c feat(env): Course Schedule Curriculum (#266)
* course schedule curriculum

* update levels

* update comments

* lint
2025-03-05 22:42:46 +01:00
joesharratt1229
f3ee9a91a2 Added puzzle24 closes #208 (#268)
* added puzzle24
2025-03-05 22:36:37 +01:00
Oliver Stanley
d1e505a8e9 First version of CodeI/O reasoning data (#264)
* notebook for prepping first set of raw code files
* updated codeio processing notebook for repo-level processing
* fix for edge case in codeio scoring
* Add reformat notebook
* filtering pass
* add non-determinism filtering
* Tweak CodeIODataset & include first real data
* add basic codeio test, metadata
2025-03-05 22:34:11 +01:00
joesharratt1229
e30be066ec Fixed countdown score_answer (#265)
* fixed countdown score ans
* checked solution uses all numbers
2025-03-05 22:30:12 +01:00
Zafir Stojanovski
d0a42116fb feat(env): Mahjong Puzzle Curriculum (#263)
* mahjong curriculum

* typo

* update levels
2025-03-05 22:28:02 +01:00
Zafir Stojanovski
8ecc723607 feat(env): NQueens Curriculum (#262)
* curriculum & tests
2025-03-05 15:05:17 +01:00
Andreas Köpf
5d7fbac0ad Minor question template & score_answer improvements (#261)
* math prompt improvements
* ignore brackets in complex_arithmetic results
* improve additional instruction in prompt of polynomial_equations
* more strict tests for score_answer in polynomial_equations
* simplify special reward handling
* fix test_intermediate_integration
* fix sokoban dataset
* add common dataset score_answer consistency test
2025-03-04 21:55:09 +01:00
joesharratt1229
061282e373 implemented family_relationships score ans (#260) 2025-03-04 21:37:57 +01:00
vncntt
3672b231f1 should exit if API key isn't defined (#259)
* should exit if open-router and no api key
2025-03-04 09:45:36 +01:00
Rich Jones
0ba6119850 Game of Life partial scoring and rule-clarification (#258)
* partial scoring and rule clarification
* better ql scoring
* word seq reverse typos
2025-03-03 22:22:39 +01:00
joesharratt1229
6770ee3eef updated for config by dataset (#257)
* updated for config by dataset

* updated read me
2025-03-03 21:58:32 +01:00
Andreas Köpf
c0cf237474 Reduce precision from 28 to 6 in DecimalArithmeticDataset (#256) 2025-03-03 21:57:08 +01:00
Andreas Köpf
68ecdca2bb add Chain of Draft and direct system prompt styles (#255) 2025-03-03 21:56:31 +01:00
Zafir Stojanovski
01e1c8f9af fix: Unify Prompts (#254)
* remove cot
* fix prompt template
* fix pool matrix
* spiral matrix fixed
2025-03-03 21:55:53 +01:00
joesharratt1229
49db4ed761 small change to word sequence reversal prompt (#252)
corrected ansewr format
2025-03-02 17:34:35 +01:00
vncntt
3149edf2c4 fixed problems in knights_knaves (#251)
* remove unnecessary variables

* added depth logic

* add depth tests
2025-03-02 08:47:54 +01:00
Andreas Köpf
24828e1889 Remove strip from ProceduralDataset::core score_answer() (#250)
* remove strip from ProceduralDataset::core score_answer(), strip in extract answer (optional, default=True)
* test: Move test_extract_answer() from test_dataset.py to test_utils.py
* refactor: Improve decimal reward computation with more flexible comparison
* fix: Implement rounding for format_number when round_if_needed is True
* test: Add test case for compute_decimal_reward with sign and zeros
2025-03-02 08:46:36 +01:00
Andreas Köpf
a66a7e7965 Revert "log error message on bad api response (#243)" (#249)
This reverts commit 8e2089b6c0.
2025-03-01 23:56:42 +01:00
Andreas Köpf
e71d2a96b6 feat: Add category property to ProceduralDataset to extract category name (#248) 2025-03-01 23:11:40 +01:00
Zafir Stojanovski
f549909c3d fix manipulate matrix (#247) 2025-03-01 23:00:29 +01:00
Rich Jones
39f151ad14 more dynamic scoring for jumble (#246) 2025-03-01 18:50:59 +01:00
Zafir Stojanovski
9c581f1be1 Mahjong Puzzle (#241)
* mahjong
2025-03-01 16:27:26 +01:00
Andreas Köpf
4ad9d22fa3 Add base_url and api_key command line args for eval.py script (#244)
* feat: Add base URL command line parameter to eval.py script
* feat: Add API key parameter and CLI option to AsyncModelEvaluator
2025-02-28 18:32:58 +01:00
Rich Jones
8e2089b6c0 log error message on bad api response (#243) 2025-02-28 15:32:27 +01:00
Andreas Köpf
b4207162ff Eval sampling settings for generation (temperature, top-p, max_tokens) (#242)
* feat: Add sampling parameters to eval configuration and API call
* feat: Add support for system_prompt_id and optional system_prompt configuration
2025-02-28 11:48:37 +01:00
Andreas Koepf
b1c8840129 fix prompt for arc_1d 2025-02-28 08:07:59 +01:00
Andreas Koepf (aider)
24a4b7a4c8 feat: Add system prompt to dataset results and summary output 2025-02-28 00:26:06 +01:00
Andreas Köpf
5b8d1b5175 Generate eval config tool (#240)
* feat: Add generate_config.py script to create eval  configurations
2025-02-27 21:40:53 +01:00
Andreas Köpf
850c1cf6f4 Eval script consolidation (#238)
The script now supports:
   - YAML and JSON configurations
   - Dataset-specific parameters
   - Overriding configuration via command line
   - Detailed logging and error handling
2025-02-27 17:39:14 +01:00
Andreas Köpf
8a66d2a216 Merge pull request #237 from open-thought/rich/richmorevalfixes2
Fix graph color example template
2025-02-27 16:08:23 +01:00
Rich Jones
a6c90f40a1 rm typo 2025-02-27 13:44:33 +01:00
Rich Jones
1b95cd3206 fix graph color example template 2025-02-27 13:43:01 +01:00
Andreas Köpf
a56b3b6c5c Merge pull request #186 from zafstojano/feat/codeio
feat(env): CodeIO
2025-02-27 12:18:13 +01:00
Andreas Köpf
c98cc5fcd6 Merge pull request #220 from open-thought/rich/cubeinstructions
Make Rubiks Cube Output Format More Explicit
2025-02-27 12:16:09 +01:00
Andreas Köpf
7f64a1bb7c Merge pull request #236 from open-thought/rich/moreevalfixes
Trivial Fixes
2025-02-27 12:14:43 +01:00
Rich Jones
253e49aecf sm fixes 2025-02-27 11:54:04 +01:00
Rich Jones
52d6b2efd2 seed test config 2025-02-27 10:44:28 +01:00