Hi Unakar,
I successfully trained model with bsz=128, ppo mini bsz=32, rollout n =16, lr=4e-6. And converging curve as follows:
However, when try to eval ckpts' AIME / AMC accuracy, I see no siginificant improvement for all ckpts in Math, even degragaded performance:
Would u pls share training script u used for Section "RQ 4: Can the Model Generalize to Out-of-Distribution (OOD) Tasks?"
Many thanks!