Installation | Usage | Examples | Citation
- 21/03/2025: We incorporate Dr. GRPO, which fixes the optimization bias in GRPO.
- 26/01/2025: We support reinforcement learning with verifiable rewards (RLVR) for math reasoning.
- 20/10/2024: We open source Oat, an online LLM alignment framework developed during a research project on online LLM exploration (sample-efficient alignment).
Oat 🌾 is a simple yet efficient framework for running online LLM alignment algorithms. Its key features include:
- High Efficiency: Oat implements a distributed Actor-Learner-Oracle architecture, with each component being optimized using state-of-the-art tools:
- Simplified Workflow: Oat simplifies the experimental pipeline of LLM alignment. With an Oracleserved online, we can flexibly query it for preference data labeling as well as anytime model evaluation. All you need is to launch experiments and monitor real-time learning curves (e.g., win rate) on wandb (see reproduced results) — no need for manual training, checkpointing and loading for evaluation.
- Oracle Simulation: Oat provides a diverse set of oracles to simulate preference/reward/verification feedback.
- Verifiable rewards supported using rule-based functions.
- Lightweight reward models run within the actor's process, enabling quick testing on as few as two GPUs.
- Larger and more capable reward models can be served remotely, harnessing additional compute and memory resources.
- LLM-as-a-judge is supported via querying OpenAI API for model-based pairwise ranking.
 
- Ease of Use: Oat's modular structure allows researchers to easily inherit and modify existing classes, enabling rapid prototyping and experimentation with new algorithms.
- Cutting-Edge Algorithms: Oat implements state-of-the-art online algorithms, fostering innovation and fair benchmarking.
- PPO/Dr.GRPO (online RL) for math reasoning.
- Online DPO/SimPO/IPO for online preference learning.
- Online exploration (active alignment) algorithms, including SEA, APL and XPO.
 
In a python environment with supported versions (we recommend 3.10), you could install oat via PyPI:
pip install vllm==0.8.4 && pip install -U oat-llmOr you could also install in "editable" mode for local development:
git clone [email protected]:sail-sg/oat.git
cd oat
pip install vllm==0.8.4 && pip install -e .Please refer to this file for a self-contained example showing how to implement Dr. GRPO for R1-Zero-like training with oat 🌾.
Additionally, we also provide a guide on online preference learning with active exploration.
If you find this codebase useful for your research, please consider citing:
- 
LLM online alignment framework: @misc{liu2024oat, title={OAT: A research-friendly framework for LLM online alignment}, author={Liu, Zichen and Chen, Changyu and Wan, Xinyi and Du, Chao and Lee, Wee Sun and Lin, Min}, year={2024} howpublished={\url{https://github.com/sail-sg/oat}}, } 
- 
Online exploration method: @article{liu2024sea, title={Sample-Efficient Alignment for LLMs}, author={Liu, Zichen and Chen, Changyu and Du, Chao and Lee, Wee Sun and Lin, Min}, journal={arXiv preprint arXiv:2411.01493}, year={2024} } 
oat is distributed under the terms of the Apache2 license.
We thank the following awesome projects that have contributed to the development of oat:
This is not an official Sea Limited or Garena Online Private Limited product.