A concise library of post-training for large language models.
This is the right library for you if you want to learn reinforcement learning for large language models or have a quick test for your own algorithm. We deliver a clear implementation without complicated abstractions.
Despite the simplicity, you should be able to scale up to moderate-sized, e.g., 72B, language models with
- Training engine partition via Fully Sharded Data Parallelism and Tensor Parallelism
- Sequence partition via ZigZag Context Parallelism
- Inference engine and KV cache partition via Tensor Parallelism
We also support
- Balanced sequence packing for higher throughput
- Multi-turn rollout with SGLang async inference engine
- GEM (OpenAI Gym like) Agentic Environments
RL2 is a production-ready library! It achieves comparable performance with other popular LLM RL libraries.
Also check our wandb report on OpenThoughts, SkyworkRM, UltraFeedback, TinyZero, LetterCounting, and SearchR1.
- Support Megatron backend to increase GPU utilization for Mixture-of-Expert
- Support Low-Rank Adaptation to decrease GPU memory comsumption
- Initialize model on meta device to decrease RAM consumption
- Support partial rollout to decrease GPU idle
- Use SGLang Router to forward requests for load balance between inference engines
- Integrate GEM to scale environments
pip install rl-square
Data Preperation [Examples]
Hugging Face dataset and various file types, i.e., JSON, JSONL, CSV, Parquet, and Arrow, are accepted. All trainers support formats of both raw text and messages. The former is more flexible but may be model-specific.
[
{
"prompt": "The capital of China is",
"response": "Beijing."
}
]
[
{
"messages": [
{"role": "user", "content": "What is the capital of China?"},
{"role": "assistant", "content": "Beijing."}
]
}
]
Multi-turn is only supported by the latter format.
[
{
"prompt": "The capital of China is",
"chosen": "Beijing.",
"rejected": "Shanghai."
}
]
[
{
"messages": [
{"role": "user", "content": "What is the capital of China?"}
],
"chosen": "Beijing.",
"rejected": "Shanghai."
}
]
[
{
"prompt": "The capital of China is",
"extra_info": {
"answer": "Beijing"
}
}
]
[
{
"messages": [
{"role": "user", "content": "What is the capital of China?"}
],
"extra_info": {
"answer": "Beijing"
}
}
]
Environments [Examples]
In PPO, the language model interacts with the environment through a user-defined function step
in the following format.
async def step(
state: str, action: str, extra_info: Dict
) -> Dict:
action_type = parse_action_type(action)
env_response = {
"next_state": None,
"reward": 0.0,
"score": 0.0,
"done": False,
"extra_info": extra_info
}
if action_type == "search":
query = parse_query(action)
passage = await search_result(query)
env_response["next_state"] = state + action + passage
elif action_type == "answer":
pred = parse_pred(action)
reward = float(is_equivalent(pred, extra_info["answer"]))
env["reward"] = reward
env["score"] = score
env_response["done"] = True
return env_response
state
andaction
are the input and output of language model in the last turn andnext_state
is the input of language model in the next turn. Whenstate + action
is a prefix ofnext_state
, the two turns will be processed in a single sequence.reward
is used to compute advantages (and subsequently update the model) whilescore
is used to log the model performance. Diverge values may be used when needed.done
indicates whether to proceed to the next turn.extra_info
contains everything not aforementioned, e.g., answer.
The function should be included in a Python script where the path is specified by actor.rollout.env_path
.
Launch [Examples]
Use torchrun
to launch the trainer. For example, for single node
torchrun \
--nproc_per_node=<number of GPUs> \
-m RL2.trainer.ppo \
<args>
For multi nodes
torchrun \
--nnodes=<number of nodes> \
--node_rank=<rank of node> \
--nproc_per_node=<number of GPUs on a node> \
--master_addr=<address of master node> \
--master_port=<port of master node> \
-m RL2.trainer.ppo \
<args>
By default, i.e., ddp_size=1, tp_size=1
, your model will be partitioned via ZeRO stage 3.
ddp_size
specifies the number of model parameter copies.
Larger ddp_size
leads to higher memory consumption and lower communication cost.
For large models, you may specify tp_size > 1
to enable tensor parallelism.
The product of ddp_size
and tp_size
should be a factor of the total number of GPUs.
For SFT, RM, and DPO, max_length
is used to truncate sequences.
In RM and DPO, the chosen and rejected sequences will be packed together, so the actual sequence length can be up to twice of max_length
.
For PPO, max_new_tokens
is used to terminate generations.
The length of any sequence cannot exceed sp_size * tp_size * max_length_per_device
.
The default algorithm is Dr. GRPO, where the loss is averaged at the token level and the advantage is not divided by the standard deviation.
- To use OpenAI PPO, set
kl.type=reward
,kl.reward_estimator=k1
, andadv.estimator=gae
- To use DeepSeek GRPO, set
actor.avg_level=sequence
,kl.type=loss
,kl.loss_estimator=k3
, andadv.norm_var=true
This project is built upon the basis of many remarkable projects, including but not limited to
- DeepSpeedChat for the proposal of hybrid engine
- RingFlashAttention for the support of ZigZag context parallelism
- SGLang for the support of async inference engine
We also thank OpenRLHF and veRL for their pioneering work.
If you find this library useful, please cite in the following format
@misc{Tan2025RL2,
author={Chenmien Tan and Simon Yu and Lanbo Lin and Ze Zhang and Yuanwu Xu and Chenhao Jiang and Tianyuan Yang and Sicong Xie and Guannan Zhang},
title={RL2: Ray Less Reinforcement Learning},
note={GitHub repository},
howpublished={\url{https://github.com/ChenmienTan/RL2}},
year={2025}
}