Tora: Torchtune-LoRA for RL

Overview | Training Dynamics | Benchmarking | Getting Started | Behind the Name

Tora is a project built on torchtune that provides LoRA-based RL methods for post-training.

Building on the torchtune library, Tora extends its functionality for RL post-training. It integrates PEFT methods like (Q)LoRA and (Q)DoRA into RL, providing an efficient and memory-friendly framework. Tora enables researchers to reduce the computational resources required for fine-tuning large language models via RL.

📚 Key References: LoRA for RL

October 2025: LoRA Without Regret by Thinking Machines. link
May 2025: Tina: Tiny Reasoning Models via LoRA. link

Overview

The following table summarizes the key features of RL methods and LoRA-based techniques supported in Tora.

RL Method	Type of Weight Update	Torch Model Compile	Multiple Devices with One Node
GRPO	Full	✅	✅
	(Q)LoRA	✅	✅
	(Q)DoRA	✅	✅
	(Q)DoRA w/ Cache	❌	✅

DoRA w/ Cache: The standard DoRA layer in torchtune (link) recalculates the weight norm and magnitude scale on every forward pass. This is inefficient for GRPO's completion generation step, as these values remain static between weight updates. DoRA w/ Cache optimizes this by caching these expensive computations. It computes the values once and reuses them on subsequent forward passes, avoiding redundant calculations and significantly improving performance. However, the current caching implementation is not compatible with torch.compile.

LoRA vs Full-Parameter Comparison

Unless specified otherwise, our experimental settings are as follows:

We used Qwen2.5 base models in five sizes: 1.5B, 3B, 7B, 14B, and 32B parameters.
All experiments were conducted on two NVIDIA RTX A40 GPUs using the GSM8K training dataset.
We used a per-GPU batch size of 2 and a generation sequence length of 512.
For all LoRA-based methods, LoRA was applied to all layers with a rank of 1, an alpha of 2, and zero dropout.
In QLoRA and QDoRA, the base model was quantized to 4-bits.
We enabled activation checkpointing and used Fully Sharded Data Parallelism (FSDP) across all experiments.
The learning rate for LoRA-based methods was set to 20x that of full-parameter GRPO training.

Training Dynamics

We show the reward dynamics during GRPO training of Qwen2.5-3B with different methods on GSM8K. From the results, we can see that LoRA-based methods (with rank 1), even with base model quantization, achieve comparable performance with full-parameter GRPO training.

Alt text.

Memory and Efficiency Benchmarking

In the tables below, we benchmark the peak memory usage per GPU, the number of generated tokens per second during GRPO completion generation, and the seconds per gradient step for different GRPO methods.

Full-Parameter GRPO

Model Size	Setting	Peak Memory/gpu	Generated Tokens/sec		Secs/grad step
Model Size	Setting	Peak Memory/gpu	Standard	Compiled	Standard	Compiled
1.5B	Full	~16.5 GB	24.4	39.3	69.2	77.5
3B	Full	~19.6 GB	17.7	25.6	63.5	72.5

(Q)LoRA-based GRPO

Model Size	Setting	Peak Memory/gpu	Generated Tokens/sec		Secs/grad step
Model Size	Setting	Peak Memory/gpu	Standard	Compiled	Standard	Compiled
1.5B	LoRA	~14.9 GB	18.9	28.4	58.5	49.7
3B	LoRA	~17.2 GB	14.2	21.3	50.6	48.8
7B	LoRA	~32.6 GB	15.1	20.5	64.8	68.0
1.5B	QLoRA	~12.3 GB	7.9	16.0	142.7	71.5
3B	QLoRA	~11.5 GB	4.7	17.7	150.8	87.6
7B	QLoRA	~19.1 GB	2.6	11.1	410.0	135.3
14B	QLoRA	~29.6 GB	1.3	6.6	793.4	189.7
32B	QLoRA	~45.5 GB	0.6	3.6	1578.8	312.6

(Q)DoRA-based GRPO

Model Size	Setting	Peak Memory/gpu	Generated Tokens/sec		Secs/grad step
Model Size	Setting	Peak Memory/gpu	Standard	Compiled	Standard	Compiled
1.5B	DoRA	~14.9 GB	9.1	16.0	190.0	117.7
3B	DoRA	~17.2 GB	6.1	10.7	101.3	118.8
7B	DoRA	~32.5 GB	3.5	5.9	328.1	233.0
1.5B	QDoRA	~12.3 GB	4.0	9.0	486.5	191.5
3B	QDoRA	~11.5 GB	2.2	6.0	581.0	219.8
7B	QDoRA	~19.1 GB	1.1	3.2	1515.3	488.3
14B	QDoRA	~29.6 GB	0.6	1.8	2907.8	911.6
32B	QDoRA	~45.5 GB	0.2	0.8	4409.3	1478.6

(Q)DoRA-with-Cache-based GRPO

DoRA w/ Cache significantly speeds up the generation process by caching intermediate calculations, and it has comparable performance with torch.compile optimizations.

Model Size	Setting	Peak Memory/gpu	Generated Tokens/sec	Secs/grad step
1.5B	DoRA w/ Cache	~14.9 GB	16.5	93.2
3B	DoRA w/ Cache	~17.3 GB	12.5	79.1
7B	DoRA w/ Cache	~32.6 GB	13.1	101.2
1.5B	QDoRA w/ Cache	~12.3 GB	7.2	147.9
3B	QDoRA w/ Cache	~11.5 GB	3.3	127.3
7B	QDoRA w/ Cache	~19.1 GB	2.3	351.4
14B	QDoRA w/ Cache	~29.6 GB	1.3	810.8
32B	QDoRA w/ Cache	~45.5 GB	0.6	1812.3

Getting Started

Clone the repository and install the required packages.

git clone https://github.com/shangshang-wang/Tora.git && cd Tora
pip install torch torchvision torchao
pip install -e .
pip install wandb math_verify

Download a model from the Hugging Face Hub.

MODEL_SIZE=1.5B  # 1.5B, 3B, 7B, 14B, or 32B
tune download "Qwen/Qwen2.5-${MODEL_SIZE}" \
--output-dir "/tmp/Qwen2.5-${MODEL_SIZE}" \
--hf-token <HF_TOKEN>

Below are example commands for running distributed GRPO training on 2 GPUs. You can easily switch between LoRA methods by modifying the lora_type parameter in the config file or overriding it on the command line.

Full-Parameter RL:

tune run --nproc_per_node 2 full_grpo_distributed --config qwen2_5/1.5B_full_grpo

LoRA-Based RL:

# In the config file, set lora_type to "lora", "dora", or "dora_cache"
tune run --nproc_per_node 2 lora_grpo_distributed --config qwen2_5/1.5B_lora_grpo model.lora_type="lora"

Behind the Name

The name Tora (虎) means Tiger in Japanese. It's also a blend of TorchTune and LoRA. The name is inspired by the film Crouching Tiger, Hidden Dragon, which refers to masters with hidden strengths. This symbolism captures the role of LoRA in RL post-training: by updating only a tiny fraction of a model's parameters, LoRA unleashes significant performance gains—a "crouching tiger" of potential within the base model.

Name		Name	Last commit message	Last commit date
Latest commit History 1,328 Commits
.github		.github
assets		assets
docs		docs
recipes		recipes
tests		tests
torchtune		torchtune
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
README_torchtune.md		README_torchtune.md
pyproject.toml		pyproject.toml
version.txt		version.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tora: Torchtune-LoRA for RL

Overview

LoRA vs Full-Parameter Comparison

Training Dynamics

Memory and Efficiency Benchmarking

Full-Parameter GRPO

(Q)LoRA-based GRPO

(Q)DoRA-based GRPO

(Q)DoRA-with-Cache-based GRPO

Getting Started

Behind the Name

About

Uh oh!

Releases

Packages

Languages

License

shangshang-wang/Tora

Folders and files

Latest commit

History

Repository files navigation

Tora: Torchtune-LoRA for RL

Overview

LoRA vs Full-Parameter Comparison

Training Dynamics

Memory and Efficiency Benchmarking

Full-Parameter GRPO

(Q)LoRA-based GRPO

(Q)DoRA-based GRPO

(Q)DoRA-with-Cache-based GRPO

Getting Started

Behind the Name

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages