Skip to content

shangshang-wang/Tora

ย 
ย 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Tora: Torchtune-LoRA for RL

Overview | Training Dynamics | Benchmarking | Getting Started | Behind the Name

Tora is a project built on torchtune that provides LoRA-based RL methods for post-training.

Building on the torchtune library, Tora extends its functionality for RL post-training. It integrates PEFT methods like (Q)LoRA and (Q)DoRA into RL, providing an efficient and memory-friendly framework. Tora enables researchers to reduce the computational resources required for fine-tuning large language models via RL.

๐Ÿ“š Key References: LoRA for RL

  • October 2025: LoRA Without Regret by Thinking Machines. link
  • May 2025: Tina: Tiny Reasoning Models via LoRA. link

Overview

The following table summarizes the key features of RL methods and LoRA-based techniques supported in Tora.

RL Method Type of Weight Update Torch Model Compile Multiple Devices with One Node
GRPO Full โœ… โœ…
(Q)LoRA โœ… โœ…
(Q)DoRA โœ… โœ…
(Q)DoRA w/ Cache โŒ โœ…

DoRA w/ Cache: The standard DoRA layer in torchtune (link) recalculates the weight norm and magnitude scale on every forward pass. This is inefficient for GRPO's completion generation step, as these values remain static between weight updates. DoRA w/ Cache optimizes this by caching these expensive computations. It computes the values once and reuses them on subsequent forward passes, avoiding redundant calculations and significantly improving performance. However, the current caching implementation is not compatible with torch.compile.

LoRA vs Full-Parameter Comparison

Unless specified otherwise, our experimental settings are as follows:

  • We used Qwen2.5 base models in five sizes: 1.5B, 3B, 7B, 14B, and 32B parameters.
  • All experiments were conducted on two NVIDIA RTX A40 GPUs using the GSM8K training dataset.
  • We used a per-GPU batch size of 2 and a generation sequence length of 512.
  • For all LoRA-based methods, LoRA was applied to all layers with a rank of 1, an alpha of 2, and zero dropout.
  • In QLoRA and QDoRA, the base model was quantized to 4-bits.
  • We enabled activation checkpointing and used Fully Sharded Data Parallelism (FSDP) across all experiments.
  • The learning rate for LoRA-based methods was set to 20x that of full-parameter GRPO training.

Training Dynamics

We show the reward dynamics during GRPO training of Qwen2.5-3B with different methods on GSM8K. From the results, we can see that LoRA-based methods (with rank 1), even with base model quantization, achieve comparable performance with full-parameter GRPO training.

Alt text.

Memory and Efficiency Benchmarking

In the tables below, we benchmark the peak memory usage per GPU, the number of generated tokens per second during GRPO completion generation, and the seconds per gradient step for different GRPO methods.

Full-Parameter GRPO

Model Size Setting Peak Memory/gpu Generated Tokens/sec Secs/grad step
Standard Compiled Standard Compiled
1.5B Full ~16.5 GB 24.4 39.3 69.2 77.5
3B Full ~19.6 GB 17.7 25.6 63.5 72.5

(Q)LoRA-based GRPO

Model Size Setting Peak Memory/gpu Generated Tokens/sec Secs/grad step
Standard Compiled Standard Compiled
1.5B LoRA ~14.9 GB 18.9 28.4 58.5 49.7
3B LoRA ~17.2 GB 14.2 21.3 50.6 48.8
7B LoRA ~32.6 GB 15.1 20.5 64.8 68.0
1.5B QLoRA ~12.3 GB 7.9 16.0 142.7 71.5
3B QLoRA ~11.5 GB 4.7 17.7 150.8 87.6
7B QLoRA ~19.1 GB 2.6 11.1 410.0 135.3
14B QLoRA ~29.6 GB 1.3 6.6 793.4 189.7
32B QLoRA ~45.5 GB 0.6 3.6 1578.8 312.6

(Q)DoRA-based GRPO

Model Size Setting Peak Memory/gpu Generated Tokens/sec Secs/grad step
Standard Compiled Standard Compiled
1.5B DoRA ~14.9 GB 9.1 16.0 190.0 117.7
3B DoRA ~17.2 GB 6.1 10.7 101.3 118.8
7B DoRA ~32.5 GB 3.5 5.9 328.1 233.0
1.5B QDoRA ~12.3 GB 4.0 9.0 486.5 191.5
3B QDoRA ~11.5 GB 2.2 6.0 581.0 219.8
7B QDoRA ~19.1 GB 1.1 3.2 1515.3 488.3
14B QDoRA ~29.6 GB 0.6 1.8 2907.8 911.6
32B QDoRA ~45.5 GB 0.2 0.8 4409.3 1478.6

(Q)DoRA-with-Cache-based GRPO

DoRA w/ Cache significantly speeds up the generation process by caching intermediate calculations, and it has comparable performance with torch.compile optimizations.

Model Size Setting Peak Memory/gpu Generated Tokens/sec Secs/grad step
1.5B DoRA w/ Cache ~14.9 GB 16.5 93.2
3B DoRA w/ Cache ~17.3 GB 12.5 79.1
7B DoRA w/ Cache ~32.6 GB 13.1 101.2
1.5B QDoRA w/ Cache ~12.3 GB 7.2 147.9
3B QDoRA w/ Cache ~11.5 GB 3.3 127.3
7B QDoRA w/ Cache ~19.1 GB 2.3 351.4
14B QDoRA w/ Cache ~29.6 GB 1.3 810.8
32B QDoRA w/ Cache ~45.5 GB 0.6 1812.3

Getting Started

Clone the repository and install the required packages.

git clone https://github.com/shangshang-wang/Tora.git && cd Tora
pip install torch torchvision torchao
pip install -e .
pip install wandb math_verify

Download a model from the Hugging Face Hub.

MODEL_SIZE=1.5B  # 1.5B, 3B, 7B, 14B, or 32B
tune download "Qwen/Qwen2.5-${MODEL_SIZE}" \
--output-dir "/tmp/Qwen2.5-${MODEL_SIZE}" \
--hf-token <HF_TOKEN>

Below are example commands for running distributed GRPO training on 2 GPUs. You can easily switch between LoRA methods by modifying the lora_type parameter in the config file or overriding it on the command line.

Full-Parameter RL:

tune run --nproc_per_node 2 full_grpo_distributed --config qwen2_5/1.5B_full_grpo

LoRA-Based RL:

# In the config file, set lora_type to "lora", "dora", or "dora_cache"
tune run --nproc_per_node 2 lora_grpo_distributed --config qwen2_5/1.5B_lora_grpo model.lora_type="lora"

Behind the Name

The name Tora (่™Ž) means Tiger in Japanese. It's also a blend of TorchTune and LoRA. The name is inspired by the film Crouching Tiger, Hidden Dragon, which refers to masters with hidden strengths. This symbolism captures the role of LoRA in RL post-training: by updating only a tiny fraction of a model's parameters, LoRA unleashes significant performance gainsโ€”a "crouching tiger" of potential within the base model.

About

Tora: Torchtune-LoRA for RL

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.8%
  • Shell 0.2%