Overview | Training Dynamics | Benchmarking | Getting Started | Behind the Name
Tora is a project built on torchtune that provides LoRA-based RL methods for post-training.
Building on the torchtune library, Tora extends its functionality for RL post-training. It integrates PEFT methods like (Q)LoRA and (Q)DoRA into RL, providing an efficient and memory-friendly framework. Tora enables researchers to reduce the computational resources required for fine-tuning large language models via RL.
๐ Key References: LoRA for RL
- October 2025: LoRA Without Regret by Thinking Machines. link
- May 2025: Tina: Tiny Reasoning Models via LoRA. link
The following table summarizes the key features of RL methods and LoRA-based techniques supported in Tora.
| RL Method | Type of Weight Update | Torch Model Compile | Multiple Devices with One Node | 
|---|---|---|---|
| GRPO | Full | โ | โ | 
| (Q)LoRA | โ | โ | |
| (Q)DoRA | โ | โ | |
| (Q)DoRA w/ Cache | โ | โ | 
DoRA w/ Cache: The standard DoRA layer in torchtune (link) recalculates the weight norm and magnitude scale on every forward pass. This is inefficient for GRPO's completion generation step, as these values remain static between weight updates. DoRA w/ Cache optimizes this by caching these expensive computations. It computes the values once and reuses them on subsequent forward passes, avoiding redundant calculations and significantly improving performance. However, the current caching implementation is not compatible with torch.compile.
Unless specified otherwise, our experimental settings are as follows:
- We used Qwen2.5 base models in five sizes: 1.5B, 3B, 7B, 14B, and 32B parameters.
- All experiments were conducted on two NVIDIA RTX A40 GPUs using the GSM8K training dataset.
- We used a per-GPU batch size of 2 and a generation sequence length of 512.
- For all LoRA-based methods, LoRA was applied to all layers with a rank of 1, an alpha of 2, and zero dropout.
- In QLoRA and QDoRA, the base model was quantized to 4-bits.
- We enabled activation checkpointing and used Fully Sharded Data Parallelism (FSDP) across all experiments.
- The learning rate for LoRA-based methods was set to 20x that of full-parameter GRPO training.
We show the reward dynamics during GRPO training of Qwen2.5-3B with different methods on GSM8K. From the results, we can see that LoRA-based methods (with rank 1), even with base model quantization, achieve comparable performance with full-parameter GRPO training.
    
    
In the tables below, we benchmark the peak memory usage per GPU, the number of generated tokens per second during GRPO completion generation, and the seconds per gradient step for different GRPO methods.
| Model Size | Setting | Peak Memory/gpu | Generated Tokens/sec | Secs/grad step | ||
|---|---|---|---|---|---|---|
| Standard | Compiled | Standard | Compiled | |||
| 1.5B | Full | ~16.5 GB | 24.4 | 39.3 | 69.2 | 77.5 | 
| 3B | Full | ~19.6 GB | 17.7 | 25.6 | 63.5 | 72.5 | 
| Model Size | Setting | Peak Memory/gpu | Generated Tokens/sec | Secs/grad step | ||
|---|---|---|---|---|---|---|
| Standard | Compiled | Standard | Compiled | |||
| 1.5B | LoRA | ~14.9 GB | 18.9 | 28.4 | 58.5 | 49.7 | 
| 3B | LoRA | ~17.2 GB | 14.2 | 21.3 | 50.6 | 48.8 | 
| 7B | LoRA | ~32.6 GB | 15.1 | 20.5 | 64.8 | 68.0 | 
| 1.5B | QLoRA | ~12.3 GB | 7.9 | 16.0 | 142.7 | 71.5 | 
| 3B | QLoRA | ~11.5 GB | 4.7 | 17.7 | 150.8 | 87.6 | 
| 7B | QLoRA | ~19.1 GB | 2.6 | 11.1 | 410.0 | 135.3 | 
| 14B | QLoRA | ~29.6 GB | 1.3 | 6.6 | 793.4 | 189.7 | 
| 32B | QLoRA | ~45.5 GB | 0.6 | 3.6 | 1578.8 | 312.6 | 
| Model Size | Setting | Peak Memory/gpu | Generated Tokens/sec | Secs/grad step | ||
|---|---|---|---|---|---|---|
| Standard | Compiled | Standard | Compiled | |||
| 1.5B | DoRA | ~14.9 GB | 9.1 | 16.0 | 190.0 | 117.7 | 
| 3B | DoRA | ~17.2 GB | 6.1 | 10.7 | 101.3 | 118.8 | 
| 7B | DoRA | ~32.5 GB | 3.5 | 5.9 | 328.1 | 233.0 | 
| 1.5B | QDoRA | ~12.3 GB | 4.0 | 9.0 | 486.5 | 191.5 | 
| 3B | QDoRA | ~11.5 GB | 2.2 | 6.0 | 581.0 | 219.8 | 
| 7B | QDoRA | ~19.1 GB | 1.1 | 3.2 | 1515.3 | 488.3 | 
| 14B | QDoRA | ~29.6 GB | 0.6 | 1.8 | 2907.8 | 911.6 | 
| 32B | QDoRA | ~45.5 GB | 0.2 | 0.8 | 4409.3 | 1478.6 | 
DoRA w/ Cache significantly speeds up the generation process by caching intermediate calculations, and it has comparable performance with torch.compile optimizations.
| Model Size | Setting | Peak Memory/gpu | Generated Tokens/sec | Secs/grad step | 
|---|---|---|---|---|
| 1.5B | DoRA w/ Cache | ~14.9 GB | 16.5 | 93.2 | 
| 3B | DoRA w/ Cache | ~17.3 GB | 12.5 | 79.1 | 
| 7B | DoRA w/ Cache | ~32.6 GB | 13.1 | 101.2 | 
| 1.5B | QDoRA w/ Cache | ~12.3 GB | 7.2 | 147.9 | 
| 3B | QDoRA w/ Cache | ~11.5 GB | 3.3 | 127.3 | 
| 7B | QDoRA w/ Cache | ~19.1 GB | 2.3 | 351.4 | 
| 14B | QDoRA w/ Cache | ~29.6 GB | 1.3 | 810.8 | 
| 32B | QDoRA w/ Cache | ~45.5 GB | 0.6 | 1812.3 | 
Clone the repository and install the required packages.
git clone https://github.com/shangshang-wang/Tora.git && cd Tora
pip install torch torchvision torchao
pip install -e .
pip install wandb math_verifyDownload a model from the Hugging Face Hub.
MODEL_SIZE=1.5B  # 1.5B, 3B, 7B, 14B, or 32B
tune download "Qwen/Qwen2.5-${MODEL_SIZE}" \
--output-dir "/tmp/Qwen2.5-${MODEL_SIZE}" \
--hf-token <HF_TOKEN>Below are example commands for running distributed GRPO training on 2 GPUs.
You can easily switch between LoRA methods by modifying the lora_type parameter in the config file or overriding it on the command line.
Full-Parameter RL:
tune run --nproc_per_node 2 full_grpo_distributed --config qwen2_5/1.5B_full_grpoLoRA-Based RL:
# In the config file, set lora_type to "lora", "dora", or "dora_cache"
tune run --nproc_per_node 2 lora_grpo_distributed --config qwen2_5/1.5B_lora_grpo model.lora_type="lora"The name Tora (่) means Tiger in Japanese. It's also a blend of TorchTune and LoRA. The name is inspired by the film Crouching Tiger, Hidden Dragon, which refers to masters with hidden strengths. This symbolism captures the role of LoRA in RL post-training: by updating only a tiny fraction of a model's parameters, LoRA unleashes significant performance gainsโa "crouching tiger" of potential within the base model.