Text2Grad: Reinforcement Learning from Natural Language Feedback

Text2Grad enables language models to learn from natural language feedback by converting text into gradient signals for model optimization.

Abstract

Traditional RLHF uses coarse scalar rewards, masking detailed feedback and leading to opaque learning. Text2Grad converts free-form textual feedback into span-level gradients, aligning critiques with token spans to enable precise, targeted model improvements.

Pipeline Overview

Figure: The Text2Grad framework pipeline showing the flow from natural language feedback to gradient-based model optimization

Framework Components

Text2Grad consists of three main components:

High-Quality Feedback Annotation Pipeline: Pairs critiques with token spans for precise feedback alignment
Fine-Grained Reward Model: Predicts span-level rewards while generating explanatory critiques
Span-Level Policy Optimizer: Back-propagates natural-language gradients for targeted model improvements

Key Features

Natural language feedback processing for model training
Token-level reward assignment from textual spans
Support for multiple tasks:
- Summarization
- Code Generation
- Question Answering
Distributed training with DeepSpeed integration
Flexible model architecture support

Installation

# Clone the repository
git clone https://github.com/EdWangLoDaSc/Text2Grad-Reinforcement-Learning-from-Natural-Language-Feedback
cd Text2Grad

# Install dependencies
bash env.sh

Setting up OpenAI API Key

# Set your OpenAI API key as an environment variable
export OPENAI_API_KEY="your_api_key"

Datasets

We utilize the following open-source datasets in our research:

Code Generation

KodCode

KodCode-Light-RL-10K: Lightweight reinforcement learning dataset
KodCode-V1-SFT-4o: Supervised fine-tuning dataset

Question Answering

UltraFeedback

UltraFeedback-Binarized: Binary feedback dataset for question-answering tasks

Summarization

SLF5K

SLF5K Dataset: Specialized dataset for summarization tasks

Note: All datasets are publicly available on the HuggingFace Hub and are used in accordance with their respective licenses.

Project Structure

/Text2Grad/
├── rm_data_anno/                 # Reward Model Data Annotation
│   ├── kodCode/                  # Code Generation dataset processing
│   ├── slf5k/                    # Summarization dataset processing
│   └── ultrafeedback/            # Question Answering dataset processing
|
├── nl_reward_model/              # Natural Language Reward Model Implementation
│   ├── kodcode/                  # Code Generation reward model
│   ├── slf5k/                    # Summarization reward model
│   └── ultrafeedback/            # Question Answering reward model
|
├── nl_gradiant_policy_optimization/                  # Natural Language Gradient Implementation
│   ├── kodcode/                  # Code Generation training
│   ├── slf5k/                    # Summarization training
│   └── ultrafeedback/            # Question Answering training

Usage

1. Data Annotation

The annotation pipeline processes raw text feedback into structured training dual-feedback reward data:

cd rm_data_anno/ultrafeedback
# For Question Answering tasks
python rm_data_anno/ultrafeedback/dual_feedback_annotation_RM.py \
    --data_path path/to/data

Input Data Formats for Dual-Feedback Annotation

1.1 SLF5K Input JSON Format

{
    "t3_132suw": {
        "generated_summary": "My girlfriend is premed and ...",
        "Post": "My girlfriend is studying something she won't ever use ..."
    }
}

1.2 UltraFeedback Input JSON Format

[
    {
        "prompt": "In this task, you are given ... \n\nA:",
        "prompt_id": "333685aaf217d34921dac65dbea023fe1180b5c1973dbb6554bc022d74b54e0c",
        "score_chosen": 4.0,
        "score_rejected": 4.0,
        "response": [
            {
                "content": "In this task, you are given ... \n\nA:",
                "role": "user"
            },
            {
                "content": "Could you provide ... ",
                "role": "assistant"
            }
        ],
        "is_chosen": true,
        "score": 4.0,
        "critique": ""
    }
]

1.3 KodCode Input JSON Format

[
    {
        "version": "v1.0",
        "style": "instruct",
        "subset": "Leetcode",
        "question_id": "Leetcode_36406_I",
        "question": "You are given a 0-indexed ...",
        "solution": "def max_height_diff_after_removal(heights):\n  ...",
        "test": "def test_max_height_diff_after_removal_all_equal():\n    assert max_height_diff_after_removal([3, 3, 3, 3]) == 0 ...",
        "gpt_pass_trial_num": 2,
        "gpt_difficulty": "hard",
        "gpt_pass_percentage": 0.2,
        "log": "..."
    }
]

2. Reward Model Training & Evaluation

2.1 Training Reward Models

To train reward models for different tasks:

# Code Generation Task
cd nl_reward_model/kodcode
bash deepspeed_train_kodcode.sh

# Summarization Task
cd nl_reward_model/slf5k
bash deepspeed_train_slf5k.sh

# Question Answering Task
cd nl_reward_model/ultrafeedback
bash deepspeed_train_ultrafeedback.sh

2.2 Evaluating Trained Models

Each dataset directory contains an Evaluation folder with scripts for model evaluation. Here's an example workflow for the KodCode dataset:

# Step 1: Navigate to evaluation directory
cd nl_reward_model/kodcode/evaluation

# Step 2: Merge LoRA weights with base model
python 1_merge_lora.py \
    --base_model "meta-llama/Llama-3.1-8B-Instruct" \
    --lora_model "../ckpt/llama31-8B-kodcode/0_4400" \
    --save_dir "../ckpt/llama31-8B-kodcode/0_4400_merge" \
    --merge_and_save True

# Step 3: Run inference on test dataset
python 2_infer.py \
    --model_path "../ckpt/llama31-8B-kodcode/0_4400_merge" \
    --dataset_path "../data/KodCode/kodcode_test.json" \
    --output_file "inference_results_0_4400.json" \
    --batch_size 40 \
    --prompt_max_length 1000 \
    --max_length 1600 \
    --max_new_tokens 350 \
    --gpu_ids "0,1" \
    --gpu_memory "30GiB"

# Step 4: Calculate evaluation metrics
python 3_metrics.py \
    --input_file "inference_results_0_4400.json" \
    --output_file "word_level_evaluation_results.json"

Note: Remember to adjust file paths according to your setup before running the evaluation scripts.

The evaluation process consists of three main steps:

Merging LoRA weights with the base model
Running inference on the test dataset
Computing evaluation metrics

Similar evaluation workflows can be followed for other tasks (SLF5K and UltraFeedback) by using their respective evaluation scripts.

3. NL-Gradient Policy Optimization

To train a model using Text2Grad:

# For Code Generation
cd nl_gradiant_policy_optimization/kodcode
bash train_kodcode.sh

# For Question Answering
cd nl_gradiant_policy_optimization/ultrafeedback
bash train_ultrafeedback.sh

# For Summarization
cd nl_gradiant_policy_optimization/slf5k
bash train_slf5k.sh

3.1 Evaluation Metrics & Benchmarks

We evaluate our NL-Gradient models using task-specific metrics and benchmarks:

Summarization (SLF5K)

We evaluate the model on the 500 validation samples from SLF5K using:

Traditional Metrics:
- BLEU
- ROUGE-1
- ROUGE-2
- ROUGE-L
- BERTScore
- Perplexity
LLM-based Evaluation:
- GPT-4 as judge for qualitative assessment
Comparative Analysis:
- Performance comparison between SFT, PPO, and Text2Grad approaches

Code Generation (KodCode)

We use EvalPlus framework for comprehensive evaluation on:

HumanEval
HumanEval Plus
MBPP (Mostly Basic Programming Problems)
MBPP Plus

Question Answering (UltraFeedback)

Evaluation is conducted using multiple benchmarks:

MT-Bench: Multi-turn conversation benchmark
ARC-C: AI2 Reasoning Challenge (Challenge Set)
AlpacaEval 2.0: Comprehensive LLM evaluation suite

Note: All evaluations are performed using standardized metrics and publicly available benchmarks to ensure reproducibility and fair comparison.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
nl_gradiant_policy_optimization		nl_gradiant_policy_optimization
nl_reward_model		nl_reward_model
rm_data_anno		rm_data_anno
README.md		README.md
env.sh		env.sh
pipeline_of_text2grad.png		pipeline_of_text2grad.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Text2Grad: Reinforcement Learning from Natural Language Feedback

Abstract

Pipeline Overview

Framework Components

Key Features

Installation

Setting up OpenAI API Key

Datasets

Code Generation

Question Answering

Summarization

Project Structure

Usage

1. Data Annotation

Input Data Formats for Dual-Feedback Annotation

1.1 SLF5K Input JSON Format

1.2 UltraFeedback Input JSON Format

1.3 KodCode Input JSON Format

2. Reward Model Training & Evaluation

2.1 Training Reward Models

2.2 Evaluating Trained Models

3. NL-Gradient Policy Optimization

3.1 Evaluation Metrics & Benchmarks

Summarization (SLF5K)

Code Generation (KodCode)

Question Answering (UltraFeedback)

About

Uh oh!

Releases

Packages

Languages

EdWangLoDaSc/Text2Grad

Folders and files

Latest commit

History

Repository files navigation

Text2Grad: Reinforcement Learning from Natural Language Feedback

Abstract

Pipeline Overview

Framework Components

Key Features

Installation

Setting up OpenAI API Key

Datasets

Code Generation

Question Answering

Summarization

Project Structure

Usage

1. Data Annotation

Input Data Formats for Dual-Feedback Annotation

1.1 SLF5K Input JSON Format

1.2 UltraFeedback Input JSON Format

1.3 KodCode Input JSON Format

2. Reward Model Training & Evaluation

2.1 Training Reward Models

2.2 Evaluating Trained Models

3. NL-Gradient Policy Optimization

3.1 Evaluation Metrics & Benchmarks

Summarization (SLF5K)

Code Generation (KodCode)

Question Answering (UltraFeedback)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages