Skip to content

TencentBAC/ReSeek

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards

Shiyu Li, Yang Tang, Yifan Wang, Peiming Li, Xi Chen
Basic Algorithm Center, PCG, Tencent
Tsinghua Shenzhen International Graduate School, Tsinghua University

πŸ”₯ News

  • [2025.10.14] Released the initial codebase.
  • [2025.10.1] Released the dataset, leaderboard, model and paper.

πŸ€— Resources

Type Links
Models β€’ReSeek-qwen2.5-3b-em-grpo
Datasets β€’FictionalHot
Leaderboard β€’Search Agent Leaderboard

πŸ“Œ Introduction

  • We propose ReSeek, a novel reinforcement learning framework that enables search agents to dynamically identify and recover from erroneous search paths during an episode through a self-correction mechanism.
  • Through a special JUDGE action, agents can evaluate retrieved information and re-plan their search strategy. We design a dense, instructive reward function that provides fine-grained feedback on both factual correctness and contextual utility.
  • We advocate for the Hot Benchmark evaluation principle and introduce FictionalHot as a contamination-resistant benchmark. Extensive experiments show that ReSeek significantly outperforms SOTA baselines in task success rate and path faithfulness.
  • ReSeek particularly excels in multi-hop reasoning scenarios, demonstrating robust self-correction capabilities in complex knowledge-intensive tasks.

πŸ›  Dependencies

Basic Installation

# Clone the repository
git clone https://github.com/TencentBAC/ReSeek.git
cd ReSeek

conda create -n ReSeek python=3.10
conda activate ReSeek

bash scripts/install_vllm_sglang_mcore.sh

# verl
pip install --no-deps -e .

Optional Dependencies

NPU (Ascend) Support:

# follow https://verl.readthedocs.io/en/latest/ascend_tutorial/ascend_quick_start.html to install vllm & vllm-ascend

pip install -r requirements-npu.txt
pip install -e .

πŸ“– Quick Start

(1) Environment Variables

Before running training scripts, set the following environment variables:

# Set project root directory
export PROJECT_ROOT=/path/to/ReSeek

# Set model directory
export MODEL_DIR=/path/to/models

# Set data directory
export DATA_DIR=/path/to/datasets

(2) Data Preparation

Download the ReSeek training dataset:

# Preprocess dataset
python utils/preprocess_reseek_dataset.py \
  --hf_repo_id TencentBAC/ReSeek_train_test \
  --local_dir ${DATA_DIR}/processed_dateset

(3) Download Pre-trained Models

# Download base model (e.g., Qwen2.5-3B-Instruct)
huggingface-cli download --resume-download Qwen/Qwen2.5-3B-Instruct --local-dir Qwen2.5-3B-Instruct

# (Optional) Download ReSeek fine-tuned model
huggingface-cli download --resume-download TencentBAC/ReSeek-qwen2.5-3b-em-grpo --local-dir ReSeek-qwen2.5-3b-em-grpo

(4) Build Retrieval Index (optional)

Using Transformers:

cd search/retrieval
bash build_index.sh

Using vLLM:

cd search/retrieval
bash build_index_vllm.sh

(5) Launch Retrieval Service

cd search
bash retrieval_launch.sh

(6) Conduct RL Training

(optional) set the parameter trainer.device=npu on npu.

GRPO Training:

cd scripts

# 3B model
bash train_grpo.sh

# 7B model
bash train_grpo_7b.sh

PPO Training:

cd scripts

# 3B model
bash train_ppo.sh

# 7B model
bash train_ppo_7b.sh

πŸ’‘ Performance

πŸ“Š Main Results

ReSeek achieves state-of-the-art performance across eight open-domain QA benchmarks:

  • Qwen2.5-7B: Average accuracy of 0.377, surpassing ZeroSearch's 0.346
  • Multi-hop Reasoning: Excels on complex multi-hop benchmarks like HotpotQA and Bamboogle
  • FictionalHot: Scores 0.061 on contamination-resistant stress test, while Direct Inference achieves only ~0.001

πŸ“Š Hot Benchmark

We propose the Hot Benchmark evaluation principle to address inconsistencies in experimental settings:

  • Test Sets: All 7 datasets (NQ, TriviaQA, PopQA, HotpotQA, 2Wiki, Musique, Bamboogle)
  • Training Set: Unified training set merging NQ and HotpotQA training splits
  • Corpus: 2018 Wikipedia corpus (wiki-18) for reproducible evaluation
  • Metrics: Exact Match (EM) as the primary metric for fair comparison
  • Retrieval: Top-k=3 with maximum T=4 tool-use turns per question
  • Embeddings: E5 embeddings for search backend
  • Models: Qwen2.5-3B/7B-Instruct as backbone models

πŸ“Š Self-Correction Case Study

ReSeek demonstrates robust self-correction through the JUDGE action:

  1. After initial search, the JUDGE action correctly identifies insufficient information
  2. Triggers a second targeted search
  3. Successfully retrieves the correct answer

This dynamic correction mechanism enables ReSeek to excel in complex multi-hop reasoning scenarios.

πŸ™ Acknowledgements

This work is implemented based on Search-R1, veRL. We sincerely thank the authors of these projects for their valuable contributions to the open-source community.

πŸ“§ Contact

If you have any questions, feel free to reach out:

🚩 Citation

If this work is helpful, please kindly cite as:

@article{li2025reseek,
  title={ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards},
  author={Li, Shiyu and Tang, Yang and Wang, Yifan and Li, Peiming and Chen, Xi},
  journal={arXiv preprint arXiv:2510.00568},
  year={2025}
}

πŸ“„ License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published