Shiyu Li, Yang Tang, Yifan Wang, Peiming Li, Xi Chen
Basic Algorithm Center, PCG, Tencent
Tsinghua Shenzhen International Graduate School, Tsinghua University
- [2025.10.14] Released the initial codebase.
- [2025.10.1] Released the dataset, leaderboard, model and paper.
| Type | Links |
|---|---|
| Models | β’ReSeek-qwen2.5-3b-em-grpo |
| Datasets | β’FictionalHot |
| Leaderboard | β’Search Agent Leaderboard |
- We propose ReSeek, a novel reinforcement learning framework that enables search agents to dynamically identify and recover from erroneous search paths during an episode through a self-correction mechanism.
- Through a special JUDGE action, agents can evaluate retrieved information and re-plan their search strategy. We design a dense, instructive reward function that provides fine-grained feedback on both factual correctness and contextual utility.
- We advocate for the Hot Benchmark evaluation principle and introduce FictionalHot as a contamination-resistant benchmark. Extensive experiments show that ReSeek significantly outperforms SOTA baselines in task success rate and path faithfulness.
- ReSeek particularly excels in multi-hop reasoning scenarios, demonstrating robust self-correction capabilities in complex knowledge-intensive tasks.
# Clone the repository
git clone https://github.com/TencentBAC/ReSeek.git
cd ReSeek
conda create -n ReSeek python=3.10
conda activate ReSeek
bash scripts/install_vllm_sglang_mcore.sh
# verl
pip install --no-deps -e .NPU (Ascend) Support:
# follow https://verl.readthedocs.io/en/latest/ascend_tutorial/ascend_quick_start.html to install vllm & vllm-ascend
pip install -r requirements-npu.txt
pip install -e .Before running training scripts, set the following environment variables:
# Set project root directory
export PROJECT_ROOT=/path/to/ReSeek
# Set model directory
export MODEL_DIR=/path/to/models
# Set data directory
export DATA_DIR=/path/to/datasetsDownload the ReSeek training dataset:
# Preprocess dataset
python utils/preprocess_reseek_dataset.py \
--hf_repo_id TencentBAC/ReSeek_train_test \
--local_dir ${DATA_DIR}/processed_dateset# Download base model (e.g., Qwen2.5-3B-Instruct)
huggingface-cli download --resume-download Qwen/Qwen2.5-3B-Instruct --local-dir Qwen2.5-3B-Instruct
# (Optional) Download ReSeek fine-tuned model
huggingface-cli download --resume-download TencentBAC/ReSeek-qwen2.5-3b-em-grpo --local-dir ReSeek-qwen2.5-3b-em-grpoUsing Transformers:
cd search/retrieval
bash build_index.shUsing vLLM:
cd search/retrieval
bash build_index_vllm.shcd search
bash retrieval_launch.sh(optional) set the parameter trainer.device=npu on npu.
GRPO Training:
cd scripts
# 3B model
bash train_grpo.sh
# 7B model
bash train_grpo_7b.shPPO Training:
cd scripts
# 3B model
bash train_ppo.sh
# 7B model
bash train_ppo_7b.shReSeek achieves state-of-the-art performance across eight open-domain QA benchmarks:
- Qwen2.5-7B: Average accuracy of 0.377, surpassing ZeroSearch's 0.346
- Multi-hop Reasoning: Excels on complex multi-hop benchmarks like HotpotQA and Bamboogle
- FictionalHot: Scores 0.061 on contamination-resistant stress test, while Direct Inference achieves only ~0.001
We propose the Hot Benchmark evaluation principle to address inconsistencies in experimental settings:
- Test Sets: All 7 datasets (NQ, TriviaQA, PopQA, HotpotQA, 2Wiki, Musique, Bamboogle)
- Training Set: Unified training set merging NQ and HotpotQA training splits
- Corpus: 2018 Wikipedia corpus (wiki-18) for reproducible evaluation
- Metrics: Exact Match (EM) as the primary metric for fair comparison
- Retrieval: Top-k=3 with maximum T=4 tool-use turns per question
- Embeddings: E5 embeddings for search backend
- Models: Qwen2.5-3B/7B-Instruct as backbone models
ReSeek demonstrates robust self-correction through the JUDGE action:
- After initial search, the JUDGE action correctly identifies insufficient information
- Triggers a second targeted search
- Successfully retrieves the correct answer
This dynamic correction mechanism enables ReSeek to excel in complex multi-hop reasoning scenarios.
This work is implemented based on Search-R1, veRL. We sincerely thank the authors of these projects for their valuable contributions to the open-source community.
If you have any questions, feel free to reach out:
- GitHub Issues: https://github.com/TencentBAC/ReSeek/issues
If this work is helpful, please kindly cite as:
@article{li2025reseek,
title={ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards},
author={Li, Shiyu and Tang, Yang and Wang, Yifan and Li, Peiming and Chen, Xi},
journal={arXiv preprint arXiv:2510.00568},
year={2025}
}This project is licensed under the Apache License 2.0. See the LICENSE file for details.