Code repo for EMNLP 2024 paper - EfficientRAG: Efficient Retriever for Multi-Hop Question Answering
Efficient RAG is a new framework to train Labeler and Filter to learn to conduct multi-hop RAG without multiple LLM calls.
- 2024-09-12 open source the code
- 2025-03-04 release our data
You can now download our synthesized data from this link.
You should unzip the EfficientRAG.zip file and place all the data under the data directory.
Within this directory, the negative_sampling_extracted folder contains our final synthesized data, which is referenced in 2.4 Negative Sampling.
Additionally, the efficient_rag directory includes two folders: labeler and filter, which store the training data constructed for the model, as referenced in 2.5 Training Data.
You need to install PyTorch >= 2.1.0 first, and then install dependent Python libraries by running the command
pip install -r requirements.txtYou can also create a conda environment with python>=3.9
conda create -n <ENV_NAME> python=3.9 pip
conda activate <ENV_NAME>
pip install -r requirements.txt-
Download the dataset from HotpotQA, 2WikiMQA and MuSiQue. Separate them as train, dev and test set, and then put them under
data/dataset. -
Download the retriever model Contriever and base model DeBERTa, put them under
model_cache -
Prepare the corpus by extract documents and construct embedding.
python src/retrievers/multihop_data_extractor.py --dataset hotpotQApython src/retrievers/passage_embedder.py \
--passages data/corpus/hotpotQA/corpus.jsonl \
--output_dir data/corpus/hotpotQA/contriever \
--model_type contriever- Deploy LLaMA-3-70B-Instruct with vLLM framework, and configure it in
src/language_models/llama.py
We will use hotpotQA training set as an example. You could construct 2WikiMQA and MuSiQue in the same way.
python src/data_synthesize/query_decompose.py \
--dataset hotpotQA \
--split train \
--model llama3python src/data_synthesize/token_labeling.py \
--dataset hotpotQA \
--split train \
--model llama3python src/data_synthesize/token_extraction.py \
--data_path data/synthesized_token_labeling/hotpotQA/train.jsonl \
--save_path data/token_extracted/hotpotQA/train.jsonl \
--verbosepython src/data_synthesize/next_hop_query_construction.py \
--dataset hotpotQA \
--split train \
--model llamapython src/data_synthesize/next_hop_query_filtering.py \
--data_path data/synthesized_next_query/hotpotQA/train.jsonl \
--save_path data/next_query_extracted/hotpotQA/train.jsonl \
--verbosepython src/data_synthesize/negative_sampling.py \
--dataset hotpotQA \
--split train \
--retriever contrieverpython src/data_synthesize/negative_sampling_labeled.py \
--dataset hotpotQA \
--split train \
--model llamapython src/data_synthesize/negative_token_extraction.py \
--dataset hotpotQA \
--split train \
--verbosepython src/data_synthesize/training_data_synthesize.py \
--dataset hotpotQA \
--split trainTraining Filter model
python src/efficient_rag/filter_training.py \
--dataset hotpotQA \
--save_path saved_models/filterTraining Labeler model
python src/efficient_rag/labeler_training.py \
--dataset hotpotQA \
--tags 2EfficientRAG retrieve procedure
python src/efficientrag_retrieve.py \
--dataset hotpotQA \
--retriever contriever \
--labels 2 \
--labeler_ckpt <<PATH_TO_LABELER_CKPT>> \
--filter_ckpt <<PATH_TO_FILTER_CKPT>> \
--topk 10 \Use LLaMA-3-8B-Instruct as generator
python src/efficientrag_qa.py \
--fpath <<MODEL_INFERENCE_RESULT>> \
--model llama-8B \
--dataset hotpotQAIf you find this paper or code useful, please cite by:
@inproceedings{zhuang2024efficientrag,
title={EfficientRAG: Efficient Retriever for Multi-Hop Question Answering},
author={Zhuang, Ziyuan and Zhang, Zhiyang and Cheng, Sitao and Yang, Fangkai and Liu, Jia and Huang, Shujian and Lin, Qingwei and Rajmohan, Saravan and Zhang, Dongmei and Zhang, Qi},
booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing},
pages={3392--3411},
year={2024}
}