Setup (from source)

Clone the repo
Create a conda environment conda env create --name retrieval_benchmarking --file=environments.yml
pip install -e .

Data!!

All data can be found at https://drive.google.com/drive/folders/1BJWrocXUzK0MA77SuMCqdF1LrZA56rZj?usp=sharing . Download and drop in data folder.

Retrievers

Name	Paradigm	More
BM25	Lexical	Link
SPLADE	Sparse	Link
DPR	Dense	Link
ANCE	Dense	Link
tas-b	Dense	Link
MPNet	Dense	Link
Contriever	Dense	Link
ColBERTv2	Late-Interaction	Link

Project Structure

data
- datastructures: Basic data classes for question, answer and others needed in the pipeline.
- dataloaders: Loaders that take raw json/zip file data and convert them to the format needed in the pipeline
retriever: Retrievers that take the data loaders and perform retrieval to produce results.
- dense : dense retrievers like ColBERTv2,ANCE, Contriever, MpNet, DPR and Tas-B
- lexical: lexical retrievers like BM25
- sparse: Sparse retrievers like SPLADE
config: Configuration files with constants and initialization.
utils: utilities needed in the pipeline like retrieval accuracy calculation and matching.

Running Evaluation for Results in report

All evaluation scripts dataset wise can be found in the evaluation folder. Just run the files directly.

Example

configure project root directory to PYTHONPATH variable

export PYTHONPATH=/path


export huggingface_token = <your huggingface token to access llama2  >

If you are using Elasticsearch (ES) installation >8 please export the following values based on your ES setup

export ca_certs = <path to http_ca.crt path in your ES installation>

export elastic_password = <your elasticsearch password>

To reproduce dpr results run

python3 src/evaluation/run_dpr_inference.py

Building your own custom dataset

You can quickly build your own dataset in three steps:

1) Loading the question, answer and evidence records

The base data loader by default takes a json file of the format

[{'id':'..','question':'..','answer':'..'}]

Each of the train, test and val splits should under their own json files named under your dir

/dir_path/train.json
/dir_path/test.json
/dir_path/validation.json

If you want to create your custom loader: Within the directory data/dataloaders, Create your Dataloader by extending from BaseDataLoader

class MyDataLoader(BaseDataLoader):
    def load_raw_dataset(self,split):
        dataset = self.load_json(split)
        
        records =  '''your code to transform the elements in json to List[Sample(idx:str,question:Question,answer:Answer,evidence:Evidence)]'''
        # If needed you can also extend from Question,Answer and Evidence dataclasses to form your own types
        self.raw_data = records
    def load_tokenized(self):
        ''' If required overwrite this function to build custom tkenization method of your dataset '''

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
evidence.csv		evidence.csv
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Setup (from source)

Data!!

Retrievers

Project Structure

Running Evaluation for Results in report

Example

To reproduce dpr results run

Building your own custom dataset

1) Loading the question, answer and evidence records

About

Uh oh!

Releases

Packages

Uh oh!

Languages

factiverse/factIR

Folders and files

Latest commit

History

Repository files navigation

Setup (from source)

Data!!

Retrievers

Project Structure

Running Evaluation for Results in report

Example

To reproduce dpr results run

Building your own custom dataset

1) Loading the question, answer and evidence records

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages