Hidden Biases in Unreliable News Detection Datasets
Official Code for the paper:
Hidden Biases in Unreliable News Detection Datasets
Xiang Zhou, Heba Elfardy, Christos Christodoulopoulos, Thomas Butler and Mohit Bansal
EACL 2021
The code is tested on Python 3.7 and PyTorch 1.6.0
Other dependencies are listed in requirements.txt and can be installed by running pip install -r requirements.txt
The experiments and results in our paper mainly involve two datasets: NELA and FakeNewsNet
For the NELA dataset, we use both the 2018 version and the 2019 version. To reproduce experiments, please first download both versions (on the download page, please select all and choose the original format) and put it under the data directory. Then, decompress nela/2018/articles.tar.gz and nela/2019/nela-gt-2019-json.tar.bz2 and put them under the original directory. The structure of data should look like this:
data
└── nela
├── 2018
│ ├── articles
│ │ └── ...
│ ├── articles.db.gz
│ ├── articles.tar.gz
│ ├── labels.csv
│ ├── labels.txt
│ ├── nela_gt_2018-new_schema.tar.bz2
│ ├── README.md
│ └── titles.tar.gz
└── 2019
├── labels.csv
├── nela-eng-2019
│ └── ...
├── nela-gt-2019-json.tar.bz2
├── nela-gt-2019.tar.bz2
├── README-1.md
├── README.md
└── source-metadata.json
The FakeNewsNet dataset can be crawled using the code from its official GitHub repo. After downloading the dataset, please put it also under the data/fakenewsnet_dataset/raw, and the whole data folder should look like this:
data
├── fakenewsnet_dataset
│ └── raw
└── nela
└── ...
The default location of data directory is under the root directory. If you prefer storing your data in other locations, you can change the variables in constants.py
To create the random/site/time split of NELA in the paper, run python data_helper.py nela {site, time, random}
To create the random label split, run python data_helper.py nela random_label (Note you have to manually rename the split dataset after creating the random label split)
To create the split of FakeNewsNet in the paper, run python data_helper.py fnn {site, time, random}
Example scripts to train baseline models used in this paper can be found under the scripts directory (Please refer to Sec. 4.1 in the paper for detailed descriptions of the baseline models). You can change the dataset path to train different baselines.
To train the logistic regression baseline, run bash scripts/lr.sh
To train the title-only RoBERTa models, run bash scripts/roberta_title.sh
To train the title+article RoBERTa models, run bash scripts/roberta_title_article.sh
- Get the predictions on the validation set (by running eval commands in the model training scripts).
- To get source-level accuracies, run
python source_evaluation.py --pred_file [PREDICTION_FILE] --key_file [KEY_FILE] --pred_type [PRED_TYPE]. SetPRED_TYPEtocleanfor the logistic regression model and the title-only RoBERTa model andfullfor the title+article RoBERTa model due to different output formats. Please refer to the python file for the details of other arguments.
- Train a logistic regression model using
bash scripts/lr.shand save the trained model by adding thesave_model [MODEL_PATH]argument. - To extract salient features from logistic regression baselines, run
python analysis_lr.py --model_path [MODEL_PATH]. Please refer to the python file for the details of other arguments.
- Create 5 different domain splits using different seeds by running
python data_helper.py nela site [SEED] - To get site similarity results in Table 7 in the paper, train 5 title+article baselines on each of these 5 different domain splits by running
bash scripts/roberta_title_article.shon and put all the predictions under theoutputdirectory. Change theSAVE_DIRSand theSITE_PREDSvariables insite_similarity.pyto match your saved path and runpython site_similarity.py
- Save the titles with correct or wrong predictions in file
correct.titleandwrong.titlerespectively by runningpython dump_titles.py --pred_file [PREDICTION_FILE] --key_file [KEY_FILE] --pred_type [PRED_TYPE]. SetPRED_TYPEtocleanfor the logistic regression model and the title-only RoBERTa model andfullfor the title+article RoBERTa model due to different output formats. Then, putcorrect.titleandwrong.titlein the same directory asdraw_cloud_unigram.py. - To draw the word cloud showing the most salient words in examples with correct or wrong (determined by the PRINT_TYPE variable in the script) prediction, run
python draw_cloud_unigram.py
See CONTRIBUTING for more information.
This project is licensed under the Apache-2.0 License.