All eight dynamic text-attributed graphs provided by DTGB can be downloaded from here.
Each graph is preserved through three files.
- edge_list.csv: stores each edge in DyTAG as a tuple. i.e.,
(u, r, i, ts, label).uis the id of the source entity,iis the id of the target entity,ris the id of the relation between them,tsis the occurring timestamp of this edge,labelis the label of this edge. - entity_text.csv: stores the mapping from entity ids (e.g.,
uandi) to the text descriptions of entities. - relation_text.csv: stores the mapping from relation ids (e.g.,
r) to the text descriptions of relations.
- After downloading the datasets, they should be uncompressed into the
DyLink_Datasetsfolder. - Run
get_pretrained_embeddings.pyto obtain the Bert-based node and edge text embeddings. They will be saved ase_feat.npyandr_feat.npyrespectively. - Run
get_LLM_data.ipynbto get the train and test set for the textual relation generation task. They will be saved asLLM_train.pklandLLM_test.pklrespectively.
- Example of training DyGFormer on GDELT dataset without text attributes:
python train_link_prediction.py --dataset_name GDELT --model_name DyGFormer --patch_size 2 --max_input_sequence_length 64 --num_runs 5 --gpu 0 --use_feature no
- Example of training DyGFormer on GDELT dataset with text attributes:
python train_link_prediction.py --dataset_name GDELT --model_name DyGFormer --patch_size 2 --max_input_sequence_length 64 --num_runs 5 --gpu 0 --use_feature Bert
- The AP and AUC-ROC metrics on the test set (both transductive setting and inductive setting) will be automatically saved in
saved_resuts/DyGFormer/GDELT/DyGFormer_seed0no.json - The best checkpoint will be saved in
saved_resuts/DyGFormer/GDELT/folder, and the checkpoint will be used to reproduce the performance on the node retrieval task.
After obtaining the best checkpoint on the Future Link Prediction Task. The Hits@k metrics of the Destination Node Retrieval Task can be reproduced by running:
python evaluate_node_retrieval.py --dataset_name GDELT --model_name DyGFormer --patch_size 2 --max_input_sequence_length 64 --negative_sample_strategy random --num_runs 5 --gpu 0 --use_feature no
- The
negative_sample_strategyhyper-parameter is used to control the candidate sampling strategies, which can berandomandhistorical. - The
use_featurehyper-parameter is used to control whether to use Bert-based embeddings, which can benoandBert.
- Example of training DyGFormer on GDELT dataset without text attributes:
python train_edge_classification.py --dataset_name GDELT --model_name DyGFormer --patch_size 2 --max_input_sequence_length 64 --num_runs 5 --gpu 0 --use_feature no
- The Precision, Recall, and F1-score metrics on the test set will be automatically saved in
saved_resuts/DyGFormer/GDELT/edge_classification_DyGFormer_seed0no.json
After obtaining the LLM_train.pkl and LLM_test.pkl files. You can directly reproduce the performance of original LLMs by running
python LLM_eval.py -config_path=LLM_configs/vicuna_7b_qlora_uncensored.yaml -model=raw
- You can change the LLMs through the
config_pathhyper-parameter. - The generated text will be saved in
s_his_o_des_his_result_vicuna7b.pkl.
And then to get the Bert_score metrics, you should change the file path in LLM_metric.py and run:
python LLM_metric.py
If you want to fine-tune the LLMs, you should run:
python LLM_train.py LLM_configs/vicuna_7b_qlora_uncensored.yaml
and then reproduce the performance of the fine-tunned LLMs by running
python LLM_eval.py -config_path=LLM_configs/vicuna_7b_qlora_uncensored.yaml -model=lora
For any questions or suggestions, you can use the issues section or contact us at ([email protected]).
Codes and model implementations are referred to DyGLib project. Thanks for their great contributions!
@article{zhang2024dtgb,
title={DTGB: A Comprehensive Benchmark for Dynamic Text-Attributed Graphs},
author={Zhang, Jiasheng and Chen, Jialin and Yang, Menglin and Feng, Aosong and Liang, Shuang and Shao, Jie and Ying, Rex},
journal={arXiv preprint arXiv:2406.12072},
year={2024}
}