There are 2 ways to train or fine tune BERT or GPT like models
- on a supervised downstream task
- in an unsupervised way on a corpus
The supervised training / fine-tuning requires a ground truth dataset.
=> We're going to work on the unsupervised approach
We can leverage the scripts given by huggingface:
- 
for Masked Language Models : - run_mlm.py
- BERT, RoBerta, DistilBert and others
 
- 
for Causale Language Models - run_clm.py
- GPT, GPT-2
 
- 
for Permuted Language Models - run_plm.py
- XL-NET
 
Most of the code within the scripts cited above is devoted to
- 
handling both pytorch and tensorflow versions 
- 
and passing arguments via 3 classes: - ModelArguments: class definition in the script.
 Arguments pertaining to which model, config and tokenizer we are going to fine-tune - DataTrainingArgument: class definition in the script.
 Arguments pertaining to what data we are going to input our model for training and eval - TrainingArguments Class imported from the huggingface lib.
 Arguments pertaining to the actual training / finetuning of the model 
The core of the script is organized along:
- 
loading the data through the dataset module with load_dataset(data_args.dataset_name, data_args.dataset_config_name)
- 
Loading the appropriate config, tokenizer and model - config = AutoConfig.from_pretrained
- tokenizer = AutoTokenizer.from_pretrained
- model = AutoModelForMaskedLM.from_config(config)
 
- 
tokenizing the data This returns 3 elements: - a list of tokens (the vocab)
- the tokens index within the vocab list
- a sequence of token mask [1,1,1,1,1,0,0,0,0]
 
- 
The datacollator handles the random masking of tokens data_collator = DataCollatorForLanguageModeling
- 
and finally the training / finetuning takes place - the trainer is instanciated trainer = Trainer()
- the training takes place trainer.train()
 
- the trainer is instanciated 
- 
the model is saved 
To fine tune on a our own data, specify the path to the training file and to the validation file:
python run_mlm.py \
    --model_name_or_path bert-base-uncased \
    --max_seq_length 128 \
    --line_by_line \
    --train_file "path_to_train_file" \
    --validation_file "path_to_validation_file" \
    --do_train \
    --do_eval \
    --max_steps 5000 \
    --save_steps 1000 \ # augment to save disk space
    --output_dir "results/"
- distilbert is smaller than BERT
- each save_stepsstep, the model is saved in a directory. This can quickly eat up all the space on the disk.