The goal is to replicate the model described in
Adaptive input representations for neural language modeling [Baevski and Auli, 2019], and train it on wikitext-103-v1
.
- implement dataset loader and tokenizer
- implement adaptive embedding
- implement decoder-only transformer based on
torch.nn.MultiheadAttention
- use adaptive softmax from
torch.nn.AdaptiveLogSoftmaxWithLoss
- match adaptive embedding API with adaptive softmax
- tie the embedding and softmax weights
- implement training loop, loss monitoring, learning rate schedule
- implement model save/load, continue training where it stopped
- aggregate the gradient over multiple batches
- check underfitting by experimenting with a tiny training set
- implement a function to compute perplexity on the validation set
- fix save/load model, fix weight tying
- improve the adaptive softmax to be able to use float16 mixed-precision
- try full size (n_tokens=3072)
python run.py --big
- Large model: n_blocks=16, n_heads=16, n_tokens=1024, n_embeddings=1024
- Adaptive embedding: cutoffs=[20000, 60000]