Skip to content

TomDLT/wiki103

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Adaptive-embedding transformer trained on wikitext-103-v1

The goal is to replicate the model described in Adaptive input representations for neural language modeling [Baevski and Auli, 2019], and train it on wikitext-103-v1.

TODO

  • implement dataset loader and tokenizer
  • implement adaptive embedding
  • implement decoder-only transformer based on torch.nn.MultiheadAttention
  • use adaptive softmax from torch.nn.AdaptiveLogSoftmaxWithLoss
  • match adaptive embedding API with adaptive softmax
  • tie the embedding and softmax weights
  • implement training loop, loss monitoring, learning rate schedule
  • implement model save/load, continue training where it stopped
  • aggregate the gradient over multiple batches
  • check underfitting by experimenting with a tiny training set
  • implement a function to compute perplexity on the validation set
  • fix save/load model, fix weight tying
  • improve the adaptive softmax to be able to use float16 mixed-precision
  • try full size (n_tokens=3072)

Model fitting

python run.py --big
  • Large model: n_blocks=16, n_heads=16, n_tokens=1024, n_embeddings=1024
  • Adaptive embedding: cutoffs=[20000, 60000]

transformer_16x16x1024x1024x20000x60000 pt_losses

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages