A toy implementation of ”Attention Is All You Need”
BERT
GPT2
I made this project in order to get a deeper understanding for the Transformer architecture, BERT, RoBERTa, T5, and GPT models. We often rely on existing Transformer implementation such as Hugging Face Transformers when we need to train a model. However, I wanted to test if I can implement them from scratch, referring to the paper.
This project does include:
torch.nn.Moduletorch.nn.Parameter- Existing tokenizer implementation from
transformers - And other primitive functions offered by PyTorch
While this project does not include:
- Any models from
transformers nn.Transformernn.MultiheadAttentionnn.Embeddingnn.LayerNormnn.functional.softmax- And other existing modules that plays an essential role in Transformer architecture
We implemented the following features so far. You can find the layers and functions in src/layers, and models in src/models.
-
dropout -
softmax -
gelu -
positional_encoding
-
MultiHeadAttention -
FeedForwardNetwork -
LayerNorm -
TokenEmbedding -
TransformerEncoder -
TransformerEncoderBlock -
TransformerDecoder -
TransformerDecoderBlock
-
BertModel -
GPT2Model -
T5Model
We use transformers for schedulers for now, but have a plan to implement them from scratch in the future.
-
AdamW -
CrossEntropy
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All You Need. NeurIPS 2017.
- Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019.
- Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. OpenAI.