Skip to content

neet/attentif

Repository files navigation

attentif

codecov

A toy implementation of ”Attention Is All You Need”

A matplotlib capture for loss vs step

Demo

BERT

Screenshot of Jupyter Lab, solving a fill-mask task by BERT

GPT2

Screenshot of Jupyter Lab, solving a generate text task by GPT2

Motivation

I made this project in order to get a deeper understanding for the Transformer architecture, BERT, RoBERTa, T5, and GPT models. We often rely on existing Transformer implementation such as Hugging Face Transformers when we need to train a model. However, I wanted to test if I can implement them from scratch, referring to the paper.

This project does include:

  • torch.nn.Module
  • torch.nn.Parameter
  • Existing tokenizer implementation from transformers
  • And other primitive functions offered by PyTorch

While this project does not include:

  • Any models from transformers
  • nn.Transformer
  • nn.MultiheadAttention
  • nn.Embedding
  • nn.LayerNorm
  • nn.functional.softmax
  • And other existing modules that plays an essential role in Transformer architecture

Features

We implemented the following features so far. You can find the layers and functions in src/layers, and models in src/models.

Functions

  • dropout
  • softmax
  • gelu
  • positional_encoding

Layers

  • MultiHeadAttention
  • FeedForwardNetwork
  • LayerNorm
  • TokenEmbedding
  • TransformerEncoder
  • TransformerEncoderBlock
  • TransformerDecoder
  • TransformerDecoderBlock

Models

  • BertModel
  • GPT2Model
  • T5Model

Schedulers

We use transformers for schedulers for now, but have a plan to implement them from scratch in the future.

  • AdamW
  • CrossEntropy

References

About

A toy implementation of “Attention Is All You Need”

Topics

Resources

Stars

Watchers

Forks

Languages