A modern reimplementation of GPT-2 that incorporates architectural improvements from recent language models like Gemma2. This project demonstrates how classic transformer architectures can benefit from modern techniques while maintaining compatibility with the Hugging Face ecosystem.
ModernGPT2 enhances the original GPT-2 architecture with several key improvements:
We've replaced the standard LayerNorm with Root Mean Square Normalization (RMSNorm). This simplification normalizes based on the RMS statistics and uses a learnable scaling parameter initialized to zeros, making training more stable.
Gone are the learned absolute position embeddings. RoPE injects positional information directly into the attention mechanism through rotation of query and key vectors, enabling better length generalization.
The attention mechanism now supports using fewer key/value heads than query heads. When num_key_value_heads < n_head, keys and values are shared across multiple query heads, significantly reducing memory usage while maintaining performance.
The feed-forward network uses a gating mechanism with three projections:
- Gate projection with activation function
- Up projection for feature expansion
- Down projection after element-wise multiplication
This allows for more sophisticated feature interactions compared to the standard two-layer MLP.
- Query scaling: Queries are scaled by
head_dim^-0.5by default - Attention softcapping: Optional logit capping before softmax to prevent training instabilities
- Embedding scaling: Input embeddings are scaled by
sqrt(hidden_size)
Each transformer block now follows a more sophisticated residual pattern with normalization both before and after each major operation (attention and MLP), improving gradient flow.
-
Tokenizer: You'll need a SentencePiece or Hugging Face tokenizer. Train one from scratch using our
train_tokenizer.pyscript or bring your own. -
Dataset: Two options are available:
- Stream the multilingual C4 dataset directly (requires internet)
- Pre-tokenize the data for faster training using
pretokenize_dataset.py
-
Hardware: Designed for GPU training with DeepSpeed. We include a ZeRO Stage 1 configuration that works well from single GPU setups to multi-node clusters.
The training pipeline consists of three main steps:
If you don't have a tokenizer, train one on multilingual C4 data:
python train_tokenizer.py \
--output_path ./my_tokenizer \
--vocab_size 32000 \
--max_train_lines 1000000 \
--special_tokens "<|endoftext|>" "<unk>" "<pad>"This trains a 32K vocabulary on 1M lines from C4 (supports en, ja, ko, zh).
Pre-tokenizing avoids redundant tokenization during training:
python pretokenize_dataset.py \
--tokenizer_path ./my_tokenizer \
--output_path ./my_pretokenized_data \
--block_size 1024 \
--max_samples_per_shard 200000 \
--c4_langs "en" "ja" \
--max_input_lines_total 5000000Processes 5M lines from English and Japanese C4, creating sharded Parquet files with 1024-token blocks.
Note on Parallelism in Pre-tokenization:
- The
pretokenize_dataset.pyscript can utilize multiple CPU cores for faster processing using the--num_proc <number>argument. By default, it uses (CPU count - 2) processes. - However, parallel processing via
--num_procis only effective if you run the script with the--no_dataset_streamingflag. This is because thedatasets.map()function cannot use multiprocessing with streaming datasets. - Warning: Using
--no_dataset_streamingwill cause the script to download and load the entirety of the specified C4 language splits into your Hugging Face cache directory (or memory if cache is disabled) before processing begins. This can require a very large amount of disk space and memory, especially for multiple languages from C4. - If you use the default
--dataset_streaming(or explicitly specify it), pre-tokenization will run on a single core but will be more memory-efficient for very large datasets.
We support various model sizes (small, medium, large, xl) and hardware configurations:
The training script offers two main ways to handle your dataset:
- Using pre-tokenized data (recommended): Provide the path to your pre-processed Parquet files using the
--pre_tokenized_dataset_path /path/to/your/pretokenized_dataargument. This is generally faster as tokenization is done only once. You can generate this data using thepretokenize_dataset.pyscript. - On-the-fly C4 streaming: If
--pre_tokenized_dataset_pathis omitted, the script defaults to streaming the C4 dataset and tokenizing it during training. In this mode, ensure you set--block_size(e.g.,--block_size 1024) to define the sequence length for the model.
Mixed Precision Training:
- Use
--fp16to enable FP16 mixed precision training. - Use
--bf16to enable BF16 mixed precision training (requires Ampere or newer NVIDIA GPUs, or compatible hardware). - Note:
--fp16and--bf16are mutually exclusive.
deepspeed train.py \
--deepspeed \
--deepspeed_config "ds_config_zero1.json" \
--model_size_name "medium" \
--tokenizer_path "./my_tokenizer" \
--pre_tokenized_dataset_path "./my_pretokenized_data" \
--output_dir "output/model" \
--num_train_epochs 3 \
--per_device_train_batch_size 16 \
--learning_rate 2.5e-4 \
--ds_config "ds_config_zero1.json" \
--fp16 \
--gradient_accumulation_steps 1accelerate launch --config_file accelerate_config.yaml --num_processes 4 train.py \
--model_size_name "medium" \
--tokenizer_path "./my_tokenizer" \
--pre_tokenized_dataset_path "./my_pretokenized_data" \
--output_dir "output/multi_gpu_model" \
--ds_config "ds_config_zero1.json" \
--num_train_epochs 3 \
--per_device_train_batch_size 16 \
--learning_rate 2.5e-4 \
--fp16 \
--gradient_accumulation_steps 1Note: First run
accelerate configto create your configuration file. The providedaccelerate_config.yamlassumes 4 GPUs with DeepSpeed.
For detailed training examples on specific hardware (TSUBAME 4, ABCI 3.0, etc.), see training_examples.md.
ModernGPT2 includes several new configuration parameters:
num_key_value_heads: For grouped query attention (use fewer thann_headto save memory)rope_theta: Base frequency for rotary embeddings (default: 10000.0)attn_logit_softcapping: Caps attention logits for stabilityactivation_function: Supports various activations (default: "gelu_pytorch_tanh")rms_norm_eps: Epsilon for RMSNorm layers (default: 1e-6)
See moderngpt2/configuration_moderngpt2.py for all available options.
- H100/H200 (80GB+): Use batch size 32-64 per device
- A100 (40-80GB): Use batch size 16-32 per device
- A6000/RTX 4090 (24-48GB): Use batch size 8-16 per device
The included DeepSpeed configuration uses ZeRO Stage 1 optimization, which works well for most setups. For larger models or limited memory, consider ZeRO Stage 2 or 3.
The codebase is organized as:
moderngpt2/: Core model implementation (PyTorch, TensorFlow, JAX)train.py: Main training script with Hugging Face Trainerdataset.py: Data loading utilities with streaming support- Configuration files for DeepSpeed and Accelerate
Install dependencies:
pip install torch transformers datasets deepspeed accelerateThis project builds upon the Hugging Face Transformers library and incorporates techniques from Google's Gemma2 model. Please refer to their respective licenses.