Clean Framework

A streamlined framework for training language models efficiently on various datasets.

Overview

Clean Framework provides a structured approach to training language models on custom datasets. The framework handles data preprocessing, tokenization, training, and evaluation in a consistent way, allowing you to focus on designing your experiments rather than boilerplate code.

Repository setup

# Clone the repository
git clone https://github.com/pe-hy/clean_framework.git
cd clean_framework

# Create a conda environment
conda create -n llms python=3.11

# Activate the environment
# On Windows, macOS, and Linux:
conda activate llms

# Install dependencies
pip install -r requirements.txt

Project Structure

clean_framework/
│
├── config/                     # Configuration files
│   ├── base.yaml               # Main configuration (modify for your project)
│   └── hf_config.py            # HuggingFace model configuration
│
├── data/                       # Data directory
│   ├── train.json              # Training data
│   ├── test.json               # Testing data
│   └── inference_data/         # Data for inference
│
├── data_generation/            # Scripts for generating data
│   └── generate_data.py        # Main data generation script
│
├── hpc_scripts/                # Scripts for HPC environments
│   ├── data_info.sh            # Get data statistics
│   ├── filter.sh               # Filter data
│   ├── gen_data.sh             # Generate data
│   ├── inference.sh            # Run inference
│   ├── job.sh                  # Training job
│   └── tokenizer.sh            # Create tokenizer
│
├── tokenizer/                  # Tokenizer files
│
├── utils/                      # Utility scripts
│   ├── create_tokenizer.py     # Tokenizer creation
│   ├── data.py                 # Data loading utilities
│   ├── evaluator.py            # Evaluation utilities
│   ├── filter_data.py          # Data filtering
│   ├── get_data_info.py        # Data statistics
│   └── inference.py            # Inference utilities
│
├── train.py                    # Main training script
├── requirements.txt            # Package list
├── .gitignore                  # Git ignore file
├── LICENSE                     # MIT License
└── README.md                   # This file

Data Format

Data must be in a JSON format that looks similar to this (input/output keys):

[
  {
    "input": "E0 . T3 . T5 . T18",
    "output": "E45"
  },
  {
    "input": "E0 . T3 . T8 . T16 . T16",
    "output": "E1"
  },
  ...
]

Data Tokenization

Data will be tokenized like: [PAD] [PAD] [BOS] E0 . T3 . T5 . T18 [OUT] E45 [EOS] ... up to model.block_size
The [OUT] delimiter token is added during preprocessing in data.py
Anything after the delimiter, including the delimiter, is masked during training
The model trains to predict the content following the delimiter token

Configuration

Every time you work on a new project, you need to update the configuration:

config/base.yaml contains default settings that should be modified for your specific project
Pay special attention to model parameters (size, heads, layers) and project naming
Run python utils/get_data_info.py to get information about your data length before setting model.block_size

Step-by-Step Tutorial

1. Setup Environment

First, create a virtual environment and install the required dependencies:

# Create a conda environment
conda create -n llms python=3.11

# Activate the environment
# On Windows, macOS, and Linux:
conda activate llms

# Install dependencies
pip install -r requirements.txt

# In terminal, type:
wandb login

2. Generate Data

Create your dataset generation script in the data_generation directory:

# Place your script at data_generation/generate_data.py
python data_generation/generate_data.py

This should create train.json and test.json in the data directory.

3. Get Data Information

Analyze your dataset to determine appropriate model parameters:

python utils/get_data_info.py

Take note of the token statistics to set an appropriate model.block_size value in the configuration.

4. Configure Your Project

Edit config/base.yaml with your project-specific settings:

Set wandb.model_name and wandb.proj_name with meaningful names
Configure model size parameters (n_layer, n_head, n_embd)
Set model.block_size based on your data statistics
Adjust batch sizes and other training parameters as needed

5. Create Tokenizer

Generate a tokenizer based on your data:

python utils/create_tokenizer.py

This creates a tokenizer specific to your dataset's vocabulary.

6. Optional: Filter Data

If needed, filter examples that are too long:

python utils/filter_data.py

7. Train the Model

Start the training process:

python train.py

This will train the model according to your configuration and save checkpoints.

8. Run Inference

After training, run inference on new data:

python utils/inference.py

Place your inference data files in the data/inference_data/ directory.

Evaluation

The framework provides built-in evaluation metrics:

Token-level accuracy
Exact match accuracy
Detailed evaluation examples logged to Weights & Biases

Step-by-Step Tutorial for Clean Framework

This tutorial provides detailed instructions for setting up and using the Clean Framework for language model training.

Prerequisites

Python 3.11
ROCm/CUDA-compatible GPU
Git (for version control)

1. Project Configuration

The first step is to configure your project by editing config/base.yaml. Here are the key sections to update:

Weights & Biases Configuration

wandb:
  # Name your model
  model_name: "Pythia-12-8-256-MyProject" 
  # Your project name
  proj_name: "my-language-model-project"
  num_examples_reported: 100

Model Configuration

model:
  name: ${wandb.model_name}
  batch_size: 512
  accumulate_grad_batches: 1
  block_size: 1024  # Adjust based on your data
  epochs: 100

  n_layer: 12       # Number of transformer layers
  n_head: 8         # Number of attention heads
  n_embd: 256       # Embedding dimension

Optimizer Configuration

optim:
  lr_type: "linear"  # Options: "linear", "linear-reg", or None (cosine decay)
  lr: 2e-4           # Learning rate (used for cosine schedule)

2. Data Generation

Create a data generation script at data_generation/generate_data.py that produces data in the required JSON format:

# Example of a simple data generation script (data_generation/generate_data.py)
import json
import os
import random

def generate_examples(num_examples):
    examples = []
    for _ in range(num_examples):
        # Generate your task-specific data here
        # Example for a simple addition task:
        a = random.randint(1, 100)
        b = random.randint(1, 100)
        input_text = f"Add {a} and {b}"
        output_text = f"{a + b}"
        
        examples.append({
            "input": input_text,
            "output": output_text
        })
    return examples

def main():
    # Create data directory if it doesn't exist
    os.makedirs("data", exist_ok=True)
    
    # Generate training data
    train_examples = generate_examples(10000)
    with open("data/train.json", "w") as f:
        json.dump(train_examples, f, indent=2)
    
    # Generate test data
    test_examples = generate_examples(1000)
    with open("data/test.json", "w") as f:
        json.dump(test_examples, f, indent=2)
    
    print(f"Generated {len(train_examples)} training examples")
    print(f"Generated {len(test_examples)} test examples")

if __name__ == "__main__":
    main()

Run the data generation script:

python data_generation/generate_data.py

3. Analyze Data Characteristics

To properly set model parameters, analyze your data:

python utils/get_data_info.py

This will provide important statistics about your dataset:

Average token count
Maximum token count
Token distribution
Sample examples

Use this information to adjust your model.block_size in the configuration. The block_size should be large enough to accommodate most examples but not excessively large to waste memory.

4. Create a Tokenizer

Generate a tokenizer based on your data:

python utils/create_tokenizer.py

This creates a custom tokenizer optimized for your dataset's vocabulary.

5. Optional: Filter Data

If your dataset contains examples that are too long for your chosen block_size, you can filter them:

# Update the filter settings in config/base.yaml first:
data:
  filter:
    max_token_length: 1024  # Maximum token length for filtering examples

# Then run the filter script
python utils/filter_data.py

6. Start Training

Launch the training process:

python train.py

The training script will:

Load your data
Initialize the model
Train for the specified number of epochs
Save checkpoints based on validation accuracy
Log metrics to Weights & Biases (if configured)

7. Monitor Training

If you've configured Weights & Biases, you can monitor your training:

Open your web browser
Go to https://wandb.ai
Navigate to your project
View training metrics and sample predictions

8. Run Inference

After training, you can run inference on new data:

Create JSON files in the data/inference_data/ directory with the same format as your training data
Run the inference script:

python utils/inference.py

The script will generate predictions and calculate accuracy metrics.

9. Create Custom Tasks

To implement your own custom tasks:

Design your task and data format
Create a generation script in data_generation/
Update config as needed for your specific task
Follow the standard training workflow

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
config		config
data		data
data_generation		data_generation
hpc_scripts		hpc_scripts
tokenizer		tokenizer
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
train.py		train.py

License

pe-hy/clean_framework

Folders and files

Latest commit

History

Repository files navigation

Clean Framework

Overview

Repository setup

Project Structure

Data Format

Data Tokenization

Configuration

Step-by-Step Tutorial

1. Setup Environment

2. Generate Data

3. Get Data Information

4. Configure Your Project

5. Create Tokenizer

6. Optional: Filter Data

7. Train the Model

8. Run Inference

Evaluation

Step-by-Step Tutorial for Clean Framework

Prerequisites

1. Project Configuration

Weights & Biases Configuration

Model Configuration

Optimizer Configuration

2. Data Generation

3. Analyze Data Characteristics

4. Create a Tokenizer

5. Optional: Filter Data

6. Start Training

7. Monitor Training

8. Run Inference

9. Create Custom Tasks

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages