TinyLLM Fine-Tuning Pipeline

A streamlined pipeline for fine-tuning small language models (primarily TinyLlama-1.1B-Chat-v1.0) on domain-specific text using HuggingFace Transformers. Includes preprocessing, training, and deployment via FastAPI/Streamlit.

Features

Efficient Fine-Tuning: Memory-efficient training using HuggingFace Transformers
Domain Adaptation: Easy preprocessing of custom text data
Low Resource Requirements: Optimized for 8GB GPU machines
Dual Inference Options:
- FastAPI server for production deployments
- Streamlit UI for interactive demos
Reproducible Pipeline: End-to-end scripts from data prep to deployment

Project Structure

.
├── .github/              # GitHub Actions workflows (e.g., Pylint)
├── data/
│   ├── preprocess.py     # Data preprocessing and tokenization script
│   └── processed/        # Default output for processed data (content ignored by .gitignore)
├── train/
│   ├── train.py          # Model fine-tuning script
│   └── checkpoints/      # Default output for model checkpoints (content ignored by .gitignore)
├── optimize/
│   └── export_onnx.py    # Script to export model to ONNX
├── serve/
│   ├── inference_fastapi.py # FastAPI inference server
│   └── inference_streamlit.py # Streamlit UI for inference
├── utils/
│   ├── benchmark.py      # Performance benchmarking (example)
│   └── metrics.py        # Evaluation metrics (example)
├── tests/                # Unit tests (recommended, structure may vary)
├── .gitignore            # Specifies intentionally untracked files by Git
├── README.md             # This file
├── requirements.txt      # Project dependencies
└── setup.sh              # Environment setup script (creates venv, installs deps)

Requirements

Python 3.10+
PyTorch (CUDA 11.8+ recommended for GPU)
8GB+ GPU (or CPU-only mode)

Key dependencies:

transformers
fastapi
uvicorn
streamlit
torch
accelerate
optimum[onnxruntime] # For ONNX export

HF_TOKEN: Ensure you have the HF_TOKEN environment variable set if you are using models that require authentication from the Hugging Face Hub (e.g., private models or some Llama variants).

Installation

git clone [repository-url]
cd llm-finetuning
pip install -r requirements.txt

Usage Guide

1. Data Preprocessing

python data/preprocess.py \
  --input data/raw/ \
  --output data/processed/ \
  --model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0

Processes raw text into tokenized training data
Supports local files or HuggingFace datasets
Outputs: data/processed/tokenized_data.jsonl

2. Model Fine-Tuning

python train/train.py \
  --data data/processed/tokenized_data.jsonl \
  --output train/checkpoints/ \
  --model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --epochs 3 \
  --batch_size 1 \
  --lr 2e-5 \
  --max_length 128 \
  # --fp16  # Uncomment to enable mixed-precision training if desired

Key Parameters:

model_name: Identifier for the Hugging Face model to use (default: TinyLlama/TinyLlama-1.1B-Chat-v1.0).
batch_size: Default 1 for 8GB GPUs.
max_length: 128 tokens recommended for memory efficiency.
fp16: Optional flag to enable fp16 mixed-precision training (e.g., add --fp16). If not provided, training is in full precision (fp32).

3. Model Serving

Note on Model Path: The serving scripts (inference_fastapi.py and inference_streamlit.py) load the fine-tuned model from the path specified by the FINETUNED_MODEL_PATH environment variable. If this variable is not set, they default to looking for the model in train/checkpoints/ (relative to the project root, assuming scripts are run from there or the path resolves correctly).

Option A: FastAPI Server

uvicorn serve.inference_fastapi:app --host 0.0.0.0 --port 8000

RESTful API endpoint at /generate

Example request:

curl -X POST "http://localhost:8000/generate" \
     -H "Content-Type: application/json" \
     -d '{"prompt": "What is the capital of France?", "max_new_tokens": 128, "temperature": 0.7}'

Option B: Streamlit UI

streamlit run serve/inference_streamlit.py

Interactive web interface
Real-time text generation
Memory-efficient with automatic CUDA management

Memory Management

The implementation includes several optimizations for running on consumer GPUs:

Automatic CUDA memory management
Session state persistence in Streamlit
CPU offloading when needed
Garbage collection after inference

Troubleshooting

Common Issues

CUDA Out of Memory
- Reduce batch_size to 1
- Decrease max_length
- Enable CPU offloading

Streamlit Device Errors

# Run with specific watcher type
export STREAMLIT_WATCHER_TYPE=watchman
# Or disable file watching
streamlit run serve/inference_streamlit.py --server.runOnSave false

Model Loading Issues
- Check CUDA availability
- Verify checkpoint files
- Clear CUDA cache if needed

Best Practices

Training
- Start with small batch sizes
- Monitor GPU memory usage
- Save checkpoints frequently
Inference
- Use FastAPI for production
- Use Streamlit for demos
- Clear CUDA cache between runs
Data Preparation
- Verify tokenizer matches model
- Check for proper padding/truncation
- Validate processed data format

Contributing

Contributions welcome! Please check the issues page and submit PRs for improvements.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TinyLLM Fine-Tuning Pipeline

Features

Project Structure

Requirements

Installation

Usage Guide

1. Data Preprocessing

2. Model Fine-Tuning

3. Model Serving

Option A: FastAPI Server

Option B: Streamlit UI

Memory Management

Troubleshooting

Common Issues

Best Practices

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github		.github
data		data
optimize		optimize
serve		serve
tests		tests
train		train
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.sh		setup.sh

Shivamjohri247/llm-finetuning

Folders and files

Latest commit

History

Repository files navigation

TinyLLM Fine-Tuning Pipeline

Features

Project Structure

Requirements

Installation

Usage Guide

1. Data Preprocessing

2. Model Fine-Tuning

3. Model Serving

Option A: FastAPI Server

Option B: Streamlit UI

Memory Management

Troubleshooting

Common Issues

Best Practices

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages