TrxCNN is a deep learning model designed for classifying full-length transcriptome sequences using convolutional neural networks with residual connections. This implementation uses PyTorch to provide an efficient and scalable solution for genomic sequence analysis.
This model is introduced in the paper: Using Deep Learning to Classify Full-Length Transcriptome Sequences
- CNN-based Architecture: Uses 1D convolutional layers with residual blocks for sequence classification
- Dynamic Sequence Length Handling: Supports variable-length sequences with efficient padding and masking
- Bucket Batching: Implements intelligent batching strategy to optimize training efficiency
- GPU Acceleration: Full CUDA support with mixed precision training
- Customizable Hyperparameters: Comprehensive hyperparameter configuration system
- Python 3.7+
- PyTorch 1.8+
- CUDA (optional, for GPU acceleration)
pip install torch torchvision torchaudio
pip install pandas
pip install torchtextTrxCNN/
├── README.md # Project documentation
├── train.py # Main training script
├── protcnn_model.py # Model architecture definition
├── fastq_dataset.py # Dataset class for FASTQ data processing
├── prepocess_raw_data.py # Data preprocessing utilities
├── hparams.py # Hyperparameter configurations
├── utils.py # Utility functions and custom collate functions
└── train.sh # Training script for batch execution
The TrxCNN model consists of:
- Input Layer: Processes DNA sequences (A, T, C, G) with one-hot encoding
- Initial Convolution: 1D convolution with masking for variable-length sequences
- Residual Blocks: Multiple residual blocks with dilated convolutions
- Global Pooling: Max pooling across the sequence dimension
- Classification Layer: Fully connected layer for transcript classification
- Conv1d_with_mask: Custom convolution layer that handles variable-length sequences
- Residual_Block: Residual connection blocks with batch normalization and ReLU activation
- DNA_Model: Main model class combining all components
The model expects DNA sequence data in the following format:
- Input: FASTQ format files containing DNA sequences
- Processing: Sequences are converted to integer encoding (A=0, T=1, C=2, G=3)
- Output: Classification labels for different transcripts
- Genes: 19,813
- Transcripts: 57,899 (classification classes)
- Dataset size: ~16GB in CSV format
- Prepare your data: Place FASTQ files in the appropriate directory structure
- Preprocess data:
python prepocess_raw_data.py
- Start training:
python train.py # or use the batch script bash train.sh
Modify hyperparameters in hparams.py:
def hparams_set_train():
hparams = {}
hparams["filters"] = 800 # Number of filters
hparams["kernel_size"] = 3 # Convolution kernel size
hparams["num_layers"] = 4 # Number of residual blocks
hparams["lr_rate"] = 0.0005 # Learning rate
hparams["num_epochs"] = 40 # Training epochs
hparams["bt_size"] = 32 # Batch size
return hparams# Test a saved model
from train import test_saved_model
results = test_saved_model()- Gradient Clipping: Prevents gradient explosion
- Learning Rate Scheduling: Dynamic learning rate with exponential decay
- Mixed Precision Training: Faster training with reduced memory usage
- Model Checkpointing: Automatic model saving during training
- Bucket Batching: Efficient batching strategy for variable-length sequences
The model achieves competitive performance on full-length transcriptome classification tasks. Training logs and model checkpoints are automatically saved in the saved_model/ directory.
Contributions are welcome! Please feel free to submit issues and pull requests.
If you use this code in your research, please cite:
@article{trxcnn2024,
title={Using Deep Learning to Classify Full-Length Transcriptome Sequences},
author={[Authors]},
journal={IEEE},
year={2024},
url={https://ieeexplore.ieee.org/document/10385824}
}This project is licensed under the MIT License - see the LICENSE file for details.
Note: Make sure to have sufficient GPU memory for training, as the model processes large genomic sequences. The training script automatically detects and uses CUDA when available.