Skip to content

fbcorrea/hvrlocator

Repository files navigation

HVRLocator

HVRLocator is a workflow to identify spanning hypervariable region(s) from amplicon sequencing variants or SRA public runs (SRR). It aligns query sequences to a reference E. coli full-length 16S rRNA gene and identifies the spanning region through alignment.

Using Singularity to run HVRLocator (Recommended)

  1. Download the singularity image (hvrlocator.sif) to your local folder from the following location: Go to https://cloud.sylabs.io/library/jsaraiva/repo/hvrlocator

OR

Paste the following command in your terminal:

singularity pull --arch amd64 library://jsaraiva/repo/hvrlocator:hvr

Usage

HVRLocator can process both SRA accession numbers and FASTA files containing ASV sequences. For specific help on each function type one of the following:

singularity exec hvrlocator.sif hvrlocator sra --help singularity exec hvrlocator.sif hvrlocator align --help

Processing SRA Accession Numbers

To process an SRA run:

singularity exec hvrlocator.sif hvrlocator sra -r SRR1585194 -o /path/to/output/folder

You can specify the location of the E. coli reference file if it's not in the default location:

singularity exec hvrlocator.sif hvrlocator sra -r SRR1585194 --ecoli /path/to/ecoli.fa -o /path/to/output/folder

To use the Random Forest model to also predict the presence of a primer please use the following (also applicable to the processing of a list of runs):

singularity exec hvrlocator.sif hvrlocator sra -r SRR1585194 --ecoli /path/to/ecoli.fa -o /path/to/output/folder -m /path/to/rf_model.pkl

To process a list of SRA runs:

singularity exec hvrlocator.sif hvrlocator sra -l /path/to/list.txt -o /path/to/output/folder

You can specify the location of the E. coli reference file if it's not in the default location:

singularity exec hvrlocator.sif hvrlocator sra -l /path/to/list.txt --ecoli /path/to/ecoli.fa -o /path/to/output/folder

Note: The list of SRA runs should be 1 per line.

Processing ASV FASTA Files

To process a FASTA file containing ASV sequences:

hvrlocator fasta -f path/to/your/asv_sequences.fasta --ecoli /path/to/ecoli.fa -o /path/to/output/folder

To modify the coverage threshold (default = 0.6) add the "-t" flag at the end pf the command (e.g. -t 0.7)"

Local Installation (Alternative)

To install HVRLocator locally, follow these steps:

  1. Create a new conda environment:
mamba create -n <ENV_NAME> -y -c bioconda -c conda-forge python=3.9 sra-tools fastp biopython numpy scipy vsearch

Note: Replace the ENV_NAME with a name of your choosing.

  1. Activate the environment, clone the repository, and install the package:
source activate <ENV_NAME> && \
cd <PATH to FOLDER WHERE YOU WANT THE GITHUB REPO TO BE LOCATED> && \
git clone https://github.com/fbcorrea/hvrlocator && \
cd hvrlocator && \
pip install -e .
mamba install -c bioconda -c conda-forge mafft scikit-learn==1.1.3 joblib

Usage

HVRLocator can process both SRA accession numbers and FASTA files containing ASV sequences.

Processing SRA Accession Numbers

To process an SRA run:

hvrlocator sra -r SRR1585194 -o /path/to/output_folder

You can specify the location of the E. coli reference file if it's not in the default location:

hvrlocator sra -r SRR1585194 --ecoli /path/to/ecoli.fa -o /path/to/output_folder

To use the Random Forest model to also predict the presence of a primer please use the following:

hvrlocator sra -r SRR1585194 --ecoli /path/to/ecoli.fa -o /path/to/output_folder -m /path/to/rf_model.pkl

To process a list of SRA runs (don't forget to add the -m /path/to/rf_model.pkl if you wish to also predict primer presence):

hvrlocator sra -l /path/to/SRA_list.txt -o /path/to/output_folder

Processing ASV FASTA Files

To process a FASTA file containing ASV sequences:

hvrlocator fasta -f path/to/your/asv_sequences.fasta

Output

The following columns are shown in the output table:

• Sample_ID: Identifier of the processed sample.
• Primer_Presence: "Yes", "No", or "NA" depending on model output and input quality.
• Score_Primer_Presence: Probability output from the Random Forest model.
• Median/Avg/Min/Max Alignment Start/End: Various statistics on read alignment positions.
• Predicted HV Region: Based on alignment range irrespective of threshold.
• Coverage-based HV Region: Based on which V-regions passed the specified coverage threshold.
• Coverage_HV_region: Coverage value of V-regions.
• Warnings: Alerts about low coverage regions.
• Cov_V1 to Cov_V9: Coverage values (0-1) for each HV region.

Random Forest Model

A detailed description of the Random Forest model generation is available here.

Project Structure

  • hvrlocator.py: The main script that handles both SRA and FASTA processing.
  • setup.py: Used for installing the package.

Possible Errors and Troubleshooting

1. FastQ Files Not Found
Error Message: Error: No FASTQ files found after fastq-dump
    • Ensure that the SRA Run ID is valid.
    • Check your internet connection for downloading SRA data.
2. Low Read Count
Error Message: Error: Run ID has less than 500 reads and the current sample was skipped
    • Some SRA runs may have low-quality reads or failed sequencing runs.
    • Consider increasing the read limit in fastq-dump.
3. Alignment Issues
Error Message: Error in alignment for <ID>
    • Ensure mafft is installed and available in the system path.
    • Check if the reference FASTA file (ecoli.fa) is correctly formatted.
4. Coverage Too Low for HV Region Assignment
Error Message: No valid alignment positions
    • Reads may not align properly due to sequencing quality or reference differences.
    • Check whether trimming parameters in fastp are too strict.
5. Missing Columns in TSV Processing
Error Message: Error: Input TSV file is missing required columns
    • Ensure the input TSV file matches the expected column structure.
    • Check if the file was manually edited and lost required fields.
6. Model Prediction Issues
Error Message: Model error: No module named 'sklearn'
    • Ensure your container or environment includes scikit-learn.
    • Verify the Random Forest model path with --model is correct and readable.

Contributing

Contributions to HVRLocator are welcome. Please feel free to submit a Pull Request.

License

This repository is licensed under the terms of the MIT License. The files "rf_model.pkl", "ecoli.fa", "mafft.fa" and "query.fa" are released under the Creative Commons CC0 1.0 Universal Public Domain Dedication (CC0 1.0 Universal).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages