HVRLocator is a workflow to identify spanning hypervariable region(s) from amplicon sequencing variants or SRA public runs (SRR). It aligns query sequences to a reference E. coli full-length 16S rRNA gene and identifies the spanning region through alignment.
- Download the singularity image (hvrlocator.sif) to your local folder from the following location: Go to https://cloud.sylabs.io/library/jsaraiva/repo/hvrlocator
OR
Paste the following command in your terminal:
singularity pull --arch amd64 library://jsaraiva/repo/hvrlocator:hvr
HVRLocator can process both SRA accession numbers and FASTA files containing ASV sequences. For specific help on each function type one of the following:
singularity exec hvrlocator.sif hvrlocator sra --help singularity exec hvrlocator.sif hvrlocator align --help
To process an SRA run:
singularity exec hvrlocator.sif hvrlocator sra -r SRR1585194 -o /path/to/output/folderYou can specify the location of the E. coli reference file if it's not in the default location:
singularity exec hvrlocator.sif hvrlocator sra -r SRR1585194 --ecoli /path/to/ecoli.fa -o /path/to/output/folderTo use the Random Forest model to also predict the presence of a primer please use the following (also applicable to the processing of a list of runs):
singularity exec hvrlocator.sif hvrlocator sra -r SRR1585194 --ecoli /path/to/ecoli.fa -o /path/to/output/folder -m /path/to/rf_model.pklTo process a list of SRA runs:
singularity exec hvrlocator.sif hvrlocator sra -l /path/to/list.txt -o /path/to/output/folderYou can specify the location of the E. coli reference file if it's not in the default location:
singularity exec hvrlocator.sif hvrlocator sra -l /path/to/list.txt --ecoli /path/to/ecoli.fa -o /path/to/output/folderNote: The list of SRA runs should be 1 per line.
To process a FASTA file containing ASV sequences:
hvrlocator fasta -f path/to/your/asv_sequences.fasta --ecoli /path/to/ecoli.fa -o /path/to/output/folderTo modify the coverage threshold (default = 0.6) add the "-t" flag at the end pf the command (e.g. -t 0.7)"
To install HVRLocator locally, follow these steps:
- Create a new conda environment:
mamba create -n <ENV_NAME> -y -c bioconda -c conda-forge python=3.9 sra-tools fastp biopython numpy scipy vsearchNote: Replace the ENV_NAME with a name of your choosing.
- Activate the environment, clone the repository, and install the package:
source activate <ENV_NAME> && \
cd <PATH to FOLDER WHERE YOU WANT THE GITHUB REPO TO BE LOCATED> && \
git clone https://github.com/fbcorrea/hvrlocator && \
cd hvrlocator && \
pip install -e .
mamba install -c bioconda -c conda-forge mafft scikit-learn==1.1.3 joblibHVRLocator can process both SRA accession numbers and FASTA files containing ASV sequences.
To process an SRA run:
hvrlocator sra -r SRR1585194 -o /path/to/output_folderYou can specify the location of the E. coli reference file if it's not in the default location:
hvrlocator sra -r SRR1585194 --ecoli /path/to/ecoli.fa -o /path/to/output_folderTo use the Random Forest model to also predict the presence of a primer please use the following:
hvrlocator sra -r SRR1585194 --ecoli /path/to/ecoli.fa -o /path/to/output_folder -m /path/to/rf_model.pklTo process a list of SRA runs (don't forget to add the -m /path/to/rf_model.pkl if you wish to also predict primer presence):
hvrlocator sra -l /path/to/SRA_list.txt -o /path/to/output_folderTo process a FASTA file containing ASV sequences:
hvrlocator fasta -f path/to/your/asv_sequences.fastaThe following columns are shown in the output table:
• Sample_ID: Identifier of the processed sample.
• Primer_Presence: "Yes", "No", or "NA" depending on model output and input quality.
• Score_Primer_Presence: Probability output from the Random Forest model.
• Median/Avg/Min/Max Alignment Start/End: Various statistics on read alignment positions.
• Predicted HV Region: Based on alignment range irrespective of threshold.
• Coverage-based HV Region: Based on which V-regions passed the specified coverage threshold.
• Coverage_HV_region: Coverage value of V-regions.
• Warnings: Alerts about low coverage regions.
• Cov_V1 to Cov_V9: Coverage values (0-1) for each HV region.
A detailed description of the Random Forest model generation is available here.
hvrlocator.py: The main script that handles both SRA and FASTA processing.setup.py: Used for installing the package.
1. FastQ Files Not Found
Error Message: Error: No FASTQ files found after fastq-dump
• Ensure that the SRA Run ID is valid.
• Check your internet connection for downloading SRA data.
2. Low Read Count
Error Message: Error: Run ID has less than 500 reads and the current sample was skipped
• Some SRA runs may have low-quality reads or failed sequencing runs.
• Consider increasing the read limit in fastq-dump.
3. Alignment Issues
Error Message: Error in alignment for <ID>
• Ensure mafft is installed and available in the system path.
• Check if the reference FASTA file (ecoli.fa) is correctly formatted.
4. Coverage Too Low for HV Region Assignment
Error Message: No valid alignment positions
• Reads may not align properly due to sequencing quality or reference differences.
• Check whether trimming parameters in fastp are too strict.
5. Missing Columns in TSV Processing
Error Message: Error: Input TSV file is missing required columns
• Ensure the input TSV file matches the expected column structure.
• Check if the file was manually edited and lost required fields.
6. Model Prediction Issues
Error Message: Model error: No module named 'sklearn'
• Ensure your container or environment includes scikit-learn.
• Verify the Random Forest model path with --model is correct and readable.
Contributions to HVRLocator are welcome. Please feel free to submit a Pull Request.
This repository is licensed under the terms of the MIT License. The files "rf_model.pkl", "ecoli.fa", "mafft.fa" and "query.fa" are released under the Creative Commons CC0 1.0 Universal Public Domain Dedication (CC0 1.0 Universal).