This repository contains the code for PArtial Read Matching with Inexpensive K-mers (PARMIK), a fast and memory-efficient tool for identifying the "Partial Match" region between sequencing reads (e.g., aligning a 150 bp query from a newly discovered genome against a 150 bp read from a metagenomic dataset,) where the boundaries of the query and read do not necessarily align, and the overlapping region can be small, including a notable number of matches and a few mismatches (i.e., substitutions and InDels.) PARMIK indexes the metagenomic dataset to a storage-efficient Inexpensive K-mer Index (IKI), excluding highly repetitive k-mers, to keep its memory footprint small. PARMIK supports gapped and local alignment and outputs a set of alignments in SAM format. To enhance alignment speed, PARMIK supports multi-threading. Check out our paper for more details.
dataPrepare/: Contains scripts to extract contigs from read and query dataset filesscripts/: Contains scripts to evaluate experiment resultssraDownload/: Contains scripts to download SRA filessrc/: Contains source code for the project
Before you begin, ensure you have the following installed on your system, (section):
- Ubuntu: All testing has been done on Ubuntu 22.04+ Operating System.
- GCC: The GNU Compiler Collection, specifically g++9 which supports C++11 or later.
- Make: The build utility to automate the compilation.
- OpenMP: Support for parallel programming in C++.
- Python3 for running the scripts
To compile, use:
makeTo clean up all compiled files:
make cleanTo download datasets, we used SRA Toolkit (v3.0.7). Here is the command we used to download a metagenomic dataset (SRR12432009):
sratoolkit.3.0.7-ubuntu64/bin/fasterq-dump SRR12432009 -p --fasta --outdir <outputDir>Replace <outputDir> with the path to your desired output directory.
Once downloaded, you can extract unique reads from the metagenomic dataset using the following script:
python3 dataPrepare/readPrepare/sampleUniqueReads.py <metagenomic_filename> <read_count> <skip_N_flag> <header_substring> <input_fasta> <output_fasta>You can download assembled viral genomes from NCBI to generate query sequences. For example:
NC_045512 (SARS-CoV-2 reference genome)
NC_004718 (SARS-1 coronavirus)
After downloading the genome FASTA files, you can extract 150-bp query sequences using:
python3 dataPrepare/queryPrepare/kmerizingTheQueryFile.py <kmer_length> <genome_fasta> <output_queries_fasta> <genome_name>Here are some examples for how to use different PARMIK modes:
To execute PARMIK in the indexing mode, you can execute a command like the following, replacing <> with your specific paths and values:
./parmik -a 0 -c <contig_size> -t <inexpensive_k-mer_threshold> -k <k-mer_size> -i <read_count> -x -r <metagenomic_read_database_address> -f <k-mer_index_address>To execute PARMIK in the alignment mode, you can execute a command like the following, replacing <> with your specific paths and values:
./parmik -a 1 -s <region_size> -c <contig_size> -m <min_exact_match_size> -t <inexpensive_k-mer_threshold> -k <k-mer_size> -d <percentage_identity> -i <read_count> -j <query_count> -x -r <metagenomic_read_database_address> -q <query_file_address> -f <k-mer_index_address> -o <output_directory> -p <penalty_file_address>To execute PARMIK in the baseline mode (brute force Smith-Waterman), you can execute a command like the following, replacing <> with your specific paths and values:
./parmik -a 3 -s <region_size> -c <contig_size> -t <inexpensive_k-mer_threshold> -k <k-mer_size> -d <percentage_identity> -i <read_count> -j <query_count> -r <metagenomic_read_database_address> -q <query_file_address> -o <output_directory> -p <penalty_file_address>In order to compare PARMIK to
To execute PARMIK in the compare mode, you can execute a command like the following, replacing <> with your specific paths and values:
./parmik -a 2 -l <other_tool_name> -s <region_size> -c <contig_size> -m <min_exact_match_size> -t <inexpensive_k-mer_threshold> -k <k-mer_size> -d <percentage_identity> -i <read_count> -j <query_count> -x -r <metagenomic_read_database_address> -q <query_file_address> -f <k-mer_index_address> -o <output_directory> -b <other_tool_alignment_file_address> -p <penalty_file_address>To execute PARMIK in the compare baseline mode, you can execute a command like the following, replacing <> with your specific paths and values:
./parmik -a 4 -l <other_tool_name> -s <region_size> -c <contig_size> -m <min_exact_match_size> -t <inexpensive_k-mer_threshold> -k <k-mer_size> -d <percentage_identity> -i <read_count> -j <query_count> -x -r <metagenomic_read_database_address> -q <query_file_address> -f <k-mer_index_address> -o <output_directory> -b <other_tool_alignment_file_address> -p <penalty_file_address>Below are the PARMIK's parameters in alphabetical order:
-a,--mode: PARMIK mode (required)- PARMIK operation mode. It can get these values:
PARMIK_MODE_INDEX (0)PARMIK_MODE_ALIGN (1)PARMIK_MODE_COMPARE (2)PARMIK_MODE_BASELINE (3)PARMIK_MODE_CMP_BASELINE (4)
- PARMIK operation mode. It can get these values:
-b,--toolFileAddress: Other Tool Alignment File Address (required for compare mode)- The address of the output of the other tool (BLAST, BWA, etc)
-c,--contigSize: Contig Size (default = 150)- Length of the contigs
-d,--percentageIdentity: Percentage Identity (default = 90%)- Minimum Percentage of Identity in the alignment
-e,--editDistance: Max Edit Distance (i/d/s) (default = 2)- Maximum edit distance (including Substitutions and InDels) allowed in the alignment
-f,--ikiAddress: Inexpensive K-mer Index Address (required)- The path to the Inexpensive K-mer Index (IKI)
-h,--help: Help-i,--readCount: Number of Metagenomic Reads (default = 1)- Number of reads in the Metagenomic dataset
-j,--queryCount: Number of Queries (default = 1)- Number of queries in the Query dataset
-k,--kmerLen: K-mer Length (default = 16)- Length of the K-mer
-l,--otherTool: The Other Tool Name (required for compare mode)- Name of other tool (bwa, blast, etc)
-m,--minExactMatchLen: Minimum Exact Match Length (default = 0)- Min length of exact match required for alignment
- minExactMatchLen = (M + K - 1), where M is the minimum number of shared k-mers and K is the k-mer size.
-n,--kmerRangesFileAddress: K-mer Ranges File Address- K-mer ranges file address required for calculating the inexpensive k-mer threshold
-o,--outputDir: Output Directory (required for all modes except index mode)- Directory to dump the alignment results
-p,--penaltyFileAddress: Penalty File Address- The penalty score sets used for the alignment step
-q,--query: Query File Address (required)- Path to the query dataset file
-r,--read: Metageomic Read Data Base Address (required)- Path to the read metagenomic dataset file
-s,--regionSize: Region Size (default = 48)- Minimum size of the alignment
-t,--cheapKmerThreshold: Cheap (Inexpensive) k-mer Threshold (required)- -t 0: includes all k-mers in the IKI (Inexpensive K-mer Index).
-u,--isSecondChanceOff: Turn Second Chance Off- Turn off the second chance
- This is a flag. When included, it disables the second chance mechanism (sets the flag to true).
-v,--verboseLog: Verbose Logging (default = false)-w,--numThreads: Number of Threads (default = 1)-x,--isIndexOffline: Is the read index offline- This is a flag. When included, it enables the write/read the IKI to/from storage.
- If not included, PARMIK creates and use the IKI on the fly
-z,--baselineBaseAddress: BaseLine file base address (required for compare mode)- Base address of the baseline alignment outputs
To display help for general usage:
./parmik --helpPlease cite the following paper if you find this repository useful.
@article {Baradaran2024.10.14.618242,
author = {Baradaran, Morteza and Layer, Ryan M and Skadron, Kevin},
title = {PARMIK: PArtial Read Matching with Inexpensive K-mers},
elocation-id = {2024.10.14.618242},
year = {2024},
doi = {10.1101/2024.10.14.618242},
publisher = {Cold Spring Harbor Laboratory},
abstract = {Environmental metagenomic sampling is instrumental in preparing for future pandemics by enabling early identification of potential pathogens and timely intervention strategies. Novel pathogens are a major concern, especially for zoonotic events. However, discovering novel pathogens often requires genome assembly, which remains a significant bottleneck. A robust metagenomic sampling that is directly searchable with new infection samples would give us a real-time understanding of outbreak origins dynamics. In this study, we propose PArtial Read Matching with Inexpensive K-mers (PARMIK), which is a search tool for efficiently identifying similar sequences from a patient sample (query) to a metagenomic sample (read). For example, at 90\% identity between a query and a read, PARMIK surpassed BLAST, providing up to 21\% higher recall. By filtering highly frequent k-mers, we reduced PARMIK{\textquoteright}s index size by over 50\%. Moreover, PARMIK identified longer alignments faster than BLAST, peaking at 1.57x, when parallelizing across 32 cores.Competing Interest StatementThe authors have declared no competing interest.},
URL = {https://www.biorxiv.org/content/early/2024/10/17/2024.10.14.618242},
eprint = {https://www.biorxiv.org/content/early/2024/10/17/2024.10.14.618242.full.pdf},
journal = {bioRxiv}
}