Seqwin is a lightning‑fast, memory‑efficient toolkit for discovering signature sequences (genomic markers) that balance high sensitivity with high specificity. It builds a minimizer‑based pan‑genome graph across target and neighboring non‑target genomes and extracts signature sequences using a novel graph algorithm.
Prerequisites
1. Create a new Conda environment "seqwin" and install Seqwin via Bioconda
conda create -n seqwin seqwin \
--channel conda-forge \
--channel bioconda \
--strict-channel-priorityTip
Setting channel priority is important for Bioconda packages to function properly. You may also persist channel priority settings for all package installation by modifying your ~/.condarc file. For more information, check the Bioconda documentation.
2. Activate the environment and verify the install
conda activate seqwin
seqwin --help1. Install dependencies
python >=3.10
numpy >=2
numba
pandas >=2
networkx
pydantic
typer
btllib
mash
blast
ncbi-datasets-cli
2. Clone this repository and install with pip
git clone https://github.com/treangenlab/Seqwin.git
cd Seqwin
pip install .
seqwin --helpIdentify signatures by providing one or more target taxa and non-target neighboring taxa.
seqwin \
-t "Salmonella enterica subsp. enterica" \
-n "Salmonella enterica subsp. salamae" \
-n "Salmonella bongori" \
-p 8Outputs are written to seqwin-out/ in your working directory (see Outputs). Taxa names must be exact matches to NCBI Taxonomy.
Alternatively, a list of target or non-target genomes can be provided as a text file of file paths. Each line of the text file should be the path to a genome file in FASTA format (plain text or compressed in gzip).
seqwin --tar-paths targets.txt --neg-paths non-targets.txtBelow is an example of targets.txt or non-targets.txt
./genomes/GCA_003718275.1_ASM371827v1_genomic.fna
/data/genomes/GCA_000389055.1_46.E.09_genomic.fna
/data/genomes/GCA_008363955.1_ASM836395v1_genomic.fna.gzExpected runtime (with -p 20): ~10min for ~500 bacterial genomes with default settings, or ~15k bacterial genomes with --no-blast and --no-mash.
Run seqwin --help to see the full command line interface.
The node penalty threshold (--penalty-th) controls the sensitivity and specificity of output signatures. Higher values allow longer / more signatures, but might reduce sensitivity and/or specificity.
When --penalty-th is not specified, it is automatically estimated with k-mer sketches. MinHash sketches (calculated with Mash) are used by default. If --no-mash is provided, minimizer sketches are used instead (faster but might be biased). Use --stringency or -s to tune this auto-estimated threshold (higher stringency lowers the threshold).
By default, output signatures are BLAST checked against target genomes for sensitivity (conservation), and non-target genomes for specificity (divergence). Signatures are sorted by conservation and divergence, which can be found in signatures.csv. Evaluation can be turned off with --no-blast for shorter running time. In that case, output signatures are still very likely to be sensitive and specific, but without second validation of BLAST.
--kmerlen (default 21): shorter k‑mers might be helpful for genomes with more sequence variations (e.g. viruses).
--windowsize (default 200): smaller windows generate more minimizers and increase resolution at the cost of runtime & memory.
Use --threads / -p to leverage multiple CPU cores. Add --no-mash and --no-blast for fastest running time.
Seqwin creates the following files/directories inside the directory specified by --title (default seqwin-out/):
| Name | Description |
|---|---|
signatures.fasta |
Signature sequences (top candidates are listed first) |
signatures.csv |
Tabulated metrics for each signature |
assemblies.csv |
Mapping of internal genome IDs to file paths (used in signatures.fasta) |
blastdb/ |
BLAST database built from all input genomes |
assemblies/ |
Genomes downloaded from NCBI |
results.seqwin |
Serialized run snapshot (Python pickle) |
config.json |
Full run configuration |
seqwin.log |
Execution log |
Seqwin is released under the GPL 3.0. See LICENSE for details.