Seqwin

Seqwin is a lightning‑fast, memory‑efficient toolkit for discovering signature sequences (genomic markers) that balance high sensitivity with high specificity. It builds a minimizer‑based pan‑genome graph across target and neighboring non‑target genomes and extracts signature sequences using a novel graph algorithm.

Installation

Bioconda (recommended)

Prerequisites

Linux, macOS, or Windows via WSL
x64 or ARM64
conda (install with miniforge or miniconda)

1. Create a new Conda environment "seqwin" and install Seqwin via Bioconda

conda create -n seqwin seqwin \
  --channel conda-forge \
  --channel bioconda \
  --strict-channel-priority

Tip

Setting channel priority is important for Bioconda packages to function properly. You may also persist channel priority settings for all package installation by modifying your ~/.condarc file. For more information, check the Bioconda documentation.

2. Activate the environment and verify the install

conda activate seqwin
seqwin --help

Manual installation

1. Install dependencies

python >=3.10
numpy >=2
numba
pandas >=2
networkx
pydantic
typer
btllib
mash
blast
ncbi-datasets-cli

2. Clone this repository and install with pip

git clone https://github.com/treangenlab/Seqwin.git
cd Seqwin
pip install .
seqwin --help

Quick start

Identify signatures by providing one or more target taxa and non-target neighboring taxa.

seqwin \
  -t "Salmonella enterica subsp. enterica" \
  -n "Salmonella enterica subsp. salamae" \
  -n "Salmonella bongori" \
  -p 8

Outputs are written to seqwin-out/ in your working directory (see Outputs). Taxa names must be exact matches to NCBI Taxonomy.

Alternatively, a list of target or non-target genomes can be provided as a text file of file paths. Each line of the text file should be the path to a genome file in FASTA format (plain text or compressed in gzip).

seqwin --tar-paths targets.txt --neg-paths non-targets.txt

Below is an example of targets.txt or non-targets.txt

./genomes/GCA_003718275.1_ASM371827v1_genomic.fna
/data/genomes/GCA_000389055.1_46.E.09_genomic.fna
/data/genomes/GCA_008363955.1_ASM836395v1_genomic.fna.gz

Expected runtime (with -p 20): ~10min for ~500 bacterial genomes with default settings, or ~15k bacterial genomes with --no-blast and --no-mash.

Run seqwin --help to see the full command line interface.

Key parameters

Node penalty threshold

The node penalty threshold (--penalty-th) controls the sensitivity and specificity of output signatures. Higher values allow longer / more signatures, but might reduce sensitivity and/or specificity.

When --penalty-th is not specified, it is automatically estimated with k-mer sketches. MinHash sketches (calculated with Mash) are used by default. If --no-mash is provided, minimizer sketches are used instead (faster but might be biased). Use --stringency or -s to tune this auto-estimated threshold (higher stringency lowers the threshold).

Signature evaluation

By default, output signatures are BLAST checked against target genomes for sensitivity (conservation), and non-target genomes for specificity (divergence). Signatures are sorted by conservation and divergence, which can be found in signatures.csv. Evaluation can be turned off with --no-blast for shorter running time. In that case, output signatures are still very likely to be sensitive and specific, but without second validation of BLAST.

Minimizer sketch

--kmerlen (default 21): shorter k‑mers might be helpful for genomes with more sequence variations (e.g. viruses).

--windowsize (default 200): smaller windows generate more minimizers and increase resolution at the cost of runtime & memory.

Performance tuning

Use --threads / -p to leverage multiple CPU cores. Add --no-mash and --no-blast for fastest running time.

Outputs

Seqwin creates the following files/directories inside the directory specified by --title (default seqwin-out/):

Name	Description
`signatures.fasta`	Signature sequences (top candidates are listed first)
`signatures.csv`	Tabulated metrics for each signature
`assemblies.csv`	Mapping of internal genome IDs to file paths (used in `signatures.fasta`)
`blastdb/`	BLAST database built from all input genomes
`assemblies/`	Genomes downloaded from NCBI
`results.seqwin`	Serialized run snapshot (Python pickle)
`config.json`	Full run configuration
`seqwin.log`	Execution log

License

Seqwin is released under the GPL 3.0. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
src/seqwin		src/seqwin
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Seqwin

Table of contents

Installation

Bioconda (recommended)

Manual installation

Quick start

Key parameters

Node penalty threshold

Signature evaluation

Minimizer sketch

Performance tuning

Outputs

License

About

Uh oh!

Releases 2

Packages

Languages

License

treangenlab/Seqwin

Folders and files

Latest commit

History

Repository files navigation

Seqwin

Table of contents

Installation

Bioconda (recommended)

Manual installation

Quick start

Key parameters

Node penalty threshold

Signature evaluation

Minimizer sketch

Performance tuning

Outputs

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages