Skip to content

treangenlab/Seqwin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

install with bioconda Anaconda-Server Badge

Seqwin

Seqwin is a lightning‑fast, memory‑efficient toolkit for discovering signature sequences (genomic markers) that balance high sensitivity with high specificity. It builds a minimizer‑based pan‑genome graph across target and neighboring non‑target genomes and extracts signature sequences using a novel graph algorithm.


Table of contents

  1. Installation
  2. Quick start
  3. Key parameters
  4. Outputs
  5. License

Installation

Bioconda (recommended)

Prerequisites

1. Create a new Conda environment "seqwin" and install Seqwin via Bioconda

conda create -n seqwin seqwin \
  --channel conda-forge \
  --channel bioconda \
  --strict-channel-priority

Tip

Setting channel priority is important for Bioconda packages to function properly. You may also persist channel priority settings for all package installation by modifying your ~/.condarc file. For more information, check the Bioconda documentation.

2. Activate the environment and verify the install

conda activate seqwin
seqwin --help

Manual installation

1. Install dependencies

python >=3.10
numpy >=2
numba
pandas >=2
networkx
pydantic
typer
btllib
mash
blast
ncbi-datasets-cli

2. Clone this repository and install with pip

git clone https://github.com/treangenlab/Seqwin.git
cd Seqwin
pip install .
seqwin --help

Quick start

Identify signatures by providing one or more target taxa and non-target neighboring taxa.

seqwin \
  -t "Salmonella enterica subsp. enterica" \
  -n "Salmonella enterica subsp. salamae" \
  -n "Salmonella bongori" \
  -p 8

Outputs are written to seqwin-out/ in your working directory (see Outputs). Taxa names must be exact matches to NCBI Taxonomy.

Alternatively, a list of target or non-target genomes can be provided as a text file of file paths. Each line of the text file should be the path to a genome file in FASTA format (plain text or compressed in gzip).

seqwin --tar-paths targets.txt --neg-paths non-targets.txt

Below is an example of targets.txt or non-targets.txt

./genomes/GCA_003718275.1_ASM371827v1_genomic.fna
/data/genomes/GCA_000389055.1_46.E.09_genomic.fna
/data/genomes/GCA_008363955.1_ASM836395v1_genomic.fna.gz

Expected runtime (with -p 20): ~10min for ~500 bacterial genomes with default settings, or ~15k bacterial genomes with --no-blast and --no-mash.

Run seqwin --help to see the full command line interface.

Key parameters

Node penalty threshold

The node penalty threshold (--penalty-th) controls the sensitivity and specificity of output signatures. Higher values allow longer / more signatures, but might reduce sensitivity and/or specificity.

When --penalty-th is not specified, it is automatically estimated with k-mer sketches. MinHash sketches (calculated with Mash) are used by default. If --no-mash is provided, minimizer sketches are used instead (faster but might be biased). Use --stringency or -s to tune this auto-estimated threshold (higher stringency lowers the threshold).

Signature evaluation

By default, output signatures are BLAST checked against target genomes for sensitivity (conservation), and non-target genomes for specificity (divergence). Signatures are sorted by conservation and divergence, which can be found in signatures.csv. Evaluation can be turned off with --no-blast for shorter running time. In that case, output signatures are still very likely to be sensitive and specific, but without second validation of BLAST.

Minimizer sketch

--kmerlen (default 21): shorter k‑mers might be helpful for genomes with more sequence variations (e.g. viruses).

--windowsize (default 200): smaller windows generate more minimizers and increase resolution at the cost of runtime & memory.

Performance tuning

Use --threads / -p to leverage multiple CPU cores. Add --no-mash and --no-blast for fastest running time.

Outputs

Seqwin creates the following files/directories inside the directory specified by --title (default seqwin-out/):

Name Description
signatures.fasta Signature sequences (top candidates are listed first)
signatures.csv Tabulated metrics for each signature
assemblies.csv Mapping of internal genome IDs to file paths (used in signatures.fasta)
blastdb/ BLAST database built from all input genomes
assemblies/ Genomes downloaded from NCBI
results.seqwin Serialized run snapshot (Python pickle)
config.json Full run configuration
seqwin.log Execution log

License

Seqwin is released under the GPL 3.0. See LICENSE for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages