fq filters, generates, subsamples, and validates FASTQ files.
There are different methods to install fq.
Precompiled binaries are built for modern Linux distributions
(x86_64-unknown-linux-gnu), macOS (x86_64-apple-darwin), and Windows
(x86_64-pc-windows-msvc). The Linux binaries require glibc 2.31+ (CentOS/RHEL
9+, Debian 11+, Ubuntu 20.04+, etc.).
fq is available via Bioconda.
$ conda install fq=0.12.0
Clone the repository and use Cargo to install fq.
$ git clone --depth 1 --branch v0.12.0 https://github.com/stjude-rust-labs/fq.git
$ cd fq
$ cargo install --locked --path .
Container images are managed by Bioconda and available through Quay.io, e.g., using Docker:
$ docker image pull quay.io/biocontainers/fq:<tag>
See the repository tags for the available tags.
Alternatively, build the development container image:
$ git clone --depth 1 --branch v0.12.0 https://github.com/stjude-rust-labs/fq.git
$ cd fq
$ docker image build --tag fq:0.12.0 .
fq provides subcommands for filtering, generating, subsampling, and validating FASTQ files.
fq filter filters a given FASTQ file by a set of names or a sequence pattern. The result includes only the records that match the given options.
Filters a FASTQ file
Usage: fq filter [OPTIONS] --dsts <DSTS> [SRCS]...
Arguments:
[SRCS]... FASTQ sources
Options:
--names <NAMES>
Allowlist of record names
--sequence-pattern <SEQUENCE_PATTERN>
Keep records that have sequences that match the given regular expression
--dsts <DSTS>
Filtered FASTQ destinations
-h, --help
Print help
-V, --version
Print version
# Filters an input FASTQ using the given allowlist.
$ fq filter --names allowlist.txt --dsts /dev/stdout in.fastq
# Filters FASTQ files by matching a sequence pattern in the first input's
# records and applying the match to all inputs.
$ fq filter --sequence-pattern ^TC --dsts out.1.fq --dsts out.2.fq in.1.fq in.2.fqfq lint is a FASTQ file pair validator.
Validates a FASTQ file pair
Usage: fq lint [OPTIONS] <R1_SRC> [R2_SRC]
Arguments:
<R1_SRC> Read 1 source. Accepts both raw and gzipped FASTQ inputs
[R2_SRC] Read 2 source. Accepts both raw and gzipped FASTQ inputs
Options:
--lint-mode <LINT_MODE>
Panic on first error or log all errors [default: panic] [possible values: panic, log]
--single-read-validation-level <SINGLE_READ_VALIDATION_LEVEL>
Only use single read validators up to a given level [default: high] [possible values: low, medium, high]
--paired-read-validation-level <PAIRED_READ_VALIDATION_LEVEL>
Only use paired read validators up to a given level [default: high] [possible values: low, medium, high]
--disable-validator <DISABLE_VALIDATOR>
Disable validators by code. Use multiple times to disable more than one
-h, --help
Print help
-V, --version
Print version
validate includes a set of validators that run on single or paired records.
By default, records are validated with all rules, but validators can be
disabled using --disable-validator CODE, where CODE is one of validators
listed below.
| Code | Level | Name | Validation |
|---|---|---|---|
| S001 | low | PlusLine | Plus line starts with a "+". |
| S002 | medium | Alphabet | All characters in sequence line are one of "ACGTN", case-insensitive. |
| S003 | high | Name | Name line starts with an "@". |
| S004 | low | Complete | All four record lines (name, sequence, plus line, and quality) are present. |
| S005 | high | ConsistentSeqQual | Sequence and quality lengths are the same. |
| S006 | medium | QualityString | All characters in quality line are between "!" and "~" (ordinal values). |
| S007 | high | DuplicateName | All record names are unique. |
| Code | Level | Name | Validation |
|---|---|---|---|
| P001 | medium | Names | Each paired read name is the same, excluding interleave. |
# Validate both reads using all validators. Exits cleanly (0) if no validation
# errors occur.
$ fq lint r1.fastq r2.fastq
# Log errors instead of quitting on first error.
$ fq lint --lint-mode log r1.fastq r2.fastq
# Disable validators S004 and S007.
$ fq lint --disable-validator S004 --disable-validator S007 r1.fastq r2.fastqfq subsample outputs a subset of records from single or paired FASTQ files.
When using a probability (-p, --probability), each file is read through once,
and a subset of records is selected based on that chance. Given the randomness
used when sampling a uniform distribution, the output record count will not be
exact but (statistically) close.
When using a record count (-n, --record-count), the first input is read
twice, but it provides an exact number of records to be selected.
A seed (-s, --seed) can be provided to influence the results, e.g.,
for a deterministic subset of records.
For paired input, the sampling is applied to each pair.
Outputs a subset of records
Usage: fq subsample [OPTIONS] --r1-dst <R1_DST> <--probability <PROBABILITY>|--record-count <RECORD_COUNT>> <R1_SRC> [R2_SRC]
Arguments:
<R1_SRC> Read 1 source. Accepts both raw and gzipped FASTQ inputs
[R2_SRC] Read 2 source. Accepts both raw and gzipped FASTQ inputs
Options:
-p, --probability <PROBABILITY> The probability a record is kept, as a percentage (0.0, 1.0). Cannot be used with `record-count`
-n, --record-count <RECORD_COUNT> The exact number of records to keep. Cannot be used with `probability`
-s, --seed <SEED> Seed to use for the random number generator
--r1-dst <R1_DST> Read 1 destination. Output will be gzipped if ends in `.gz`
--r2-dst <R2_DST> Read 2 destination. Output will be gzipped if ends in `.gz`
-h, --help Print help
-V, --version Print version
# Sample ~50% of records from a single FASTQ file
$ fq subsample --probability 0.5 --r1-dst r1.50pct.fastq r1.fastq
# Sample ~50% of records from a single FASTQ file and seed the RNG
$ fq subsample --probability --seed 13 --r1-dst r1.50pct.fastq r1.fastq
# Sample ~25% of records from paired FASTQ files
$ fq subsample --probability 0.25 --r1-dst r1.25pct.fastq --r2-dst r2.25pct.fastq r1.fastq r2.fastq
# Sample ~10% of records from a gzipped FASTQ file and compress output
$ fq subsample --probability 0.1 --r1-dst r1.10pct.fastq.gz r1.fastq.gz
# Sample exactly 10000 records from a single FASTQ file
$ fq subsample --record-count 10000 --r1-dst r1.10k.fastq r1.fastqPlease see the disclaimer that applies to all crates and command line tools made available by St. Jude Rust Labs.