ReadFaker

A tool for simulating Oxford Nanopore sequencing reads with realistic quality profiles by extracting empirical models from real FASTQ data.

Features

Creates empirical models for read length and quality scores (quality scores are grouped by length batches).
Supports compressed input and output FASTQ files.
Fast: can generate a million reads in under a minute.

Motivation

Oxford Nanopore data quality depends on many factors, such as the kit used, basecalling model version, and model precision level. Basecalling models keep improving quite often, making it challenging to simulate realistic data with fixed parameters.

This tool takes a different approach: instead of hardcoded models, it extracts length and quality profiles directly from your real data, ensuring the simulated reads match the characteristics of actual sequencing runs.

This is particularly useful for artificially contaminating real data for testing purposes (the reason I wrote this tool to begin with).

Current Limitations / Planned Improvements

Insertions and deletions are limited to one nucleotide length. Alteration type (substitution, insertion, deletion) ratios are fixed.
Only generates modified sequences, not chimeras, junk reads and other types of artifacts.
No BAM files support.

Installation

Go to the Releases and download the latest binary for your platform.

Usage

readfaker -r <reference.fasta> -i <input.fastq> -o <output.fastq> -n <num_reads>

Required Arguments

-r, --reference <FASTA> - Reference sequences to sample reads from
-i, --input <FASTQ> - Input FASTQ file to extract quality and length models
-o, --output <FASTQ> - Output FASTQ file for simulated reads

Optional Arguments

-n, --num-reads <N> - Number of reads to generate (default: 100000)
-s, --seed <N> - Random seed for reproducibility
-v, --verbose - Enable verbose output

Example

# Generate 10000 reads with verbose output
readfaker -r genome.fasta -i real_reads.fastq.gz -o simulated_reads.fastq.gz -n 10000 -v

# Generate reproducible reads with a fixed seed
readfaker -r genome.fasta -i real_reads.fastq -o simulated_reads.fastq -s 42

How It Works

Model Extraction: Reads an existing FASTQ file to build empirical models of read lengths and quality scores
Reference Loading: Parses reference genome sequences from FASTA format
Read Generation: Samples read lengths, selects random reference positions, applies quality profiles, and introduces errors based on quality scores
Output: Writes FASTQ records with automatic BGZF compression for .gz, .bgz, or .bgzf files

Building from Source

cargo build --release

The binary will be available at target/release/readfaker.

Running Tests

cargo test

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

ReadFaker

Features

Motivation

Current Limitations / Planned Improvements

Installation

Usage

Required Arguments

Optional Arguments

Example

How It Works

Building from Source

Running Tests

About

Uh oh!

Releases 1

Packages

Languages

Uh oh!

License

Uh oh!

dialvarezs/readfaker

Folders and files

Latest commit

History

Repository files navigation

ReadFaker

Features

Motivation

Current Limitations / Planned Improvements

Installation

Usage

Required Arguments

Optional Arguments

Example

How It Works

Building from Source

Running Tests

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages