A command line tool for transferring Unique Molecular Identifiers (UMIs) provided as a separate FastQ file to the header of records in paired FastQ files.
- Background on Unique Molecular Identifiers
- Installing
umi-transfer - Using
umi-transferto integrate UMIs - Benchmarks and parameter recommendations
- Chaining with other software
- Contributing bugfixes and new features
To increase the accuracy of quantitative DNA sequencing experiments, Unique Molecular Identifiers may be used. UMIs are short sequences used to uniquely tag each molecule in a sample library, enabling precise identification of read duplicates. They must be added during library preparation and prior to sequencing, therefore require appropriate arrangements with your sequencing provider.
Most tools capable of taking UMIs into consideration during an analysis workflow expect the respective UMI sequence to be embedded into the read's ID. Please consult your tools' manuals regarding the exact specification.
For some library preparation kits and sequencing adapters, the UMI sequence needs to be read together with the index from the antisense strand. Consequently, it will be output as a separate FastQ file during the demultiplexing process.
This tool efficiently integrates these separate UMIs into the headers and can also correct divergent read numbers back to the canonical 1 and 2.
Binaries for umi-transfer are available for most platforms and can be obtained from the Releases page on GitHub. Navigate to the releases and download the appropriate binary for your operating system. Once downloaded, you can place it in a directory of your choice and optionally add the binary to your system's $PATH.
💡 Tip for macOS users encountering "cannot be opened because the developer cannot be verified"
When you download binaries from the internet on macOS, the operating system places them under a "quarantine" to protect you from running potentially unsafe software. This often results in an error such as:
"my-binary" cannot be opened because the developer cannot be verified.
To allow running the umi-transfer binary (or any downloaded binary), you need to remove the quarantine attribute. You can do this from the command line with:
xattr -dr com.apple.quarantine ./path/to/umi-transferReplace ./path/to/umi-transfer with the actual path to your downloaded binary. This command tells macOS to trust the binary and should resolve the warning so you can execute the file normally.
umi-transfer is also available on BioConda. Please refer to the Bioconda documentation for comprehensive installation instructions. If you are already familiar with conda and BioConda, here's a quick reference:
mamba install umi-transferIf you wish to create a separate virtual environment for the tool, replace <myenvname> with a suitable environment name of your choice and run
mamba create --name <myenvname> umi-transferDocker provides a platform for packaging software into self-contained units called containers. Containers encapsulate all the dependencies and libraries needed to run an application, making it easy to deploy and run the software consistently across different environments.
To use umi-transfer with Docker, you can pull the pre-made Docker image from Docker Hub. Open a terminal or command prompt and run the following command:
docker pull mzscilifelab/umi-transfer:latestOnce the image is downloaded, you can run umi-transfer within a Docker container using:
docker run -t -v `pwd`:`pwd` -w `pwd` mzscilifelab/umi-transfer:latest umi-transfer --helpA complete command might look like the example below. The options -t -v -w to Docker will ensure that your local directory is mapped to and available inside the container. Everything after the image command resembles the standard command line syntax:
docker run -t -v `pwd`:`pwd` -w `pwd` mzscilifelab/umi-transfer:latest umi-transfer external --in=read1.fq --in2=read2.fq --umi=umi.fqOptionally, you can create an alias for the Docker part of the command to be able to use the containerized version as if it was locally installed. Add the line below to your ~/.profile, ~/.bash_aliases, ~/.bashrc or ~/.zprofile (depending on the terminal & configuration being used).
alias umi-transfer="docker run -t -v `pwd`:`pwd` -w `pwd` mzscilifelab/umi-transfer:latest umi-transfer"Given that you have Rust installed on your computer, clone or download this repository and run
cargo build --releaseThat should create an executable target/release/umi-transfer that can be placed anywhere in your $PATH or be executed directly by specifying its path:
./target/release/umi-transfer --version
umi-transfer 1.6.0The tool requires three FastQ files as input. You can manually specify the names and location of the output files with --out and --out2 or the tool will automatically append a with_UMI suffix to your input file names. It additionally allows you to choose a custom UMI delimiter with --delim, the position of the integrated UMI with --position, and to set the flags -f, -c and -z.
-c is used to ensure the canonical read numbers 1 and 2 in paired output files, regardless of the read numbers of the input reads. -f / --force will overwrite existing output files without prompting the user and -z enables the internal compression of the output files. Alternatively, you can also specify an output file name with .gz suffix to obtain compressed output.
$ umi-transfer external --help
Integrate UMIs from a separate FastQ file
Usage: umi-transfer external [OPTIONS] --in <R1_IN> --in2 <R2_IN> --umi <RU_IN>
Options:
-p, --position <TARGET_POSITION>
Choose the target position for the UMI: 'header' or 'inline'. Defaults to 'header'.
[default: header] [possible values: header, inline]
-c, --correct_numbers
Read numbers will be altered to ensure the canonical read numbers 1 and 2 in output file sequence headers.
-z, --gzip
Compress output files. Turned off by default.
-l, --compression_level <COMPRESSION_LEVEL>
Choose the compression level: Maximum 9, defaults to 3. Higher numbers result in smaller files but take longer to compress.
-t, --threads <NUM_THREADS>
Maximum number of threads to use for processing. Preferably pick odd numbers, 9 or 11 recommended. Defaults to the maximum number of cores available.
-f, --force
Overwrite existing output files without further warnings or prompts.
-d, --delim <DELIM>
Delimiter to use when joining the UMIs to the read name. Defaults to `:`.
--in <R1_IN>
[REQUIRED] Input file 1 with reads.
--in2 <R2_IN>
[REQUIRED] Input file 2 with reads.
-u, --umi <RU_IN>
[REQUIRED] Input file with UMI.
--out <R1_OUT>
Path to FastQ output file for R1.
--out2 <R2_OUT>
Path to FastQ output file for R2.
-h, --help
Print help
-V, --version
Print version
A typical run may look like this:
umi-transfer external -fz -d '_' --in 'R1.fastq' --in2 'R3.fastq' --umi 'R2.fastq'umi-transfer warrants paired input files. To run on singletons, use the same input twice and redirect one output to /dev/null:
umi-transfer external --in read1.fastq --in2 read1.fastq --umi read2.fastq --out output1.fastq --out2 /dev/nullSince the release of version 1.5, umi-transfer features internal multi-threaded output compression. As a result, umi-transfer 1.5 now runs approximately 25 times faster than version 1.0 when using internal compression and about twice as fast compared to using an external compression tool. This improvement is enabled by the outstanding gzp crate, which abstracts a lot of the underlying complexity away from the main software.
In our first benchmark using 17 threads, version 1.5 of umi-transfer processed approximately 550,000 paired records per second with the default gzip compression level of 3. At the highest compression level of 9, the rate dropped to just below 200,000 records per second. While the exact numbers may vary depending on your storage, file system, and processors, we expect the relative performance rates to remain approximately constant.
| Version | Command | --position |
~ reads / s |
|---|---|---|---|
| 1.0 | external | N/A | 30500 |
| 1.5 | external | N/A | 591200 |
| 1.6 | external | Header | 579500 |
| 1.6 | external | Inline | 567287 |
Due to the new --position parameter for choosing the UMI integration position, Version 1.6 is about 5% slower than its predecessor. If you do not require this option, there is no need to upgrade.
In a subsequent benchmark, we tested the effect of increasing the number of threads. For the default compression level, the maximum speed was achieved with 9 to 11 threads. Since umi-transfer writes two output files simultaneously, this configuration allows for 4 to 5 threads per file to handle the output compression.
Adding more threads per file proved unhelpful, as other steps became the rate-limiting factors. These factors include file system I/O, input file decompression, and the actual editing of the file contents, which now determine the performance of umi-transfer. Only when increasing the compression level to higher settings did adding more threads continue to provide a performance benefit. For the highest compression setting, we did not reach the plateau phase during the benchmark, but it is likely to occur in the range of 53-55 total threads, or about 26 threads per output file.
In summary, we recommend running umi-transfer with 9 or 11 threads for compression. Odd numbers are favorable as they allow one dedicated main thread, while evenly splitting the remaining threads between the two output files. It's important to note that specifying more threads than the available physical or logical cores on your machine will result in a severe performance loss, since umi-transfer operates synchronously.
umi-transfer cannot be used with the pipe operator, because it neither supports writing output to stdout nor reading input from stdin. However, FIFOs (First In, First Out buffered pipes) can be used to elegantly combine umi-transfer with other software on GNU/Linux and MacOS operating systems.
For example, we may want to use external compression software like Parallel Gzip together with umi-transfer. For this purpose, it would be unfavorable to write the data uncompressed to disk before compressing it. Instead, we create named pipes with mkfifo, which can be provided to umi-transfer as if they were regular output file paths. In reality, the data is directly passed on to pigz via a buffered stream.
First, the named pipes are created:
mkfifo output1
mkfifo output2Then a multi-threaded pigz compression is tied to the FIFO. Note the trailing & to leave these processes running in the background.
$ pigz -p 10 -c > output1.fastq.gz < output1 &
[4] 233394
$ pigz -p 10 -c > output2.fastq.gz < output2 &
[5] 233395The argument -p 10 specifies the number of threads that each pigz processes may use. The optimal setting is hardware-specific and will require some testing.
Finally, we can run umi-transfer using the FIFOs as output paths:
umi-transfer external --in read1.fastq --in2 read3.fastq --umi read2.fastq --out output1 --out2 output2It's good practice to remove the FIFOs after the program has finished:
rm output1 output2umi-transfer is a free and open-source software developed and maintained by scientists of the Swedish National Genomics Infrastructure. We gladly welcome suggestions for improvement, bug reports and code contributions.
If you'd like to contribute code, the best way to get started is to create a personal fork of the repository. Subsequently, use a new branch to develop your feature or contribute your bug fix. Ideally, use a code linter like rust-analyzer in your code editor and run the tests with cargo test.
Before developing a new feature, we recommend opening an issue on the main repository to discuss your proposal upfront. Once you're ready, simply open a pull request to the dev branch and we'll happily review your changes. Thanks for your interest in contributing to umi-transfer!