endSeeker: A computational software for identifying 2'-O-Methylation sites from Nm-REP-seq data.
endSeeker is a software to identify novel Nm sites by calculating the 3'-end coverage of the candidate Nm sites from Nm-REP-seq data.
Usage: endSeeker [options] --fa <genome seq> --fai <fai file> --gene <bed12 file> --treat <alignments> --input <input alignments>
endSeeker: A computational software for identifying 2'-O-Methylation sites from Nm-REP-seq data.
[options]
--fa : genome sequence[Required]
--fai : genome fai file[Required].
using "samtools faidx" to generate fai file
--gene : gene file [Required]
--treat : file treated by mgR/mgR+OED file[Required]
--input : input file[Required]
-v/--verbose : verbose information
-V/--version : endSeeker version
-h/--help : help informations
-s/--strand : strand-specific sequencing data
-n/--norm : normalized reads to the locus number
-c/--collapser : keep duplication, deault is false
-o/--outfile : output file
-t/--min-tag : minimum tag number for each end site, default>=5.0 read
-r/--rpm : minimum rpm value for each end site, default>=0.001
-f/--fold : minimum fold-change[default>=1.0]
-w/--window : window size around the end position[default=20]
-l/--min-len : minimum length of reads, default=15
-L/--max-len : maximum length of reads, default=1000
Download endSeeker-1.0.tar.gz from https://github.com/sysu-software/endSeeker/releases ; unpack it, and make:
tar -xzvf endSeeker-1.0.tar.gz
cd endSeeker-1.0
make
Operating system: endSeeker is designed to run on POSIX-compatible platforms, including UNIX, Linux and Mac OS/X. We have tested most extensively on Linux and MacOS/X because these are the machines we develop on.
Compiler: The source code is compiled with the C++ compiler g++. We test the code using the g++ compilers.
Libraries and other installation requirements: endSeeker includes one software library: the BamTools library package. All will automatically compile during endSeeker installation process.
By default, endSeeker does not require any additional libraries to be installed by you.
Dependencies: The input of endSeeker is BAM file. So you need the short read mapper STAR or other mappers
You can get the most fresh versions:
(1) STAR: https://github.com/alexdobin/STAR
(2) Samtools: http://www.htslib.org/
You need to have the reference genome, fai file, annotation file, and STAR indexes for genome and annotation.
You can constructed these datasets by yourself using following steps:
As an example, let's assume you use human genome (version hg38).
(1) Genome:
mkdir genome
wget -c 'http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz'
gzip -d hg38.fa.gz
cd ..
(2) Annotation:
You can use Table Browser to get the BED12 for genome annotation(e.g. GENCODE)
http://genome.ucsc.edu/cgi-bin/hgTables
e.g. You can save the output files in genome directory: hg38.gencode.bed12
(3) Build the genome index and align reads to genome
STAR --runMode genomeGenerate --genomeDir ./starIndex --genomeFastaFiles ./genome/hg38.fa --sjdbGTFfile gencode.v30.annotation.gtf --sjdbOverhang 100
(4)build the fai index:
samtools faidx hg38.fa
(5)Align reads to genome using STAR
STAR parameters as follows: --alignEndsType EndToEnd --outFilterType BySJout --outFilterMultimapScoreRange 0 --outFilterMultimapNmax 30 --outFilterMismatchNmax 15 --outFilterMismatchNoverLmax 0.1 --outFilterScoreMin 0 --outFilterScoreMinOverLread 0 --outFilterMatchNmin 15 --outFilterMatchNminOverLread 0.8 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --seedSearchStartLmax 15 --seedSearchStartLmaxOverLread 1 --seedSearchLmax 0 --seedMultimapNmax 20000 --seedPerReadNmax 1000 --seedPerWindowNmax 100 --seedNoneLociPerWindow 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outSAMtype BAM SortedByCoordinate --quantMode TranscriptomeSAM GeneCounts
cd example;
../endSeeker --norm -t 5 -f 3 --fa chr21.fa --fai chr21.fa.fai --gene human_rRNA_genes.bed12 --treat MgR_treatment_sample.bam --input MgR_control_sample.bam >endSeeker_candidate_Nm_sites.txt
#chrom chromStart chromEnd name score strand geneName geneStart geneEnd modifiedBase endReadNum endRPM upFC upCtrlFC downFC downCtrlFC extendSeq
chr21 8217780 8217781 endSeeker-1 3.06000 + NR_146148|28S|rRNA 3894 3895 C 306.00000 135.37146 3.06000 3.99876 3.97403 5.19320 AGCGGGGAAAGAAGAmCCTGTTGAGCTTGAC
Note: # is comment line
Thanks a lot to everyone who contributed to the public code (e.g. BamTools, Samtools) used by endSeeker.
*****************************************************************************************
* endSeeker - A computational software for identifying 2'-O-Methylation sites from Nm-REP-seq data.
*
* Author : Jian-Hua Yang [email protected]
*
* RNA Information Center, School of Life Sciences, Sun Yat-Sen University
*
* Create date: 11/18/2019
*
* last modified time: 09/01/2020
****************************************************************************************