-
Notifications
You must be signed in to change notification settings - Fork 44
Test data
Data for the TOBIAS test commands found in this wiki can be obtained using TOBIAS DownloadData:
$ TOBIAS DownloadData --bucket data-tobias-2020
$ mv data-tobias-2020/ test_data/
This downloads the test-data (~700 MB) from the loosolab S3-storage server and moves the data to the test_data/ directory.
The source of the test data is the paper "Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position", Buenrostro et al. 2013, Nature Methods link. This paper applied ATAC-seq to the GM12878 lymphoblastoid cell line (derived from B cells) and to CD4+ positive T cells at three time points. The raw data from the study (study accession PRJNA207663) in the format of .fastqs were downloaded from the following urls:
| sample_title | experiment_accession | fastq files |
|---|---|---|
| GM12878_ATACseq_50k_Rep1 | SRX298000 | read1,read2 |
| GM12878_ATACseq_50k_Rep2 | SRX298001 | read1,read2 |
| GM12878_ATACseq_50k_Rep3 | SRX298002 | read1,read2 |
| GM12878_ATACseq_50k_Rep4 | SRX298003 | read1,read2 |
| CD4+_ATACseq_Day1_Rep1 | SRX298007 | read1,read2 |
| CD4+_ATACseq_Day1_Rep2 | SRX298008 | read1,read2 |
| CD4+_ATACseq_Day2_Rep1 | SRX298009 | read1,read2 |
| CD4+_ATACseq_Day2_Rep2 | SRX298010 | read2,read2 |
| CD4+_ATACseq_Day3_Rep1 | SRX298011 | read1,read2 |
| CD4+_ATACseq_Day3_Rep2 | SRX298012 | read1,read2 |
All samples were mapped using STAR. Single replicates were merged using samtools merge to condition .bam-files to yield Bcell.bam, Tcell_day1.bam, Tcell_day2.bam and Tcell_day3.bam. To keep file sizes minimal, a random subset of reads were chosen for each replicate using samtools view -s <fraction>. For the sake of the examples, the Tcell samples were further merged to one .bam-file Tcell.bam.
Peak-calling was performed per replicate using MACS2 with parameters --nomodel --shift -100 --extsize 200 --broad. The file merged_peaks.bed represents peaks merged across the Bcell and Tcell conditions.
The .gtf-file used for annotation was downloaded from Ensembl (link). Chromosome prefix "chr" was added and the file was further subset to chr4.
Annotation of peaks in merged_peaks.bed was performed using UROPA as shown here:
$ uropa --bed merged_peaks.bed --gtf transcripts_chr4.gtf --show_attributes gene_id gene_name --feature_anchor start --distance 20000 10000 --feature gene
The test files are obtained with:
$ cut -f 1-6,16-17 merged_peaks_finalhits.txt | head -n 1 > merged_peaks_annotated_header.txt
$ cut -f 1-6,16-17 merged_peaks_finalhits.txt | tail -n +2 > merged_peaks_annotated.bed
The file motifs.jaspar contains 83 motifs from the JASPAR 2020 vertebrate database (download here. The motifs found in test_data/individual_motifs/ were obtained using TOBIAS FormatMotifs --task split.
The file blacklist.bed is a subset of the Boyle-lab blacklist (available here) containing only chr4 regions.
Additional files are obtained using the test commands throughout this wiki.