This repository holds a nextflow pipeline for analysing gene expression studies. The experimental design is build on multiple generations (F0, F1, F2, etc) and doses (0, 10, 100, 1000, etc). The pipeline expect quantification files (quant.sf) as input generated with Salmon and outputs tables of (1) differentially expressed genes and (2) gene ontology analysis results. These tables come in Rds file format containing all generations and doses in a single file, respectively. The pipeline also outputs the tables as excel files to be included as supplementary tables in a scientic report.
Note
This pipeline is per default setup for the mouse genome (GRCm39)
Quantification files
To generate methylation coverage files from sequencing files refer to nf-core/rnaseq pipeline
Differential expression
Differential expression was analysed with the R-package
edgeR, which utilizes negative binomial distributions and generalized linear model as statistical method.FDR < 0.05was used for multiple testing correction (Benjamini-Hochberg qvalue). Default settings were used for most of the functions expect;estimateDisp(robust = TRUE)andglmQLFit(robust = TRUE). Summaries findings in a long table w/ a significant gene as a unique row, add results from the differentail gene expression analysis as columns.
Gene ontology analysis
To investigate if any biological functions, processes or pathways are enriched (over-represented) the Over Representation Analysis (ORA) Boyle et al., 2004 method is used. ORA uses hypergeometric distribution and compares the differentially methylated genes with all genes in the dataset. The p-values are adjusted to q-values for multiple corretion (significance threshold
qvalue < 0.2).
Enrichment is analysed in three databases; (1) Gene Ontology (GO), (2) Kyoto Encyclopedia of Genes and Genomes (KEGG), and Reactome pathways. GO and KEGG enrichment are tested with the R-package
clusterProfiler, Yu et al., 2012, Wu et al., 2021. The reactome pathways are tested with the R-packageReactomePA, Yu et al., 2016.
#!/bin/bash -l
export NXF_HOME=".nextflow/"
nextflow pull andreyhgl/transcriptome-analysis
nextflow run andreyhgl/transcriptome-analysis \
--quant_path 'path-to-quant-files' \
--metadata 'path-to-metadata.csv' \
--tx2gene 'path-to-tx2gene.tsv' \
-profile local \
-resumeSingularity containers
For reproducibility this pipeline uses two singularity containers, which can be downloaded from the Cloud Library. The RNAseq container holds most of the R-packages used in the analysis, while gene-ontology container holds gene ontology related R-packages
# apptainer (instead of singulartiy) also works
IMAGE1='library://andreyhgl/singularity-r/rnaseq'
IMAGE2='library://andreyhgl/singularity-r/gene-ontology'
singularity pull ${IMAGE1}
singularity pull ${IMAGE2}To run scripts manually with the containers use the exec flag or run the script interactively with shell.
# execute script
singularity exec ${IMAGE} <scriptfile>
# run script interactively
singularity shell ${IMAGE}
$ Rscript <scriptfile>The nextflow pipeline produce the following:
- Ensembl database table containing gene annotations
- DGEList object
- Quality control plots: PCA, distance plots
- Differentially expressed genes table
- Gene ontology analysis
- Supplementary files (plots, excel-tables)
- Concatinated tables (for easy import for results report)
Setup the metadata.csv to look like the following:
gen,id,treatment,...
F0,F0_1,0,...
F0,F0_2,0,...
F1,F1_3,10,...
F2,F2_4,100,...
Each row represents a sample in the column order: generation, sample id and treatment/dose.
The pipeline accepts three parameters:
Experimental design:
- generations (F0, F1, F2, etc)
- doses (0, 10, 100, 1000, etc)
- genomic features (CpG-sites, Promoters, CpG-islands)