Skip to content

andreyhgl/transcriptome-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nextflow R run with singularity

README

This repository holds a nextflow pipeline for analysing gene expression studies. The experimental design is build on multiple generations (F0, F1, F2, etc) and doses (0, 10, 100, 1000, etc). The pipeline expect quantification files (quant.sf) as input generated with Salmon and outputs tables of (1) differentially expressed genes and (2) gene ontology analysis results. These tables come in Rds file format containing all generations and doses in a single file, respectively. The pipeline also outputs the tables as excel files to be included as supplementary tables in a scientic report.

Note

This pipeline is per default setup for the mouse genome (GRCm39)

Quantification files

To generate methylation coverage files from sequencing files refer to nf-core/rnaseq pipeline

Differential expression

Differential expression was analysed with the R-package edgeR, which utilizes negative binomial distributions and generalized linear model as statistical method. FDR < 0.05 was used for multiple testing correction (Benjamini-Hochberg qvalue). Default settings were used for most of the functions expect; estimateDisp(robust = TRUE) and glmQLFit(robust = TRUE). Summaries findings in a long table w/ a significant gene as a unique row, add results from the differentail gene expression analysis as columns.

Gene ontology analysis

To investigate if any biological functions, processes or pathways are enriched (over-represented) the Over Representation Analysis (ORA) Boyle et al., 2004 method is used. ORA uses hypergeometric distribution and compares the differentially methylated genes with all genes in the dataset. The p-values are adjusted to q-values for multiple corretion (significance threshold qvalue < 0.2).

Enrichment is analysed in three databases; (1) Gene Ontology (GO), (2) Kyoto Encyclopedia of Genes and Genomes (KEGG), and Reactome pathways. GO and KEGG enrichment are tested with the R-package clusterProfiler, Yu et al., 2012, Wu et al., 2021. The reactome pathways are tested with the R-package ReactomePA, Yu et al., 2016.

Reproducibility

Run the pipeline

#!/bin/bash -l

export NXF_HOME=".nextflow/"

nextflow pull andreyhgl/transcriptome-analysis

nextflow run andreyhgl/transcriptome-analysis \
  --quant_path 'path-to-quant-files' \
  --metadata 'path-to-metadata.csv' \
  --tx2gene 'path-to-tx2gene.tsv' \
  -profile local \
  -resume
Singularity containers

For reproducibility this pipeline uses two singularity containers, which can be downloaded from the Cloud Library. The RNAseq container holds most of the R-packages used in the analysis, while gene-ontology container holds gene ontology related R-packages

# apptainer (instead of singulartiy) also works

IMAGE1='library://andreyhgl/singularity-r/rnaseq'
IMAGE2='library://andreyhgl/singularity-r/gene-ontology'

singularity pull ${IMAGE1}
singularity pull ${IMAGE2}

To run scripts manually with the containers use the exec flag or run the script interactively with shell.

# execute script
singularity exec ${IMAGE} <scriptfile>

# run script interactively
singularity shell ${IMAGE}
$ Rscript <scriptfile>

The pipeline

The nextflow pipeline produce the following:

  • Ensembl database table containing gene annotations
  • DGEList object
  • Quality control plots: PCA, distance plots
  • Differentially expressed genes table
  • Gene ontology analysis
  • Supplementary files (plots, excel-tables)
  • Concatinated tables (for easy import for results report)

Preparation

Metadata

Setup the metadata.csv to look like the following:

gen,id,treatment,...
F0,F0_1,0,...
F0,F0_2,0,...
F1,F1_3,10,...
F2,F2_4,100,...

Each row represents a sample in the column order: generation, sample id and treatment/dose.

Parameters

The pipeline accepts three parameters:

Experimental design:

  • generations (F0, F1, F2, etc)
  • doses (0, 10, 100, 1000, etc)
  • genomic features (CpG-sites, Promoters, CpG-islands)

About

Pipeline for differential gene expression analysis

Topics

Resources

Stars

Watchers

Forks