Skip to content

A generalised data structure for fast and efficient loading and data munching of sparse omics data.

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
Notifications You must be signed in to change notification settings

agusinac/OmicFlow

Repository files navigation

CRAN Codecov R-CMD-check run with conda run with docker

OmicFlow

Installation


The latest stable version can be installed from CRAN.

install.packages('OmicFlow', dependencies = TRUE)

The development version is available on GitHub.

install.packages('pak') # if not yet installed
pak::pak('agusinac/OmicFlow')

📋 Metadata File Specification

OmicFlow expects your sample metadata to follow a simple, but strict structure so that all datasets are compatible and validated up‑front. Sample metadata can be supplied as a CSV/TSV file or as a data.table in R. In both cases the sample metadata should contain a header (this is your first line if you supply a file) where each row = one sample Additional column names not mentioned here are allowed and will be ignored during the metadata validation step.


Minimum requirement

  • SAMPLE_ID ➡ every row must have a unique, non‑empty sample identifier.
  • No spaces are allowed in IDs — use underscores _ or dashes - instead.

Example:

SAMPLE_ID SAMPLEPAIR_ID CONTRAST_Treatment VARIABLE_Age
S1 P1 Drug 42
S2 P1 Placebo 36
S3 P2 Drug 51

Column types and naming rules

🔹 Required column

Column Type Rules
SAMPLE_ID string Unique, no spaces, one per sample row

🔹 Optional standard columns

Column Type Rules
FEATURE_ID string Optional — no spaces. Naming of the feature identifiers to include or exclude certain features
SAMPLEPAIR_ID string Optional — no spaces. Use when samples are paired and belong to an individual source/subject

🔹 Pattern‑based columns

You can define extra variables using special prefixes:

  • CONTRAST_... → grouping/category labels used in differential comparisons
    Example: CONTRAST_Treatment with values Drug / Placebo
  • VARIABLE_... → numeric or string variables for statistical analysis
    Example: VARIABLE_Age with values 42, 51, etc.

The pattern-based columns are only used during the autoFlow function. At the moment only columns with prefix CONTRAST_ are supported. Example: Outputs a report.html file in the current working directory

taxa$autoFlow(
    normalize = FALSE,
    weighted = TRUE,
    pvalue.threshold = 0.05
)

Usage

Note

Make sure your metadata meets the requirements!


Only the metagenomics class supports biom files in both HDF5 (version 2) as JSON data structure to be passed via biomData. The proteomics class only supports the countData and featureData. The treeData is optional in both omics sub-classes and when supplied, both the rows of the countData as featureData will be aligned by the tree tip labels.

library("OmicFlow")

metadata_file <- system.file("extdata", "metadata.tsv", package = "OmicFlow")
counts_file <- system.file("extdata", "counts.tsv", package = "OmicFlow")
features_file <- system.file("extdata", "features.tsv", package = "OmicFlow")
tree_file <- system.file("extdata", "tree.newick", package = "OmicFlow")

taxa <- metagenomics$new(
    metaData = metadata_file,
    countData = counts_file,
    featureData = features_file,
    treeData = tree_file
)

taxa$feature_subset(Kingdom == "Bacteria")
taxa$normalize()

# Access variables directly
taxa$metaData
taxa$countData
taxa$featureData
taxa$treeData

# Inspect what functions variables are available to the class
str(taxa)

Visualisations


Note

All visualizations use by default color-blind palettes!

🔹Alpha diversity

alpha_div <- taxa$alpha_diversity(
    col_name = "treatment",
    metric = "shannon",
    paired = FALSE # If TRUE it performs wilcox signed rank test
)

alpha_div$plot

🔹Beta diversity

By default PERMANOVA is applied pairwise against each group within the specified contrast, via group_by that is used in pairwise_adonis. The permutation design in vegan::adonis2 is by default set to free. But this may not always be the right test when you have paired samples and you also want to restrict permutations between different sites or genders. Therefore, pairwise_adonis supports a custom permutation design, which can be constructed via permute and fed into vegan::adonis2 as a function via pairwise_adonis with the flag perm_design. See the examples below.

set.seed(1970)

# Perform ordinations with in-built distance matrix computation
#--------------------------------------------------------------------------------
beta_div <- taxa$ordination(
    metric = "unifrac",
    method = "pcoa",
    group_by = "treatment",
    perm = 999
)

# Add a custom pre-computed distance matrix
#--------------------------------------------------------------------------------
qiime_unifrac <- data.table::fread("weighted-unifrac-matrix.tsv", header=TRUE)
distmat <- Matrix::Matrix(as.matrix(qiime_unifrac[, .SD, .SDcols = !c("V1")]))
rownames(distmat) <- colnames(distmat)
distmat <- distmat[taxa$metaData[["SAMPLE_ID"]], taxa$metaData[["SAMPLE_ID"]]]
distmat <- as.dist(distmat) 

beta_div <- taxa$ordination(
    distmat = distmat,
    method = "pcoa",
    group_by = "treatment",
    perm = 999
)

# Add a custom permutation design via `perm_design`
#--------------------------------------------------------------------------------
## taxa$ordination() automatically will input taxa$metaData inside the supplied function.
perm_design_func <- function(meta) {
  base::with(
    data = meta,
    expr = permute::how(
      nperm = 999,
      plots = permute::Plots(meta$SAMPLEPAIR_ID, type = "none"), # In case samplepair ids is supplied
      within = permute::Within(type = "free")
    )
  )
}

beta_div <- taxa$ordination(
    metric = "unifrac",
    method = "pcoa",
    group_by = "treatment",
    perm_design = perm_design_func
)

patchwork::wrap_plots(
    beta_div[c("scree_plot", "anova_plot", "scores_plot")],
    nrow = 1)

🔹Composition

res <- taxa$composition(
    feature_rank = "Genus",
    feature_filter = c("uncultured"),
    feature_top = 15,
    normalize = FALSE,
    col_name = "CONTRAST_sex"
)

composition_plot(
    data = res$data,
    palette = res$palette,
    feature_rank = "Genus",
    # If group_by = NULL, then a stacked barplot for each sample sorted alphabetically will be visualized.
    group_by = "CONTRAST_sex"
    )

🔹Volcano plot

The volcano_plot will contain the average percentage abundance for each Genus between the two contrasts. Additional parameters can be used to only filter for relevant bacteria based on the pvalue.threshold, foldchange.threshold and abundance.threshold. The returned p-values can be adjusted and used for a new volcano plot via OmicFlow::volcano_plot.

res <- taxa$DFE(
    feature_rank = "Genus",
    feature_filter = c("uncultured"),
    paired = FALSE,
    normalize = FALSE,
    condition.group = "CONTRAST_sex",
    condition_A = "male",
    condition_B = "female"
)

res$volcano_plot

Run OmicFlow and autoFlow standalone script with docker!

Note

Symbolic links do not work with mounting, please only copy the original file!

Example: Outputs a report.html file in current work directory

docker pull agusinac/autoflow:1.4.0

docker run -it --rm -v \
    "$(pwd)":/data \             # Mount the data in a temporary directory
    -w /data \                   # set working directory
    -u $(id -u):$(id -g) \       # non-root user
    agusinac/autoflow:1.4.0 \
    autoflow \                   # autoflow R script
    -b /data/biom_with_taxonomy_hdf5.biom \
    -m /data/metadata.tsv

Support

If you are having issues, please create a ticket

About

A generalised data structure for fast and efficient loading and data munching of sparse omics data.

Topics

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Packages

No packages published