--- title: "MetaboDynamics: analyzing longitudinal metabolomics data with probabilistic models" package: MetaboDynamics author: "Katja Danielzik (katja.danielzik@uni-due.de)" output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{1. MetaboDynamics} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() ``` # Data requirements - MetaboDynamics requires at least triplicates of metabolite abundances for all analyzed time points and conditions. No minimum number of metabolites is required. Time points are treated as categorical variables and have to be the same for all conditions. - Data can be a data frame (see Vignette "using MetaboDynamics with data frames) or a SummarizedExperiment object (demonstrated in this vignette). - Metabolite abundances have to be stored in column named "metabolite", time points in "time" and the experimental condition in "condition" - For clustering of metabolite abundance dynamics are at least two time points required - To conduct over-representation analysis on KEGG functional modules KEGG IDs of metabolites are required - To conduct comparison of metabolite abundance dynamics between clusters at least two experimental conditions are required - If an in vitro experiment is analyzed andc ell counts should be incorporated in model, cell counts have to be estimated (i.e. cell counter not counting every cell) and independent (i.e. separate wells for cell counts and metabolite abundances) as a poisson distribution is assumed. At least one replicate of cell counts per time point and condition is required. Metabolite abundances have to be adequately normalized and scaled. The expected scaling is explained in the following workflow. In the case of an in vitro experiment with available cell counts for all time points and conditions no scaling or normalization is necessary as this will be handled by the fit_dynamics_model() function (model="raw_plus_counts") # Analysis time: For two conditions of 100 metabolites each the fitting of the metabolite abundance dynamics model will take 10-15 minutes (one core per chain). Total workflow time is approximately 15-20 minutes. # Background Metabolomics data from longitudinal or time-course high-throughput experiments, such as untargeted liquid chromatography-mass spectrometry (LC-MS), allows for insights into changes of metabolite abundances over time. However, most existing tools are limited to comparing only two time points or experimental conditions, and they often rely on frequentist statistical methods. Additionally, most tool s are only able to analyze groups of metabolites on a pathway level, potentially missing more general patterns of changes in the metabolism in sparse data. Here, we introduce MetaboDynamics, a modular workflow of probabilistic models for the analysis of longitudinal metabolomics data. This package provides a flexible framework for analyzing longitudinal data, allowing users to model changes in metabolite abundances over time and identify patterns of interest. Metabolomics data is often noisy and limited by few replicates due to high costs. To address these challenges, we employ a Bayesian hierarchical model that balances between under-fitting and over-fitting. This model provides a robust method for estimating mean abundances at every time point as well as abundance differences between subsequent time points, even with limited replicates. An additional output of the model is the difference between dynamics of different experimental conditions. The MetaboDynamics workflow requires abundance tables as input, which can include metabolites and their observed abundances, LC-MS peak intensities or areas. The workflow can handle any number of time points, and the dynamics model requires at least three replicates. However, functions to compare dynamics with each other do not require a certain number of replicates and can be applied to calculated mean abundances of e.g. two replicates. In this vignette, I will demonstrate how to use the MetaboDynamics package to analyze longitudinal metabolomics data, including data preprocessing, model fitting, and interpretation of results. We will show how to apply the package to a simulated data set and how to visualize and interpret the results. # Setup: load required packages ```{r setup} library(MetaboDynamics) library(SummarizedExperiment) # storing and manipulating simulated metabolomics data library(ggplot2) # visualization library(dplyr) # data handling library(tidyr) # data handling ``` # Load data and plot data overview The MetaboDynamics package includes a simulated data set called "longitudinalMetabolomics". This data set consists of 98 metabolites with three replicates at four time points (1-4) across two experimental conditions (A-B). The data set is represented as a [SummarizedExperiment](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html) object, where the abundance tables of each experimental condition are stored in assays (raw abundances in "concentrations", log-transformed transformations in "log_con" and scaled (mean=0, sd=1) log-transformed abundances" in "scaled_log"), and the metabolite names, KEGG IDs, experimental conditions, and clustering solutions per experimental condition are stored in colData and rowData. If you have a data frame as input and do not want to convert to a SummarizedExperiment object please see Vignette "using MetaboDynamics with data frames" as guidance. To load the data, execute the following code: ```{r} data("longitudinalMetabolomics", package = "MetaboDynamics") ``` The next plot shows the raw metabolite abundances. ```{r,fig.wide=TRUE} # convert to dataframe abundances <- as.data.frame(cbind( as.data.frame(t(assays(longitudinalMetabolomics)$concentrations)), as.data.frame(colData(longitudinalMetabolomics)) )) abundances <- abundances %>% pivot_longer( cols = seq_len(4), names_to = "time", values_to = "abundance" ) ggplot(abundances, aes(x = abundance)) + geom_freqpoly() + theme_bw() + facet_grid(cols = vars(time), rows = vars(condition)) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) + ggtitle("raw data", "raw measurements") ``` In this plot I visualize the distribution of metabolite abundances. We can see many observations with low abundances and few observations for higher abundances. This raw data is not distributed normally. So let's log-transform the values. In the integrated simulated data set this is already done in the column "log_m". ```{r,fig.wide=TRUE} ggplot(abundances, aes(x = abundance)) + geom_density() + scale_x_log10() + theme_bw() + facet_grid(cols = vars(time), rows = vars(condition)) + ggtitle("data", "log-transformed values") ``` We can see that the metabolite abundances vary over magnitudes and are normally distributed after log-transformation. The next plot shows the log-transformed dynamics of single metabolites. ```{r,fig.wide=TRUE} ggplot(abundances) + geom_line(aes( x = time, y = abundance, col = metabolite, group = interaction(metabolite, replicate) )) + scale_y_log10() + theme_bw() + xlab("timepoint") + scale_color_viridis_d() + theme(legend.position = "none") + facet_grid(rows = vars(condition)) + ggtitle("raw metabolite dynamics", "color=metabolite") ``` The dynamics profiles of single metabolites have intercepts (i.e. mean metabolite abundances) that differ by magnitudes from each other. To be able to compare metabolites with each other we define dynamics as deviations at the observed time points from the metabolite's mean abundance. For this we standardize each metabolite at each experimental condition to a mean of zero and a standard deviation of one. In the simulated data set the scaled measurements are in column "m_scaled". ```{r,fig.wide=TRUE} data("longitudinalMetabolomics") # convert to dataframe abundances <- as.data.frame(cbind( as.data.frame(t(assays(longitudinalMetabolomics)$scaled_log)), as.data.frame(colData(longitudinalMetabolomics)) )) abundances <- abundances %>% pivot_longer( cols = seq_len(4), names_to = "time", values_to = "abundance" ) ggplot(abundances) + geom_line(aes( x = time, y = abundance, col = metabolite, group = interaction(metabolite, replicate) )) + theme_bw() + scale_color_viridis_d() + xlab("timepoint") + ylab("deviation from mean abundance") + facet_grid(rows = vars(condition)) + theme(legend.position = "none") + ggtitle("standardized dynamics", "color=metabolite") ``` # Model dynamics To analyze the dynamics data, we use a Bayesian hierarchical model. This model is used to account for the complexity and variability of the data and to provide a flexible and robust framework for analysis. The model is defined as follows with: con = metabolite abundances, m = metabolite, c = experimental condition and t = time point ID: \begin{align*} \log(con_{m,c,t})&\sim {\sf normal}(\mu_{m,c,t},\sigma_{m,c,t}) \\ \mu_{m,c,t}&\sim {\sf normal}(0,2) \\ \sigma_{m,c,t}&\sim {\sf exponential}(\lambda_{m,c}) \\ \lambda_{m,c}&\sim {\sf exponential}(2) \end{align*} The model assumes a normal distribution of log-transformed and scaled metabolite abundances and estimates a mean $\mu$ per metabolite and time point. The spread $\sigma$ of the replicate measurements is also estimated per metabolite and time point, but the spread of different time points inform each other through the hierarchical structure of the hyper-parameter $\lambda$. This partial pooling allows for the balance between over- and under fitting. If the untargeted LC-MS experiment is conducted in vitro and independent cell counts (i.e. cell counts were measured in a different well than metabolite abundances) are available an adapted model can be run that takes the *raw* (i.e. not normalized and not scaled) metabolite abundances and the cell counts as input. The "raw_plus_counts" model is slower but has the advantage of additionally incorporating the uncertainty of cell counts into the estimates of the Bayesian model. The code below shows how to fit the Bayesian model with the function fit_dynamics_model(). Just in this vignette the number of chains is set to 1. To get reliable estimates at least four chains should be used. If more than one CPU core is available chains can run in parallel (parameter "chains"). Here we fit the dynamics modelfor a small subset of the simulated metabolites. ```{r,fig.wide=TRUE} # we can hand a SummarizedExperiment object to the function data("longitudinalMetabolomics") # we only use a subsection of the simulated data (1 condition and subsample of # the whole dataset) for demonstration purposes samples <- c( "UMP", "PEP", "2-Aminomuconate","ATP" ) # only one condition data <- longitudinalMetabolomics[, longitudinalMetabolomics$condition %in% c("A", "B") & longitudinalMetabolomics$metabolite %in% samples] # fit model data <- fit_dynamics_model( model = "scaled_log", data = data, scaled_measurement = "m_scaled", assay = "scaled_log", adapt_delta = 0.95, # default 0.95 iter = 2000, cores = 1, chains = 1 # only set to 1 for vignette, use 4 chains ) ``` This returns a fit. If data is a SummarizedExperiment object the fits are stored in metadata(data)[["dynamic_fit"]]. MetaboDynamics fits Bayesian models with MCMC (Markov Chain Monte Carlo) which returns a list of diagnostic criteria: rhat = R-hat convergence diagnostic (should be 1), neff = number of effective samples (should be above 100), divergences = number of divergent transitions (should be 0), max_treedepth = number of iterations that exceed maximum treedepth (should be 0). Hints on how to adjust parameters for fit_dynamics_model() can be found in the function documentation (type "?fit_dynamics_model" in R console if MetaboDynamics is loaded). With diagnostics_dynamics() we can extract all the diagnostic criteria of the model fit and visualize them. If data is a SummarizedExperiment object the diagnostics are stored in metadata(data)[["diagnostics_dynamics"]]. The diagnostic criteria can be visualized with the function plot_diagnostics(). ```{r} # extract diagnostics data <- diagnostics_dynamics( data = data, iter = 2000, # number of iterations used for model fitting # the dynamic model chains = 1 # number of chains used for model fitting ) plot_diagnostics( data = data ) ``` In this example we have zero divergent transitions, zero iterations exceeding maximum tree depth, Rhat values are consistently below 1.01 and the number of effective samples are comfortably above 100. All of this indicates that the fitting of the model went well. The result of a Bayesian model (the posterior) should predict the experimental data well. This can be checked with a posterior predictive check (PPC) in which the experimental data (points) should lie within the predictions (violins). ```{r} # PPCs can be accessed with plot_PPC( data = data ) ``` The PPC plot shows that the experimental data points are consistent with the predicted distribution, suggesting that the model is a good fit to the data. ## Model outputs After checking the diagnostic criteria and the PPC we can extract the estimates from the model fit using the estimates_dynamics() function: If data is a SummarizedExperiment the estimates are stored in metadata(data)[["estimates_dynamics"]]. ```{r,fig.wide=TRUE} # #extract estimates data <- estimates_dynamics( data = data ) ``` To visualize the estimates we can use the plot_estimates() function. ## Differences of metabolite abundances between time points First we visualize the differences of metabolite abundances between time points (Parameter "delta_t"). In the following plots the column label "A" indicates the experimental condition. The row labels "timepoint 2 - timepoint 1". The metabolites are ordered on the x-axis according to the estimated mean differences between time points and the estimated abundance differences ("delta") are on the y-axis. ```{r,fig.wide=TRUE} # 1) the differences between two timepoints plots <- plot_estimates( data = data, delta_t = TRUE, dynamics = FALSE, distance_conditions = FALSE ) plots$delta_t$`A_time_point_2-time_point_1` plots$delta_t$`A_time_point_3-time_point_2` ``` The results of bayesian models can be interpreted intuitively: the 95% credible interval (CrI) (95% highest density interval) indicates the 95% probability range of the parameter (i.e. in this case the differences in abundances between time points). That means that with 95% probability the desired values lies within the range of the error bars. If the 95% CrI lies below zero, we can conclude that there is a decrease in abundance between the two subsequent time points (e.g. PEP between time point 1 and 2, red error bar). If the 95% CrI lies above zero, we can conclude that there is an increase in abundance between the two subsequent time points. If the 95% CrI contains zero, we cannot conclude that there is a difference in abundance between the two time points. ### Difference of dynamcis between experimental conditions If the parameter "distance_conditions" is set to TRUE a ranked list of differences between metabolite dynamics of different conditions is visualized. ```{r,fig.wide=TRUE} # 1) the differences between two timepoints plot_estimates( data = data, delta_t = FALSE, dynamics = FALSE, distance_conditions = TRUE ) ``` In the case of our simulated data set the uncertainty (i.e. the CrI width) for the estimated differences between dynamics is pretty high. Bigger euclidean distances indicate big differences between dynamics, small euclidean distances small differences between dynamics. ### Dynamic profiles The dynamic profiles of each metabolite can be visualized using the estimated mean abundances at each time point (parameter "dynamics"). The plot below shows the estimated mean abundances at each time point for each metabolite. ```{r,fig.wide=TRUE} # 2) dynamic profiles plot_estimates( data = data, dynamics = TRUE, distance_conditions = FALSE, delta_t = FALSE ) ``` The dynamic profiles can be used to identify patterns and trends in the data, such as shifts in abundance over time, and to visualize the changes in abundance for each metabolite over the observed time interval. The estimates of the dynamics model are returned by the function estimates_dynamics() and can be accessed with. The estimates are ordered by condition. ```{r} # get estimates estimates <- metadata(data)[["estimates_dynamics"]] # only show first rows of condition A estimates$euclidean_distances ``` To identify groups of metabolites that have similar dynamics we could now cluster these metabolite specific dynamics vectors (estimates$mu) to determine if groups of metabolites have similar dynamics. # Cluster dynamics MetaboDynamics has a build-in wrapper function for clustering with the "hybrid" method of the package [dynamicTreeCut](https://cran.r-project.org/web/packages/dynamicTreeCut/index.html). Bootstrapping of the clustering solution is implemented with draws from the posterior distribution of mean metabolite abundances. Bootstrapping allows for the estimation of clustering reliability. From here on I will demonstrate the workflow on the whole simulated data set. The code below shows how to use the cluster wrapper function cluster_dynamics on a SummarizedExperiment object where the dynamics estimates are stored in metadata(data)[["estimates_dynamics"]]. ```{r,fig.wide=TRUE} data <- cluster_dynamics(data, deepSplit = 2, B = 10 # number of bootstrapps ) # low clustering sensitivity ``` For the visualization of the clustering results we can use the function plot_cluster() ```{r} plots <- plot_cluster(data) ``` The functions returns three outputs: 1) phylogenetic trees ([ggtree](https://www.bioconductor.org/packages/release/bioc/html/ggtree.html)) (dendrograms) with number on nodes indicating the number of bootstrapss with the same clustering solution. Similar visualizations have been implemented in [scbubbletree](https://bioconductor.org/packages/release/bioc/html/scBubbletree.html) and [cellmig](https://github.com/snaketron/cellmig). To plot phylogenetic trees of condition A run: ```{r} plots$trees$B ``` In half of the 10 bootstrapps 2-Amminomuconate and UMP are clustered together, but ATP and PEP are only clustered together in 1/10 bootstrapps. 2) lineplots of the clustered metabolite dynamics ```{r} plots$lineplots$B ``` and 3) a [patchwork](https://cran.r-project.org/web/packages/patchwork/index.html) plot of the phylogenetic tree, cluster affiliations and the metabolite dynamics. ```{r} plots$patchwork$B ``` # Over-representation analysis of functional modules in dynamics clusters To quantify the possible biological function of these dynamics clusters we retrieved from the KEGG-database the following information: 1. to which functional modules our experimental metabolites are annotated and 2. which metabolites are annotated to functional modules in general. The functional modules of the KEGG-database are organised in three hierarchies: upper, middle and lower. The middle hierarchy holds modules such as "Amino acid metabolism", the lower for example "branched chain amino acid metabolism". To facilitate analysis the data frames "metabolite_modules", which holds the information about the metabolites of the simulated data set, and "modules_compounds", which holds the information about which metabolites are in general annotated to functional modules, were prepared. To construct the needed annotation data frame for your experimental data set simply filter the "modules_compounds" data set for your experimental metabolites' KEGG IDs. We load both data sets and can inspect the documentation. ```{r} data("metabolite_modules") help("metabolite_modules") head(metabolite_modules) data("modules_compounds") help("modules_compounds") head(modules_compounds) ``` It is important to note that not all KEGG modules are suitable for testing on every observed organism and experimental condition. For example, the modules "Xenobiotics biodegradation", "Biosynthesis of other secondary metabolites", and "Biosynthesis of terpenoids and polyketides" should not be analyzed in a human lung cancer cell line. For the ORA we additionally need a data frame annotating metabolites to KEGG IDs. For the simulated example data set the results are in data("IDs") ```{r} data("IDs") head(IDs) ``` For the functional analysis we employ a hypergeometric model. We consider a functional module as over-represented in a cluster if the 95% inter-quantile range (ICR) of the log-transformed probabilities of OvEs (observed vs. expected) lies above zero. OvE refers to the ratio of observed metabolites in a cluster being mapped to a functional module over the number of expected metabolites in a cluster being mapped to a module under the assumption of a hypergeometric distribution (=drawing without replacement). If data is a SummarizedExperiment object where the clustering solution is stored in metadata(data)[["cluster"]] the ORA results are stored in metadata(data)[["ORA_tested_column"]] We apply the functional analysis to the middle hierarchy of functional modules. ```{r,fig.wide=TRUE} data <- ORA_hypergeometric( background = modules_compounds, annotations = metabolite_modules, IDs = IDs, data = data, tested_column = "middle_hierarchy" ) plot_ORA(data = data) ``` Great, we can now see which functional module is over- (green points and error-bars) or under-represented (none in this example) in which dynamics cluster! For instance in cluster 1 of condition A module "Nucleotide metabolism" is over-represented. If we want to attach the over-representation analysis to the previously generated patchwork plot of the cluster dynamics, we need to store the clustering plots and set argument "patchwork" of plot_ORA to TRUE. ```{r} # get cluster plots cluster <- plot_cluster(data = data) # set patchwork to TRUE and provide cluster plots ora <- plot_ORA(data = data, tested_column = "middle_hierarchy", patchwork = TRUE, plot_cluster = cluster) ``` To visualize the dynamics and ORA results run: ```{r,fig.width=12,fig.height=6} # patch plots together library(patchwork) p <- cluster$patchwork$A|ora$ora_patchwork$A p+plot_layout(ncol=4, widths = c(2,1,2,2)) ``` # Comparison of clusters of different experimental conditions ## Dynamics We can not only conduct over-representation analysis of KEGG-functional modules but also compare dynamics clusters across different experimental conditions. For this we employ a Bayesian model that estimates the mean difference as well as the standard deviation of differences between dynamics clusters. If data is a SummarizedExperiment object the results of compare_metabolites are stored in metadata(data)[["comparison_dynamics"]]. dist = vector of pairwise euclidean distances between every dynamic vector of cluster a and every dynamic vector of cluster b, ID = cluster pair ID \begin{align*} dist_{ID}&\sim {\sf normal}(\mu_{ID},\sigma_{ID}) \\ \mu_{ID}&\sim {\sf normal^+}(0,2) \\ \sigma_{ID}&\sim {\sf exponential}(1) \end{align*} ```{r,fig.aling='center',fig.dpi=150} data("longitudinalMetabolomics") longitudinalMetabolomics <- compare_dynamics( data = longitudinalMetabolomics, cores = 1 # just set to 1 for vignette, set to 4 if possible ) ``` We can visualize the posterior estimates of the model with: ```{r,fig.aling='center',fig.width=8,fig.height=6} heatmap_dynamics(data = longitudinalMetabolomics) ``` This results in a bubble heat map where each cluster of every condition (i.e. A_2 = cluster 2 of condition A) is compared with all clusters of every other condition. The comparison of clusters is reciprocal and therefore each comparison pair is only visualized once. The bigger and brighter a point, the smaller is the mean distance between dynamics clusters and the smaller is the standard deviation. That means that big bright points indicate high dynamic similarity which small spread. Here A_7 and B_8 have high similarity in dynamics. ## Metabolites We can also compare metabolite composition of clusters. For this we employ the Jaccard Index which is a metric for similarity of two vectors. Values near 0 indicate low similarity, values near 1 high similarity. If data is a SummarizedExperiment object the results of compare_metabolites are stored in metadata(data)[["comparison_metabolites"]]. ```{r,fig.aling='center',fig.width=8,fig.height=6} longitudinalMetabolomics <- compare_metabolites( data = longitudinalMetabolomics ) heatmap_metabolites(data = longitudinalMetabolomics) ``` The brighter a tile, the higher is the similarity of two clusters in regard to their metabolite composition. The comparison of clusters of one condition here always have a Jaccard Index of zero, as every metabolite is only assigned to one cluster per condition, hence there is no similarity between clusters of one experimental condition. We have two clusters that are similar in their metabolite composition: A_12 and B_8. If we compare that to the dynamics profiles and ORA analysis we see that similar functional modules are over-expressed as expected BUT the dynamics differ between the two radiation doses. Can we facilitate visualization for the combination of both measures? ## Combine both ```{r,fig.aling='center',fig.height=6,fig.width=8} dynamics <- metadata(longitudinalMetabolomics)[["comparison_dynamics"]][["estimates"]] metabolites <- metadata(longitudinalMetabolomics)[["comparison_metabolites"]] temp <- left_join(dynamics, metabolites, join_by("cluster_a", "cluster_b")) x <- unique(c(temp[, "cluster_a"], temp[, "cluster_b"])) temp <- temp %>% mutate(scale_Jaccard = scale(Jaccard)) ggplot(temp, aes(x = cluster_b, y = cluster_a)) + geom_point(aes(size = Jaccard, col = mu_mean)) + theme_bw() + scale_color_viridis_c(option = "magma") + scale_x_discrete(limits = x) + xlab("") + ylab("") + scale_y_discrete(limits = x) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) + labs(col = "dynamics distance", size = "metabolite similarity") + ggtitle("comparison of clusters", "label = condition + cluster ID") ``` With this form of visualization we can easily spot clusters in which the experimental condition (e.g. kind of treatment/ dose of treatment) has the biggest effect. These clusters are similar in regards to their composing metabolites (i.e. big points) but dissimilar (i.e. bright color) in regards to their dynamics. Here this is for example the cluster pair A_11 and B_12 (not a big point but bright color). Clusters of metabolites that are not much affected by the experimental condition can be spotted by big (i.e. high metabolite similarity) black (i.e. low dissimilarity in dynamics) points, for example cluster pair A_6 and B_9. ```{r} sessionInfo() ```