--- title: "Association testing" output: BiocStyle::html_document: toc: true toc_float: true toc_depth: 2 number_sections: true vignette: > %\VignetteIndexEntry{3. Differential associations} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() ``` ```{r setup, include = FALSE} knitr::opts_chunk$set( echo = TRUE, tidy.opts = list(width.cutoff = 80), tidy = TRUE ) # long links adj_link <- paste0( "https://thomazbastiaanssen.github.io/anansi/", "articles/adjacency_matrices.html" ) library(S7) ``` # Overview {-} The goal of `anansi` is to provide a framework to interrogate interactions between different 'omics datasets, or more generally, between data modalities. More concretely, this involves estimating the association between pairs of features in several contexts. Anansi uses a (bi)adjacency matrix to determine which pars of features to investigate. For more on the [adjacency matrix, see the vignette]`r sprintf("(%s)", adj_link)`. Here, we will discuss the different statistical models used to estimate the associations between feature-pairs. Anansi aims to follow the style of `stats::lm()` and this vignette provides equivalent models in `stats` as theoretical familiar ground. For these reasons, I warmly recommend the documentation of `stats` for further reading. This vignette is divided by three sections: * [Section 1](#sec-da-1) introduces the anansi framework and covers the *full model*. * [Section 2](#sec-da-2) Covers differential associations * [Section 3](#sec-da-3) Discusses incorporating repeated measures through `Error()` notation. # Getting started: The full model {#sec-da-1} After determining which pairs of features should be investigated, `anansi` fits several linear models for which it returns summary statistics. The user has control over these models through the `formula` argument in `anansi(). ` ## Setup We will use the same example as in [the vignette on adjacency matrices]`r sprintf("(%s)", adj_link)`. There, we also introduce the `AnansiWeb` class and demonstrate using `weaveWeb()`. Please note that the data used in this example have been generated with statistical associations manually spiked in for this vignette. ```{r formula-web-full} # Setup AnansiWeb with some dummy data library(anansi) web <- krebsDemoWeb() ``` ## Formula syntax For each of the feature-pairs that should be assessed, anansi fits similarly structured linear models. Anansi supports complex linear models as well as longitudinal models using the R formula syntax. The most basic model in anansi is: ```{r y-x-formula,echo=FALSE} y ~ x ``` where `y` refers to is the feature from `tableY` and `x` is the corresponding feature in `tableX`. Besides estimating the asociation between `y` and `x`, anansi also allows for the inclusion of covariates that could influence this association. With **full model**, we mean the total influence of `x`, including all of its interaction terms, on `y`. For example, if our input formula was: ```{r formulate, echo = FALSE} f <- y ~ x * (group_ab + score_a) f ``` R rewrites this as follows: ```{r reformulate, echo=FALSE} update.formula(f, . ~ .) ``` The variables that constitute the 'full' effect of x would be: `x + x:group_ab + x:score_a`. ### Formula in anansi() When providing a formula argument in the main `anansi()` function, the `y` and `x` variables should not be mentioned, they are already implied. Rather, only additional covariates that could interact with `x` should be provided in the `formula` argument. ```{r formula-anansi-full} out <- anansi(web, formula = ~ group_ab + score_a, groups = "group_ab" ) ``` Note that `anansi()` will tell us how the rewritten model looks if we don't set `verbose` to FALSE. The full model simply estimates the total association between `y` and `x`, making interpretation straightforward. We can imagine these associations to be positive or negative. ```{r plot_full} #| echo=FALSE, warning=FALSE, message=FALSE, fig.dim=c(6,3), #| out.width="80%", #| fig.cap="Figure 1. Both citrate and cis-aconitate ~ aconitase dehydrogenase, sequentially." #| library(patchwork) library(ggplot2) full_df <- data.frame( citrate = tableX(web)[, "citrate"], `cis-aconitate` = tableX(web)[, "cis-aconitate"], aconitase = tableY(web)[, "aconitase"] ) ggplot(full_df) + aes(x = aconitase, y = citrate) + facet_wrap(~"citrate") + ggplot(full_df) + aes(x = aconitase, y = cis.aconitate) + labs(y = "cis-aconitate") + facet_wrap(~"cis-aconitate") + plot_layout(axes = "collect", axis_titles = "collect") + plot_annotation( subtitle = ) & geom_point(shape = 21, fill = "#ffffbf") & coord_equal(ratio = 1, xlim = c(-2.5, 2.5), ylim = c(-2.5, 2.5)) & theme_bw() ``` # Differential association {#sec-da-2} In order to assess differences in associations based on one or more variables (such as phenotype or treatment), we make use of the emergent and disjointed association paradigm introduced in the context of proportionality \citep{diff_prop,quinn2017propr,part_corr} and apply it outside of the simplex. ## Disjointed associations Disjointed associations describe the situation where there is a detectable association in all cases, but the quality of that association, the **slope**, changes in a manner explained by a term. We estimate this as interaction terms with `x`. In our example we have two such terms, namely `x:group_ab` and `x:score_a`. ### Disjointed associations: Categorical variables The figure below shows how such a disjointed association might look. We can imagine some sort of a pharmacological treatment that alters enzymatic conversion rates to the point where association between the enzyme and the metabolite *flips*. ```{r disj_cat} #| echo=FALSE, fig.dim=c(6,3), out.width="80%", #| fig.cap="isocitrate ~ isocitrate dehydrogenase by category (treatment)." #| diff_df <- data.frame( group_ab = paste0("Treatment ", metadata(web)$group_ab), score_a = metadata(web)$score_a, score_a_cat = cut( metadata(web)$score_a, labels = c("0-20%", "20-40%", "40-60%", "60-80%", "80-100%"), quantile(metadata(web)$score_a, probs = seq(0, 1, 1 / 5)), include.lowest = TRUE ), repeated = metadata(web)$repeated, sample_id = metadata(web)$sample_id, isocitrate = tableX(web)[, "isocitrate"], ketoglutarate = tableX(web)[, "ketoglutarate"], isocitrate.dehydrogenase = tableY(web)[, "isocitrate dehydrogenase"], ketoglutarate.dehydrogenase = tableY(web)[, "ketoglutarate dehydrogenase"], succinyl.CoA = tableX(web)[, "succinyl-CoA"], succinate = tableX(web)[, "succinate"], succinyl.CoA.synthetase = tableY(web)[, "succinyl-CoA synthetase"], succinate.dehydrogenase = tableY(web)[, "succinate dehydrogenase"], fumarate = tableX(web)[, "fumarate"], fumarase = tableY(web)[, "fumarase"] ) ggplot(diff_df) + aes(x = isocitrate.dehydrogenase, y = isocitrate, fill = group_ab) + facet_wrap(~group_ab, ) + labs(x = "isocitrate dehydrogenase") + geom_point(shape = 21, show.legend = FALSE) + scale_fill_manual(values = c( "Treatment a" = "#d73027", "Treatment b" = "#2166ac" )) + theme_bw() + coord_equal(ratio = 1, xlim = c(-2.5, 2.5), ylim = c(-2.5, 2.5)) ``` ### Disjointed associations: Continuous variables Similarly, we could imagine that this flip is not binary or otherwise step-wise, but rather the result of a continuous variable. ```{r disj_cont} #| echo=FALSE, fig.dim=c(10,3), #| fig.cap="ketoglutarate ~ ketoglutarate dehydrogenase stratified by a third continuous variable." #| ggplot(diff_df) + aes( x = ketoglutarate.dehydrogenase, y = ketoglutarate, fill = pnorm(score_a) ) + facet_grid(~score_a_cat) + geom_point(shape = 21, show.legend = FALSE) + scale_fill_gradientn( name = "Example health index", colors = c("#a50026", "#f46d43", "#ffffbf", "#74add1", "#313695") ) + labs(x = "ketoglutatate dehydrogenase") + theme_bw() + guides(fill = guide_colourbar( position = "bottom", legend.direction = "horizontal" )) + coord_equal(ratio = 1, xlim = c(-2.5, 2.5), ylim = c(-2.5, 2.5)) ``` ### Equivalent `stats::lm()` model We can use `pairwiseApply()` on an `AnansiWeb` object to run a function on the observations for each pair of features. We can use this to confirm the statistical output of `anansi()` against the `stats` package: ```{r disj-lm} score_lm <- pairwiseApply(web, FUN = function(x, y) { anova( lm(y ~ x * (group_ab + score_a), data = metadata(web)) )$`F value`[5L] }) group_lm <- pairwiseApply(web, FUN = function(x, y) { anova( lm(y ~ x * (score_a + group_ab), data = metadata(web)) )$`F value`[5L] }) all.equal(score_lm, out$disjointed_score_a_f.values) all.equal(group_lm, out$disjointed_group_ab_f.values) ``` We run two separate `pairwiseApply()` calls for the two interaction terms. Because of the way `anova()` calculates sum-of-squares (type-I), the order of variables matters for `anova()`. ## Emergent association Emergent associations describe the situation where the **strength** of an association is explained by a term interacting with x. We again estimate this as interaction terms with `x`. In our example we have two such terms, namely `x:group_ab` and `x:score_a`. ### Emergent associations: Categorical variables The figure below shows how such a emergent association might look. We can imagine some sort of a pharmacological treatment that can completely shut down a biochemical pathway. As a result, the members of that pathway, which would show a consistently strong association under physiological conditions, could go *out of sync*. ```{r emerg_cat} #| echo=FALSE, fig.dim=c(6,3), out.width="80%", #| fig.cap="succinyl-CoA ~ succinyl-CoA synthetase by categorical variable (Treatment)." #| ggplot(diff_df) + aes(x = succinyl.CoA.synthetase, y = succinyl.CoA, fill = group_ab) + facet_wrap(~group_ab) + geom_point(shape = 21, show.legend = FALSE) + scale_fill_manual(values = c( "Treatment a" = "#d73027", "Treatment b" = "#2166ac" )) + labs(x = "succinyl-CoA synthetase", y = "succinyl-CoA") + theme_bw() + coord_equal(ratio = 1, xlim = c(-2.5, 2.5), ylim = c(-2.5, 2.5)) ``` ### Emergent associations: Continuous variables Similarly, we could imagine situations where this loss in association strength happens gradually, rather than in a binary manner. For example, perhaps there is a strong association between a metabolite and an enzyme that metabolizes it under physiological conditions, but this association weakens along with some health index or environmental variable. ```{r emerg_cont} #| echo=FALSE, fig.dim=c(10,3), #| fig.cap="succinate ~ succinate dehydrogenase stratified by a third continuous variable." #| ggplot(diff_df) + aes(x = succinate.dehydrogenase, y = succinate, fill = pnorm(score_a)) + facet_grid(~score_a_cat) + geom_point(shape = 21, show.legend = FALSE) + scale_fill_gradientn( name = "Example health index", colors = c("#a50026", "#f46d43", "#ffffbf", "#74add1", "#313695") ) + labs(x = "succinate dehydrogenase") + theme_bw() + coord_equal(ratio = 1, xlim = c(-2.5, 2.5), ylim = c(-2.5, 2.5)) ``` # Repeated measures {#sec-da-3} ## Random slopes through `Error()` Similar to `stats::aov()`, you can wrap a variable in `Error()` for `anansi()` to treat it as a stratum, allowing for random intercepts. This is useful when dealing with multiple measurements per subject, commonly due to a longitudinal design. ```{r intro-error} outErr <- anansi(web, formula = ~ group_ab + score_a + Error(sample_id), groups = "group_ab" ) ``` Note that `anansi()` acknowledges the random intercept. ### compare to `aov( Error() )`. We can use `aov()` to see how `Error()` works. `R` does not use orthogonal contrasts and uses type-I sum-of-squares calculations by default. In other words, the order of covariates matters. A variable wrapped in `Error()` gets moved to the front of the line, so the rest of the model could be understood as conditioned on the `Error()`-wrapped term. Here is the model with `Error()` ```{r aov-error} summary( aov(fumarate ~ fumarase * repeated + Error(sample_id), data = diff_df) ) ``` And the equivalent model with `sample_id` moved to the front instead:. ```{r aov-regular} summary( aov(fumarate ~ sample_id + fumarase * repeated, data = diff_df) ) # See ?stats::aov for more. ``` For every term except for `sample_id`, the results are equivalent. # Session info ```{r session-info} sessionInfo() ```