--- title: "Getting started with anansi" output: BiocStyle::html_document: toc: true toc_float: true toc_depth: 2 number_sections: true vignette: > %\VignetteIndexEntry{1. Getting started with anansi} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( echo = TRUE, tidy.opts = list(width.cutoff = 80), tidy = TRUE ) # long links adj_link <- paste0( "https://thomazbastiaanssen.github.io/anansi/", "articles/adjacency_matrices.html" ) dif_link <- paste0( "https://thomazbastiaanssen.github.io/anansi/", "articles/differential_associations.html" ) library(anansi) ``` # Overview {-} This vignette can be divided into three sections: * [Section 1](#sec-anansi-1) introduces the `anansi` framework. * [Section 2](#sec-anansi-2) goes through a simple workflow and required input. * [Section 3](#sec-anansi-3) discusses more advanced applications of anansi. # Introduction {#sec-anansi-1} The `anansi` package computes and compares the association between the features of two 'omics datasets that are known to interact based on a database such as KEGG. Studies including both functional microbiome and metabolomics data are becoming more common. Often, it would be helpful to integrate both datasets in order to see if they corroborate each others patterns. However, all-vs-all association analyses are imprecise and likely to yield spurious associations. This package takes a knowledge-based approach to constrain association search space, only considering metabolite-function interactions that have been recorded in a pathway database. In addition, it provides a framework to assess differential associations. While `anansi` is geared towards metabolite-function interactions in the context of host-microbe interactions, it is perfectly capable of handling any other pair of data sets where some features interact canonically. ## A note on functional microbiome data Two common questions in the host-microbiome field are "Who's there?" and "What are they doing?". Techniques like 16S rRNA sequencing and shotgun metagenomics sequencing are most commonly used to answer the first question. The second question can be a bit more tricky, as it often requires functional inference software to address them. For 16S sequencing, programs like [PICRUSt2](https://github.com/picrust/picrust2/) can be used to infer function. For shotgun metagenomics, this can be done with programs such as [HUMANn3 in the bioBakery suite](https://github.com/picrust/picrust2) and [woltka](https://github.com/qiyunzhu/woltka). All of these algorithms can produce functional count data in terms of KEGG Orthologues (KOs) organised as tables, which can be directly passed to `anansi`. # Getting started with anansi {#sec-anansi-2} ## Installation instructions Get the latest stable `R` release from [CRAN](http://cran.r-project.org/). Then install `anansi` from [Bioconductor](http://bioconductor.org/) using the following code: ```{r install, eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("anansi") ``` And the development version from [GitHub](https://github.com/thomazbastiaanssen/anansi) with `remotes`: ```{r remotes-install, eval = FALSE} install.packages("remotes") remotes::install_github("thomazbastiaanssen/anansi") ``` ## Setup ```{r load-libraries, eval=TRUE} # load anansi library(anansi) # load ggplot2 and ggforce to plot results library(ggplot2) library(ggforce) # load example data + metadata from FMT Aging study data(FMT_data) ``` ## Data preparation The anansi framework requires input data to be prepared in a specific format, namely the `AnansiWeb` S7 class. This is to ensure that features are correctly linked across input datasets. We will first discuss the required input data and then demonstrate the `weaveWeb` method, which is the recommended way to prepare input for the anansi framework. ## Input formatting for anansi The main `anansi` function expects data in the `AnansiWeb` class format, which contains four tables: ### feature tables: `tableY` and `tableX`. The first two tables, `tableY` and `tableX` are the feature tables of your two data modalities (e.g., metabolites, genes, microbes, ...). Both tables should have columns as features and rows as samples. Below is an example of an appropriately formatted feature table, we will use it in the sections below. ```{r 'ko-table-format', echo = FALSE} cat("anansi expects samples to be rows and features to be columns:\n") t(FMT_KOs[seq(4L),seq(10L)]) ``` ```{r 't-tables', eval=TRUE} # Transpose demo tables to input format tableY <- t(FMT_metab) tableX <- t(FMT_KOs) ``` ### linking information: `link`. The third table, `link`, provides the information to link the features across the first two tables, `tableY` and `tableX`. `anansi` provides a pre-built map between ko, cpd and ec feature annotations from the KEGG database, but users can also provide their custom mapping information. We will not expand on here, but [see vignette on adjacency matrices.]`r sprintf("(%s)", adj_link)` Also see `?anansi::MultiFactor()` ### Additional sample data for analysis: Metadata The final table is `metadata`: a `data.frame` with additional information about the data set. Its rows correspond to samples and its columns to variables that can be used as covariates in downstream `anansi()` analysis. ```{r metadata-format, echo = FALSE} cat("anansi expects samples to be rows and features to be columns:\n") `rownames<-`(FMT_metadata, NULL)[seq(10L),] ``` ## Weave a web🕸️ The `weaveWeb()` function accepts the four input tables discussed above and collates them into an `AnansiWeb` object. The `AnansiWeb` format is a necessary input file for the main `anansi` workflow. It allows `anansi` to keep track of which features from the two input data sets should be considered as pairs. ### Specify which features to `link`. Additionally, `weaveWeb()` requires the user to specify the names of the feature types that should be linked. These names need to be present as the columns of the `link` table. In this case, we'll link KEGG compound (cpd) to KEGG Orthologues (ko). ```{r weaveweb, eval=TRUE} # Create AnansiWeb object from tables web <- weaveWeb( x = "ko", y = "cpd", link = kegg_link(), tableY = tableY, tableX = tableX, metadata = FMT_metadata ) ``` ### Specify `y`, and `x` Though most of the examples use metabolites and functions, `anansi` is able to handle any type of big/'omics data, as long as there is a dictionary available. Because of this, anansi uses the type-naive nomenclature `tableY` and `tableX`. The Y and X refer to the position these measurements will have in the linear modeling framework: $$ Y \sim X \times{covariates} $$ ### `weaveWeb` with Bioconductor ecosystem As an alternative to providing input as raw tables, `weaveWeb()` also accepts the Bioconductor classes `MultiAssayExperiment` (MAE) and `TreeSummarizedExperiment` (TSE) as input. Conversely, an `AnansiWeb` object can be converted to a MAE or TSE with the functions `asMAE()` and `asTSE()`. ```{r bioc1} #| eval=TRUE, message=FALSE # Convert web to mae mae <- asMAE(web) ``` The MAE object can be passed to the `weaveWeb()` function, which returns the same output, ready for the `anansi` function. ```{r bioc2} #| eval=TRUE web.mae <- weaveWeb( mae, tableY = "cpd", tableX = "ko", link = kegg_link() ) out.mae <- anansi( web = web.mae, formula = ~Legend, adjust.method = "BH", verbose = TRUE ) identical(web, web.mae) ``` ## Run anansi🕷️ The main workspider is called `anansi` just like the package. Generally, two main arguments should be specified: 1. `web`: an `AnansiWeb` object, such as the one we generated in the above step. 2. `formula`: a formula specifying which associations to test. For instance, to assess differential associations between treatments, we use the formula `~ Legend` (corresponding to the `Legend` column in the `metadata` table we provided to `weaveWeb()` above). ```{r run-anansi, eval=TRUE} # Run anansi on AnansiWeb object anansi_out <- anansi( web = web, formula = ~Legend, adjust.method = "BH", verbose = TRUE ) ``` ## Reporting output📝 By default, `anansi` outputs a table in the `wide` format. For general reporting, we recommend sticking to the more legible `table` format by specifying the `return.format` argument. Let's take a look at the structure of the output table: ```{r show-output, eval=TRUE} # General structure of anansi() output head(anansi_out, c(5, 5)) # Let's take a look at the column names colnames(anansi_out) ``` ### Output statistics We can see that the output of `anansi()` is structured so that there is one feature pair per row, marked by the first two columns. For each of those feature pairs, the table contains outcome parameters from several models. In order, we have pooled correlations, group-wise correlations, full model, disjointed and emergent associations. For each of these models we report four parameters. For correlations we provide the Pearson's coefficient ($\rho$) and the $t$-statistic, whereas we provide their squared counterparts for the linear regression models: $R^2$ and the $F$ statistic. For all models, we additionally provide p-values and adjusted p-values. For more on this, [see the vignette on differential associations]`r sprintf("(%s)", dif_link)`. | Association | Method | Statistics | Significance | |-------------|-------------------|-------------|---------------| | Pooled | Correlation | $\rho$, $t$ | p, adjusted p | | Group-wise | Correlation | $\rho$, $t$ | p, adjusted p | | Full | Linear regression | $R^2$, $F$ | p, adjusted p | | Disjointed | Linear regression | $R^2$, $F$ | p, adjusted p | | Emergent | Linear regression | $R^2$, $F$ | p, adjusted p | ## Plot the results Finally, the output of `anansi` can be visualized with `plotAnansi` by specifying the type of association to use. ```{r bioc3} #| eval=TRUE, fig.width = 10, fig.height = 8 # Filter associations by q value anansi_out <- anansi_out[anansi_out$full_q.values < 0.2, ] # Visualize disjointed associations plotAnansi( anansi_out, association.type = "disjointed", model.var = "Legend", signif.threshold = 0.05, fill_by = "group" ) ``` # Advanced applications and customization {#sec-anansi-3} ## Generate `AnansiWeb` with random data The `randomWeb()` function allows the user to generate random data, which can be useful for statistical modeling or to generate placeholder data, for instance. ```{r r-web} rweb <- randomWeb( n_samples = 10, n_reps = 1, n_features_x = 8, n_features_y = 12, sparseness = 0.5 ) # The random object comes with some dummy metadata: metadata(rweb) # also see `?randomAnansi` ``` ## Apply user-defined function on each pair of features A typical `anansi` workflow, as the one above, estimates associations between each pair of features. More generally, we can use `pairwiseApply()` on the same `AnansiWeb` input object to apply a user-defined function on each pair of features. The function passes as the `FUN` argument is applied on each pair of features, which are pre-mapped to `y` and `x`: `function(x, y)`. [Also see vignette on adjacency matrices.]`r sprintf("(%s)", adj_link)` ```{r pairwiseApply1} # For each feature pair, was the value for x higher than the value for y? pairwise_gt <- pairwiseApply( X = web, FUN = function(x, y) x > y, USE.NAMES = TRUE ) pairwise_gt[1:5, 1:5] ``` ```{r pairwiseApply2} # Run cor() on each pair of features pairwise_cor <- pairwiseApply( X = web, FUN = function(x, y) cor(x, y), USE.NAMES = FALSE ) pairwise_cor ``` ## Plotting examples Automatic plotting functions such as `plotAnansi()` provide a convenient and code-light solution to visualization. However, some may prefer interfacing directly with the `ggplot()` code. We provide code to recreate the above figure: ```{r plot-extra1} #| plot_FMT, eval=TRUE, fig.width=10, fig.height=8, tidy=FALSE, #| fig.alt="Differential association plot, with facets separating metabolites #| and metabolites on the y-axis. x-axis depicts pearson's correlation #| coefficient." # Use tidyr to wrangle the correlation r-values to a single column library(tidyr) anansiLong <- anansi_out |> pivot_longer(starts_with("All") | contains("FMT")) |> separate_wider_delim( name, delim = "_", names = c("cor_group", "param") ) |> pivot_wider(names_from = param, values_from = value) # Now it's ready to be plugged into ggplot2, though let's clean up a bit more. # Only consider interactions where the entire model fits well enough. anansiLong <- anansiLong[anansiLong$full_q.values < 0.2, ] ggplot(anansiLong) + # Define aesthetics aes( x = r.values, y = feature_X, fill = cor_group, alpha = disjointed_Legend_p.values < 0.05 ) + # Make a vertical dashed red line at x = 0 geom_vline(xintercept = 0, linetype = "dashed", colour = "red") + # Points show raw correlation coefficients geom_point(shape = 21, size = 3) + # facet per compound ggforce::facet_col(~feature_Y, space = "free", scales = "free_y") + # fix the scales, labels, theme and other layout scale_y_discrete(limits = rev, position = "right") + scale_alpha_manual( values = c("TRUE" = 1, "FALSE" = 1 / 3), "Disjointed association\np < 0.05" ) + scale_fill_manual( values = c( "Young yFMT" = "#2166ac", "Aged oFMT" = "#b2182b", "Aged yFMT" = "#ef8a62", "All" = "gray" ), breaks = c("Young yFMT", "Aged oFMT", "Aged yFMT", "All"), name = "Treatment" ) + theme_bw() + ylab("") + xlab("Pearson's \u03c1") ``` ## Plotting multiple feature pairs at once with patchwork We can also also use the `AnansiWeb` object to pull up specific pairs of features. The [`patchwork`](https://patchwork.data-imaginist.com/) provides a wonderful interface to combine plots. We can use `getFeaturePairs()` to extract the relevant data from our `AnansiWeb` object in a convenient list format: ```{r plot-extra2} #| multiplot, eval=TRUE, message=FALSE, tidy = FALSE library(patchwork) # Get a list containing data.frames for each feature pair. feature_pairs <- getFeaturePairs(web) # For each, prepare a ggplot object with those features on x & y axes. plot_list <- lapply( feature_pairs, FUN = function(pair_i) { ggplot(pair_i) + aes( y = .data[[colnames(pair_i)[1L]]], x = .data[[colnames(pair_i)[2L]]] ) + # We'll use facet_grid for labels instead of labs(). facet_grid(colnames(pair_i)[1L] ~ colnames(pair_i)[2L]) } ) # Feed the first six of our ggplots to patchwork, using wrap_plots() wrap_plots(plot_list[seq_len(6L)], guides = "collect") & # Continue as with an ordinary ggplot, but use & instead of + geom_point( shape = 21, show.legend = FALSE, aes(fill = FMT_metadata$Legend) ) & geom_smooth( method = "lm", se = FALSE, aes(colour = FMT_metadata$Legend), show.legend = FALSE ) & # Appearance scale_fill_manual( aesthetics = c("colour", "fill"), values = c( "Young yFMT" = "#2166ac", "Aged oFMT" = "#b2182b", "Aged yFMT" = "#ef8a62", "All" = "gray" ) ) & theme_bw() & labs(x = NULL, y = NULL) ``` These types of visualizations of feature pairs work best when the amount of feature pairs to be plotted is modest. ### Note on using `patchwork` with `anansi`: Especially for large numbers of feature pairs, it's much more efficient to add shared elements to the patchwork using the `&` operator, rather than adding them to the initial `ggplot()` calls in the `lapply()` loop. # Session info ```{r session-info} sessionInfo() ```