---
title: "curatedMetagenomicData"
author:
- name: Lucas Schiffer, MPH
  affiliation: 
  - Section of Computational Biomedicine, Boston University School of Medicine,
    Boston, MA, U.S.A.
  email: schifferl@bu.edu
- name: Levi Waldron, PhD
  affiliation: 
  - Graduate School of Public Health and Health Policy, City University of New
    York, New York, NY, U.S.A.
  email: levi.waldron@sph.cuny.edu
package: curatedMetagenomicData
abstract: >
    The curatedMetagenomicData package provides standardized, curated human
    microbiome data for novel analyses. It includes gene families, marker
    abundance, marker presence, pathway abundance, pathway coverage, and
    relative abundance for samples collected from different body sites. The
    bacterial, fungal, and archaeal taxonomic abundances for each sample were
    calculated with MetaPhlAn3, and metabolic functional potential was
    calculated with HUMAnN3. The manually curated sample metadata and
    standardized metagenomic data are available as (Tree)SummarizedExperiment
    objects.
output: 
  BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{curatedMetagenomicData}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---

```{r, include = FALSE}
library(curatedMetagenomicData)
```

# What `curatedMetagenomicData` provides

`curatedMetagenomicData` provides processed data from whole-metagenome shotgun metagenomics, with manually-curated metadata, as integrated and documented Bioconductor TreeSummarizedExperiment objects or TSV flat text exports. It provides 6 types of data for each dataset:

1. Species-level taxonomic profiles, expressed as relative abundance from kingdom to strain level (`relative_abundance`)
2. Presence of unique, clade-specific markers (`marker_presence`)
3. Abundance of unique, clade-specific markers (`marker_abundance`)
4. Abundance of gene families (`gene_families`)
5. Metabolic pathway coverage (`pathway_coverage`)
6. Metabolic pathway abundance (`pathway_abundance`)

Types 1-3 are generated by [MetaPhlAn3](http://segatalab.cibio.unitn.it/tools/metaphlan/index.html); 4-6 are generated by [HUMAnN3](https://huttenhower.sph.harvard.edu/humann/) using the [UniRef90 database](https://www.uniprot.org/help/uniref).

Currently there are:

* `r nrow(sampleMetadata)` samples from `r length(unique(sampleMetadata$study_name))` datasets; see the list of [available studies](https://waldronlab.io/curatedMetagenomicData/articles/available-studies.html)
* `r ncol(sampleMetadata)` fields of specimen metadata from original papers, supplementary files, and websites, with manual curation and automated syntax-checking to standardize annotations

# Additional documentation and resources

See:

1. [Available Studies](https://waldronlab.io/curatedMetagenomicData/articles/available-studies.html)
2. [Our Pipeline](https://waldronlab.io/curatedMetagenomicData/articles/our-pipeline.html)
3. [Changes in cMD 3](https://waldronlab.io/curatedMetagenomicData/articles/version-three.html)
4. [Reference for cMD Functions](https://waldronlab.io/curatedMetagenomicData/reference/index.html)
5. [The command-line tool](https://github.com/waldronlab/curatedMetagenomicDataTerminal)
6. [Example analyses in R and Python, Docker image, free Cloud instance for teaching/learning](https://github.com/waldronlab/curatedMetagenomicDataAnalyses)

# Installation

To install `r BiocStyle::Biocpkg("curatedMetagenomicData")` from Bioconductor,
use `r BiocStyle::CRANpkg("BiocManager")` as follows.

```{r, eval = FALSE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("curatedMetagenomicData")
```

Most users should simply install
`r BiocStyle::Biocpkg("curatedMetagenomicData")` from Bioconductor. Developers
may install the GitHub repository with `remotes::install_github` and the latest
Bioconductor development version.

# R Packages

To demonstrate the functionality of `r Biocpkg("curatedMetagenomicData")`, the
`r CRANpkg("dplyr")` and `r CRANpkg("DT")` packages are needed.

```{r, message = FALSE}
library(dplyr)
library(DT)
```

# Sample Metadata

The `r Biocpkg("curatedMetagenomicData")` package contains a `data.frame`,
`sampleMetadata`, of manually curated sample metadata to help users understand
the nature of studies and samples available prior to returning resources.
Beyond this, it serves two purposes: 1) to define `study_name`, which is used
with the `curatedMetagenomicData()` function to query and return resources, and
2) to define `sample_id`, which is used with the `returnSamples()` function to
return samples across studies.

To demonstrate, the first ten rows and columns (without any `NA` values) of
`sampleMetadata` for the `AsnicarF_2017` study are shown in the table below.

```{r, collapse = TRUE}
sampleMetadata |>
    filter(study_name == "AsnicarF_2017") |>
    select(where(~ !any(is.na(.x)))) |>
    slice(1:10) |>
    select(1:10) |>
    datatable(options = list(dom = "t"), extensions = "Responsive")
```

# Data Access

There are three main ways to access data resources in curatedMetagenomicData.

1. The `curatedMetagenomicData()` function to search for and return resources.
2. The `returnSamples()` function to return samples across studies.
3. Through the [curatedMetagenomicDataTerminal](https://github.com/waldronlab/curatedMetagenomicDataTerminal) command-line interface.

## `curatedMetagenomicData()`

To access curated metagenomic data, users will use the `curatedMetagenomicData()` function both to query and return resources. The first argument `pattern` is a regular expression pattern to look for in the titles of resources available in `r Biocpkg("curatedMetagenomicData")`; `""` will return all resources. The title of each resource is a three part string with "." as a delimiter – the fields are `runDate`, `studyName`, and `dataType`. The `runDate` is the date we created the resource and can mostly be ignored by users because if there is more than one date corresponding to a resource, the most recent one is selected automatically – it would be used if a specific `runDate` was needed.

Multiple resources can be queried or returned with a single call to `curatedMetagenomicData()`, but only the titles of resources are returned by default.

```{r, collapse = TRUE}
curatedMetagenomicData("AsnicarF_20.+")
```

When the `dryrun` argument is set to `FALSE`, a `list` of `SummarizedExperiment` and/or `TreeSummarizedExperiment` objects is returned. The `rownames` argument determines the type of `rownames` to use for `relative_abundance` resources: either `"long"` (the default), `"short"` (species name), or `"NCBI"` (NCBI Taxonomy ID). When a single resource is requested, a single element `list` is returned.

```{r, collapse = TRUE, message = FALSE}
curatedMetagenomicData("AsnicarF_2017.relative_abundance", dryrun = FALSE, rownames = "short")
```

When the `counts` argument is set to `TRUE`, relative abundance proportions are multiplied by read depth and rounded to the nearest integer prior to being returned. Also, when multiple resources are requested, the `list` will contain named elements corresponding to each `SummarizedExperiment` and/or `TreeSummarizedExperiment` object.

```{r, collapse = TRUE, message = FALSE}
curatedMetagenomicData("AsnicarF_20.+.relative_abundance", dryrun = FALSE, counts = TRUE, rownames = "short")
```

### `mergeData()`

To merge the `list` elements returned from the `curatedMetagenomicData()` function into a single `SummarizedExperiment` or `TreeSummarizedExperiment` object, users will use the `mergeData()` function, provided elements are of the same `dataType`.

```{r, collapse = TRUE, message = FALSE}
curatedMetagenomicData("AsnicarF_20.+.marker_abundance", dryrun = FALSE) |>
    mergeData()
```

The `mergeData()` function works for every `dataType` and will always return the appropriate data structure (a single `SummarizedExperiment` or `TreeSummarizedExperiment` object).

```{r, collapse = TRUE, message = FALSE}
curatedMetagenomicData("AsnicarF_20.+.pathway_abundance", dryrun = FALSE) |>
    mergeData()
```

This is useful for analysis across entire studies (e.g. meta-analysis); however, when doing analysis across individual samples (e.g. mega-analysis) the `returnSamples()` function is preferable.

```{r, collapse = TRUE, message = FALSE}
curatedMetagenomicData("AsnicarF_20.+.relative_abundance", dryrun = FALSE, rownames = "short") |>
    mergeData()
```

## `returnSamples()`

The `returnSamples()` function takes the `sampleMetadata` `data.frame` subset to include only desired samples and metadata as input, and returns a single `SummarizedExperiment` or `TreeSummarizedExperiment` object that includes only desired samples and metadata. To use this function, filter rows and/or select columns of interest from the `sampleMetadata` `data.frame`, maintaining at least one row, and the `sample_id` and `study_name` columns. Then provide the subset `data.frame` as the first argument to the `returnSamples()` function.

The `returnSamples()` function requires a second argument `dataType` (either `"gene_families"`, `"marker_abundance"`, `"marker_presence"`, `"pathway_abundance"`, `"pathway_coverage"`, or `"relative_abundance"`) to be specified. It is often most convenient to subset the `sampleMetadata` `data.frame` using `r CRANpkg("dplyr")` syntax.

```{r, collapse = TRUE, message = FALSE}
sampleMetadata |>
    filter(age >= 18) |>
    filter(!is.na(alcohol)) |>
    filter(body_site == "stool") |>
    select(where(~ !all(is.na(.x)))) |>
    returnSamples("relative_abundance", rownames = "short")
```

The `counts` and `rownames` arguments apply to `returnSamples()` as well, and can be passed the function. Finally, users should know that any arbitrary columns added to `sampleMetadata` will be present in the `colData` of the `SummarizedExperiment` or `TreeSummarizedExperiment` object that is returned.

# Example Analysis

To demonstrate the utility of `r Biocpkg("curatedMetagenomicData")`, an example analysis is presented below. However, readers should know analysis is generally beyond the scope of `r Biocpkg("curatedMetagenomicData")` and the analysis presented here is for demonstration alone. It is best to consider the output of `r Biocpkg("curatedMetagenomicData")` as the input of analysis more than anything else.

## R Packages

To demonstrate the utility of `r Biocpkg("curatedMetagenomicData")`, the `r CRANpkg("stringr")`, `r Biocpkg("mia")`, `r Biocpkg("scater")`, and `r CRANpkg("vegan")` packages are needed.

```{r, message = FALSE}
library(stringr)
library(mia)
library(scater)
library(vegan)
```

## Prepare Data

In our hypothetical study, let's examine the association of alcohol consumption and stool microbial composition across all annotated samples in `r Biocpkg("curatedMetagenomicData")`. We will examine the alpha diversity (within subject diversity), beta diversity (between subject diversity), and conclude with a few notes on differential abundance analysis.

### Return Samples

First, as above, we use the `returnSamples()` function to return the relevant samples across all studies available in `r Biocpkg("curatedMetagenomicData")`. We want adults over the age of 18, for whom alcohol consumption status is known, and we want only stool samples. The `select(where...` line below simply removes metadata columns which are all `NA` values – they exist in another study but are all `NA` once subsetting has been done. Lastly, the `"relative_abundance"` `dataType` is requested because it contains the relevant information about microbial composition.

```{r, collapse = TRUE, message = FALSE}
alcoholStudy <-
    filter(sampleMetadata, age >= 18) |>
    filter(!is.na(alcohol)) |>
    filter(body_site == "stool") |>
    select(where(~ !all(is.na(.x)))) |>
    returnSamples("relative_abundance", rownames = "short")
```

### Mutate colData

Most of the values in the `sampleMetadata` `data.frame` (which becomes `colData`) are in snake case (e.g. `snake_case`) and don't look nice in plots. Here, the values of the `alcohol` variable are made into title case using `r CRANpkg("stringr")` so they will look nice in plots.

```{r, collapse = TRUE, message = FALSE}
colData(alcoholStudy) <-
    colData(alcoholStudy) |>
    as.data.frame() |>
    mutate(alcohol = str_replace_all(alcohol, "no", "No")) |>
    mutate(alcohol = str_replace_all(alcohol, "yes", "Yes")) |>
    DataFrame()
```

### Agglomerate Ranks

Next, the `splitByRanks` function from `r Biocpkg("mia")` is used to create alternative experiments for each level of the taxonomic tree (e.g. Genus). This allows for diversity and differential abundance analysis at specific taxonomic levels; with this step complete, our data is ready to analyze.

```{r, collapse = TRUE, message = FALSE}
altExps(alcoholStudy) <-
    splitByRanks(alcoholStudy)
```

## Alpha Diversity

Alpha diversity is a measure of the within sample diversity of features (relative abundance proportions here) and seeks to quantify the evenness (i.e. are the amounts of different microbes the same) and richness (i.e. are they are large variety of microbial taxa present). The Shannon index (H') is a commonly used measure of alpha diversity, it's estimated here using the `estimateDiversity()` function from the `r Biocpkg("mia")` package.

To quickly plot the results of alpha diversity estimation, the `plotColData()` function from the `r Biocpkg("scater")` package is used along with `r CRANpkg("ggplot2")` syntax.

```{r, collapse = TRUE, fig.cap = "Alpha Diversity – Shannon Index (H')"}
alcoholStudy |>
    estimateDiversity(assay.type = "relative_abundance", index = "shannon") |>
    plotColData(x = "alcohol", y = "shannon", colour_by = "alcohol", shape_by = "alcohol") +
    labs(x = "Alcohol", y = "Alpha Diversity (H')") +
    guides(colour = guide_legend(title = "Alcohol"), shape = guide_legend(title = "Alcohol")) +
    theme(legend.position = "none")
```

The figure suggest that those who consume alcohol have higher Shannon alpha diversity than those who do not consume alcohol; however, the difference does not appear to be significant, at least qualitatively.

## Beta Diversity

Beta diversity is a measure of the between sample diversity of features (relative abundance proportions here) and seeks to quantify the magnitude of differences (or similarity) between every given pair of samples. Below it is assessed by Bray–Curtis Principal Coordinates Analysis (PCoA) and Uniform Manifold Approximation and Projection (UMAP).

### Bray–Curtis PCoA

To calculate pairwise Bray–Curtis distance for every sample in our study we will use the `runMDS()` function from the `r Biocpkg("scater")` package along with the `vegdist()` function from the `r CRANpkg("vegan")` package.

To quickly plot the results of beta diversity analysis, the `plotReducedDim()` function from the `r Biocpkg("scater")` package is used along with `r CRANpkg("ggplot2")` syntax.

```{r, collapse = TRUE, fig.cap = "Beta Diversity – Bray–Curtis PCoA"}
alcoholStudy |>
    runMDS(FUN = vegdist, method = "bray", exprs_values = "relative_abundance", altexp = "genus", name = "BrayCurtis") |>
    plotReducedDim("BrayCurtis", colour_by = "alcohol", shape_by = "alcohol") +
    labs(x = "PCo 1", y = "PCo 2") +
    guides(colour = guide_legend(title = "Alcohol"), shape = guide_legend(title = "Alcohol")) +
    theme(legend.position = c(0.90, 0.85))
```

### UMAP

To calculate the UMAP coordinates of every sample in our study we will use the `runUMAP()` function from the `r Biocpkg("scater")` package package, as it handles the task in a single line.

To quickly plot the results of beta diversity analysis, the `plotReducedDim()` function from the `r Biocpkg("scater")` package is used along with `r CRANpkg("ggplot2")` syntax again.

```{r, collapse = TRUE, fig.cap = "Beta Diversity – UMAP (Uniform Manifold Approximation and Projection)"}
alcoholStudy |>
    runUMAP(exprs_values = "relative_abundance", altexp = "genus", name = "UMAP") |>
    plotReducedDim("UMAP", colour_by = "alcohol", shape_by = "alcohol") +
    labs(x = "UMAP 1", y = "UMAP 2") +
    guides(colour = guide_legend(title = "Alcohol"), shape = guide_legend(title = "Alcohol")) +
    theme(legend.position = c(0.90, 0.85))
```

## Differential Abundance

Next, it would be desirable to establish which microbes are differentially abundant between the two groups (those who consume alcohol, and those who do not). The `r Biocpkg("lefser")` and `r Biocpkg("ANCOMBC")` packages are excellent resources for this tasks; however, code is not included here to avoid including excessive `Suggests` packages – `r Biocpkg("curatedMetagenomicData")` had far too many of these in the the past and is now very lean. There is a repository of analyses, [curatedMetagenomicAnalyses](https://github.com/waldronlab/curatedMetagenomicAnalyses), on GitHub and a forthcoming paper that will feature extensive demonstrations of analyses – but for now, the suggestions above will have to suffice.

# Type Conversion

Finally, the `r Biocpkg("curatedMetagenomicData")` package previously had functions for conversion to `phyloseq` class objects, and they have been removed. It is likely that some users will still want to do analysis using `r Biocpkg("phyloseq")`, and we would like to help them do so – it is just easier if we don't have to maintain the conversion function ourselves. As such, the `r Biocpkg("mia")` package has a function, `makePhyloseqFromTreeSummarizedExperiment`, that will readily do the conversion – users needing this functionality are advised to use it.

```{r, eval = FALSE}
convertToPhyloseq(alcoholStudy, assay.type = "relative_abundance")
```

# Session Info

```{r, echo = FALSE}
utils::sessionInfo()
```