---
title: "HMP16SData"
author:
- name: Lucas Schiffer
  affiliation:
  - &1 Graduate School of Public Health and Health Policy, City University of
    New York, New York, NY
  - &2 Institute for Implementation Science in Population Health, City
    University of New York, New York, NY
- name: Rimsha Azhar
  affiliation:
  - *1
  - *2
- name: Marcel Ramos
  affiliation:
  - *1
  - *2
  - &3 Roswell Park Cancer Institute, University of Buffalo, Buffalo, NY
- name: Ludwig Geistlinger
  affiliation:
  - *1
  - *2
- name: Levi Waldron
  affiliation:
  - *1
  - *2
date: '`r format(Sys.Date(), "%B %e, %Y")`'
abstract: >
    HMP16SData is a Bioconductor ExperimentData package of the Human Microbiome
    Project (HMP) 16S rRNA sequencing data for variable regions 1–3 and 3–5. Raw
    data files are provided in the package as downloaded from the HMP Data
    Analysis and Coordination Center. Processed data is provided as
    SummarizedExperiment class objects via ExperimentHub.
package: HMP16SData
output:
    BiocStyle::html_document
vignette: >
    %\VignetteIndexEntry{HMP16SData}
    %\VignetteEncoding{UTF-8}
    %\VignetteEngine{knitr::rmarkdown}
editor_options:
    chunk_output_type: console
---

# Publications

Schiffer, L. *et al.* [HMP16SData: Efficient Access to the Human Microbiome
Project through Bioconductor](https://dx.doi.org/10.1093/aje/kwz006). *Am. J.
Epidemiol.* (2019).

Griffith, J. C. & Morgan, X. C. [Invited Commentary: Improving accessibility of
the Human Microbiome Project data through integration with
R/Bioconductor](https://dx.doi.org/10.1093/aje/kwz007). *Am. J. Epidemiol.*
(2019).

Waldron, L. *et al.* [Waldron et al. Reply to “Commentary on the HMP16SData
Bioconductor Package”](https://dx.doi.org/10.1093/aje/kwz008). *Am. J.
Epidemiol.* (2019).

# Prerequisites

The following `r BiocStyle::CRANpkg("knitr")` options will be used in this
vignette to provide the most useful and concise output.

```{r}
knitr::opts_chunk$set(message = FALSE)
```

The following packages will be used in this vignette to provide demonstrative
examples of what a user might do with `r BiocStyle::Biocpkg("HMP16SData")`.

```{r}
library(HMP16SData)
library(phyloseq)
library(magrittr)
library(ggplot2)
library(tibble)
library(dplyr)
library(dendextend)
library(circlize)
library(ExperimentHub)
library(gridExtra)
library(cowplot)
library(readr)
library(haven)
```

Pipe operators from the `r BiocStyle::CRANpkg("magrittr")` package are used in
this vignette to provide the most elegant and concise syntax. See the
`r BiocStyle::CRANpkg("magrittr")` vignette if the syntax is unclear.

# Introduction

`r BiocStyle::Biocpkg("HMP16SData")` is a Bioconductor ExperimentData package of
the Human Microbiome Project (HMP) 16S rRNA sequencing data for variable regions
1–3 and 3–5. Raw data files are provided in the package as downloaded from the
[HMP Data Analysis and Coordination Center](https://tinyurl.com/y7ev836z).
Processed data is provided as `SummarizedExperiment` class objects via
`r BiocStyle::Biocpkg("ExperimentHub")`.

`r BiocStyle::Biocpkg("HMP16SData")` can be installed using
`r BiocStyle::CRANpkg("BiocManager")` as follows.

```{r, eval=FALSE}
BiocManager::install("HMP16SData")
```

Once installed, `r BiocStyle::Biocpkg("HMP16SData")` provides two functions to
access data – one for variable region 1–3 and another for variable region 3–5.
When called, as follows, the functions will download data from an
`r BiocStyle::Biocpkg("ExperimentHub")` Amazon S3 (Simple Storage Service)
bucket over `https` or load data from a local cache.

```{r}
V13()
V35()
```

The two data sets are represented as `SummarizedExperiment` objects, a standard
Bioconductor class that is amenable to subsetting and analysis. To maintain
brevity, details of the `SummarizedExperiment` class are not outlined here but
the `r BiocStyle::Biocpkg("SummarizedExperiment")` package provides an excellent
vignette.

# Features

## Frequency Table Generation

Sometimes it is desirable to provide a quick summary of key demographic
variables and to make the process easier `r BiocStyle::Biocpkg("HMP16SData")`
provides a function, `table_one`, to do so. It returns a `data.frame` or a
`list` of `data.frame` objects that have been transformed to make a publication
ready table.

```{r}
V13() %>%
    table_one() %>%
    head()
```

If a `list` is passed to `table_one`, its elements must be named so that the
named elements can be used by the `kable_one` function. The `kable_one` function
will produce an `HTML` table for vignettes such as the one shown below.

```{r}
list(V13 = V13(), V35 = V35()) %>%
    table_one() %>%
    kable_one()
```

## Straightforward Subsetting

The `SummarizedExperiment` container provides for straightforward subsetting by
data or metadata variables using either the `subset` function or `[` methods –
see the `r BiocStyle::Biocpkg("SummarizedExperiment")` vignette for additional
details. Shown below, the variable region 3–5 data set is subset to include only
stool samples.

```{r}
V35_stool <-
    V35() %>%
    subset(select = HMP_BODY_SUBSITE == "Stool")

V35_stool
```

## HMP Controlled-Access Participant Data

Most participant data from the HMP study is controlled through the National
Center for Biotechnology Information (NCBI) database of Genotypes and Phenotypes
(dbGaP). `r BiocStyle::Biocpkg("HMP16SData")` provides a data dictionary
translated from dbGaP `XML` files for the seven different controlled-access data
tables related to the HMP. See `?HMP16SData::dictionary` for details of these
source data tables, and `View(dictionary)` to view the complete data dictionary.
Several steps are required to access the data tables, but the process is greatly
simplified by `r BiocStyle::Biocpkg("HMP16SData")`.

### Apply for dbGaP Access

You must make a controlled-access application through
<https://dbgap.ncbi.nlm.nih.gov> for:

> HMP Core Microbiome Sampling Protocol A (HMP-A) (**phs000228.v4.p1**)

Once approved, browse to <https://dbgap.ncbi.nlm.nih.gov>, sign in, and select
the option "*get dbGaP repository key*" to download your `*.ngc` repository key.
This is all you need from the dbGaP website.

### Install the SRA Toolkit

You must also install the [NCBI SRA Toolkit](https://tinyurl.com/ydgmzc8a),
which will be used in the background for downloading and decrypting
controlled-access data.

There are shortcuts for common platforms:

* Debian/Ubuntu: `apt install sra-toolkit`
* macOS: `brew install sratoolkit`

For macOS, the `brew` command does not come installed by default and requires
installation of the homebrew package manager. Instructions are available at
<https://tinyurl.com/ybeqwl8f>.

For Windows, binary installation is necessary and instructions are available at
<https://tinyurl.com/y845ppaa>.

### Merge with HMP16SData

The `attach_dbGap()` function takes a `r BiocStyle::Biocpkg("HMP16SData")`
`SummarizedExperiment` object as its first argument and the path to a dbGaP
repository key as its second argument. It performs download, decryption, and
merging of all available controlled-access participant data with a single
function call.

```{r, eval=FALSE}
V35_stool_protected <-
    V35_stool %>%
    attach_dbGaP("~/prj_12146.ngc")
```

The returned `V35_stool_protected` object contains controlled-access participant
data as additional columns in its `colData` slot.

```{r, eval=FALSE}
colData(V35_stool_protected)
```

## Analysis Using the phyloseq Package

The `r BiocStyle::Biocpkg("phyloseq")` package provides an extensive suite of
methods to analyze microbiome data.

For those familiar with both the HMP and `r BiocStyle::Biocpkg("phyloseq")`, you
may recall that an alternative `phyloseq` class object containing the HMP
variable region 3–5 data has been made available by Joey McMurdie at
<https://joey711.github.io/phyloseq-demo/HMPv35.RData>. However, this object is
not compatible with the methods documented here for integration with dbGaP
controlled-access participant data, shotgun metagenomic data, or variable region
1–3 data. For that reason, we would encourage the use of the
`r BiocStyle::Biocpkg("HMP16SData")` `SummarizedExperiment` class objects with
the `r BiocStyle::Biocpkg("phyloseq")` package.

To demonstrate how `r BiocStyle::Biocpkg("HMP16SData")` could be used as a
control or comparison cohort in microbime data analyses, we will demonstrate
basic comparisons of the palatine tonsils and stool body subsites using the
`r BiocStyle::Biocpkg("phyloseq")` package. We first create and subset two
`SummarizedExperiment` objects from `r BiocStyle::Biocpkg("HMP16SData")` to
include only the relevant body subsites.

```{r}
V13_tonsils <-
    V13() %>%
    subset(select = HMP_BODY_SUBSITE == "Palatine Tonsils")

V13_stool <-
    V13() %>%
    subset(select = HMP_BODY_SUBSITE == "Stool")
```

While these objects are both from the `r BiocStyle::Biocpkg("HMP16SData")`
package, a user would potentially be comparing to their own data and only need a
single object from the package.

### Coercion to phyloseq Objects

The `SummarizedExperiment` class objects can then be coerced to `phyloseq` class
objects containing count data, sample (participant) data, taxonomy, and
phylogenetic trees using the `as_phyloseq` function.

```{r}
V13_tonsils_phyloseq <-
    as_phyloseq(V13_tonsils)

V13_stool_phyloseq <-
    as_phyloseq(V13_stool)
```

The analysis of all the samples in these two `phyloseq` objects would be rather
computationally intensive. So to preform the analysis in a more timely manner, a
function, `sample_samples`, is written here to take a sample of the samples
available in each `phyloseq` object.

```{r}
sample_samples <- function(x, size) {
    sampled_names <-
        sample_names(x) %>%
        sample(size)

    prune_samples(sampled_names, x)
}
```

Each `phyloseq` object is then sampled to contain only twenty-five samples.

```{r}
V13_tonsils_phyloseq %<>%
    sample_samples(25)

V13_stool_phyloseq %<>%
    sample_samples(25)
```

A "Study" identifier is then added to the `sample_data` of each `phyloseq`
object to be used for stratification in analysis. In the case that a user were
comparing the HMP samples to their own data, an identifier would be added in the
same manner.

```{r}
sample_data(V13_tonsils_phyloseq)$Study <- "Tonsils"

sample_data(V13_stool_phyloseq)$Study <- "Stool"
```

Once the two `phyloseq` objects have been sampled and their `sample_data` has
been augmented, they can be merged into a single `phyloseq` object using the
`merge_phyloseq` command.

```{r}
V13_phyloseq <-
    merge_phyloseq(V13_tonsils_phyloseq, V13_stool_phyloseq)
```

Finally, because the V13 data were subset and sampled, taxa with no relative
abundance are present in the merged object. These are removed using the
`prune_taxa` command to avoid warnings during analysis.

```{r}
V13_phyloseq %<>%
    taxa_sums() %>%
    is_greater_than(0) %>%
    prune_taxa(V13_phyloseq)
```

The resulting `V13_phyloseq` object can then be analyzed quickly and easily.

### Alpha Diversity Analysis

Alpha diversity measures the taxonomic variation within a sample and
`r BiocStyle::Biocpkg("phyloseq")` provides a method, `plot_richness`, to plot
various alpha diversity measures.

First a vector of richness (i.e. alpha diversity) measures is created to be
passed to the `plot_richness` method.

```{r}
richness_measures <-
    c("Observed", "Shannon", "Simpson")
```

The `V13_phyloseq` object and the `richness_measures` vector are then passed to
the `plot_richness` method to construct a box plot of the three alpha diversity
measures. Additional `r BiocStyle::CRANpkg("ggplot2")` syntax is used to control
the presentational aspects of the plot.

```{r, fig.height=5, fig.width=8}
V13_phyloseq %>%
    plot_richness(x = "Study", color = "Study", measures = richness_measures) +
    stat_boxplot(geom ="errorbar") +
    geom_boxplot() +
    theme_bw() +
    theme(axis.title.x = element_blank(), legend.position = "none")
```

### Beta Diversity Analysis

Beta diversity measures the taxonomic variation between samples by calculating
the dissimilarity of clade relative abundances. The
`r BiocStyle::Biocpkg("phyloseq")` package provides a method, `distance`, to
calculate various dissimilarity measures, such as Bray–Curtis dissimilarity.
Once dissimilarity has been calculated, samples can then be clustered and
represented as a dendrogram.

```{r}
V13_dendrogram <-
    distance(V13_phyloseq, method = "bray") %>%
    hclust() %>%
    as.dendrogram()
```

However, coercion to a `dendrogram` object results in the lost of `sample_data`
present in the `phyloseq` object which is needed for plotting. A `data.frame` of
this `sample_data` can be extracted from the `phyloseq` object as follows.

```{r}
V13_sample_data <-
    sample_data(V13_phyloseq) %>%
    data.frame()
```

Samples in the the plots will be identified by "PSN"" (Primary Sample Number)
and "Study". So, additional columns to denote the colors and shapes of leaves
and labels are added to the `data.frame` using `r BiocStyle::CRANpkg("dplyr")`
syntax.

```{r}
V13_sample_data %<>%
    rownames_to_column(var = "PSN") %>%
    mutate(labels_col = if_else(Study == "Stool", "#F8766D", "#00BFC4")) %>%
    mutate(leaves_col = if_else(Study == "Stool", "#F8766D", "#00BFC4")) %>%
    mutate(leaves_pch = if_else(Study == "Stool", 16, 17))
```

Additionally, the order of samples in the `dendrogram` and `data.frame` objects
is different and a vector to sort samples is constructed as follows.

```{r}
V13_sample_order <-
    labels(V13_dendrogram) %>%
    match(V13_sample_data$PSN)
```

The label and leaf color and shape columns of the `data.frame` object can then
be coerced to vectors and sorted according to the sample order of the
`dendrogram` object.

```{r}
labels_col <- V13_sample_data$labels_col[V13_sample_order]
leaves_col <- V13_sample_data$leaves_col[V13_sample_order]
leaves_pch <- V13_sample_data$leaves_pch[V13_sample_order]
```

The `r BiocStyle::CRANpkg("dendextend")` package is then used to add these
vectors to the `dendrogram` object as metadata which will be used for plotting.

```{r}
V13_dendrogram %<>%
    set("labels_col", labels_col) %>%
    set("leaves_col", leaves_col) %>%
    set("leaves_pch", leaves_pch)
```

Finally, the `r BiocStyle::CRANpkg("dendextend")` package provides a method,
`circlize_dendrogram`, to produce a circular dendrogram, using the
`r BiocStyle::CRANpkg("circlize")` package, with a single line of code.

```{r, fig.height=8, fig.width=8}
V13_dendrogram %>%
    circlize_dendrogram(labels_track_height = 0.2)
```

### Principle Coordinates Analysis

The `r BiocStyle::Biocpkg("phyloseq")` package additionally provides methods for
commonly-used ordination analyses such as principle coordinates analysis. The
`ordinate` method simply requires a `phyloseq` object and specification of the
type of ordination analysis to be preformed. The type of distance method used
can also optionally be specified.

```{r}
V13_ordination <-
    ordinate(V13_phyloseq, method = "PCoA", distance = "bray")
```

The ordination analysis can then be plotted using the `plot_ordination` method
provided by the `r BiocStyle::Biocpkg("phyloseq")` package. Again, additional
`r BiocStyle::CRANpkg("ggplot2")` syntax is used to control the presentational
aspects of the plot.

```{r, fig.height=5, fig.width=8}
V13_phyloseq %>%
    plot_ordination(V13_ordination, color = "Study", shape = "Study") +
    theme_bw() +
    theme(legend.position = "bottom")
```

Finally, the ordination eigenvalues can be plotted using the `plot_scree` method
provided by the `r BiocStyle::Biocpkg("phyloseq")` package.

```{r, fig.height=5, fig.width=8}
V13_ordination %>%
    plot_scree() +
    theme_bw()
```

# Phylum-level Comparison to Metagenomic Shotgun Sequencing

In addition to 16S rRNA sequencing, the HMP Study conducted whole metagenome
shotgun (MGX) sequencing. These profiles, along with thousands of profiles from
other studies, are available in the
`r BiocStyle::Biocpkg("curatedMetagenomicData")` package. Here, a phylum-level
relative abundance comparison of the 16S and MGX samples is made to illustrate
comparing these data sets.

First a `V35_stool_phyloseq` object is constructed and contains the 16S variable
region 3–5 data for stool samples. Then the MGX stool samples are obtained and
coerced to a `phyloseq` object as follows.

```{r}
V35_stool_phyloseq <-
    V35_stool %>%
    as_phyloseq()

EH <-
    ExperimentHub()

HMP_2012.metaphlan_bugs_list.stool <-
    EH[["EH426"]]

# a modified version of ExpressionSet2phyloseq taken from curatedMetagenomicData
# ExpressionSet2phyloseq was removed in curatedMetagenomicData version 3.0.0
ExpressionSet2phyloseq <- function(eset, simplify = TRUE, relab = TRUE) {
    otu.table <- 
        Biobase::exprs(eset)
    
    sample.data <-
        Biobase::pData(eset) %>%
        phyloseq::sample_data(., errorIfNULL = FALSE)

    taxonomic.ranks <-
        c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species", "Strain")
    
    tax.table <-
        rownames(otu.table) %>%
        gsub("[a-z]__", "", .) %>%
        dplyr::tibble() %>%
        tidyr::separate(., ".", taxonomic.ranks, sep = "\\|", fill="right") %>%
        as.matrix()
    
    if(simplify) {
        rownames(otu.table) <- rownames(otu.table) %>%
            gsub(".+\\|", "", .)
    }
    
    rownames(tax.table) <-
        rownames(otu.table)
    
    if(!relab) {
        otu.table <-
            round(sweep(otu.table, 2, eset$number_reads/100, "*"))
    }
    
    otu.table <-
        phyloseq::otu_table(otu.table, taxa_are_rows = TRUE)
    
    tax.table <-
        phyloseq::tax_table(tax.table)
    
    
    phyloseq::phyloseq(otu.table, sample.data, tax.table)
}

MGX_stool_phyloseq <-
    ExpressionSet2phyloseq(HMP_2012.metaphlan_bugs_list.stool)
```

The `r BiocStyle::Biocpkg("curatedMetagenomicData")` package provides taxonomic
relative abundance for MGX data (with an option to estimate counts by
multiplying by read depth) from MetaPhlAn2. MetaPhlAn2 directly estimates
relative abundance at every taxonomic level based on clade-specific markers, and
these estimates are better than summing lower-level taxonomic abundances.

Note, this comparison becomes complicated because
`r BiocStyle::Biocpkg("phyloseq")` does not currently support row and column
matching and reordering. Instead, we use the `phyloseq::psmelt()` method to
generate `data.frame` objects for further manipulation.

```{r}
MGX_stool_melted <-
    MGX_stool_phyloseq %>%
    subset_taxa(is.na(Class) & !is.na(Phylum)) %>%
    psmelt()
```

The 16S data sets do not contain summarized counts for higher taxonomic levels,
so we use the `phyloseq::tax_glom()` method to agglomerate at phylum level.

```{r}
V35_stool_melted <-
    V35_stool_phyloseq %>%
    tax_glom(taxrank = "PHYLUM") %>%
    psmelt()
```

There is a column `SRS_SAMPLE_ID` present in the 16S variable region 3–5 data
that is explicitly for mapping to the MGX samples and the matching identifiers
are in the `NCBI_accession` column of the MGX sample data. The intersection of
the two vectors represents the matching samples that are present in both data
sets. This intersection, shown below as `SRS_SAMPLE_IDS`, can then be used to
filter both the 16S and MGX samples to include only the samples in common.

Along with the `filter` step shown below, standardization of sample identifiers
to the `SRS` numbers and conversion of taxonomic counts to relative abundance is
done. The conversion to relative abundance is necessary for comparability across
the 16S and MGX samples.

In either case, data is first grouped by samples and the percent composition of
each phylum relative to the others in the sample is calculated by taking the
count abundance of the phylum divided by the sum of all abundance counts in the
sample. Samples are then sorted by phylum and descending relative abundance
before being grouped by phylum – the grouping by phylum allows for the
assignment a phylum rank that is then used to sort the phyla by descending
relative abundance. Finally, only the sample, phylum, and relative abundance
columns are kept once the order has been established and all groupings can be
removed.

```{r}
SRS_SAMPLE_IDS <-
    intersect(V35_stool_melted$SRS_SAMPLE_ID, MGX_stool_melted$NCBI_accession)

V35_stool_melted %<>%
    filter(SRS_SAMPLE_ID %in% SRS_SAMPLE_IDS) %>%
    rename(Phylum = PHYLUM) %>%
    mutate(Sample = SRS_SAMPLE_ID) %>%
    group_by(Sample) %>%
    mutate(`Relative Abundance` = Abundance / sum(Abundance)) %>%
    arrange(Phylum, -`Relative Abundance`) %>%
    group_by(Phylum) %>%
    mutate(phylum_rank = sum(`Relative Abundance`)) %>%
    arrange(-phylum_rank) %>%
    select(Sample, Phylum, `Relative Abundance`) %>%
    ungroup()

MGX_stool_melted %<>%
    filter(NCBI_accession %in% SRS_SAMPLE_IDS) %>%
    group_by(Sample) %>%
    mutate(`Relative Abundance` = Abundance / sum(Abundance)) %>%
    arrange(Phylum, -`Relative Abundance`) %>%
    group_by(Phylum) %>%
    mutate(phylum_rank = sum(`Relative Abundance`)) %>%
    arrange(-phylum_rank) %>%
    select(Sample, Phylum, `Relative Abundance`) %>%
    ungroup()
```

Provided that the phyla are ordered by relative abundance, the top phyla from
each data set can be obtained and intersected to give the top eight phyla
present in both data sets. The top eight phyla can then be used to filter 16S
and MGX samples to include only the desired phyla.

```{r}
V35_top_phyla <-
    V35_stool_melted %$%
    unique(Phylum) %>%
    as.character()

MGX_top_phyla <-
    MGX_stool_melted %$%
    unique(Phylum) %>%
    as.character()

top_eight_phyla <-
    intersect(V35_top_phyla, MGX_top_phyla) %>%
    extract(1:8)

V35_stool_melted %<>%
    filter(Phylum %in% top_eight_phyla)

MGX_stool_melted %<>%
    filter(Phylum %in% top_eight_phyla)
```

To achieve ordering of samples by the relative abundance of the top phylum when
plotting, a vector of the unique sample identifiers is constructed and will be
used as the `levels` of a `factor`.

```{r}
sample_order <-
    V35_stool_melted %$%
    unique(Sample)
```

A vector of unique phyla is also constructed and will be used as the `levels` of
a `factor` when plotting. If this were not done, phyla would simply be sorted
alphabetically.

```{r}
phylum_order <-
    V35_stool_melted %$%
    unique(Phylum)
```

Color blindness affects a significant portion of the population, but, using an
intelligent color pallet, figures can be designed to be friendly to those with
deuteranopia, protanopia, and tritanopia. The following colors are derived from
Wong, B. Points of view: Color blindness. *Nat. Methods* **8**, 441 (2011).

```{r}
bang_wong_colors <-
    c("#CC79A7", "#D55E00", "#0072B2", "#F0E442", "#009E73", "#56B4E9",
      "#E69F00", "#000000")
```

Using the `sample_order` and `phylum_order` vectors constructed above, stacked
phylum-level relative abundance bar plots sorted by decreasing relative
abundance of the top phylum and stratified by the top eight phyla can be made
for 16S and MGX samples. The two plots are made separately and serialized so
that they can be arranged in a single figure for comparison.

```{r}
V35_plot <-
    V35_stool_melted %>%
    mutate(Sample = factor(Sample, sample_order)) %>%
    mutate(Phylum = factor(Phylum, phylum_order)) %>%
    ggplot(aes(Sample, `Relative Abundance`, fill = Phylum)) +
    geom_bar(stat = "identity", position = "fill", width = 1) +
    scale_fill_manual(values = bang_wong_colors) +
    theme_minimal() +
    theme(axis.text.x = element_blank(), axis.title.x = element_blank(),
          panel.grid = element_blank(), legend.position = "none",
          legend.title = element_blank()) +
    ggtitle("Phylum-Level Relative Abundance", "16S Stool Samples")

MGX_plot <-
    MGX_stool_melted %>%
    mutate(Sample = factor(Sample, sample_order)) %>%
    mutate(Phylum = factor(Phylum, phylum_order)) %>%
    ggplot(aes(Sample, `Relative Abundance`, fill = Phylum)) +
    geom_bar(stat = "identity", position = "fill", width = 1) +
    scale_fill_manual(values = bang_wong_colors) +
    theme_minimal() +
    theme(axis.text.x = element_blank(), axis.title.x = element_blank(),
          panel.grid = element_blank(), legend.position = "none",
          legend.title = element_blank()) +
    ggtitle("", "MGX Stool Samples")
```

In the plots above, legends are removed to reduce redundancy, but a legend is
still necessary and can be serialized as its own plot using the `get_legend`
method from the `r BiocStyle::CRANpkg("cowplot")` package.

```{r}
plot_legend <- {
        MGX_plot +
            theme(legend.position = "bottom")
    } %>%
    get_legend()
```

Finally, the `grid.arrange` method from the `r BiocStyle::CRANpkg("gridExtra")`
package is used to arrange, scale, and plot the three plots in a single figure.

```{r, fig.height=8, fig.width=8}
grid.arrange(V35_plot, MGX_plot, plot_legend, ncol = 1, heights = c(3, 3, 1))
```

Notably, the figure illustrates the Bacteroidetes/Firmicutes gradient with
reasonable agreement between the 16S and MGX samples.

When these plots were submitted for publication, the following code was used to
produce ESP and PDF files.

```{r, eval=FALSE}
V35_plot +
    theme(text = element_text(size = 19)) +
    labs(title = NULL, subtitle = NULL, tag = "A")

ggsave("~/AJE-00611-2018 Schiffer Figure 2A.eps", device = "eps")
ggsave("~/AJE-00611-2018 Schiffer Figure 2A.pdf", device = "pdf")

MGX_plot +
    theme(text = element_text(size = 19)) +
    labs(title = NULL, subtitle = NULL, tag = "B")

ggsave("~/AJE-00611-2018 Schiffer Figure 2B.eps", device = "eps")
ggsave("~/AJE-00611-2018 Schiffer Figure 2B.pdf", device = "pdf")

plot_legend <- {
        MGX_plot +
            theme(legend.position = "bottom", text = element_text(size = 19)) +
            guides(fill = guide_legend(byrow = TRUE))
    } %>%
    get_legend()

ggsave("~/AJE-00611-2018 Schiffer Figure 2 Legend 1.eps", plot = plot_legend,
       device = "eps")
ggsave("~/AJE-00611-2018 Schiffer Figure 2 Legend 1.pdf", plot = plot_legend,
       device = "pdf")

plot_legend <- {
        MGX_plot +
            theme(legend.position = "right", text = element_text(size = 19))
    } %>%
    get_legend()

ggsave("~/AJE-00611-2018 Schiffer Figure 2 Legend 2.eps", plot = plot_legend,
       device = "eps")
ggsave("~/AJE-00611-2018 Schiffer Figure 2 Legend 2.pdf", plot = plot_legend,
       device = "pdf")

plot_legend <- {
        MGX_plot +
            theme(legend.position = "right", text = element_text(size = 19)) +
            guides(fill = guide_legend(ncol = 2, byrow = TRUE))
    } %>%
    get_legend()

ggsave("~/AJE-00611-2018 Schiffer Figure 2 Legend 3.eps", plot = plot_legend,
       device = "eps")
ggsave("~/AJE-00611-2018 Schiffer Figure 2 Legend 3.pdf", plot = plot_legend,
       device = "pdf")

V35_plot <-
    V35_plot +
    theme(text = element_text(size = 19)) +
    labs(title = NULL, subtitle = NULL, tag = "A")

MGX_plot <-
    MGX_plot +
    theme(text = element_text(size = 19)) +
    labs(title = NULL, subtitle = NULL, tag = "B")

plot_legend <- {
        MGX_plot +
            theme(legend.position = "bottom", text = element_text(size = 19)) +
            guides(fill = guide_legend(byrow = TRUE))
    } %>%
    get_legend()

grid_object <-
    grid.arrange(V35_plot, MGX_plot, plot_legend, ncol = 1,
                 heights = c(3, 3, 1))

ggsave("~/AJE-00611-2018 Schiffer Figure 3.eps", plot = grid_object,
       device = "eps", width = 8, height = 8)
ggsave("~/AJE-00611-2018 Schiffer Figure 3.pdf", plot = grid_object,
       device = "pdf", width = 8, height = 8)
```

# Exporting Data to CSV, SAS, SPSS, and STATA Formats

To our knowledge, R and Bioconductor provide the most and best methods for the
analysis of microbiome data. However, we realize they are not the only analysis
environments and wish to provide methods to export the data from
`r BiocStyle::Biocpkg("HMP16SData")` to CSV, SAS, SPSS, and STATA formats. As we
do not use these other languages regularly, we are unaware of how to represent
phylogenitic trees in them and will not attempt to export the trees here.

## Prepare Data for Export

Bioconductor's `SummarizedExperiment` class is essentially a normalized
representation of tightly coupled data and metadata that must be "unglued"
before it can be saved into alternative formats.

The process begins by creating a `data.frame` object of the participant data and
moving the row names to their own column.

```{r, eval=FALSE}
V13_participant_data <-
  V13() %>%
  colData() %>%
  as.data.frame() %>%
  rownames_to_column(var = "PSN")
```

Next, the taxonomic abundance matrix is extracted and transposed because
Bioconductor objects represent samples as rows and measurements as columns
whereas other languages do the opposite. The matrix is then coerced to a
`data.frame` object and the row names are moved to their own column.

```{r, eval=FALSE}
V13_OTU_counts <-
  V13() %>%
  assay() %>%
  t() %>%
  as.data.frame() %>%
  rownames_to_column(var = "PSN")
```

With the participant data and taxonomic abundances represented as simple tables
containing a primary key (i.e. PSN or Primary Sample Number) the two tables can
be joined using the `merge.data.frame` method.

```{r, eval=FALSE}
V13_data <-
  merge.data.frame(V13_participant_data, V13_OTU_counts, by = "PSN")
```

The column names of the `V13_data` object are denoted as OTUs or operational
taxonomic units based on their sequence similarity to the 16S rRNA gene. The OTU
nomenclature is not particularly useful without the traditional taxonomic clade
identifiers which are stored in a separate table. A dictionary of these
identifiers is created by extracting and transposing the `rowData` of the
`SummarizedExperiment` object which is then coerced to a `data.frame` object.

```{r, eval=FALSE}
V13_dictionary <-
  V13() %>%
  rowData() %>%
  t.data.frame() %>%
  as.data.frame()
```

The column names of the `V13_data` object contain periods which some languages
and formats are unable to process. The periods in the column names are replaced
with underscores using the `gsub` method.

```{r, eval=FALSE}
colnames(V13_data) <-
    colnames(V13_data) %>%
    gsub(pattern = "\\.", replacement = "_", x = .)
```

The process is repeated for the column names of the `V13_dictionary` object.

```{r, eval=FALSE}
colnames(V13_dictionary) <-
    colnames(V13_dictionary) %>%
    gsub(pattern = "\\.", replacement = "_", x = .)
```

The two tables `V13_data` and `V13_dictionary` are then ready for export to CSV,
SAS, SPSS, and STATA formats.

## Export to CSV Format

To export to CSV format, two calls to the `write_csv` method from the
`r BiocStyle::CRANpkg("readr")` package are used to write CSV files to disk.

```{r, eval=FALSE}
write_csv(V13_data, "~/V13_data.csv")
write_csv(V13_dictionary, "~/V13_dictionary.csv")
```

## Export to SAS Format

To export to SAS format, two calls to the `write_sas` method from the
`r BiocStyle::CRANpkg("haven")` package are used to write SAS files to disk.

```{r, eval=FALSE}
write_sas(V13_data, "~/V13_data.sas7bdat")
write_sas(V13_dictionary, "~/V13_dictionary.sas7bdat")
```

## Export to SPSS Format

To export to SPSS format, two calls to the `write_sav` method from the
`r BiocStyle::CRANpkg("haven")` package are used to write SPSS files to disk.

```{r, eval=FALSE}
write_sav(V13_data, "~/V13_data.sav")
write_sav(V13_dictionary, "~/V13_dictionary.sav")
```

## Export to STATA Format

To export to STATA format, two calls to the `write_dta` method from the
`r BiocStyle::CRANpkg("haven")` package are used to write STATA files to disk.

```{r, eval=FALSE}
write_dta(V13_data, "~/V13_data.dta", version = 13)
write_dta(V13_dictionary, "~/V13_dictionary.dta", version = 13)
```

Here, version 13 STATA files are written because version 14 and above files
require a platform-specific text encoding that would make the data less
transportable. The encoding of the version 13 files is ASCII.

# Session Information

```{r}
sessionInfo()
```