--- title: "HMP16SData" author: - name: Lucas Schiffer affiliation: - &1 Graduate School of Public Health and Health Policy, City University of New York, New York, NY - &2 Institute for Implementation Science in Population Health, City University of New York, New York, NY - name: Rimsha Azhar affiliation: - *1 - *2 - name: Marcel Ramos affiliation: - *1 - *2 - &3 Roswell Park Cancer Institute, University of Buffalo, Buffalo, NY - name: Ludwig Geistlinger affiliation: - *1 - *2 - name: Levi Waldron affiliation: - *1 - *2 date: '`r format(Sys.Date(), "%B %e, %Y")`' abstract: > HMP16SData is a Bioconductor ExperimentData package of the Human Microbiome Project (HMP) 16S rRNA sequencing data for variable regions 1–3 and 3–5. Raw data files are provided in the package as downloaded from the HMP Data Analysis and Coordination Center. Processed data is provided as SummarizedExperiment class objects via ExperimentHub. package: HMP16SData output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{HMP16SData} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- # Publications Schiffer, L. *et al.* [HMP16SData: Efficient Access to the Human Microbiome Project through Bioconductor](https://dx.doi.org/10.1093/aje/kwz006). *Am. J. Epidemiol.* (2019). Griffith, J. C. & Morgan, X. C. [Invited Commentary: Improving accessibility of the Human Microbiome Project data through integration with R/Bioconductor](https://dx.doi.org/10.1093/aje/kwz007). *Am. J. Epidemiol.* (2019). Waldron, L. *et al.* [Waldron et al. Reply to “Commentary on the HMP16SData Bioconductor Package”](https://dx.doi.org/10.1093/aje/kwz008). *Am. J. Epidemiol.* (2019). # Prerequisites The following `r BiocStyle::CRANpkg("knitr")` options will be used in this vignette to provide the most useful and concise output. ```{r} knitr::opts_chunk$set(message = FALSE) ``` The following packages will be used in this vignette to provide demonstrative examples of what a user might do with `r BiocStyle::Biocpkg("HMP16SData")`. ```{r} library(HMP16SData) library(phyloseq) library(magrittr) library(ggplot2) library(tibble) library(dplyr) library(dendextend) library(circlize) library(ExperimentHub) library(gridExtra) library(cowplot) library(readr) library(haven) ``` Pipe operators from the `r BiocStyle::CRANpkg("magrittr")` package are used in this vignette to provide the most elegant and concise syntax. See the `r BiocStyle::CRANpkg("magrittr")` vignette if the syntax is unclear. # Introduction `r BiocStyle::Biocpkg("HMP16SData")` is a Bioconductor ExperimentData package of the Human Microbiome Project (HMP) 16S rRNA sequencing data for variable regions 1–3 and 3–5. Raw data files are provided in the package as downloaded from the [HMP Data Analysis and Coordination Center](https://tinyurl.com/y7ev836z). Processed data is provided as `SummarizedExperiment` class objects via `r BiocStyle::Biocpkg("ExperimentHub")`. `r BiocStyle::Biocpkg("HMP16SData")` can be installed using `r BiocStyle::CRANpkg("BiocManager")` as follows. ```{r, eval=FALSE} BiocManager::install("HMP16SData") ``` Once installed, `r BiocStyle::Biocpkg("HMP16SData")` provides two functions to access data – one for variable region 1–3 and another for variable region 3–5. When called, as follows, the functions will download data from an `r BiocStyle::Biocpkg("ExperimentHub")` Amazon S3 (Simple Storage Service) bucket over `https` or load data from a local cache. ```{r} V13() V35() ``` The two data sets are represented as `SummarizedExperiment` objects, a standard Bioconductor class that is amenable to subsetting and analysis. To maintain brevity, details of the `SummarizedExperiment` class are not outlined here but the `r BiocStyle::Biocpkg("SummarizedExperiment")` package provides an excellent vignette. # Features ## Frequency Table Generation Sometimes it is desirable to provide a quick summary of key demographic variables and to make the process easier `r BiocStyle::Biocpkg("HMP16SData")` provides a function, `table_one`, to do so. It returns a `data.frame` or a `list` of `data.frame` objects that have been transformed to make a publication ready table. ```{r} V13() %>% table_one() %>% head() ``` If a `list` is passed to `table_one`, its elements must be named so that the named elements can be used by the `kable_one` function. The `kable_one` function will produce an `HTML` table for vignettes such as the one shown below. ```{r} list(V13 = V13(), V35 = V35()) %>% table_one() %>% kable_one() ``` ## Straightforward Subsetting The `SummarizedExperiment` container provides for straightforward subsetting by data or metadata variables using either the `subset` function or `[` methods – see the `r BiocStyle::Biocpkg("SummarizedExperiment")` vignette for additional details. Shown below, the variable region 3–5 data set is subset to include only stool samples. ```{r} V35_stool <- V35() %>% subset(select = HMP_BODY_SUBSITE == "Stool") V35_stool ``` ## HMP Controlled-Access Participant Data Most participant data from the HMP study is controlled through the National Center for Biotechnology Information (NCBI) database of Genotypes and Phenotypes (dbGaP). `r BiocStyle::Biocpkg("HMP16SData")` provides a data dictionary translated from dbGaP `XML` files for the seven different controlled-access data tables related to the HMP. See `?HMP16SData::dictionary` for details of these source data tables, and `View(dictionary)` to view the complete data dictionary. Several steps are required to access the data tables, but the process is greatly simplified by `r BiocStyle::Biocpkg("HMP16SData")`. ### Apply for dbGaP Access You must make a controlled-access application through for: > HMP Core Microbiome Sampling Protocol A (HMP-A) (**phs000228.v4.p1**) Once approved, browse to , sign in, and select the option "*get dbGaP repository key*" to download your `*.ngc` repository key. This is all you need from the dbGaP website. ### Install the SRA Toolkit You must also install the [NCBI SRA Toolkit](https://tinyurl.com/ydgmzc8a), which will be used in the background for downloading and decrypting controlled-access data. There are shortcuts for common platforms: * Debian/Ubuntu: `apt install sra-toolkit` * macOS: `brew install sratoolkit` For macOS, the `brew` command does not come installed by default and requires installation of the homebrew package manager. Instructions are available at . For Windows, binary installation is necessary and instructions are available at . ### Merge with HMP16SData The `attach_dbGap()` function takes a `r BiocStyle::Biocpkg("HMP16SData")` `SummarizedExperiment` object as its first argument and the path to a dbGaP repository key as its second argument. It performs download, decryption, and merging of all available controlled-access participant data with a single function call. ```{r, eval=FALSE} V35_stool_protected <- V35_stool %>% attach_dbGaP("~/prj_12146.ngc") ``` The returned `V35_stool_protected` object contains controlled-access participant data as additional columns in its `colData` slot. ```{r, eval=FALSE} colData(V35_stool_protected) ``` ## Analysis Using the phyloseq Package The `r BiocStyle::Biocpkg("phyloseq")` package provides an extensive suite of methods to analyze microbiome data. For those familiar with both the HMP and `r BiocStyle::Biocpkg("phyloseq")`, you may recall that an alternative `phyloseq` class object containing the HMP variable region 3–5 data has been made available by Joey McMurdie at . However, this object is not compatible with the methods documented here for integration with dbGaP controlled-access participant data, shotgun metagenomic data, or variable region 1–3 data. For that reason, we would encourage the use of the `r BiocStyle::Biocpkg("HMP16SData")` `SummarizedExperiment` class objects with the `r BiocStyle::Biocpkg("phyloseq")` package. To demonstrate how `r BiocStyle::Biocpkg("HMP16SData")` could be used as a control or comparison cohort in microbime data analyses, we will demonstrate basic comparisons of the palatine tonsils and stool body subsites using the `r BiocStyle::Biocpkg("phyloseq")` package. We first create and subset two `SummarizedExperiment` objects from `r BiocStyle::Biocpkg("HMP16SData")` to include only the relevant body subsites. ```{r} V13_tonsils <- V13() %>% subset(select = HMP_BODY_SUBSITE == "Palatine Tonsils") V13_stool <- V13() %>% subset(select = HMP_BODY_SUBSITE == "Stool") ``` While these objects are both from the `r BiocStyle::Biocpkg("HMP16SData")` package, a user would potentially be comparing to their own data and only need a single object from the package. ### Coercion to phyloseq Objects The `SummarizedExperiment` class objects can then be coerced to `phyloseq` class objects containing count data, sample (participant) data, taxonomy, and phylogenetic trees using the `as_phyloseq` function. ```{r} V13_tonsils_phyloseq <- as_phyloseq(V13_tonsils) V13_stool_phyloseq <- as_phyloseq(V13_stool) ``` The analysis of all the samples in these two `phyloseq` objects would be rather computationally intensive. So to preform the analysis in a more timely manner, a function, `sample_samples`, is written here to take a sample of the samples available in each `phyloseq` object. ```{r} sample_samples <- function(x, size) { sampled_names <- sample_names(x) %>% sample(size) prune_samples(sampled_names, x) } ``` Each `phyloseq` object is then sampled to contain only twenty-five samples. ```{r} V13_tonsils_phyloseq %<>% sample_samples(25) V13_stool_phyloseq %<>% sample_samples(25) ``` A "Study" identifier is then added to the `sample_data` of each `phyloseq` object to be used for stratification in analysis. In the case that a user were comparing the HMP samples to their own data, an identifier would be added in the same manner. ```{r} sample_data(V13_tonsils_phyloseq)$Study <- "Tonsils" sample_data(V13_stool_phyloseq)$Study <- "Stool" ``` Once the two `phyloseq` objects have been sampled and their `sample_data` has been augmented, they can be merged into a single `phyloseq` object using the `merge_phyloseq` command. ```{r} V13_phyloseq <- merge_phyloseq(V13_tonsils_phyloseq, V13_stool_phyloseq) ``` Finally, because the V13 data were subset and sampled, taxa with no relative abundance are present in the merged object. These are removed using the `prune_taxa` command to avoid warnings during analysis. ```{r} V13_phyloseq %<>% taxa_sums() %>% is_greater_than(0) %>% prune_taxa(V13_phyloseq) ``` The resulting `V13_phyloseq` object can then be analyzed quickly and easily. ### Alpha Diversity Analysis Alpha diversity measures the taxonomic variation within a sample and `r BiocStyle::Biocpkg("phyloseq")` provides a method, `plot_richness`, to plot various alpha diversity measures. First a vector of richness (i.e. alpha diversity) measures is created to be passed to the `plot_richness` method. ```{r} richness_measures <- c("Observed", "Shannon", "Simpson") ``` The `V13_phyloseq` object and the `richness_measures` vector are then passed to the `plot_richness` method to construct a box plot of the three alpha diversity measures. Additional `r BiocStyle::CRANpkg("ggplot2")` syntax is used to control the presentational aspects of the plot. ```{r, fig.height=5, fig.width=8} V13_phyloseq %>% plot_richness(x = "Study", color = "Study", measures = richness_measures) + stat_boxplot(geom ="errorbar") + geom_boxplot() + theme_bw() + theme(axis.title.x = element_blank(), legend.position = "none") ``` ### Beta Diversity Analysis Beta diversity measures the taxonomic variation between samples by calculating the dissimilarity of clade relative abundances. The `r BiocStyle::Biocpkg("phyloseq")` package provides a method, `distance`, to calculate various dissimilarity measures, such as Bray–Curtis dissimilarity. Once dissimilarity has been calculated, samples can then be clustered and represented as a dendrogram. ```{r} V13_dendrogram <- distance(V13_phyloseq, method = "bray") %>% hclust() %>% as.dendrogram() ``` However, coercion to a `dendrogram` object results in the lost of `sample_data` present in the `phyloseq` object which is needed for plotting. A `data.frame` of this `sample_data` can be extracted from the `phyloseq` object as follows. ```{r} V13_sample_data <- sample_data(V13_phyloseq) %>% data.frame() ``` Samples in the the plots will be identified by "PSN"" (Primary Sample Number) and "Study". So, additional columns to denote the colors and shapes of leaves and labels are added to the `data.frame` using `r BiocStyle::CRANpkg("dplyr")` syntax. ```{r} V13_sample_data %<>% rownames_to_column(var = "PSN") %>% mutate(labels_col = if_else(Study == "Stool", "#F8766D", "#00BFC4")) %>% mutate(leaves_col = if_else(Study == "Stool", "#F8766D", "#00BFC4")) %>% mutate(leaves_pch = if_else(Study == "Stool", 16, 17)) ``` Additionally, the order of samples in the `dendrogram` and `data.frame` objects is different and a vector to sort samples is constructed as follows. ```{r} V13_sample_order <- labels(V13_dendrogram) %>% match(V13_sample_data$PSN) ``` The label and leaf color and shape columns of the `data.frame` object can then be coerced to vectors and sorted according to the sample order of the `dendrogram` object. ```{r} labels_col <- V13_sample_data$labels_col[V13_sample_order] leaves_col <- V13_sample_data$leaves_col[V13_sample_order] leaves_pch <- V13_sample_data$leaves_pch[V13_sample_order] ``` The `r BiocStyle::CRANpkg("dendextend")` package is then used to add these vectors to the `dendrogram` object as metadata which will be used for plotting. ```{r} V13_dendrogram %<>% set("labels_col", labels_col) %>% set("leaves_col", leaves_col) %>% set("leaves_pch", leaves_pch) ``` Finally, the `r BiocStyle::CRANpkg("dendextend")` package provides a method, `circlize_dendrogram`, to produce a circular dendrogram, using the `r BiocStyle::CRANpkg("circlize")` package, with a single line of code. ```{r, fig.height=8, fig.width=8} V13_dendrogram %>% circlize_dendrogram(labels_track_height = 0.2) ``` ### Principle Coordinates Analysis The `r BiocStyle::Biocpkg("phyloseq")` package additionally provides methods for commonly-used ordination analyses such as principle coordinates analysis. The `ordinate` method simply requires a `phyloseq` object and specification of the type of ordination analysis to be preformed. The type of distance method used can also optionally be specified. ```{r} V13_ordination <- ordinate(V13_phyloseq, method = "PCoA", distance = "bray") ``` The ordination analysis can then be plotted using the `plot_ordination` method provided by the `r BiocStyle::Biocpkg("phyloseq")` package. Again, additional `r BiocStyle::CRANpkg("ggplot2")` syntax is used to control the presentational aspects of the plot. ```{r, fig.height=5, fig.width=8} V13_phyloseq %>% plot_ordination(V13_ordination, color = "Study", shape = "Study") + theme_bw() + theme(legend.position = "bottom") ``` Finally, the ordination eigenvalues can be plotted using the `plot_scree` method provided by the `r BiocStyle::Biocpkg("phyloseq")` package. ```{r, fig.height=5, fig.width=8} V13_ordination %>% plot_scree() + theme_bw() ``` # Phylum-level Comparison to Metagenomic Shotgun Sequencing In addition to 16S rRNA sequencing, the HMP Study conducted whole metagenome shotgun (MGX) sequencing. These profiles, along with thousands of profiles from other studies, are available in the `r BiocStyle::Biocpkg("curatedMetagenomicData")` package. Here, a phylum-level relative abundance comparison of the 16S and MGX samples is made to illustrate comparing these data sets. First a `V35_stool_phyloseq` object is constructed and contains the 16S variable region 3–5 data for stool samples. Then the MGX stool samples are obtained and coerced to a `phyloseq` object as follows. ```{r} V35_stool_phyloseq <- V35_stool %>% as_phyloseq() EH <- ExperimentHub() HMP_2012.metaphlan_bugs_list.stool <- EH[["EH426"]] # a modified version of ExpressionSet2phyloseq taken from curatedMetagenomicData # ExpressionSet2phyloseq was removed in curatedMetagenomicData version 3.0.0 ExpressionSet2phyloseq <- function(eset, simplify = TRUE, relab = TRUE) { otu.table <- Biobase::exprs(eset) sample.data <- Biobase::pData(eset) %>% phyloseq::sample_data(., errorIfNULL = FALSE) taxonomic.ranks <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species", "Strain") tax.table <- rownames(otu.table) %>% gsub("[a-z]__", "", .) %>% dplyr::tibble() %>% tidyr::separate(., ".", taxonomic.ranks, sep = "\\|", fill="right") %>% as.matrix() if(simplify) { rownames(otu.table) <- rownames(otu.table) %>% gsub(".+\\|", "", .) } rownames(tax.table) <- rownames(otu.table) if(!relab) { otu.table <- round(sweep(otu.table, 2, eset$number_reads/100, "*")) } otu.table <- phyloseq::otu_table(otu.table, taxa_are_rows = TRUE) tax.table <- phyloseq::tax_table(tax.table) phyloseq::phyloseq(otu.table, sample.data, tax.table) } MGX_stool_phyloseq <- ExpressionSet2phyloseq(HMP_2012.metaphlan_bugs_list.stool) ``` The `r BiocStyle::Biocpkg("curatedMetagenomicData")` package provides taxonomic relative abundance for MGX data (with an option to estimate counts by multiplying by read depth) from MetaPhlAn2. MetaPhlAn2 directly estimates relative abundance at every taxonomic level based on clade-specific markers, and these estimates are better than summing lower-level taxonomic abundances. Note, this comparison becomes complicated because `r BiocStyle::Biocpkg("phyloseq")` does not currently support row and column matching and reordering. Instead, we use the `phyloseq::psmelt()` method to generate `data.frame` objects for further manipulation. ```{r} MGX_stool_melted <- MGX_stool_phyloseq %>% subset_taxa(is.na(Class) & !is.na(Phylum)) %>% psmelt() ``` The 16S data sets do not contain summarized counts for higher taxonomic levels, so we use the `phyloseq::tax_glom()` method to agglomerate at phylum level. ```{r} V35_stool_melted <- V35_stool_phyloseq %>% tax_glom(taxrank = "PHYLUM") %>% psmelt() ``` There is a column `SRS_SAMPLE_ID` present in the 16S variable region 3–5 data that is explicitly for mapping to the MGX samples and the matching identifiers are in the `NCBI_accession` column of the MGX sample data. The intersection of the two vectors represents the matching samples that are present in both data sets. This intersection, shown below as `SRS_SAMPLE_IDS`, can then be used to filter both the 16S and MGX samples to include only the samples in common. Along with the `filter` step shown below, standardization of sample identifiers to the `SRS` numbers and conversion of taxonomic counts to relative abundance is done. The conversion to relative abundance is necessary for comparability across the 16S and MGX samples. In either case, data is first grouped by samples and the percent composition of each phylum relative to the others in the sample is calculated by taking the count abundance of the phylum divided by the sum of all abundance counts in the sample. Samples are then sorted by phylum and descending relative abundance before being grouped by phylum – the grouping by phylum allows for the assignment a phylum rank that is then used to sort the phyla by descending relative abundance. Finally, only the sample, phylum, and relative abundance columns are kept once the order has been established and all groupings can be removed. ```{r} SRS_SAMPLE_IDS <- intersect(V35_stool_melted$SRS_SAMPLE_ID, MGX_stool_melted$NCBI_accession) V35_stool_melted %<>% filter(SRS_SAMPLE_ID %in% SRS_SAMPLE_IDS) %>% rename(Phylum = PHYLUM) %>% mutate(Sample = SRS_SAMPLE_ID) %>% group_by(Sample) %>% mutate(`Relative Abundance` = Abundance / sum(Abundance)) %>% arrange(Phylum, -`Relative Abundance`) %>% group_by(Phylum) %>% mutate(phylum_rank = sum(`Relative Abundance`)) %>% arrange(-phylum_rank) %>% select(Sample, Phylum, `Relative Abundance`) %>% ungroup() MGX_stool_melted %<>% filter(NCBI_accession %in% SRS_SAMPLE_IDS) %>% group_by(Sample) %>% mutate(`Relative Abundance` = Abundance / sum(Abundance)) %>% arrange(Phylum, -`Relative Abundance`) %>% group_by(Phylum) %>% mutate(phylum_rank = sum(`Relative Abundance`)) %>% arrange(-phylum_rank) %>% select(Sample, Phylum, `Relative Abundance`) %>% ungroup() ``` Provided that the phyla are ordered by relative abundance, the top phyla from each data set can be obtained and intersected to give the top eight phyla present in both data sets. The top eight phyla can then be used to filter 16S and MGX samples to include only the desired phyla. ```{r} V35_top_phyla <- V35_stool_melted %$% unique(Phylum) %>% as.character() MGX_top_phyla <- MGX_stool_melted %$% unique(Phylum) %>% as.character() top_eight_phyla <- intersect(V35_top_phyla, MGX_top_phyla) %>% extract(1:8) V35_stool_melted %<>% filter(Phylum %in% top_eight_phyla) MGX_stool_melted %<>% filter(Phylum %in% top_eight_phyla) ``` To achieve ordering of samples by the relative abundance of the top phylum when plotting, a vector of the unique sample identifiers is constructed and will be used as the `levels` of a `factor`. ```{r} sample_order <- V35_stool_melted %$% unique(Sample) ``` A vector of unique phyla is also constructed and will be used as the `levels` of a `factor` when plotting. If this were not done, phyla would simply be sorted alphabetically. ```{r} phylum_order <- V35_stool_melted %$% unique(Phylum) ``` Color blindness affects a significant portion of the population, but, using an intelligent color pallet, figures can be designed to be friendly to those with deuteranopia, protanopia, and tritanopia. The following colors are derived from Wong, B. Points of view: Color blindness. *Nat. Methods* **8**, 441 (2011). ```{r} bang_wong_colors <- c("#CC79A7", "#D55E00", "#0072B2", "#F0E442", "#009E73", "#56B4E9", "#E69F00", "#000000") ``` Using the `sample_order` and `phylum_order` vectors constructed above, stacked phylum-level relative abundance bar plots sorted by decreasing relative abundance of the top phylum and stratified by the top eight phyla can be made for 16S and MGX samples. The two plots are made separately and serialized so that they can be arranged in a single figure for comparison. ```{r} V35_plot <- V35_stool_melted %>% mutate(Sample = factor(Sample, sample_order)) %>% mutate(Phylum = factor(Phylum, phylum_order)) %>% ggplot(aes(Sample, `Relative Abundance`, fill = Phylum)) + geom_bar(stat = "identity", position = "fill", width = 1) + scale_fill_manual(values = bang_wong_colors) + theme_minimal() + theme(axis.text.x = element_blank(), axis.title.x = element_blank(), panel.grid = element_blank(), legend.position = "none", legend.title = element_blank()) + ggtitle("Phylum-Level Relative Abundance", "16S Stool Samples") MGX_plot <- MGX_stool_melted %>% mutate(Sample = factor(Sample, sample_order)) %>% mutate(Phylum = factor(Phylum, phylum_order)) %>% ggplot(aes(Sample, `Relative Abundance`, fill = Phylum)) + geom_bar(stat = "identity", position = "fill", width = 1) + scale_fill_manual(values = bang_wong_colors) + theme_minimal() + theme(axis.text.x = element_blank(), axis.title.x = element_blank(), panel.grid = element_blank(), legend.position = "none", legend.title = element_blank()) + ggtitle("", "MGX Stool Samples") ``` In the plots above, legends are removed to reduce redundancy, but a legend is still necessary and can be serialized as its own plot using the `get_legend` method from the `r BiocStyle::CRANpkg("cowplot")` package. ```{r} plot_legend <- { MGX_plot + theme(legend.position = "bottom") } %>% get_legend() ``` Finally, the `grid.arrange` method from the `r BiocStyle::CRANpkg("gridExtra")` package is used to arrange, scale, and plot the three plots in a single figure. ```{r, fig.height=8, fig.width=8} grid.arrange(V35_plot, MGX_plot, plot_legend, ncol = 1, heights = c(3, 3, 1)) ``` Notably, the figure illustrates the Bacteroidetes/Firmicutes gradient with reasonable agreement between the 16S and MGX samples. When these plots were submitted for publication, the following code was used to produce ESP and PDF files. ```{r, eval=FALSE} V35_plot + theme(text = element_text(size = 19)) + labs(title = NULL, subtitle = NULL, tag = "A") ggsave("~/AJE-00611-2018 Schiffer Figure 2A.eps", device = "eps") ggsave("~/AJE-00611-2018 Schiffer Figure 2A.pdf", device = "pdf") MGX_plot + theme(text = element_text(size = 19)) + labs(title = NULL, subtitle = NULL, tag = "B") ggsave("~/AJE-00611-2018 Schiffer Figure 2B.eps", device = "eps") ggsave("~/AJE-00611-2018 Schiffer Figure 2B.pdf", device = "pdf") plot_legend <- { MGX_plot + theme(legend.position = "bottom", text = element_text(size = 19)) + guides(fill = guide_legend(byrow = TRUE)) } %>% get_legend() ggsave("~/AJE-00611-2018 Schiffer Figure 2 Legend 1.eps", plot = plot_legend, device = "eps") ggsave("~/AJE-00611-2018 Schiffer Figure 2 Legend 1.pdf", plot = plot_legend, device = "pdf") plot_legend <- { MGX_plot + theme(legend.position = "right", text = element_text(size = 19)) } %>% get_legend() ggsave("~/AJE-00611-2018 Schiffer Figure 2 Legend 2.eps", plot = plot_legend, device = "eps") ggsave("~/AJE-00611-2018 Schiffer Figure 2 Legend 2.pdf", plot = plot_legend, device = "pdf") plot_legend <- { MGX_plot + theme(legend.position = "right", text = element_text(size = 19)) + guides(fill = guide_legend(ncol = 2, byrow = TRUE)) } %>% get_legend() ggsave("~/AJE-00611-2018 Schiffer Figure 2 Legend 3.eps", plot = plot_legend, device = "eps") ggsave("~/AJE-00611-2018 Schiffer Figure 2 Legend 3.pdf", plot = plot_legend, device = "pdf") V35_plot <- V35_plot + theme(text = element_text(size = 19)) + labs(title = NULL, subtitle = NULL, tag = "A") MGX_plot <- MGX_plot + theme(text = element_text(size = 19)) + labs(title = NULL, subtitle = NULL, tag = "B") plot_legend <- { MGX_plot + theme(legend.position = "bottom", text = element_text(size = 19)) + guides(fill = guide_legend(byrow = TRUE)) } %>% get_legend() grid_object <- grid.arrange(V35_plot, MGX_plot, plot_legend, ncol = 1, heights = c(3, 3, 1)) ggsave("~/AJE-00611-2018 Schiffer Figure 3.eps", plot = grid_object, device = "eps", width = 8, height = 8) ggsave("~/AJE-00611-2018 Schiffer Figure 3.pdf", plot = grid_object, device = "pdf", width = 8, height = 8) ``` # Exporting Data to CSV, SAS, SPSS, and STATA Formats To our knowledge, R and Bioconductor provide the most and best methods for the analysis of microbiome data. However, we realize they are not the only analysis environments and wish to provide methods to export the data from `r BiocStyle::Biocpkg("HMP16SData")` to CSV, SAS, SPSS, and STATA formats. As we do not use these other languages regularly, we are unaware of how to represent phylogenitic trees in them and will not attempt to export the trees here. ## Prepare Data for Export Bioconductor's `SummarizedExperiment` class is essentially a normalized representation of tightly coupled data and metadata that must be "unglued" before it can be saved into alternative formats. The process begins by creating a `data.frame` object of the participant data and moving the row names to their own column. ```{r, eval=FALSE} V13_participant_data <- V13() %>% colData() %>% as.data.frame() %>% rownames_to_column(var = "PSN") ``` Next, the taxonomic abundance matrix is extracted and transposed because Bioconductor objects represent samples as rows and measurements as columns whereas other languages do the opposite. The matrix is then coerced to a `data.frame` object and the row names are moved to their own column. ```{r, eval=FALSE} V13_OTU_counts <- V13() %>% assay() %>% t() %>% as.data.frame() %>% rownames_to_column(var = "PSN") ``` With the participant data and taxonomic abundances represented as simple tables containing a primary key (i.e. PSN or Primary Sample Number) the two tables can be joined using the `merge.data.frame` method. ```{r, eval=FALSE} V13_data <- merge.data.frame(V13_participant_data, V13_OTU_counts, by = "PSN") ``` The column names of the `V13_data` object are denoted as OTUs or operational taxonomic units based on their sequence similarity to the 16S rRNA gene. The OTU nomenclature is not particularly useful without the traditional taxonomic clade identifiers which are stored in a separate table. A dictionary of these identifiers is created by extracting and transposing the `rowData` of the `SummarizedExperiment` object which is then coerced to a `data.frame` object. ```{r, eval=FALSE} V13_dictionary <- V13() %>% rowData() %>% t.data.frame() %>% as.data.frame() ``` The column names of the `V13_data` object contain periods which some languages and formats are unable to process. The periods in the column names are replaced with underscores using the `gsub` method. ```{r, eval=FALSE} colnames(V13_data) <- colnames(V13_data) %>% gsub(pattern = "\\.", replacement = "_", x = .) ``` The process is repeated for the column names of the `V13_dictionary` object. ```{r, eval=FALSE} colnames(V13_dictionary) <- colnames(V13_dictionary) %>% gsub(pattern = "\\.", replacement = "_", x = .) ``` The two tables `V13_data` and `V13_dictionary` are then ready for export to CSV, SAS, SPSS, and STATA formats. ## Export to CSV Format To export to CSV format, two calls to the `write_csv` method from the `r BiocStyle::CRANpkg("readr")` package are used to write CSV files to disk. ```{r, eval=FALSE} write_csv(V13_data, "~/V13_data.csv") write_csv(V13_dictionary, "~/V13_dictionary.csv") ``` ## Export to SAS Format To export to SAS format, two calls to the `write_sas` method from the `r BiocStyle::CRANpkg("haven")` package are used to write SAS files to disk. ```{r, eval=FALSE} write_sas(V13_data, "~/V13_data.sas7bdat") write_sas(V13_dictionary, "~/V13_dictionary.sas7bdat") ``` ## Export to SPSS Format To export to SPSS format, two calls to the `write_sav` method from the `r BiocStyle::CRANpkg("haven")` package are used to write SPSS files to disk. ```{r, eval=FALSE} write_sav(V13_data, "~/V13_data.sav") write_sav(V13_dictionary, "~/V13_dictionary.sav") ``` ## Export to STATA Format To export to STATA format, two calls to the `write_dta` method from the `r BiocStyle::CRANpkg("haven")` package are used to write STATA files to disk. ```{r, eval=FALSE} write_dta(V13_data, "~/V13_data.dta", version = 13) write_dta(V13_dictionary, "~/V13_dictionary.dta", version = 13) ``` Here, version 13 STATA files are written because version 14 and above files require a platform-specific text encoding that would make the data less transportable. The encoding of the version 13 files is ASCII. # Session Information ```{r} sessionInfo() ```