--- title: "HDCytoData package" author: - name: Lukas M. Weber affiliation: - &id1 "Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland" - &id2 "SIB Swiss Institute of Bioinformatics, Zurich, Switzerland" - name: Charlotte Soneson affiliation: - &id3 "Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland" - &id4 "SIB Swiss Institute of Bioinformatics, Basel, Switzerland" package: HDCytoData output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{1. HDCytoData package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Overview The `HDCytoData` package is an extensible resource containing a set of publicly available high-dimensional flow cytometry and mass cytometry (CyTOF) benchmark datasets, which have been formatted into `SummarizedExperiment` and `flowSet` Bioconductor object formats. The data objects are hosted on Bioconductor's `ExperimentHub` platform. The objects each contain one or more tables of cell-level expression values, as well as all required metadata. Row metadata includes sample IDs, group IDs, patient IDs, reference cell population or cluster labels (where available), and labels identifying 'spiked in' cells (where available). Column metadata includes channel names, protein marker names, and protein marker classes (cell type, cell state, as well as non protein marker columns). Note that raw expression values should be transformed prior to any downstream analyses (see below). Currently, the package includes benchmark datasets used in our previous work to evaluate methods for clustering and differential analyses. The datasets are provided here in `SummarizedExperiment` and `flowSet` formats in order to make them easier to access and integrate into R/Bioconductor workflows. For more details, see our paper describing the `HDCytoData` package: - [Weber L.M. and Soneson C. (2019), *HDCytoData: Collection of high-dimensional cytometry benchmark datasets in Bioconductor object formats*, F1000Research, 8:1459, v2.](https://f1000research.com/articles/8-1459) # Datasets The package contains the following datasets, which can be grouped into datasets useful for benchmarking methods for (i) clustering, and (ii) differential analyses. - Clustering: - Levine_32dim - Levine_13dim - Samusik_01 - Samusik_all - Nilsson_rare - Mosmann_rare - Differential analyses: - Krieg_Anti_PD_1 - Bodenmiller_BCR_XL - Weber_AML_sim (multiple datasets from several simulation scenarios) - Weber_BCR_XL_sim (multiple datasets from several simulation scenarios) Extensive documentation is available in the help files for the objects. For each dataset, this includes a description of the dataset (e.g. biological context, number of samples and conditions, number of cells, number of reference cell populations, number and classes of protein markers, etc.), as well as an explanation of the object structures, details on accessor functions required to access the expression tables and metadata, and references to original data sources. File sizes are listed in the help files for the datasets. The `removeCache` function from the `ExperimentHub` package can be used to clear the local download cache (see `ExperimentHub` documentation). The help files can be accessed by the dataset names, e.g. `?Bodenmiller_BCR_XL` or `help(Bodenmiller_BCR_XL)`. ## Programmatic access to list of datasets An updated list of all available datasets can also be obtained programmatically using the `ExperimentHub` accessor functions, as follows. This retrieves a table of metadata from the `ExperimentHub` database, which includes information such as the ExperimentHub ID, title, and description for each dataset. ```{r} suppressPackageStartupMessages(library(ExperimentHub)) # Create ExperimentHub instance ehub <- ExperimentHub() # Find HDCytoData datasets ehub <- query(ehub, "HDCytoData") ehub # Retrieve metadata table md <- as.data.frame(mcols(ehub)) head(md, 2) ``` # How to load data This section shows how to load the datasets, using one of the datasets (`Bodenmiller_BCR_XL`) as an example. The datasets can be loaded by either (i) referring to named functions for each dataset, or (ii) creating an `ExperimentHub` instance and referring to the dataset IDs. Both methods are demonstrated below. See the help files (e.g. `?Bodenmiller_BCR_XL`) for details about the structure of the `SummarizedExperiment` or `flowSet` objects. Load the datasets using named functions: ```{r} suppressPackageStartupMessages(library(HDCytoData)) # Load 'SummarizedExperiment' object using named function Bodenmiller_BCR_XL_SE() # Load 'flowSet' object using named function Bodenmiller_BCR_XL_flowSet() ``` Alternatively, load the datasets by creating an `ExperimentHub` instance: ```{r} # Create ExperimentHub instance ehub <- ExperimentHub() # Find HDCytoData datasets query(ehub, "HDCytoData") # Load 'SummarizedExperiment' object using dataset ID ehub[["EH2254"]] # Load 'flowSet' object using dataset ID ehub[["EH2255"]] ``` ## Using the data Once the datasets have been loaded from `ExperimentHub`, they can be used as normal within an R session. For example, using the `SummarizedExperiment` form of the dataset loaded above: ```{r} # Load dataset in 'SummarizedExperiment' format d_SE <- Bodenmiller_BCR_XL_SE() # Inspect object d_SE length(assays(d_SE)) assay(d_SE)[1:6, 1:6] rowData(d_SE) colData(d_SE) metadata(d_SE) ``` ## Transformation of raw data Note that flow and mass cytometry data should be transformed prior to performing any downstream analyses, such as clustering. Standard transformations include the `asinh` with `cofactor` parameter equal to 5 for mass cytometry (CyTOF) data, or 150 for flow cytometry data (see Bendall et al. 2011, Supplementary Figure S2). ## Exploring the data Interactive visualizations to explore the datasets can be generated from the `SummarizedExperiment` objects using the [iSEE](http://bioconductor.org/packages/iSEE) ("Interactive SummarizedExperiment Explorer") package, available from Bioconductor (Soneson, Lun, Marini, and Rue-Albrecht, 2018), which provides a Shiny-based graphical user interface to explore single-cell datasets stored in the `SummarizedExperiment` format. For more details, see the `iSEE` package vignettes. # Contribution guidelines We welcome contributions or suggestions for new datasets to include in the `HDCytoData` package. Contribution guidelines are provided in the [Contribution guidelines](http://bioconductor.org/packages/release/data/experiment/vignettes/HDCytoData/inst/doc/Contribution_ _guidelines.html) vignette, available from [Bioconductor](http://bioconductor.org/packages/HDCytoData). # Citation If the `HDCytoData` package is useful in your work, please cite the following paper: - [Weber L.M. and Soneson C. (2019), *HDCytoData: Collection of high-dimensional cytometry benchmark datasets in Bioconductor object formats*, F1000Research, 8:1459, v2.](https://f1000research.com/articles/8-1459)