--- title: "scMultiome tutorial" author: Aleksander Chlebowski and Xiaosai Yao output: BiocStyle::html_document: toc: true number_section: true self_contained: true titlecaps: true vignette: > %\VignetteIndexEntry{scMultiome tutorial} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = TRUE ) ``` # Introduction Single cell data is gaining sophistication - Cells can be measured in multiple modalities including gene expression, chromatin accessibility, cell surface markers and protein expression. These orthogonal measures of the same or matched cells enable a holistic construction of the cell state. However it has been challenging to share multiomic data, especially in an integrated format that consolidates the multiple layers of measurement. The `MultiAssayExperiment` container (implemented in the `r Biocpkg("MultiAssayExperiment")` package) provides a framework to package the various modalities into a single object on a per cell basis. The `r Biocpkg("scMultiome")` package is a collection of public single cell multiome data sets pre-processed and packaged into `MultiAssayExperiment` objects for downstream analysis. It also provides basic functions to save the `MultiAssayExperiment` as `.hdf5` files so that users need to only load the desired modalities into memory. The 'scMultiome' package is similar to `r Biocpkg("SingleCellMultiModal")` in terms of providing multimodal data as an ExperimentHub package. One key difference is that we included additional functionalities to load only the desired modalities, and allow users to save their `MAE` objects as `.hdf5`s. The selective loading of experiments is desirable because multimodal data can be large: A typical 10x scMultiome consists of 100-200K atac-seq peaks on top of the typical 30-50K genes from rna-seq. Finally, `r Biocpkg("scMultiome")` provides a list of transcription factor and chromatin co-activator binding sites compiled from ENCODE, ChIP-Atlas or Cistrome-DB ChIP-seq data as `GRangesList` objects. ChIP-seq data complements transcription factor motif information by potentially distinguishing related family members and providing occupancy information of chromatin factors that do not directly bind to DNA. # Installation ``` {r, eval = FALSE} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("scMultiome") ``` # Accessing datasets ## Available Data Sets View currently available data sets and the names of their accessor functions. Help pages for particular accessors contain more information on their data sets. ```{r, results = FALSE, message=FALSE} library(scMultiome) listDatasets() ``` ```{r, results = 'asis', echo = FALSE} lds <- listDatasets() knitr::kable(lds, caption = "Available Data Sets") ``` Access a data set by calling its accessor function: ```{r} prostateENZ(metadata = TRUE) ``` See the help files for accessor functions for possible extra options, e.g. `?prostateENZ`. ## Transcription Factor Binding Sites In addition to single cell multiomic data, the package also provides binding sites of transcription factors, obtained from bulk ChIP-seq studies. There are two sources: * `atlas` - merging of ChIP-Atlas and ENCODE * `cistrome` - merging of CistromeDB and ENCODE Academic institutions are free to use both sources but commercial entities should only use `atlas` due to restrictions imposed by Cistrome-DB. If multiple ChIP-seq files are available for the same transcription factor, the peaks are merged to create a union set of peaks. Currently three genomic builds genomes are provided: hg38, hg19, and mm10. The ChIP-seq data is packaged into individual RDS files and they are accessed with a common accessor function, `tfBinding`, specifying the genome and source. ```{r} tfBinding(genome = "hg38", source ="atlas", metadata = TRUE) ```


# Adding Datasets If you want to contribute your publicly available multiome data set, please read this section and contact the package maintainer, `r desc::desc_get_maintainer()`. Once the data has been preprocessed and converted into `MultiAssayExperiment`, it can be saved in a hdf5 file using `saveMAE`. You also need to provide metadata, documentation, and the accessor function that will retrieve the data from `ExperimentHub`. ## Developer Mode Adding data involves updating the package and as such, it must be done in "developer mode". The developer mode allows access to additional tools such as documentation templates. To work in developer mode, you must first clone the package repository with `git`. Create a branch from `master` and work on that. Start an R session in the package directory (e.g. by opening the RStudio project in RStudio) and load all the functions. This is necessary for the R engine to temporarily identify your working directory as the package installation directory, and to expose the internal functions that you will be using. ```{r devtoolsAvailable, eval = FALSE, include = FALSE} devtoolsAvailable <- requireNamespace("devtools", quietly = TRUE) if (devtoolsAvailable) { # attach development version of the package devtools::load_all() } ``` ## Saving Your Data Set `MultiAssayExperiment` is saved in the hdf5 format as it splits the MAEs into individual experiments so that you can choose to load selected experiments. Upon loading, selected experiments are reassembled and wrapped into an MAE object. Assays are represented by `DelayedMatrix` objects to save memory.and saved as a hdf5 file. Currently only `MultiAssayExperiment` objects are supported. Experiments must be objects that inherit from `SummarizedExperiment` and will usually be `SingleCellExperiment`, hence full support is provided for the latter and their slots (`reducedDims` and `altExps`). ```{r saving, class.output = "scroll250", class.echo = "scroll250"} # construct a dummy data set mae <- dummyMAE() mae # name the file to save to fileName <- tempfile(fileext = ".h5") # save data set saveMAE(mae, fileName) ``` You can use `testFile` to validate that your data set can be reconstructed. ```{r test reloading} testFile(fileName) ``` For a detailed explanation of the process see `?saveMAE`. ## Creating Metadata To add metadata, we first create a R script to store your data set's metadata. The metadata script provides key information such as data resource, species, genomic build and version number. This R script will be named `inst/scripts/make-metadata-.R`. ```{r making metadata, eval = FALSE} makeMakeMetadata("dataset") ``` Metadata must be a 1-row data frame with specific columns and values must be character strings (some fields allow character vectors). See `inst/scripts/make-metadata.R` for more information and `inst/scripts/make-metadata-prostateENZ.R` for an example. The file also stores metadata that will be returned by `listDatasets`. Likewise, this must be a 1-row data frame and values must be character strings. Once your `make-metadata` file is ready, build the metadata ```{r build metadata, eval = FALSE} source(system.file("scripts", "make-metadata.R", package = "scMultiome")) ``` This must run without errors. If successful, all the datasets will be captured in `inst/extdata/manifest.csv` and `inst/extdata/metadata.csv` Subsequently, validate metadata ```{r validation, eval = FALSE, class.output = "scroll250"} ExperimentHubData::makeExperimentHubMetadata(dirname(system.file(package = "scMultiome"))) ``` This call must also run without errors. It will return an `ExperimentHub` object that will display your metadata in the form that the end users will see it. ## Documenting Your Data Set Every data set needs an additional documentation to describe how the data was obtained. This doesn't have to be a working script, just a report. Pseudocode is acceptable. Note that code evaluation has been disabled so that you can copy your actual code and the lengthy analysis does not run again. Create an Rmarkdown file called `inst/scripts/make-data-.Rmd`. ```{r making, eval = FALSE} makeMakeData("dataset") ``` ## Creating accessor function Every data set is accessed by its own accessor function, which provides access and help for your data set. For every data set, we create a `R` file that defines the function to retrieve the data set and provide important documentation. The package framework is constructed such that accessor functions are extremely simple and you can basically copy the original one (`prostateENZ`) and most of its documentation. Create an R file called `.R` ```{r making2, eval = FALSE} makeR("dataset") ``` This file will quote the accompanying Rmd file created above. This way the R file itself is more concise and easier to edit. Adjust the file accordingly: 1. Give the file the same title was used in the `make-metadata-<>.R` file. 2. Add a Description section. 3. Document any arguments other than `experiments` and `metadata`. 4. Describe the format of your `MultiAssayExperiment`. 5. Cite appropriate references. 6. Make sure the default value of the `experiments` argument reflects the experiment names in your data set. 7. If you want to restore custom classes to your experiments, add converting functions here. They do not require documentation or exporting. ## Building documentation Once the files are ready, build the documentation, and `?scMultiome` and `?` to review it. ```{r documentation, eval = FALSE} # build documentation devtools::document() # view package man page ?scMultiome # view your data set man page help("dataset") ``` ## Updating Package Files Edit the `DESCRIPTION` file to update package metadata: add yourself as package author and package version. ```{r DESCRIPTION, eval = FALSE} desc::desc_add_author("", "", "", role = "aut") desc::desc_bump_version("minor") ``` Push your changes to git and create a merge request. Once the request is approved and merged to the `master` branch, the package maintainer will take over and move on to the next stage. ## Uploading Your Data Set The final step of the process is to place the data set in Bioconductor's data bucket. This can only be done with Bioconductor's knowledge and blessing. Bioconductor will be notified that a `scMultiome` update is coming by email at `hubs@bioconductor.org`. They will issue a temporary SAS token for the Bioconductor data bucket. The data set will be placed in the Bioconductor staging directory with `uploadFile` and Bioconductor will be notified again that the upload is ready. They will receive a link to the package repository, update metadata in `ExperimentHub` and finalize the process. ```{r upload, eval = FALSE} uploadFile(file = fileName, sasToken = "") ```
Consult [this vignette](https://www.bioconductor.org/packages/release/bioc/vignettes/HubPub/inst/doc/CreateAHubPackage.html#additional-resources-to-existing-hub-package) for current Bioconductor requirements. ## Congratulations If you found any of this vignette or the process confusing, we would welcome feedback and gladly add more clarifications. Please email the package maintainer, `r desc::desc_get_maintainer()`. # Session Info ```{r} sessionInfo() ```