--- title: "BiocSet: Representing Element Sets in the Tidyverse" author: - name: Kayla Morrell affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY - name: Martin Morgan affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true toc_float: true package: BiocSet vignette: > %\VignetteIndexEntry{BiocSet: Representing Element Sets in the Tidyverse} %\VignetteEngine{knitr::rmarkdown} %\VignetteEndcoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction `BiocSet` is a package that represents element sets in a tibble format with the `BiocSet` class. Element sets are read in and converted into a tibble format. From here, typical `dplyr` operations can be performed on the element set. `BiocSet` also provides functionality for mapping different ID types and providing reference urls for elements and sets. # Installation Install the most recent version from Bioconductor: ```{r bioconductor, eval = FALSE} if(!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("BiocSet") ``` The development version is also available for install from GitHub: ```{r github, eval = FALSE} BiocManager::install("Kayla-Morrell/BiocSet") ``` Then load `BiocSet`: ```{r load, message = FALSE} library(BiocSet) ``` # BiocSet ## Input and Output `BiocSet` can create a `BiocSet` object using two different input methods. The first is to input named character vectors of element sets. `BiocSet` returns three tibbles, `es_element` which contains the elements, `es_set` which contains the sets and `es_elementset` which contains elements and sets together. ```{r constructor} tbl <- BiocSet(set1 = letters, set2 = LETTERS) tbl ``` The second method of creating a `BiocSet` object would be to read in a `.gmt` file. Using `import()`, a path to a downloaded `.gmt` file is read in and a `BiocSet` object is returned. The example below uses a hallmark element set downloaded from [GSEA][], which is also included with this package. This `BiocSet` includes a `source` column within the `es_elementset` tibble for reference as to where the element set came from. [GSEA]: http://software.broadinstitute.org/esea/index.jsp ```{r gmt} gmtFile <- system.file(package = "BiocSet", "extdata", "hallmark.gene.symbol.gmt") tbl2 <- import(gmtFile) tbl2 ``` `export()` allows for a `BiocSet` object to be exported into a temporary file with the extention `.gmt`. ```{r export, tidy = TRUE} fl <- tempfile(fileext = ".gmt") gmt <- export(tbl2, fl) gmt ``` We also support importing files to create an `OBOSet` object which extends the `BiocSet` class. The default only displays the most basic amount of columns needed to perform certain tasks, but the argument `extract_tag = everything` allows for additional columns to be imported as well. The following workflow will demonstrate the use of both export and import to create a smaller subset of the large obo file that is accessable from GeneOntology. ```{r obo, eval = FALSE} download.file("http://current.geneontology.org/ontology/go.obo", "obo_file.obo") foo <- import("obo_file.obo", extract_tag = "everything") small_tst <- es_element(foo)[1,] %>% unnest("ancestors") %>% select("element", "ancestors") %>% unlist() %>% unique() small_oboset <- foo %>% filter_elementset(element %in% small_tst) fl <- tempfile(fileext = ".obo") export(small_oboset, fl) ``` Then you can import this smaller obo file which we utilize here, in tests, and example code. ```{r small_obo} oboFile <- system.file(package = "BiocSet", "extdata", "sample_go.obo") obo <- import(oboFile) obo ``` ## Implemented functions A feature available to `BiocSet` is the ability to activate different tibbles to perform certain functions on. When a `BiocSet` is created, the tibble `es_elementset` is automatically activated and all functions will be performed on this tibble. `BiocSet` adopts the use of many common `dplyr` functions such as `filter()`, `select()`, `mutate()`, `summarise()`, and `arrange()`. With each of the functions the user is able to pick a different tibble to activate and work on by using 'verb_tibble'. After the function is executed than the 'active' tibble is returned back to the tibble that was active before the function call. Some examples are shown below of how these functions work. ```{r activate} tbl <- BiocSet(set1 = letters, set2 = LETTERS) tbl tbl %>% filter_element(element == "a" | element == "A") tbl %>% mutate_set(pval = rnorm(1:2)) tbl %>% arrange_elementset(desc(element)) ``` ## Set operations `BiocSet` also allows for common set operations to be performed on the `BiocSet` object. `union()` and `intersection()` are the two set operations available in `BiocSet`. We demonstate how a user can find the union between two `BiocSet` objects or within a single `BiocSet` object. Intersection is used in the same way. ```{r set_ops} # union of two BiocSet objects es1 <- BiocSet(set1 = letters[c(1:3)], set2 = LETTERS[c(1:3)]) es2 <- BiocSet(set1 = letters[c(2:4)], set2 = LETTERS[c(2:4)]) union(es1, es2) # union within a single BiocSet object es3 <- BiocSet(set1 = letters[c(1:10)], set2 = letters[c(4:20)]) union_single(es3) ``` # Case study I Next, we demonstrate the a couple uses of `BiocSet` with an experiment dataset `airway` from the package `airway`. This data is from an RNA-Seq experiment on airway smooth muscle (ASM) cell lines. The first step is to load the library and the necessary data. ```{r airway, message = FALSE} library(airway) data("airway") se <- airway ``` The function `go_sets()` discovers the keys from the org object and uses `AnnotationDbi::select` to create a mapping to GO ids. `go_sets()` also allows the user to indicate which evidence type or ontology type they would like when selecting the GO ids. The default is using all evidence types and all ontology types. We represent these identifieres as a `BiocSet` object. Using the `go_sets` function we are able to map the Ensembl ids and GO ids from the genome wide annotation for Human data in the `org.Hs.eg.db` package. The Ensembl ids are treated as elements while the GO ids are treated as sets. ```{r go_sets, message = FALSE} library(org.Hs.eg.db) go <- go_sets(org.Hs.eg.db, "ENSEMBL") go # an example of subsetting by evidence type go_sets(org.Hs.eg.db, "ENSEMBL", evidence = c("IPI", "TAS")) ``` Some users may not be interested in reporting the non-descriptive elements. We demonstrate subsetting the `airway` data to include non-zero assays and then filter out the non-descriptive elements. ```{r drop_assays} se1 = se[rowSums(assay(se)) != 0,] go %>% filter_element(element %in% rownames(se1)) ``` It may also be of interest to users to know how many elements are in each set. Using the `count` function we are able to calculate the elements per set. ```{r count} go %>% group_by(set) %>% dplyr::count() ``` It may also be helpful to remove sets that are empty. Since we have shown how to calculate the number of elements per set, we know that this data set does not contain any empty sets. We decide to demonstrate regardless for those users that may need this functionality. ```{r empty} drop <- es_activate(go, elementset) %>% group_by(set) %>% dplyr::count() %>% filter(n == 0) %>% pull(set) go %>% filter_set(!(set %in% drop)) ``` To simplify mapping we created a couple `map` functions. `map_unique()` is used when there is known 1:1 mapping, This takes four arguements, a `BiocSet` object, an `AnnotationDbi` object, the id to map from, and the id to map to. `map_multiple()` needs a fifth argument to indicate how the function should treat an element when there is multiple mapping. Both functions utilize `mapIds` from `AnnotationDbi` and return a `BiocSet` object. In the example below we show how to use `map_unique` to map `go`'s ids from Ensembl to gene symbols. ```{r map_unique} go %>% map_unique(org.Hs.eg.db, "ENSEMBL", "SYMBOL") ``` Another functionality of `BiocSet` is the ability to add information to the tibbles. Using the `GO.db` library we are able to map definitions to the GO ids. From there we can add the mapping to the tibble using `map_add()` and the mutate function. ```{r adding, message = FALSE} library(GO.db) map <- map_add_set(go, GO.db, "GOID", "DEFINITION") go %>% mutate_set(definition = map) ``` The library `KEGGREST` is a client interface to the KEGG REST server. KEGG contains pathway maps that represent interaction, reaction and relation networks for various biological processes and diseases. `BiocSet` has a function that utilizes `KEGGREST` to develop a `BiocSet` object that contains the elements for every pathway map in KEGG. Due to limiations of the KEGGREST package, `kegg_sets` can take some time to run depending on the amount of pathways for the species of interest. Therefore we demonstrate using `BiocFileCache` to make the data available to the user. ```{r file_cache, message = FALSE} library(BiocFileCache) rname <- "kegg_hsa" exists <- NROW(bfcquery(query=rname, field="rname")) != 0L if (!exists) { kegg <- kegg_sets("hsa") fl <- bfcnew(rname = rname, ext = ".gmt") export(kegg_sets("hsa"), fl) } kegg <- import(bfcrpath(rname=rname)) ``` Within the `kegg_sets()` function we remove pathways that do not contain any elements. We then mutate the element tibble using the `map_add` function to contain both Ensembl and Entrez ids. ```{r kegg_filter} map <- map_add_element(kegg, org.Hs.eg.db, "ENTREZID", "ENSEMBL") kegg <- kegg %>% mutate_element(ensembl = map) ``` Since we are working with ASM data we thought we would subset the `airway` data to contain only the elements in the asthma pathway. This filter is performed on the KEGG id, which for asthma is "hsa05310". ```{r subset} asthma <- kegg %>% filter_set(set == "hsa05310") se <- se[rownames(se) %in% es_element(asthma)$ensembl,] se ``` The filtering can also be done for multiple pathways. ```{r multiple} pathways <- c("hsa05310", "hsa04110", "hsa05224", "hsa04970") multipaths <- kegg %>% filter_set(set %in% pathways) multipaths ``` # Case study II This example will start out the same way that Case Study I started, by loading in the `airway` dataset, but we will also do some reformating of the data. The end goal is to be able to perform a Gene Set Enrichment Analysis on the data and return a BiocSet object of the gene sets. ```{r airway2, message = FALSE} data("airway") airway$dex <- relevel(airway$dex, "untrt") ``` Similar to other analyses we perform a differential expression analysis on the `airway` data using the library `DESeq2` and then store the results into a tibble. ```{r DE} library(DESeq2) library(tibble) des <- DESeqDataSet(airway, design = ~ cell + dex) des <- DESeq(des) res <- results(des) tbl <- res %>% as.data.frame() %>% as_tibble(rownames = "ENSEMBL") ``` Since we want to use `limma::goana()` to perform the GSEA we will need to have ENTREZ identifiers in the data, as well as filter out some `NA` information. Later on this will be considered our 'element' tibble. ```{r ENTREZ} tbl <- tbl %>% mutate( ENTREZID = mapIds( org.Hs.eg.db, ENSEMBL, "ENTREZID", "ENSEMBL" ) %>% unname() ) tbl <- tbl %>% filter(!is.na(padj), !is.na(ENTREZID)) tbl ``` Now that the data is ready for GSEA we can go ahead and use `goana()` and make the results into a tibble. This tibble will be considered our 'set' tibble. ```{r goana} library(limma) go_ids <- goana(tbl$ENTREZID[tbl$padj < 0.05], tbl$ENTREZID, "Hs") %>% as.data.frame() %>% as_tibble(rownames = "GOALL") go_ids ``` The last thing we need to do is create a tibble that we will consider our 'elementset' tibble. This tibble will be a mapping of all the elements and sets. ```{r final_tibble} foo <- AnnotationDbi::select( org.Hs.eg.db, tbl$ENTREZID, "GOALL", "ENTREZID") %>% as_tibble() foo <- foo %>% dplyr::select(-EVIDENCEALL) %>% distinct() foo <- foo %>% filter(ONTOLOGYALL == "BP") %>% dplyr::select(-ONTOLOGYALL) foo ``` The function `BiocSet_from_elementset()` allows for users to create a BiocSet object from tibbles. This function is helpful when there is metadata contained in the tibble that should be in the BiocSet object. For this function to work properly, the columns that are being joined on need to be named correctly. For instance, in order to use this function on the tibbles we created we need to change the column in the 'element' tibble to 'element', the column in the 'set' tibble to 'set' and the same will be for the 'elementset' tibble. We demonstrate this below and then create the BiocSet object with the simple function. ```{r BiocSet_from_elementset} foo <- foo %>% dplyr::rename(element = ENTREZID, set = GOALL) tbl <- tbl %>% dplyr::rename(element = ENTREZID) go_ids <- go_ids %>% dplyr::rename(set = GOALL) es <- BiocSet_from_elementset(foo, tbl, go_ids) es ``` For those users that may need to put this metadata filled BiocSet object back into an object similar to GRanges or SummarizedExperiment, we have created functions that allow for the BiocSet object to be created into a tibble or data.frame. ```{r tibble_or_data.frame} tibble_from_element(es) head(data.frame_from_elementset(es)) ``` # Information look up A final feature to `BiocSet` is the ability to add reference information about all of the elements/sets. A user could utilize the function `url_ref()` to add information to the `BiocSet` object. If a user has a question about a specific id then they can follow the reference url to get more informtion. Below is an example of using `url_ref()` to add reference urls to the `go` data set we worked with above. ```{r url} url_ref(go) ``` # Session info ```{r} sessionInfo() ```