--- title: "The depmap data" author: - name: Theo Killian - name: Laurent Gatto affiliation: Computational Biology, UCLouvain date: "`r Sys.Date()`" output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{depmap} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() ``` ```{r, echo = FALSE} suppressPackageStartupMessages(library("tidyverse")) ``` # Introduction The `depmap` package aims to provide a reproducible research framework to cancer dependency data described by [Tsherniak, Aviad, et al. "Defining a cancer dependency map." Cell 170.3 (2017): 564-576.](https://www.ncbi.nlm.nih.gov/pubmed/28753430). The data found in the [depmap](https://bioconductor.org/packages/devel/data/experiment/html/depmap.html) package has been formatted to facilitate the use of common R packages such as `dplyr` and `ggplot2`. We hope that this package will allow researchers to more easily mine, explore and visually illustrate dependency data taken from the Depmap cancer genomic dependency study. # Installation instructions To install [depmap](https://bioconductor.org/packages/devel/data/experiment/html/depmap.html), the [BiocManager](https://cran.r-project.org/web/packages/BiocManager/index.html) Bioconductor Project Package Manager is required. If [BiocManager](https://cran.r-project.org/web/packages/BiocManager/index.html) is not already installed, it will need to be done so beforehand. Type (within R) install.packages("BiocManager") (This needs to be done just once.) ```{r install, eval=FALSE} install.packages("BiocManager") BiocManager::install("depmap") ``` The `depmap` package fully depends on the `ExperimentHub` Bioconductor package, which allows the data accessed in this package to be stored and retrieved from the cloud. ```{r import_EH, message = FALSE} library("depmap") library("ExperimentHub") ``` # Tidy depmap data The [depmap](https://bioconductor.org/packages/devel/data/experiment/html/depmap.html) package currently contains eight datasets available through `ExperimentHub`. The data found in this R package has been converted from a "wide" format `.csv` file to "long" format .rda file. None of the values taken from the original datasets have been changed, although the columns have been re-arranged. Descriptions of the changes made are described under the `Details` section after querying the relevant dataset. ```{r ehquery} ## create ExperimentHub query object eh <- ExperimentHub() query(eh, "depmap") ``` Each dataset has a `ExperimentHub` accession number, (e.g. *EH2260* refers to the `rnai` dataset from the 19Q1 release). ## RNA inference knockout data The `rnai` dataset contains the combined genetic dependency data for RNAi - induced gene knockdown for select genes and cancer cell lines. This data corresponds to the `D2_combined_genetic_dependency_scores.csv` file. Specific `rnai` datasets can be accessed, such as `rnai_19Q1` by EH number. ```{r, eval = FALSE} eh[["EH2260"]] ``` The most recent `rnai` dataset can be automatically loaded into R by using the `depmap_rnai` function. ```{r} depmap::depmap_rnai() ``` ## CRISPR-Cas9 knockout data The `crispr` dataset contains the (batch corrected CERES inferred gene effect) CRISPR-Cas9 knockout data of select genes and cancer cell lines. This data corresponds to the `gene_effect_corrected.csv` file. Specific `crispr` datasets can be accessed, such as `crispr_19Q1` by EH number. ```{r, eval = FALSE} eh[["EH2261"]] ``` The most recent `crispr` dataset can be automatically loaded into R by using the `depmap_crispr` function. ```{r} depmap::depmap_crispr() ``` ## WES copy number data The `copyNumber` dataset contains the WES copy number data, relating to the numerical log-fold copy number change measured against the baseline copy number of select genes and cell lines. This dataset corresponds to the `public_19Q1_gene_cn.csv` Specific `copyNumber` datasets can be accessed, such as `copyNumber_19Q1` by EH number. ```{r, eval = FALSE} eh[["EH2262"]] ``` The most recent `copyNumber` dataset can be automatically loaded into R by using the `depmap_copyNumber` function. ```{r} depmap::depmap_copyNumber() ``` ## CCLE Reverse Phase Protein Array data The `RPPA` dataset contains the CCLE Reverse Phase Protein Array (RPPA) data which corresponds to the `CCLE_RPPA_20180123.csv` file. Specific `RPPA` datasets can be accessed, such as `RPPA_19Q1` by EH number. ```{r, eval = FALSE} eh[["EH2263"]] ``` The most recent `RPPA` dataset can be automatically loaded into R by using the `depmap_RPPA` function. ```{r} depmap::depmap_RPPA() ``` ## CCLE RNAseq gene expression data The `TPM` dataset contains the CCLE RNAseq gene expression data. This shows expression data only for protein coding genes (using scale log2(TPM+1)). This data corresponds to the `CCLE_depMap_19Q1_TPM.csv` file. Specific `TPM` datasets can be accessed, such as `TPM_19Q1` by EH number. ```{r, eval = FALSE} eh[["EH2264"]] ``` The `TPM` dataset can also be accessed by using the `depmap_TPM` function. ```{r} depmap::depmap_TPM() ``` ## Cancer cell lines The `metadata` dataset contains the metadata about all of the cancer cell lines. It corresponds to the `depmap_19Q1_cell_lines.csv` file. Specific `metadata` datasets can be accessed, such as `metadata_19Q1` by EH number. ```{r, eval = FALSE} eh[["EH2266"]] ``` The most recent `metadata` dataset can be automatically loaded into R by using the `depmap_metadata` function. ```{r} depmap::depmap_metadata() ``` ## Mutation calls The `mutationCalls` dataset contains all merged mutation calls (coding region, germline filtered) found in the depmap dependency study. This dataset corresponds with the `depmap_19Q1_mutation_calls.csv` file. Specific `mutationCalls` datasets can be accessed, such as `mutationCalls_19Q1` by EH number. ```{r, eval = FALSE} eh[["EH2265"]] ``` The most recent `mutationCalls` dataset can be automatically loaded into R by using the `depmap_mutationCalls` function. ```{r} depmap::depmap_mutationCalls() ``` ## Drug Sensitivity The `drug_sensitivity` dataset contains dependency data for cancer cell lines treated with various compounds. This dataset corresponds with the `primary_replicate_collapsed_logfold_change.csv` file. Specific `drug_sensitivity` datasets can be accessed, such as `drug_sensitivity_19Q3` by EH number. ```{r, eval = FALSE} eh[["EH3087"]] ``` The most recent `drug_sensitivity` dataset can be automatically loaded into R by using the `depmap_drug_sensitivity` function. ```{r} depmap::depmap_drug_sensitivity() ``` ## Proteomic The `proteomic` dataset contains normalized quantitative profiling of proteins of cancer cell lines by mass spectrometry. This dataset corresponds with the `https://gygi.med.harvard.edu/sites/gygi.med.harvard.edu/files/documents/protein_quant_current_normalized.csv.gz` file. Specific `proteomic` datasets can be accessed, such as `proteomic_20Q2` by EH number. ```{r, eval = FALSE} eh[["EH3459"]] ``` The most recent `proteomic` dataset can be automatically loaded into R by using the `depmap_proteomic` function. ```{r} depmap::depmap_proteomic() ``` ## Repackaged data source If desired, the original data from which the [depmap](https://bioconductor.org/packages/depmap) package were derived from can be downloaded from the [Broad Institute](https://depmap.org/portal/download/) website. The instructions on how to download these files and how the data was transformed and loaded into the [depmap](https://bioconductor.org/packages/depmap) package can be found in the `make_data.R` file found in `./inst/scripts`. (It should be noted that the original uncompressed *.csv* files are > 1.5GB in total and take a moderate amount of time to download remotely.) # Original depmap data In addition to the re-packaged files, the package also allows to download any of the original files provided by the [DepMap project on Figshare](https://figshare.com/authors/Broad_DepMap/5514062). A list of all the datasets is available with the `dmsets()` function: ```{r} dmsets() ``` We could check what datasets from any quarter of 2020 are available by searching for `"20Q"` in the datasets titles: ```{r, message=FALSE} library(tidyverse) dmsets() |> filter(grepl("20Q", title)) ``` Let's focus on the *PRISM Repurposing 20Q2 Dataset* dataset, with identifier `20564034`. A list of all the files is available with the `dmfiles()` function: ```{r} dmfiles() ``` If we want to find all files from the *PRISM Repurposing 20Q2 Dataset* identified above, we could filter all files with its `dataset_id`: ```{r} dmfiles() |> filter(dataset_id == 20564034) ``` Let's now focus on the `prism-repurposing-20q2-primary-screen-cell-line-info.csv` file. We can filter it by its name and downloaded it with `dmget()`: ```{r} dmfiles() |> filter(name == "prism-repurposing-20q2-primary-screen-cell-line-info.csv") |> dmget() ``` The `dmget()` function will first check if it hasn't already been downloaded and cached in the depmap cache directory (see `?dmCache()`). If so, it will retrieve if from there. Otherwise, it will download the file and store it in the package cache directory. It will return the location of the cached file. Given that the file is in csv format, we can directly open it with `read_csv()`: ```{r} dmfiles() |> filter(name == "prism-repurposing-20q2-primary-screen-cell-line-info.csv") |> dmget() |> read_csv() ``` It is also possible to pass multiple rows of the `dmfiles()` table to `dmget()` to retrieve multiple file paths. Below, let's get all the README.txt files from 2020: ```{r} ids_2020 <- filter(dmsets(), grepl("20Q", title)) |> pull(dataset_id) dmfiles() |> filter(dataset_id %in% ids_2020) |> filter(grepl("README", name)) |> dmget() ``` # Session information ```{r echo = FALSE} sessionInfo()