--- title: "Import and representation of BioPlex PPI data and related resources" author: "Ludwig Geistlinger and Robert Gentleman" affiliation: Center for Computational Biomedicine, Harvard Medical School output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 code_folding: show date: "`r doc_date()`" package: "`r pkg_ver('BioPlex')`" vignette: > % \VignetteIndexEntry{1. Data retrieval} % \VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html ) ``` # Setup The [BioPlex project](https://bioplex.hms.harvard.edu/) uses affinity-purification mass spectrometry to profile protein-protein interactions (PPIs) in human cell lines. To date, the BioPlex project has created two proteome-scale, cell-line-specific PPI networks. The first, BioPlex 3.0, results from affinity purification of 10,128 human proteins —- half the proteome —- in 293T cells and includes 118,162 interactions among 14,586 proteins. The second results from 5,522 immunoprecipitations in HCT116 cells and includes 70,966 interactions between 10,531 proteins. For more information, please see: * Huttlin et al. [The BioPlex network: a systematic exploration of the human interactome](https://doi.org/10.1016/j.cell.2015.06.043). *Cell*, 2015. * Huttlin et al. [Architecture of the human interactome defines protein communities and disease networks](https://doi.org/10.1038/nature22366), *Nature*, 2017. * Huttlin et al. [Dual proteome-scale networks reveal cell-specific remodeling of the human interactome](https://doi.org/10.1016/j.cell.2021.04.011), *Cell*, 2021. The [BioPlex R package](https://bioconductor.org/packages/BioPlex) implements access to the BioPlex protein-protein interaction networks and related resources from within R. Besides protein-protein interaction networks for 293T and HCT116 cells, this includes access to [CORUM](http://mips.helmholtz-muenchen.de/corum) protein complex data, and transcriptome and proteome data for the two cell lines. Functionality focuses on importing these data resources and storing them in dedicated Bioconductor data structures, as a foundation for integrative downstream analysis of the data. For a set of downstream analyses and applications, please see the [BioPlexAnalysis package](https://github.com/ccb-hms/BioPlexAnalysis) and [analysis vignettes](https://ccb-hms.github.io/BioPlexAnalysis/). # Installation To install the package, start R and enter: ```{r, eval = FALSE} if(!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("BioPlex") ``` After the installation, we proceed by loading the package and additional packages used in the vignette. ```{r, message = FALSE} library(BioPlex) library(AnnotationHub) library(ExperimentHub) library(graph) ``` # Data resources ## General Connect to [AnnotationHub](http://bioconductor.org/packages/AnnotationHub): ```{r ahub, message = FALSE} ah <- AnnotationHub::AnnotationHub() ``` Connect to [ExperimentHub](http://bioconductor.org/packages/ExperimentHub): ```{r ehub, message = FALSE} eh <- ExperimentHub::ExperimentHub() ``` OrgDb package for human: ```{r orgdb, message = FALSE} orgdb <- AnnotationHub::query(ah, c("orgDb", "Homo sapiens")) orgdb <- orgdb[[1]] orgdb keytypes(orgdb) ``` ## BioPlex PPIs [Available networks](https://bioplex.hms.harvard.edu/interactions.php) include: - BioPlex PPI network for human embryonic kidney [293T](https://en.wikipedia.org/wiki/293T) cells (versions 1.0, 2.0, and 3.0) - BioPlex PPI network for human colon cancer [HCT116](https://en.wikipedia.org/wiki/HCT116_cells) cells (version 1.0) Let's get the latest version of the 293T PPI network: ```{r bioplex293T} bp.293t <- getBioPlex(cell.line = "293T", version = "3.0") head(bp.293t) nrow(bp.293t) ``` Each row corresponds to a PPI between a bait protein A and a prey protein B, for which NCBI Entrez Gene IDs, Uniprot IDs, and gene symbols are annotated. The last three columns reflect the likelihood that each interaction resulted from either an incorrect protein identification (`pW`), background (`pNI`), or a bona fide interacting partner (`pInt`) as determined using the [CompPASS algorithm](https://github.com/dnusinow/cRomppass). Analgously, we can obtain the latest version of the HCT116 PPI network: ```{r bioplexHCT116} bp.hct116 <- getBioPlex(cell.line = "HCT116", version = "1.0") head(bp.hct116) nrow(bp.hct116) ``` ### ID mapping The protein-to-gene mappings from BioPlex (i.e. UNIPROT-to-SYMBOL and UNIPROT-to-ENTREZID) are based on the mappings available from Uniprot at the time of publication of the BioPlex 3.0 networks. We can update those based on Bioc annotation functionality: ```{r bioplex-remap} bp.293t.remapped <- getBioPlex(cell.line = "293T", version = "3.0", remap.uniprot.ids = TRUE) ``` ### Data structures for BioPlex PPIs We can also represent a given version of the BioPlex PPI network for a given cell line as one big graph where bait and prey relationship are represented by directed edges from bait to prey. ```{r bpgraph} bp.gr <- bioplex2graph(bp.293t) bp.gr head(graph::nodeData(bp.gr)) head(graph::edgeData(bp.gr)) ``` ### PFAM domains We can easily add [PFAM](http://pfam.xfam.org) domain annotations to the node metadata: ```{r pfam} bp.gr <- annotatePFAM(bp.gr, orgdb) head(graph::nodeData(bp.gr, graph::nodes(bp.gr), "PFAM")) ``` ## CORUM complexes Obtain the complete set of human protein complexes from [CORUM](http://mips.helmholtz-muenchen.de/corum/#download): ```{r corumALL} all <- getCorum(set = "all", organism = "Human") dim(all) colnames(all) all[1:5, 1:5] ``` Core set of complexes: ```{r corumCore} core <- getCorum(set = "core", organism = "Human") dim(core) ``` Complexes with splice variants: ```{r corumSplice} splice <- getCorum(set = "splice", organism = "Human") dim(splice) ``` ### ID mapping The protein-to-gene mappings from CORUM (i.e. UNIPROT-to-SYMBOL and UNIPROT-to-ENTREZID) might not be fully up-to-date. We can update those based on Bioc annotation functionality: ```{r corum-remap} core.remapped <- getCorum(set = "core", organism = "Human", remap.uniprot.ids = TRUE) ``` ### Data structures for CORUM complexes We can represent the CORUM complexes as a list of character vectors. The names of the list are the complex IDs/names, and each element of the list is a vector of UniProt IDs for each complex. ```{r corum2list} core.list <- corum2list(core, subunit.id.type = "UNIPROT") head(core.list) length(core.list) ``` We can also represent the CORUM complexes as a list of graph instances, where all nodes of a complex are connected to all other nodes of that complex with undirected edges. ```{r corum2glist} core.glist <- corum2graphlist(core, subunit.id.type = "UNIPROT") head(core.glist) length(core.glist) core.glist[[1]]@graphData graph::nodeData(core.glist[[1]]) ``` Note that we can easily convert a [graph](https://bioconductor.org/packages/graph) object into an [igraph](https://cran.r-project.org/package=igraph) object using `igraph::graph_from_graphnel`. ## CNV data ### HEK293T cells Genomic data from whole-genome sequencing for six different lineages of the human embryonic kidney HEK293 cell line can be obtained from [hek293genome.org](http://hek293genome.org). This includes copy number variation (CNV) data for the 293T cell line. Available CNV tracks include (i) CNV regions inferred from sequencing read-depth analysis, and (ii) CNV regions inferred from Illumina SNP arrays. Here, we obtain CNV segments obtained from applying a hidden Markov model (HMM) to sequencing-inferred copy numbers in 2kbp windows. More details on how copy numbers were calculated can be obtained from the [primary publication](https://doi.org/10.1038/ncomms5767). ```{r cnv} cnv.hmm <- getHEK293GenomeTrack(track = "cnv.hmm", cell.line = "293T") cnv.hmm ``` See also the [data checks vignette, Section 5]( https://bioconductor.org/packages/release/data/experiment/vignettes/BioPlex/inst/doc/BasicChecks.html) for an exploration of the agreement between inferred copy numbers from both assay types (SNP arrays vs. sequencing). ## Transcriptome data ### HEK293T cells #### GSE122425 Obtain transcriptome data for 293T cells from GEO dataset: [GSE122425](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE122425). ```{r gse122425} se <- getGSE122425() se head(assay(se, "raw")) head(assay(se, "rpkm")) colData(se) rowData(se) ``` The dataset includes three wild type samples and three NSUN2 knockout samples. See also the [data checks vignette, Section 7](https://bioconductor.org/packages/release/data/experiment/vignettes/BioPlex/inst/doc/BasicChecks.html) for an exploration of the relationship between expression level and the frequency of a protein being detected as prey. ### HCT116 cells #### Cancer Cell Line Encyclopedia (CCLE) RNA-seq data for 934 cancer cell lines (incl. HCT116) from the [Cancer Cell Line Encyclopedia](https://portals.broadinstitute.org/ccle) is available from the [ArrayExpress-ExpressionAtlas](https://www.ebi.ac.uk/gxa) (Accession: [E-MTAB-2770](https://www.ebi.ac.uk/gxa/experiments/E-MTAB-2770)). The data can be obtained as a `SummarizedExperiment` using the [ExpressionAtlas](https://bioconductor.org/packages/ExpressionAtlas) package. ```{r, eval = FALSE} ccle.trans <- ExpressionAtlas::getAtlasExperiment("E-MTAB-2770") ``` See also the [Transcriptome-Proteome analysis vignette](https://ccb-hms.github.io/BioPlexAnalysis/articles/TranscriptomeProteome.html) for further exploration of the correlation between CCLE HCT116 transcript and protein expression. #### Klijn et al., 2015 RNA-seq data of 675 commonly used human cancer cell lines (incl. HCT116) from [Klijn et al., 2015](https://pubmed.ncbi.nlm.nih.gov/25485619) is available from the [ArrayExpress-ExpressionAtlas](https://www.ebi.ac.uk/gxa) (Accession: [E-MTAB-2706](https://www.ebi.ac.uk/gxa/experiments/E-MTAB-2706)) The data can be obtained as a `SummarizedExperiment` using the [ExpressionAtlas](https://bioconductor.org/packages/ExpressionAtlas) package. ```{r, eval = FALSE} klijn <- ExpressionAtlas::getAtlasExperiment("E-MTAB-2706") ``` See also the [Transcriptome-Proteome analysis vignette](https://ccb-hms.github.io/BioPlexAnalysis/articles/TranscriptomeProteome.html) for further exploration of differential transcript and protein expression between 293T and HCT116 cells. ## Splicing data For the inference of differential exon usage between cell lines, raw RNA-seq read counts on exon level can be obtained from [ExperimentHub](https://bioconductor.org/packages/ExperimentHub). RNA-seq data for 293T cells was obtained from GEO accession [GSE122633](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE122633) and RNA-seq data for HCT116 cells was obtained from GEO accession [GSE52429](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52429). The data can be obtained as a `DEXSeqDataSet` which is a `SummarizedExperiment`-derivative and can be accessed and manipulated very much like a `DESeqDataSet`. ```{r splice, message = FALSE} AnnotationHub::query(eh, c("BioPlex")) dex <- eh[["EH7563"]] dex ``` We take a closer look at the sample annotation, the counts for each exon for both cell lines, and the genomic coordinates and additional annotation for each exon. ```{r} DEXSeq::sampleAnnotation(dex) head(DEXSeq::featureCounts(dex)) rowRanges(dex) ``` ## Proteome data ### CCLE Pull the [CCLE proteome data](https://doi.org/10.1016/j.cell.2019.12.023) from ExperimentHub. The dataset profiles 12,755 proteins by mass spectrometry across 375 cancer cell lines. ```{r ccle-proteom, message = FALSE} AnnotationHub::query(eh, c("gygi", "depmap")) ccle.prot <- eh[["EH3459"]] ccle.prot <- as.data.frame(ccle.prot) ``` Explore the data: ```{r ccle-proteom2} dim(ccle.prot) colnames(ccle.prot) head(ccle.prot) ``` Restrict to HCT116: ```{r ccle-prot-hct116} ccle.prot.hct116 <- subset(ccle.prot, cell_line == "HCT116_LARGE_INTESTINE") dim(ccle.prot.hct116) head(ccle.prot.hct116) ``` Or turn into a `SummarizedExperiment` for convenience (we can restrict this to selected cell lines, but here we keep all cell lines): ```{r ccle-prot-se} se <- ccleProteome2SummarizedExperiment(ccle.prot, cell.line = NULL) assay(se)[1:5, 1:5] assay(se)[1:5, "HCT116"] rowData(se) ``` ### Relative protein expression data from BioPlex3.0 The [BioPlex 3.0 publication](https://doi.org/10.1016/j.cell.2021.04.011), Supplementary Table S4A, provides relative protein expression data comparing 293T and HCT116 cells based on tandem mass tag analysis. ```{r bp.prot} bp.prot <- getBioplexProteome() assay(bp.prot)[1:5,1:5] colData(bp.prot) rowData(bp.prot) ``` The data contains 5 replicates each for 293T and for HCT116 cells. As a result of the data collection process, the data represent relative protein abundance scaled to add up to 100% in each row. See also the [data checks vignette, Section 8](https://bioconductor.org/packages/release/data/experiment/vignettes/BioPlex/inst/doc/BasicChecks.html) for a basic exploration of the annotated differential expression measures. # Caching Note that calling functions like `getCorum` or `getBioPlex` with argument `cache = FALSE` will automatically overwrite the corresponding object in your cache. It is thus typically not required for a user to interact with the cache. For more extended control of the cache, use from within R: ```{r cache} cache.dir <- tools::R_user_dir("BioPlex", which = "cache") bfc <- BiocFileCache::BiocFileCache(cache.dir) ``` and then proceed as described in the [BiocFileCache vignette, Section 1.10](https://www.bioconductor.org/packages/release/bioc/vignettes/BiocFileCache/inst/doc/BiocFileCache.html#cleaning-or-removing-cache) either via `cleanbfc()` to clean or `removebfc()` to remove your cache. To do a hard reset (use with caution!): ```{r rmCache, eval = FALSE} BiocFileCache::removebfc(bfc) ``` # SessionInfo ```{r sessionInfo} sessionInfo() ```