---
title: "Import and representation of BioPlex PPI data and related resources"
author: "Ludwig Geistlinger and Robert Gentleman"
affiliation: Center for Computational Biomedicine, Harvard Medical School
output:
  BiocStyle::html_document:
    self_contained: yes 
    toc: true
    toc_float: true
    toc_depth: 2
    code_folding: show
date: "`r doc_date()`"
package: "`r pkg_ver('BioPlex')`"
vignette: >
  % \VignetteIndexEntry{1. Data retrieval}
  % \VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
)
```

# Setup

The [BioPlex project](https://bioplex.hms.harvard.edu/) uses 
affinity-purification mass spectrometry to profile protein-protein interactions (PPIs)
in human cell lines.

To date, the BioPlex project has created two proteome-scale, cell-line-specific 
PPI networks. The first, BioPlex 3.0, results from affinity purification of 
10,128 human proteins —- half the proteome —- in 293T cells and includes 118,162
interactions among 14,586 proteins. 
The second results from 5,522 immunoprecipitations in HCT116 cells and includes
70,966 interactions between 10,531 proteins. 

For more information, please see:

* Huttlin et al. [The BioPlex network: a systematic exploration of the human interactome](https://doi.org/10.1016/j.cell.2015.06.043). *Cell*, 2015.
* Huttlin et al. [Architecture of the human interactome defines protein communities and disease networks](https://doi.org/10.1038/nature22366), *Nature*, 2017.
* Huttlin et al. [Dual proteome-scale networks reveal cell-specific remodeling of the human interactome](https://doi.org/10.1016/j.cell.2021.04.011), *Cell*, 2021.

The [BioPlex R package](https://bioconductor.org/packages/BioPlex) 
implements access to the BioPlex protein-protein interaction networks and 
related resources from within R. 
Besides protein-protein interaction networks for 293T and HCT116 cells,
this includes access to [CORUM](http://mips.helmholtz-muenchen.de/corum)
protein complex data, and transcriptome and proteome data for the two cell lines. 
             
Functionality focuses on importing these data resources and
storing them in dedicated Bioconductor data structures, as a foundation for 
integrative downstream analysis of the data. For a set of downstream analyses
and applications, please see the 
[BioPlexAnalysis package](https://github.com/ccb-hms/BioPlexAnalysis)
and 
[analysis vignettes](https://ccb-hms.github.io/BioPlexAnalysis/).

# Installation

To install the package, start R and enter: 

```{r, eval = FALSE}
if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("BioPlex")
``` 

After the installation, we proceed by loading the package and additional packages
used in the vignette.

```{r, message = FALSE}
library(BioPlex)
library(AnnotationHub)
library(ExperimentHub)
library(graph)
```

# Data resources

## General

Connect to 
[AnnotationHub](http://bioconductor.org/packages/AnnotationHub):

```{r ahub, message = FALSE}
ah <- AnnotationHub::AnnotationHub()
```

Connect to 
[ExperimentHub](http://bioconductor.org/packages/ExperimentHub):

```{r ehub, message = FALSE}
eh <- ExperimentHub::ExperimentHub()
```

OrgDb package for human:
```{r orgdb, message = FALSE}
orgdb <- AnnotationHub::query(ah, c("orgDb", "Homo sapiens"))
orgdb <- orgdb[[1]]
orgdb
keytypes(orgdb)
```

## BioPlex PPIs

[Available networks](https://bioplex.hms.harvard.edu/interactions.php) include: 

- BioPlex PPI network for human embryonic kidney [293T](https://en.wikipedia.org/wiki/293T) cells (versions 1.0, 2.0, and 3.0)
- BioPlex PPI network for human colon cancer [HCT116](https://en.wikipedia.org/wiki/HCT116_cells) cells (version 1.0)

Let's get the latest version of the 293T PPI network:

```{r bioplex293T}
bp.293t <- getBioPlex(cell.line = "293T", version = "3.0")
head(bp.293t)
nrow(bp.293t)
```

Each row corresponds to a PPI between a bait protein A and a prey protein B, for which
NCBI Entrez Gene IDs, Uniprot IDs, and gene symbols are annotated. 
The last three columns reflect the likelihood that each interaction resulted 
from either an incorrect protein identification (`pW`), background (`pNI`), or 
a bona fide interacting partner (`pInt`) as determined using the 
[CompPASS algorithm](https://github.com/dnusinow/cRomppass).

Analgously, we can obtain the latest version of the HCT116 PPI network:

```{r bioplexHCT116}
bp.hct116 <- getBioPlex(cell.line = "HCT116", version = "1.0")
head(bp.hct116)
nrow(bp.hct116)
```

### ID mapping

The protein-to-gene mappings from BioPlex (i.e. UNIPROT-to-SYMBOL and UNIPROT-to-ENTREZID) are based on the mappings available from Uniprot at the time of publication of the BioPlex 3.0 networks.

We can update those based on Bioc annotation functionality:  

```{r bioplex-remap}
bp.293t.remapped <- getBioPlex(cell.line = "293T",
                               version = "3.0",
                               remap.uniprot.ids = TRUE)
```

### Data structures for BioPlex PPIs

We can also represent a given version of the BioPlex PPI network for a given
cell line as one big graph where bait and prey relationship are represented 
by directed edges from bait to prey.

```{r bpgraph}
bp.gr <- bioplex2graph(bp.293t)
bp.gr
head(graph::nodeData(bp.gr))
head(graph::edgeData(bp.gr))
```

### PFAM domains

We can easily add [PFAM](http://pfam.xfam.org) domain annotations to the node metadata:

```{r pfam}
bp.gr <- annotatePFAM(bp.gr, orgdb)
head(graph::nodeData(bp.gr, graph::nodes(bp.gr), "PFAM"))
```

## CORUM complexes

Obtain the complete set of human protein complexes from
[CORUM](http://mips.helmholtz-muenchen.de/corum/#download):

```{r corumALL}
all <- getCorum(set = "all", organism = "Human")
dim(all)
colnames(all)
all[1:5, 1:5]
```

Core set of complexes:
```{r corumCore}
core <- getCorum(set = "core", organism = "Human")
dim(core)
```

Complexes with splice variants:
```{r corumSplice}
splice <- getCorum(set = "splice", organism = "Human")
dim(splice)
```

### ID mapping

The protein-to-gene mappings from CORUM (i.e. UNIPROT-to-SYMBOL and UNIPROT-to-ENTREZID) might not be fully up-to-date. 

We can update those based on Bioc annotation functionality:  

```{r corum-remap}
core.remapped <- getCorum(set = "core", 
                          organism = "Human",
                          remap.uniprot.ids = TRUE)
```

### Data structures for CORUM complexes 

We can represent the CORUM complexes as a list of character vectors.
The names of the list are the complex IDs/names, and each element of the list is
a vector of UniProt IDs for each complex.

```{r corum2list}
core.list <- corum2list(core, subunit.id.type = "UNIPROT")
head(core.list)
length(core.list)
```

We can also represent the CORUM complexes as a list of graph instances,
where all nodes of a complex are connected to all other nodes of that complex
with undirected edges.

```{r corum2glist}
core.glist <- corum2graphlist(core, subunit.id.type = "UNIPROT")
head(core.glist)
length(core.glist)
core.glist[[1]]@graphData
graph::nodeData(core.glist[[1]])
```

Note that we can easily convert a 
[graph](https://bioconductor.org/packages/graph) object into an 
[igraph](https://cran.r-project.org/package=igraph) object using
`igraph::graph_from_graphnel`.

## CNV data

### HEK293T cells

Genomic data from whole-genome sequencing for six different lineages
of the human embryonic kidney HEK293 cell line can be obtained from
[hek293genome.org](http://hek293genome.org).

This includes copy number variation (CNV) data for the 293T cell line.
Available CNV tracks include (i) CNV regions inferred from sequencing 
read-depth analysis, and (ii) CNV regions inferred from Illumina SNP arrays.

Here, we obtain CNV segments obtained from applying a hidden Markov model (HMM)
to sequencing-inferred copy numbers in 2kbp windows. More details on how copy
numbers were calculated can be obtained from the 
[primary publication](https://doi.org/10.1038/ncomms5767). 

```{r cnv}
cnv.hmm <- getHEK293GenomeTrack(track = "cnv.hmm", cell.line = "293T")
cnv.hmm
```

See also the [data checks vignette, Section 5](
https://bioconductor.org/packages/release/data/experiment/vignettes/BioPlex/inst/doc/BasicChecks.html) for an exploration of the agreement between inferred copy numbers from both assay types (SNP arrays vs. sequencing).

## Transcriptome data

### HEK293T cells

#### GSE122425

Obtain transcriptome data for 293T cells from GEO dataset:
[GSE122425](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE122425).

```{r gse122425}
se <- getGSE122425()
se
head(assay(se, "raw"))
head(assay(se, "rpkm"))
colData(se)
rowData(se)
```

The dataset includes three wild type samples and three NSUN2 knockout samples.

See also the [data checks vignette, Section 7](https://bioconductor.org/packages/release/data/experiment/vignettes/BioPlex/inst/doc/BasicChecks.html) for an exploration of the relationship between expression level and the frequency of a protein being detected as prey. 

### HCT116 cells

#### Cancer Cell Line Encyclopedia (CCLE)

RNA-seq data for 934 cancer cell lines (incl. HCT116) from the 
[Cancer Cell Line Encyclopedia](https://portals.broadinstitute.org/ccle) 
is available from the [ArrayExpress-ExpressionAtlas](https://www.ebi.ac.uk/gxa) 
(Accession: [E-MTAB-2770](https://www.ebi.ac.uk/gxa/experiments/E-MTAB-2770)).

The data can be obtained as a `SummarizedExperiment` using the 
[ExpressionAtlas](https://bioconductor.org/packages/ExpressionAtlas) package.

```{r, eval = FALSE}
ccle.trans <- ExpressionAtlas::getAtlasExperiment("E-MTAB-2770")
```

See also the [Transcriptome-Proteome analysis vignette](https://ccb-hms.github.io/BioPlexAnalysis/articles/TranscriptomeProteome.html) 
for further exploration of the correlation between CCLE HCT116 transcript and
protein expression. 


#### Klijn et al., 2015

RNA-seq data of 675 commonly used human cancer cell lines (incl. HCT116) from 
[Klijn et al., 2015](https://pubmed.ncbi.nlm.nih.gov/25485619) 
is available from the [ArrayExpress-ExpressionAtlas](https://www.ebi.ac.uk/gxa) 
(Accession: [E-MTAB-2706](https://www.ebi.ac.uk/gxa/experiments/E-MTAB-2706))

The data can be obtained as a `SummarizedExperiment` using the 
[ExpressionAtlas](https://bioconductor.org/packages/ExpressionAtlas) package.

```{r, eval = FALSE}
klijn <- ExpressionAtlas::getAtlasExperiment("E-MTAB-2706")
```

See also the [Transcriptome-Proteome analysis vignette](https://ccb-hms.github.io/BioPlexAnalysis/articles/TranscriptomeProteome.html) 
for further exploration of differential transcript and
protein expression between 293T and HCT116 cells. 

## Splicing data

For the inference of differential exon usage between cell lines, raw RNA-seq read
counts on exon level can be obtained from
[ExperimentHub](https://bioconductor.org/packages/ExperimentHub).   

RNA-seq data for 293T cells was obtained from GEO accession
[GSE122633](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE122633) and
RNA-seq data for HCT116 cells was obtained from GEO accession
[GSE52429](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52429). 

The data can be obtained as a `DEXSeqDataSet` which is a
`SummarizedExperiment`-derivative and can be accessed and manipulated very much
like a `DESeqDataSet`.

```{r splice, message = FALSE}
AnnotationHub::query(eh, c("BioPlex"))
dex <- eh[["EH7563"]]
dex
```

We take a closer look at the sample annotation, the counts for each exon for
both cell lines, and the genomic coordinates and additional annotation for each
exon.

```{r}
DEXSeq::sampleAnnotation(dex)
head(DEXSeq::featureCounts(dex))
rowRanges(dex)
```

## Proteome data

### CCLE

Pull the [CCLE proteome data](https://doi.org/10.1016/j.cell.2019.12.023) from ExperimentHub. The dataset profiles 12,755 proteins by mass spectrometry across 375 cancer cell lines.

```{r ccle-proteom, message = FALSE}
AnnotationHub::query(eh, c("gygi", "depmap"))
ccle.prot <- eh[["EH3459"]]
ccle.prot <- as.data.frame(ccle.prot)
```

Explore the data:
```{r ccle-proteom2}
dim(ccle.prot)
colnames(ccle.prot)
head(ccle.prot)
```

Restrict to HCT116:

```{r ccle-prot-hct116}
ccle.prot.hct116 <- subset(ccle.prot, cell_line == "HCT116_LARGE_INTESTINE")
dim(ccle.prot.hct116)
head(ccle.prot.hct116)
```

Or turn into a `SummarizedExperiment` for convenience (we can restrict
this to selected cell lines, but here we keep all cell lines):

```{r ccle-prot-se}
se <- ccleProteome2SummarizedExperiment(ccle.prot, cell.line = NULL)
assay(se)[1:5, 1:5]
assay(se)[1:5, "HCT116"]
rowData(se)
```

### Relative protein expression data from BioPlex3.0

The [BioPlex 3.0 publication](https://doi.org/10.1016/j.cell.2021.04.011), 
Supplementary Table S4A, provides relative protein expression data comparing
293T and HCT116 cells based on tandem mass tag analysis.

```{r bp.prot}
bp.prot <- getBioplexProteome()
assay(bp.prot)[1:5,1:5]
colData(bp.prot)
rowData(bp.prot)
```

The data contains 5 replicates each for 293T and for HCT116 cells.
As a result of the data collection process, the data represent relative protein abundance 
scaled to add up to 100% in each row.

See also the [data checks vignette, Section 8](https://bioconductor.org/packages/release/data/experiment/vignettes/BioPlex/inst/doc/BasicChecks.html) for a basic exploration of the annotated differential expression measures.

# Caching

Note that calling functions like `getCorum` or `getBioPlex` with argument
`cache = FALSE` will automatically overwrite the corresponding object in your 
cache. It is thus typically not required for a user to interact with the cache.

For more extended control of the cache, use from within R:

```{r cache}
cache.dir <- tools::R_user_dir("BioPlex", which = "cache") 
bfc <- BiocFileCache::BiocFileCache(cache.dir)
```

and then proceed as described in the
[BiocFileCache vignette, Section 1.10](https://www.bioconductor.org/packages/release/bioc/vignettes/BiocFileCache/inst/doc/BiocFileCache.html#cleaning-or-removing-cache)
either via `cleanbfc()` to clean or `removebfc()` to remove your cache.

To do a hard reset (use with caution!):

```{r rmCache, eval = FALSE}
BiocFileCache::removebfc(bfc)
```

# SessionInfo

```{r sessionInfo}
sessionInfo()
```