--- title: "Basic checks of BioPlex PPI data" output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 code_folding: show date: "`r doc_date()`" package: "`r pkg_ver('BioPlex')`" vignette: > % \VignetteIndexEntry{2. Data checks} % \VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html ) ``` # Setup ```{r, message = FALSE} library(BioPlex) library(AnnotationDbi) library(AnnotationHub) library(graph) ``` Connect to [AnnotationHub](http://bioconductor.org/packages/AnnotationHub): ```{r ahub, message = FALSE} ah <- AnnotationHub::AnnotationHub() ``` Connect to [ExperimentHub](http://bioconductor.org/packages/ExperimentHub): ```{r ehub, message = FALSE} eh <- ExperimentHub::ExperimentHub() ``` OrgDb package for human: ```{r orgdb, message = FALSE} orgdb <- AnnotationHub::query(ah, c("orgDb", "Homo sapiens")) orgdb <- orgdb[[1]] orgdb keytypes(orgdb) ``` # Check: identify CORUM complexes that have a subunit of interest Get core set of complexes: ```{r corumCore} core <- getCorum(set = "core", organism = "Human") ``` Turn the CORUM complexes into a list of graph instances, where all nodes of a complex are connected to all other nodes of that complex with undirected edges. ```{r corum2glist} core.glist <- corum2graphlist(core, subunit.id.type = "UNIPROT") ``` Identify complexes that have a subunit of interest: ```{r corum-subunit} has.cdk2 <- hasSubunit(core.glist, subunit = "CDK2", id.type = "SYMBOL") ``` Check the answer: ```{r corum-subunit2} table(has.cdk2) cdk2.glist <- core.glist[has.cdk2] lapply(cdk2.glist, function(g) unlist(graph::nodeData(g, attr = "SYMBOL"))) ``` We can then also inspect the graph with plotting utilities from the [Rgraphviz](https://bioconductor.org/packages/Rgraphviz) package: ```{r corum-subunit3, message = FALSE, eval = FALSE} plot(cdk2.glist[[1]], main = names(cdk2.glist)[1]) ``` # Check: extract BioPlex PPIs for a CORUM complex Get the latest version of the 293T PPI network: ```{r bioplex293T} bp.293t <- getBioPlex(cell.line = "293T", version = "3.0") ``` Turn the BioPlex PPI network into one big graph where bait and prey relationship are represented by directed edges from bait to prey. ```{r bpgraph} bp.gr <- bioplex2graph(bp.293t) ``` Now we can also easily pull out a BioPlex subnetwork for a CORUM complex of interest: ```{r} n <- graph::nodes(cdk2.glist[[1]]) bp.sgr <- graph::subGraph(n, bp.gr) bp.sgr ``` # Check: identify interacting domains for a PFAM domain of interest Add PFAM domain annotations to the node metadata: ```{r} bp.gr <- BioPlex::annotatePFAM(bp.gr, orgdb) ``` Create a map from PFAM to UNIPROT: ```{r} unip2pfam <- graph::nodeData(bp.gr, graph::nodes(bp.gr), "PFAM") pfam2unip <- stack(unip2pfam) pfam2unip <- split(as.character(pfam2unip$ind), pfam2unip$values) head(pfam2unip, 2) ``` Let's focus on [PF02023](http://pfam.xfam.org/family/PF02023), corresponding to the zinc finger-associated SCAN domain. For each protein containing the SCAN domain, we now extract PFAM domains connected to the SCAN domain by an edge in the BioPlex network. ```{r} scan.unip <- pfam2unip[["PF02023"]] getIAPfams <- function(n) graph::nodeData(bp.gr, graph::edges(bp.gr)[[n]], "PFAM") unip2iapfams <- lapply(scan.unip, getIAPfams) unip2iapfams <- lapply(unip2iapfams, unlist) names(unip2iapfams) <- scan.unip ``` Looking at the top 5 PFAM domains most frequently connected to the SCAN domain by an edge in the BioPlex network ... ```{r} pfam2iapfams <- unlist(unip2iapfams) sort(table(pfam2iapfams), decreasing = TRUE)[1:5] ``` ... we find [PF02023](http://pfam.xfam.org/family/PF02023), the SCAN domain itself, and [PF00096](http://pfam.xfam.org/family/PF00096), a C2H2 type zinc finger domain. This finding is consistent with results reported in the [BioPlex 3.0 publication](https://doi.org/10.1016/j.cell.2021.04.011). See also the [PFAM domain-domain association analysis vignette](https://ccb-hms.github.io/BioPlexAnalysis/articles/PFAM.html) for a more comprehensive analysis of PFAM domain associations in the BioPlex network. # Check: agreement between CNV tracks Genomic data from whole-genome sequencing for six different lineages of the human embryonic kidney HEK293 cell line can be obtained from [hek293genome.org](http://hek293genome.org). This includes copy number variation (CNV) data for the 293T cell line. Available CNV tracks include (i) CNV regions inferred from sequencing read-depth analysis, and (ii) CNV regions inferred from Illumina SNP arrays. Here, we check agreement between inferred copy numbers from both assay types. We start by obtaining genomic coordinates and copy number scores from the sequencing track ... ```{r cnv-seq} cnv.hmm <- getHEK293GenomeTrack(track = "cnv.hmm", cell.line = "293T") cnv.hmm ``` ... and from the SNP track. ```{r cnv-snp} cnv.snp <- getHEK293GenomeTrack(track = "cnv.snp", cell.line = "293T") cnv.snp ``` We reduce the check for agreement between both CNV tracks by transferring copy numbers to overlapping genes, and subsequently, assess the agreement between the resulting gene copy numbers for both tracks. As the genomic coordinates from both CNV tracks is based on the hg18 human genome assembly, we obtain gene coordinates for hg18 from AnnotationHub: ```{r hg18-genes} AnnotationHub::query(ah, c("TxDb", "Homo sapiens")) txdb <- ah[["AH52257"]] gs <- GenomicFeatures::genes(txdb) gs ``` We then transfer SNP-inferred copy numbers to genes by overlap ... ```{r} olaps <- GenomicRanges::findOverlaps(gs, cnv.snp) joined <- gs[S4Vectors::queryHits(olaps)] joined$score <- cnv.snp$score[S4Vectors::subjectHits(olaps)] joined ``` ... and, analogously, transfer sequencing-inferred copy numbers to genes by overlap. ```{r} olaps <- GenomicRanges::findOverlaps(gs, cnv.hmm) joined2 <- gs[S4Vectors::queryHits(olaps)] joined2$score <- cnv.hmm$score[S4Vectors::subjectHits(olaps)] joined2 ``` We then restrict both tracks to common genes. ```{r} isect <- intersect(names(joined), names(joined2)) joined <- joined[isect] joined2 <- joined2[isect] ``` Now, can assess agreement by testing the correlation between SNP-inferred gene copy numbers and the corresponding sequencing-inferred gene copy numbers. ```{r} cor(joined$score, joined2$score) cor.test(joined$score, joined2$score) ``` We also inspect the correlation via a boxplot. ```{r, fig.width = 8, fig.height = 8} spl <- split(joined2$score, joined$score) boxplot(spl, xlab = "SNP-inferred copy number", ylab = "sequencing-inferred copy number") rho <- cor(joined$score, joined2$score) rho <- paste("cor", round(rho, digits=3), sep=" = ") p <- cor.test(joined$score, joined2$score) p <- p$p.value p <- " p < 2.2e-16" legend("topright", legend=c(rho, p)) ``` # Check: expressed genes are showing up as prey (293T cells) Get RNA-seq data for HEK293 cells from GEO: [GSE122425](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE122425) ```{r gse122425} se <- getGSE122425() se ``` Inspect expression of prey genes: ```{r prey-expression} bait <- unique(bp.293t$SymbolA) length(bait) prey <- unique(bp.293t$SymbolB) length(prey) ``` ```{r} ind <- match(prey, rowData(se)$SYMBOL) par(las = 2) boxplot(log2(assay(se, "rpkm") + 0.5)[ind,], names = se$title, ylab = "log2 RPKM") ``` How many prey genes are expressed (raw read count > 0) in all 3 WT reps: ```{r prey-expression2} # background: how many genes in total are expressed in all three WT reps gr0 <- rowSums(assay(se)[,1:3] > 0) table(gr0 == 3) # prey: expressed in all three WT reps table(gr0[ind] == 3) # prey: expressed in at least one WT rep table(gr0[ind] > 0) ``` Are prey genes overrepresented in the expressed genes? ```{r prey-expression-ora} exprTable <- matrix(c(9346, 1076, 14717, 32766), nrow = 2, dimnames = list(c("Expressed", "Not.expressed"), c("In.prey.set", "Not.in.prey.set"))) exprTable ``` Test using hypergeometric test (i.e. one-sided Fisher's exact test): ```{r prey-expression-ora2} fisher.test(exprTable, alternative = "greater") ``` Alternatively: permutation test, i.e. repeatedly sample number of prey genes from the background, and assess how often we have as many or more than 9346 genes expressed: ```{r prey-expression-293T-perm} permgr0 <- function(gr0, nr.genes = length(prey)) { ind <- sample(seq_along(gr0), nr.genes) sum(gr0[ind] == 3) } ``` ```{r prey-expression-perm2} perms <- replicate(permgr0(gr0), 1000) summary(perms) (sum(perms >= 9346) + 1) / 1001 ``` # Check: is there a relationship between prey frequency and prey expression level? Check which genes turn up most frequently as prey: ```{r prey-freq} prey.freq <- sort(table(bp.293t$SymbolB), decreasing = TRUE) preys <- names(prey.freq) prey.freq <- as.vector(prey.freq) names(prey.freq) <- preys head(prey.freq) summary(prey.freq) hist(prey.freq, breaks = 50, main = "", xlab = "Number of PPIs", ylab = "Number of genes") ``` Prey genes are involved in `r round(mean(as.vector(prey.freq)))` PPIs on average. There doesn't seem to be a strong correlation between expression level and the frequency of gene to turn up as prey: ```{r} ind <- match(names(prey.freq), rowData(se)$SYMBOL) rmeans <- rowMeans(assay(se, "rpkm")[ind, 1:3]) log.rmeans <- log2(rmeans + 0.5) par(pch = 20) plot( x = prey.freq, y = log.rmeans, xlab = "prey frequency", ylab = "log2 RPKM") cor(prey.freq, log.rmeans, use = "pairwise.complete.obs") ``` See also the [BioNet maximum scoring subnetwork analysis vignette](https://ccb-hms.github.io/BioPlexAnalysis/articles/BioNet.html) for a more comprehensive analysis of the 293T transcriptome data from GSE122425 when mapped onto BioPlex PPI network. # Check: differential protein expression (HEK293 vs. HCT116) Get the relative protein expression data comparing 293T and HCT116 cells from Supplementary Table S4A of the BioPlex 3 paper: ```{r bp.prot} bp.prot <- getBioplexProteome() bp.prot rowData(bp.prot) ``` A couple of quick sanity checks: 1. The relative abundances are scaled to sum up to 100% for each protein: ```{r} rowSums(assay(bp.prot)[1:5,]) ``` 2. The `rowData` column `log2ratio` corresponds to the mean of the five HEK samples, divided by the mean of the five HCT samples (and then taking log2 of it): ```{r} ratio <- rowMeans(assay(bp.prot)[1:5, 1:5]) / rowMeans(assay(bp.prot)[1:5, 6:10]) log2(ratio) ``` 3. The `rowData` column `adj.pvalue` stores Benjamini-Hochberg adjusted *p*-values from a *t*-test between the five HEK samples and the five HCT samples: ```{r} t.test(assay(bp.prot)[1, 1:5], assay(bp.prot)[1, 6:10]) ``` The [Transcriptome-Proteome analysis vignette](https://ccb-hms.github.io/BioPlexAnalysis/articles/TranscriptomeProteome.html) also explores the agreement between differential gene expression and differential protein expression when comparing HEK293 against HCT116 cells. # SessionInfo ```{r sessionInfo} sessionInfo() ```