--- title: "iCluster analysis of uveal melanoma multi-omics data" author: name: "Qianxing Mo" affiliation: "Department of Biostatistics and Bioinformatics
H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL 33612" output: html_document: theme: united df_print: kable date: 'Compiled: `r format(Sys.Date(), "%B %d, %Y")`' --- ```{r setup, include=FALSE} all_times <- list() # store the time for each chunk knitr::knit_hooks$set(time_it = local({ now <- NULL function(before, options) { if (before) { now <<- Sys.time() } else { res <- difftime(Sys.time(), now) all_times[[options$label]] <<- res } } })) knitr::opts_chunk$set( message = FALSE, warning = FALSE, time_it = TRUE, error = TRUE ) ``` This tutorial aims to demonstrate how to use the updated iCluster functions to identify cancer molecular subtypes using multi-omics data. We use the uveal melanoma (UM) multi-omics data generated by The Cancer Genome Atlas (TCGA) project ([Robertson et al., Cancer Cell 2018](https://www.sciencedirect.com/science/article/pii/S1535610817302957)) for the demonstration. The UM dataset consists of somatic mutation, DNA copy number, DNA methylation and RNA-seq gene expression data for 80 UM primary samples. The level-3 UM multi-omics data available at [Firebrowse portal](http://firebrowse.org/) were processed to form 4 data matrices (rows representing samples, and columns representing omics features) for iCluster analysis. The data procession were detailed in [Mo et al., 2021](https://www.mdpi.com/2072-6694/13/24/6168). In the following, we briefly describe the 4 data matrices. ```{r packages, cache=FALSE} library(iClusterPlus) library(patchwork) library(lattice) library(umap) library(ggplot2) library(gplots) library(survival) library(survminer) library(circlize) library(ComplexHeatmap) library(knitr) ``` TCGA UM multi-omics data matrices: rows and columns correspond to samples and features, respectively. ```{r multi-omics data} data(UM) #Somatic mutation data. 1: mutation, 0: no-mutation. dim(mut02) #Copy number data: merged log2 ratios of chromosome segments using CNregions function dim(cn) #Methylation data matrix made of the top 25% most variable genes. dim(methy25) #mRNA expression data matrix made of the top 25% most variable genes. dim(mrna25) ``` The iCluster analysis of multi-omics data can be carried out in 4 steps as described in the following. # Step 1: Make PCA variance plot to determine the dimension of latent factor used for iCluster modeling. The iCluster methods are based on latent factor model. We suggest using PCA variance plot to determine the number of latent factors used for iCluster modeling. ```{r data procession, fig.width=10, fig.height=5} # PCA variance plot for continuous multi-omics data pcaVarPlot(xList=list(cn,methy25,mrna25),K=30) ``` From the PCA variance plot, we see that the rate of decrease of the explained variance starts to slow down significantly after the 5th PC. An optimal number of latent factors used for iCluster modeling can be chosen around the change point where the rate of decrease of the explained variance of the PCs starts to slow down significantly. As an example, here, we choose K=7 as the number of latent factors for iCluster modeling of the UM multi-omics data. It can be found the sample clustering results are quite consistent when K >= 7. # Step 2: Perform iCluster modeling We first apply iClusterPlus2, an updated [iClusterPlus](https://pubmed.ncbi.nlm.nih.gov/23431203/) function to the UM multi-omics data. The iCluster modeling generates a latent factor matrix (meanZ), which capture the major sources of variation of the multi-omics data. As a result, the latent factor matrix can be used for sample clustering and visualization. ```{r iClusterPlus2 analysis} set.seed(321) date() fit.um <- iClusterPlus2(xList=list(mut02,cn,methy25,mrna25), type=c("binomial","gaussian","gaussian","gaussian"), K=7,n.burnin=100,n.draw=200,maxiter=25,sdev=0.05) date() # check the number of genes (variables) whose coefficients are all 0 betaID <- list() for(i in 1:4){ betaID[[i]] <- apply(fit.um$beta[[i]],1,function(x){all(x==0)}) cat(sum(betaID[[i]]), " ") } ``` # Step 3: Visualize sample clusters on a 2-dimensional space Before clustering analysis, we project the samples to a 2-dimensional space to estimate the number of clusters. ```{r UMAP, fig.height=6, fig.width=6} ## clustering of cells using Louvain method and the latent factor matrix generated by iCluster umap.um <- umap(fit.um$meanZ) colnames(umap.um$layout) <- c("UMAP1","UMAP2") plot(umap.um$layout) ``` # Step 4: Perform sample clustering From the UMAP plot, it seems that there are 3 or 4 major clusters. Therefore, we perform k-means clustering on the UM samples using the latent factor matrix meanZ and let the number of clusters be 3 or 4. The user can also try other clustering methods (e.g., k-medoids) on the latent factor matrix meanZ. ```{r kmeans clustering, fig.height=5, fig.width = 10} km3 <- kmeans(fit.um$meanZ, 3, nstart = 100) km4 <- kmeans(fit.um$meanZ, 4, nstart = 100) # compare the 3 and 4 cluster solution table(km3$cluster, km4$cluster) um.clusters <- data.frame(umap.um$layout,ic3=factor(km3$cluster),ic4=factor(km4$cluster)) p1 <- ggplot(um.clusters, aes(x=UMAP1, y=UMAP2, col=ic3)) + geom_point(size = 3)+theme_bw() p2 <- ggplot(um.clusters, aes(x=UMAP1, y=UMAP2, col=ic4)) + geom_point(size = 3)+theme_bw() p1+p2 ``` # Step 5: Draw the heatmap of the sample clusters It is necessary to make a heatmap of the multi-omics data to see if the clusters are biologically meaningful. We draw the heatmaps for the 3 and 4 clusters separately. On the following heatmaps, the columns are the samples ordered by the clusters and the rows are the features ordered by the clustering methods specified by the plotHeatmap function. Heatmap for the 3 iClusters ```{r heatmap of 3 clusters, fig.height=8, fig.width = 8} col.scheme <- alist() col.scheme[[1]] <- colorpanel(2,low="white",high="black") col.scheme[[2]] <- bluered(256) col.scheme[[3]] <- bluered(256) col.scheme[[4]] <- bluered(256) chr <- unlist(strsplit(colnames(cn),"\\.")) chr <- chr[seq(1,length(chr),by=2)] chr <- gsub("chr","",chr) chr <- as.numeric(chr) ### heatmap for the 3 iClusters fit.um$clusters <- factor(um.clusters$ic3) hm3c <- plotHeatmap(fit=fit.um,xList=list(mut02,cn,methy25,mrna25), type=c("binomial","gaussian","gaussian","gaussian"), threshold=rep(0.25, 4),feature.order=c(F,F,T,T),feature.scale=c(F,F,T,T), dist.method = "euclidean", hclust.method="ward.D",col.scheme = col.scheme,chr=chr, plot.chr=c(F,T,F,F),sparse=c(T,F,T,T),cap=c(0,0.99,0.99,0.99)) ``` Heatmap for the 4 iClusters. ```{r heatmap of 4 clusters, fig.height=8, fig.width = 8} fit.um$clusters <- factor(um.clusters$ic4) hm4c <- plotHeatmap(fit=fit.um,xList=list(mut02,cn,methy25,mrna25), type=c("binomial","gaussian","gaussian","gaussian"), threshold=rep(0.25, 4),feature.order=c(F,F,T,T),feature.scale=c(F,F,T,T), dist.method = "euclidean", hclust.method="ward.D",col.scheme = col.scheme,chr=chr, plot.chr=c(F,T,F,F),sparse=c(T,F,T,T),cap=c(0,0.99,0.99,0.99)) ``` Comparing the 3 and 4 iClusters, we can see that the cluster characterized by chromosome 6 loss in the 3-cluster solution is further divided into 2 smaller clusters in the 4-cluster solution. We can also use the ComplexHeatmap to make the multi-omics heatmap. The following is the heatmap of the 4 iClusters on which the columns are the features ordered by the clustering methods specified by the Heatmap function and the rows are the samples ordered by the clusters. ```{r heatmap plot 2, fig.height=5, fig.width=10} sample.ord <- order(um.clusters$ic4) meth <- scale(methy25) expr <- scale(mrna25) image.datasets <- list(mut02,cn,meth,expr) top.features <- list() for(i in 1:4){ ### keep all the CN features to show the full image of copy number variation if(i == 2){ top.features[[i]] <- (image.datasets[[i]])[sample.ord,] cat(ncol(top.features[[i]]), " ") }else{ # select the top features for mutation, methylation, and gene expression data sum.abs.coef <- apply(abs(fit.um$beta[[i]]),1,sum) top.features[[i]] <- (image.datasets[[i]])[sample.ord,sum.abs.coef>quantile(sum.abs.coef,0.5)] cat(ncol(top.features[[i]]), " ") } } clusters.ord <- um.clusters$ic4[sample.ord] colfun0 <- c("0"="white","1"="black") colfun1 <- colorRamp2(c(-2, 0, 2), c("blue", "white", "red")) colfun2 <- colorRamp2(c(-3, 0, 3), c("blue", "white", "red")) col_letters <- c("1" = "red", "2" = "purple", "3" = "gold", "4"="green") ht0 <- Heatmap(clusters.ord, name="iCluster",col = col_letters,width = unit(5, "mm"),cluster_rows = FALSE,cluster_columns=FALSE, column_labels="",row_labels=rep("",nrow(top.features[[1]]))) ht1 <- Heatmap(top.features[[1]], name = "Mutation",col=colfun0,cluster_rows = FALSE,clustering_method_columns="ward.D", column_labels=rep("",ncol(top.features[[1]])),row_labels=rep("",nrow(top.features[[1]])),use_raster=TRUE,width = unit(2, "cm")) ht2 <- Heatmap(top.features[[2]], name = "Copy number",col=colfun1,cluster_rows = FALSE,cluster_columns = FALSE,clustering_method_columns="ward.D", column_labels=rep("",ncol(top.features[[2]])),row_labels=rep("",nrow(top.features[[2]])),use_raster=TRUE,width = unit(5, "cm")) ht3 <- Heatmap(top.features[[3]], name = "Methylation",col=colfun2,cluster_rows = FALSE,clustering_method_columns="ward.D", column_labels=rep("",ncol(top.features[[3]])),row_labels=rep("",nrow(top.features[[3]])),use_raster=TRUE,width = unit(5, "cm")) ht4 <- Heatmap(top.features[[4]], name = "mRNA",col=colfun2,cluster_rows = FALSE,clustering_method_columns="ward.D", column_labels=rep("",ncol(top.features[[4]])),row_labels=rep("",nrow(top.features[[4]])),use_raster=TRUE,width = unit(5, "cm")) ht0 + ht1 + ht2 + ht3 + ht4 ``` # Step 6: check the overall survival of the iClusters ```{r survival analysis, fig.width=5,fig.width=10} clin4 <- cbind(clin4[,1:15],um.clusters) # OS of 3 clusters survdiff(Surv(osmonth,vital_status)~ic3,data=clin4) surv.ic3 <- survfit(Surv(osmonth,vital_status)~ic3,data=clin4) p3 <- ggsurvplot(surv.ic3, pval=TRUE)$plot # OS of 4 clusters survdiff(Surv(osmonth,vital_status)~ic4,data=clin4) surv.ic4 <- survfit(Surv(osmonth,vital_status)~ic4,data=clin4) p4 <- ggsurvplot(surv.ic4,pval=TRUE)$plot p3 + p4 ``` # iCluster analysis using iCluster2b If multi-omics data are continuous or can be transformed to a continuous format, the user may use function iCluster2b to perform iCluster analysis. The iCluster2b is an updated integrative clustering method originally reported by [Shen, Wang and Mo. 2023](https://pubmed.ncbi.nlm.nih.gov/24587839/). The user can use function tune.iCluster2b to search optimal lasso shrinkage parameters for multi-omics data. In the following analysis, we exclude the somatic mutation data, which is binary. A. Perform iCluster modeling using iCluster2b on continuous multi-omics data. ```{r iCluster2b} # tune.iCluster2b search optimal lasso parameter lambda from min.lambda to max.lambda for iCluster2b function date() fit2 <- tune.iCluster2b(xList=list(cn,methy25, mrna25), min.lambda=10,max.lambda=1000,K=7, method=c("lasso","lasso","lasso"),min.shrinkage.rate=rep(0.05,3), lambda.iter=25,EM.iter=25) date() ## check the number of coefficients are all 0 betaID <- list() for(i in 1:3){ betaID[[i]] <- apply(fit2$beta[[i]],1,function(x){all(x==0)}) cat(sum(betaID[[i]]), " ") } ``` B. Visualize sample clusters on a 2-dimensional UMAP space and perform k-means clustering ```{r Umap and kmeans, fig.width=10,fig.height=5} umap2 <- umap(fit2$meanZ) colnames(umap2$layout) <- c("UMAP1","UMAP2") km3.2 <- kmeans(fit2$meanZ, 3, nstart = 100) km4.2 <- kmeans(fit2$meanZ, 4, nstart = 100) um.clusters2 <- data.frame(umap2$layout,ic3=factor(km3.2$cluster),ic4=factor(km4.2$cluster)) # compare sample clusters generated by iClusterPlus2 and iCluster2b table(um.clusters$ic3, um.clusters2$ic3) table(um.clusters$ic4, um.clusters2$ic4) p1.2 <- ggplot(um.clusters2, aes(x=UMAP1, y=UMAP2, col=ic3)) + geom_point(size = 3)+theme_bw() p2.2 <- ggplot(um.clusters2, aes(x=UMAP1, y=UMAP2, col=ic4)) + geom_point(size = 3)+theme_bw() p1.2 + p2.2 ``` C. Make heatmap for the 4 iClusters produced by tune.iCluster2b ```{r heatmap of 4 clusters based on 3 datasets, fig.height=8, fig.width = 8} col.scheme2 = list() col.scheme2[[1]] <- bluered(256) col.scheme2[[2]] <- bluered(256) col.scheme2[[3]] <- bluered(256) fit2$clusters <- factor(um.clusters2$ic4,levels=c(1,2,3,4)) hm4c.2 <- plotHeatmap(fit=fit2,xList=list(cn,methy25,mrna25), type=c("gaussian","gaussian","gaussian"), threshold=rep(0.5, 3),feature.order=c(F,T,T),feature.scale=c(F,T,T), dist.method = "euclidean", hclust.method="ward.D",col.scheme=col.scheme2,chr=chr, plot.chr=c(T,F,F),sparse=c(F,T,T),cap=c(0.99,0.99,0.99)) ```

**Session Info**

```{r} sessionInfo() ```