--- title: "Apply clustering method" author: - name: Angelo Duò - name: Mark D Robinson - name: Charlotte Soneson date: "`r Sys.Date()`" package: DuoClustering2018 output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{Apply clustering method} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: duoclustering2018.bib editor_options: chunk_output_type: console --- # Introduction This vignette describes how each of the included clustering methods was applied to the collection of data sets in order to generate the clustering result summaries provided with the package. It also shows how to apply a new clustering method to the included data sets, to generate results that can be compared to those already included. # Applying a new clustering algorithm to a provided data set The code below describes how we applied each of the included clustering methods to the data sets for our paper [@Duo2018-F1000]. The `apply_*()` functions, describing how the respective clustering methods were run, are available from the [GitHub repository](https://github.com/markrobinsonuzh/scRNAseq_clustering_comparison/tree/master/Rscripts/clustering) corresponding to the publication. In order to apply a new clustering algorithm to one of the data sets using the same framework, it is necessary to generate a function with the same format. The input arguments to this function should be: - a `SingleCellExperiment` object - a named list of parameter values (can be empty, if no parameters are used for the method) - the desired number of clusters (k). The function should return a list with three elements: - `st` - a vector with the timing information. Should have five elements, named `user.self`, `sys.self`, `user.child`, `sys.child` and `elapsed`. - `cluster` - a named vector of cluster assignments for all cells. - `est_k` - the number of clusters estimated by the method (if available, otherwise `NA`). If the method does not allow specification of the desired number of clusters, but has another parameter affecting the resolution, this can be accommodated as well (see the solution for `Seurat` in the code below). First, load the package and define the data set and clustering method to use (note that in order to apply a method named ``, there has to be a function named `apply_()`, with the above specifications, available in the workspace). ```{r} suppressPackageStartupMessages({ library(DuoClustering2018) }) scename <- "sce_filteredExpr10_Koh" sce <- sce_filteredExpr10_Koh() method <- "PCAHC" ``` Next, define the list of hyperparameter values. The package contains the hyperparameter values for the methods included in our paper. ```{r} ## Load parameter files. General dataset and method parameters as well as ## dataset/method-specific parameters params <- duo_clustering_all_parameter_settings_v2()[[paste0(scename, "_", method)]] params ``` Finally, define the number of times to apply the clustering method (for each value of the number of clusters), and run the clustering across a range of imposed numbers of clusters (defined in the parameter list). ```{r, eval=FALSE} ## Set number of times to run clustering for each k n_rep <- 5 ## Run clustering set.seed(1234) L <- lapply(seq_len(n_rep), function(i) { ## For each run cat(paste0("run = ", i, "\n")) if (method == "Seurat") { tmp <- lapply(params$range_resolutions, function(resolution) { ## For each resolution cat(paste0("resolution = ", resolution, "\n")) ## Run clustering res <- get(paste0("apply_", method))(sce = sce, params = params, resolution = resolution) ## Put output in data frame df <- data.frame(dataset = scename, method = method, cell = names(res$cluster), run = i, k = length(unique(res$cluster)), resolution = resolution, cluster = res$cluster, stringsAsFactors = FALSE, row.names = NULL) tm <- data.frame(dataset = scename, method = method, run = i, k = length(unique(res$cluster)), resolution = resolution, user.self = res$st[["user.self"]], sys.self = res$st[["sys.self"]], user.child = res$st[["user.child"]], sys.child = res$st[["sys.child"]], elapsed = res$st[["elapsed"]], stringsAsFactors = FALSE, row.names = NULL) kest <- data.frame(dataset = scename, method = method, run = i, k = length(unique(res$cluster)), resolution = resolution, est_k = res$est_k, stringsAsFactors = FALSE, row.names = NULL) list(clusters = df, timing = tm, kest = kest) }) ## End for each resolution } else { tmp <- lapply(params$range_clusters, function(k) { ## For each k cat(paste0("k = ", k, "\n")) ## Run clustering res <- get(paste0("apply_", method))(sce = sce, params = params, k = k) ## Put output in data frame df <- data.frame(dataset = scename, method = method, cell = names(res$cluster), run = i, k = k, resolution = NA, cluster = res$cluster, stringsAsFactors = FALSE, row.names = NULL) tm <- data.frame(dataset = scename, method = method, run = i, k = k, resolution = NA, user.self = res$st[["user.self"]], sys.self = res$st[["sys.self"]], user.child = res$st[["user.child"]], sys.child = res$st[["sys.child"]], elapsed = res$st[["elapsed"]], stringsAsFactors = FALSE, row.names = NULL) kest <- data.frame(dataset = scename, method = method, run = i, k = k, resolution = NA, est_k = res$est_k, stringsAsFactors = FALSE, row.names = NULL) list(clusters = df, timing = tm, kest = kest) }) ## End for each k } ## Summarize across different values of k assignments <- do.call(rbind, lapply(tmp, function(w) w$clusters)) timings <- do.call(rbind, lapply(tmp, function(w) w$timing)) k_estimates <- do.call(rbind, lapply(tmp, function(w) w$kest)) list(assignments = assignments, timings = timings, k_estimates = k_estimates) }) ## End for each run ## Summarize across different runs assignments <- do.call(rbind, lapply(L, function(w) w$assignments)) timings <- do.call(rbind, lapply(L, function(w) w$timings)) k_estimates <- do.call(rbind, lapply(L, function(w) w$k_estimates)) ## Add true group for each cell truth <- data.frame(cell = as.character(rownames(colData(sce))), trueclass = as.character(colData(sce)$phenoid), stringsAsFactors = FALSE) assignments$trueclass <- truth$trueclass[match(assignments$cell, truth$cell)] ## Combine results res <- list(assignments = assignments, timings = timings, k_estimates = k_estimates) df <- dplyr::full_join(res$assignments %>% dplyr::select(dataset, method, cell, run, k, resolution, cluster, trueclass), res$k_estimates %>% dplyr::select(dataset, method, run, k, resolution, est_k) ) %>% dplyr::full_join(res$timings %>% dplyr::select(dataset, method, run, k, resolution, elapsed)) ``` The resulting `df` data frames can then be combined across data sets, filterings and methods and used as input to the provided plotting functions. # Session info ```{r} sessionInfo() ``` # References