--- title: "Conformal Prediction for cell type annotation" author: - name: Daniela Corbetta affiliation: Department of Statistical Sciences, University of Padova email: daniela.corbetta@unipd.it package: scConform output: BiocStyle::html_document: toc: true toc_float: true bibliography: REFERENCES.bib vignette: > %\VignetteIndexEntry{Conformal Prediction for cell type annotation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction Cell type annotation is a crucial step in bioinformatic research. A multitude of annotated datasets are now readily accessible, providing valuable references for annotating cells in unannotated datasets originating from similar tissues. Typically, a model is chosen and trained on the reference data to predict the label of a new, unannotated cell in the query dataset. These methods commonly provide point predictions of the cell label, along with estimated probabilities or scores assigned to each cell type in the reference dataset. However, relying solely on point predictions can be problematic when the corresponding estimated probability is low, leading to unreliable classification. To address this issue, we propose to exploit conformal inference to return prediction sets that include multiple labels, with the set size reflecting the confidence in the point prediction. The package `scConform` implements two methods to achieve this: 1. The first method uses split conformal inference (as a reference, see for example section 1 of @angelopoulos2022gentle The output is a prediction set, $C(X_{new})$, that contains some of the cell types present in the reference data. Let $Y_{new}$ be the true cell type of a new cell in the query dataset. Then, $C(X_{new})$ satisfies $P(Y_{new} \in C(X_{new})) \geq 1-\alpha$, where $\alpha$ is a user-chosen error level. 2. The second method is based on conformal risk control (angelopoulos2025conformalriskcontrol). It considers the inherent relationships among cell types that are encoded as graph-structured constraints available through the cell ontology. The final output is a prediction set aligned with the cell ontology structure. Specifically, the prediction set will include all the children of an ancestor of the predicted class. The more unsure the point prediction, the broader the classification. Also this method satisfies $P(Y_{new} \in C(X_{new})) \geq 1-\alpha$. More details are available at @corbetta2025conformal. # Installation To install it ```{r inst, eval=FALSE} ## install if needed if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("scConform") ``` # Preliminaries ## Setup ```{r libraries, message=FALSE} library(scConform) library(SingleCellExperiment) library(VGAM) library(ontoProc) library(MerfishData) library(igraph) library(scuttle) library(scran) library(BiocParallel) `%notin%` <- Negate(`%in%`) ``` ## Load data and cell ontology As an example, we'll use the mouse ileum Merfish data from the `MerfishData` Bioconductor package, segmented with Baysor. For the purpose of this vignette, we'll treat the data as if they were single-cell data, discarding the spatial information. This is a toy example in which reference and query data will be a subset of the same dataset. In real life examples, data integration is a necessary first step. Next, load the cell ontology trough the `ontoProc` Bioconductor package. ```{r data} # Load data spe_baysor <- MouseIleumPetukhov2021( segmentation = "baysor", use.images = FALSE, use.polygons = FALSE ) # Load ontology cl <- getOnto("cellOnto", "2023") ``` ## Build the ontology The object `cl` contains information regarding all the relationships among all the known cell types. The first step is to filter the ontology and only keep the parts related to the cell types that are present in the data. This dataset does not provide annotations with the cell ontology tags, so we need to browse the ontology to find the interesting tags. For more information on how to restrict the cell ontology, refer to the [ontoProc](https://www.bioconductor.org/packages/release/bioc/html/ontoProc.html) documentation. ```{r tags, echo=TRUE, fig.show='hide'} tags <- c( "CL:0009022", # Stromal "CL:0000236", # B cell "CL:0009080", # Tuft "CL:1000411", # Endothelial "CL:1000335", # Enterocyte "CL:1000326", # Goblet "CL:0002088", # ICC "CL:0009007", # Macrophage + DC "CL:1000343", # Paneth "CL:0000669", # Pericyte "CL:1000278", # Smooth Muscle "CL:0009017", # Stem + TA "CL:0000492", # T (CD4+) "CL:0000625", # T (CD8+) "CL:0017004" # Telocyte ) opi <- graph_from_graphnel(onto_plot2(cl, tags)) ``` In the `opi` object, there are also instances from other ontologies (CARO and BFO) that need to be removed. Moreover, the names of the leaf nodes of the ontology must correspond to the names used for the annotation. ```{r build-ontology} ## Delete CARO and BFO instances sel_ver <- V(opi)$name[c(grep("CARO", V(opi)$name), grep("BFO", V(opi)$name))] opi1 <- opi - sel_ver ## Rename vertex to match annotations V(opi1)$name[grep("CL:0000236", V(opi1)$name)] <- "B cell" V(opi1)$name[grep("CL:1000411", V(opi1)$name)] <- "Endothelial" V(opi1)$name[grep("CL:1000335", V(opi1)$name)] <- "Enterocyte" V(opi1)$name[grep("CL:1000326", V(opi1)$name)] <- "Goblet" V(opi1)$name[grep("CL:0002088", V(opi1)$name)] <- "ICC" V(opi1)$name[grep("CL:0009007", V(opi1)$name)] <- "Macrophage + DC" V(opi1)$name[grep("CL:1000343", V(opi1)$name)] <- "Paneth" V(opi1)$name[grep("CL:0000669", V(opi1)$name)] <- "Pericyte" V(opi1)$name[grep("CL:1000278", V(opi1)$name)] <- "Smooth Muscle" V(opi1)$name[grep("CL:0009017", V(opi1)$name)] <- "Stem + TA" V(opi1)$name[grep("CL:0009022", V(opi1)$name)] <- "Stromal" V(opi1)$name[grep("CL:0000492", V(opi1)$name)] <- "T (CD4+)" V(opi1)$name[grep("CL:0000625", V(opi1)$name)] <- "T (CD8+)" V(opi1)$name[grep("CL:0017004", V(opi1)$name)] <- "Telocyte" V(opi1)$name[grep("CL:0009080", V(opi1)$name)] <- "Tuft" ## Add the edge from connective tissue cell and telocyte and delete redundant ## nodes opi1 <- add_edges(opi1, c("connective\ntissue cell\nCL:0002320", "Telocyte")) gr <- as_graphnel(opi1) opi2 <- opi1 - c( "somatic\ncell\nCL:0002371", "contractile\ncell\nCL:0000183", "native\ncell\nCL:0000003" ) V(opi2)$name <- trimws(gsub("CL:.*|\\n", " ", V(opi2)$name)) gr1 <- as_graphnel(opi2) ## Plot the final ontology attrs <- list(node = list(shape = "box", fontsize = 15, fixedsize = FALSE)) plot(gr1, attrs = attrs) ``` ## Preprocess data Modify the colData variable `leiden_final` to unify B cells and enterocytes ```{r preprocess} spe_baysor$cell_type <- spe_baysor$leiden_final spe_baysor$cell_type[spe_baysor$cell_type %in% c( "B (Follicular, Circulating)", "B (Plasma)" )] <- "B cell" spe_baysor$cell_type[grep("Enterocyte", spe_baysor$cell_type)] <- "Enterocyte" spe_baysor <- spe_baysor[, spe_baysor$cell_type %notin% c( "Removed", "Myenteric Plexus" )] spe_baysor # See frequencies of cell types table(spe_baysor$cell_type) ``` Now randomly split the data into reference and query data. ```{r split} set.seed(1636) ref <- sample(seq_len(ncol(spe_baysor)), 600) spe_ref <- spe_baysor[, ref] spe_query <- spe_baysor[, -ref] # Reference data spe_ref # Query data spe_query ``` ## Fit a classification model We now need to build a classification model. To this aim, we'll draw a random sample of 300 cells from the reference data, and train a multinomial model on those. As explanatory variables we'll use the 50 most variable genes. For this step, every statistical model or machine learning method can be used, as long as it provides also estimated probabilities for each class. ```{r model, warning=FALSE} # Randomly select 300 cells set.seed(1704) train <- sample(seq_len(ncol(spe_ref)), 300) # Training data spe_train <- spe_ref[, train] spe_train # get HVGs spe_train <- logNormCounts(spe_train) v <- modelGeneVar(spe_train) hvg <- getTopHVGs(v, n = 50) # Extract counts and convert data into a data.frame format df_train <- as.data.frame(t(as.matrix(counts(spe_train[hvg, ])))) df_train$Y <- spe_train$cell_type table(df_train$Y) # Fit multinomial model fit <- vglm(Y ~ ., family = multinomial(refLevel = "B cell"), data = df_train ) ``` # Method 1: standard conformal inference ## Algorithm As anticipated in the introduction, the first method uses split conformal inference to build the prediction sets. It requires to split the reference data into two subsets: - training set: portion of the reference data to be used to build a classifier $\hat{f}$ (the multinomial model in our example) - calibration set: portion of the reference data to be used to calibrate the procedure. Given $\hat{f}$ and the calibration data, the objective of conformal inference is to build, for a new non-annotated cell $X_{new}$, a prediction set $C(X_{new})$ that satisfies \begin{equation} P(Y_{new} \in C(X_{new})) \geq 1-\alpha, \end{equation} where $Y_{new}$ is the true label of the new cell and $\alpha$ is a user-chosen error level. The algorithm for split conformal inference is the following: 1. For the data in the calibration set, $(X_1, Y_1), \dots, (X_n, Y_n),$ obtain the conformal scores $s_1=1-\hat{f}(X_i)_{Y_i}$ (i.e. 1 minus the estimated probability that the model is assigning to the true label of the $i$-th cell). These scores will be high when the model is assigning a small probability to the true class, and low otherwise. 2. Obtain $\hat{q}$ as the $\lceil(1-\alpha)(n+1)\rceil/n$ empirical quantile of the conformal scores. 3. Finally, for a new cell $X_{new}$, build a prediction set by including all the classes or which the estimated probability is higher than $1-\hat{q}$: $$ C(X_{new}) = \left\{y: \hat{f}(X_{n+1})_y\geq 1-\hat{q}\right\} $$ ## Obtain prediction matrices The first step to build conformal prediction sets is to obtain matrices with the estimated probabilities for each cell type for cells in the calibration data and in the query data. Each row of the matrix corresponds to a particular cell, while each row to a different cell type. The entry $p_{i,j}$ of the matrix indicates the estimated probability that the cell $i$ is of type $j$. ```{r predictions, warning=FALSE} spe_cal <- spe_ref[, -train] # Prediction matrix for calibration data df_cal <- as.data.frame(t(as.matrix(counts(spe_cal[hvg, ])))) p_cal <- predict(fit, newdata = df_cal, type = "response") head(round(p_cal, 3)) # Prediction matrix for query data df_test <- as.data.frame(t(as.matrix(counts(spe_query[hvg, ])))) p_test <- predict(fit, newdata = df_test, type = "response") head(round(p_test, 3)) ``` ## Obtain conformal prediction sets We can now directly call the `getPredictionSet` function by using as input the prediction matrices for the calibration and the query dataset. In this case, the output of the function will be a list whose elements are the prediction sets for each query cell. By setting `follow_ontology=FALSE`, we are asking the function to return prediction sets obtained via split conformal inference. The parameter `onto` is not necessary with this method, but if the ontology is provided then it will be used by the function to retrieve the considered cell types. Alternatively, we can explicitly provide the labels with the parameter `labels`. The parameter `alpha` indicates the allowed miscoverage percentage. For example, if we set `alpha=0.1`, it means that at most 10% of the prediction sets we obtain will not include the true label. ```{r predsets} labels <- colnames(p_test) sets <- getPredictionSets( x_query = p_test, x_cal = p_cal, y_cal = spe_cal$cell_type, alpha = 0.1, follow_ontology = FALSE, labels = labels ) # See the first six prediction sets sets[1:6] ``` Since in this example the cell labels are available, we can explicitly check that the constraint on the miscoverage is satisfied by computing the proportion of sets that include the true label. We set `alpha=0.1`, so we expect this proportion to be higher than 0.9. ```{r cvg} # Check coverage cvg <- rep(NA, length(sets)) for (i in seq_len(length(sets))) { cvg[i] <- spe_query$cell_type[i] %in% sets[[i]] } mean(cvg) ``` As an alternative, we can provide as input a `SingleCellExperiment` object. In this case, it needs to have the estimated probabilities for each cell type in the `colData`. The names of these `colData` have to correspond to the names of the leaf nodes in the ontology. The output will be a `SingleCellExperiment` object with a new variable in the `colData` containing the prediction sets. The name of this variable can be assigned trough the `pr_name` parameter. ```{r predsets-sc} # Retrieve labels as leaf nodes of the ontology labels <- V(opi2)$name[degree(opi2, mode = "out") == 0] # Create corresponding colData for (i in labels) { colData(spe_cal)[[i]] <- p_cal[, i] colData(spe_query)[[i]] <- p_test[, i] } # Create prediction sets spe_query <- getPredictionSets( x_query = spe_query, x_cal = spe_cal, y_cal = spe_cal$cell_type, alpha = 0.1, follow_ontology = FALSE, pr_name = "pred_set", labels = labels ) # See the new variable pred_set into the colData head(colData(spe_query)) ``` # Method 2: exploit the cell ontology ## Algorithm Let $\hat{y}(x)$ be the class with the maximum estimated probability. Moreover, given a directed graph let $P(v)$ and $A(v)$ be the set of children nodes and ancestor nodes of $v$, respectively. Finally, for each node $v$ define a score $g(v,x)$ as the sum of the predicted probabilities of the leaf nodes that are children of $v$. We build the sets as follows: $$ P(v) \cup \{P(a): a\in A(\hat{y}(x)): g(a,x)\leq\lambda \}, $$ where $v:v\in A(\hat{y}(x)), \;g(v,x)\geq\lambda$ and $v=\arg\min_{u:g(u,x)\geq\lambda}g(u,x)$. In words, we start from the predicted class and we go up in the graph until we find an ancestor of $\hat{y}(x)$ that has a score that is at least $\lambda$ and include in the prediction sets all its children. Then we add to this subgraph the other ones that contain $\hat{y}(x)$ for which the score is less than $\lambda$. To choose $\lambda$, we follow eq. (4) in @angelopoulos2025conformalriskcontrol, considering the miscoverage as loss function. See @corbetta2025conformal for more details on the algorithm and a discussion on the rationale of the method. ## Obtain hierarchical prediction sets To switch to the hierarchical algorithm, we just need to set `follow_ontology=TRUE` and provide an ontology to the `onto` parameter. As in the previous case, if the input is a `SingleCellExperiment` object then the output will be again a `SingleCellExperiment` object with a new variable containing the prediction sets in the colData. ```{r predset-hier} spe_query <- getPredictionSets( x_query = spe_query, x_cal = spe_cal, y_cal = spe_cal$cell_type, onto = opi2, alpha = 0.1, follow_ontology = TRUE, method = "full", pr_name = "pred_set_hier", BPPARAM = MulticoreParam(workers = 2), simplify = FALSE ) # See the first six prediction sets head(spe_query$pred_set_hier) ``` Also in this case, we can check the coverage: ```{r cvg1} # Check coverage cvg1 <- rep(NA, length(spe_query$pred_set_hier)) for (i in seq_len(length(spe_query$pred_set_hier))) { cvg1[i] <- spe_query$cell_type[i] %in% spe_query$pred_set_hier[[i]] } mean(cvg1) ``` As an alternative, we can choose to return only the common ancestor, instead of the entire set of predicted labels. This can be done directly in the function by setting `simplify=TRUE`. Alternatively, we can convert the obtained prediction sets invoking the function `getCommonAncestor`. ```{r common-ancestor} spe_query$pred_set_hier_simp <- vapply( spe_query$pred_set_hier, function(x) getCommonAncestor(x, opi2), character(1) ) head(spe_query$pred_set_hier_simp) ``` # Visualization Let's now visualize the obtained prediction sets. The simplest way is to just ask to color the labels included in the prediction set. Let's compare the prediction sets for one cell obtained with the two methods ```{r plotres} plotResult(spe_query$pred_set[[75]], opi2, col_grad = "pink", attrs = attrs, add_scores = FALSE, title = "Conformal Prediction set" ) ``` ```{r plotres2} plotResult(spe_query$pred_set_hier[[75]], opi2, col_grad = "pink", attrs = attrs, add_scores = FALSE, title = "Hierarchical prediction set" ) ``` We can now instead give as input also the estimated probabilities and a gradient. The function will color the cells according to the gradient, with stronger colors corresponding to higher probabilities. If `add_scores=TRUE`, then the estimated probabilities will be added to the labels. ```{r plotres3} plotResult(spe_query$pred_set[[75]], opi2, probs = p_test[75, ], col_grad = c("lemonchiffon", "orange", "darkred"), attrs = attrs, add_scores = TRUE, title = "Conformal Prediction set" ) ``` ```{r plotres4} plotResult(spe_query$pred_set_hier[[75]], opi2, probs = p_test[75, ], col_grad = c("lemonchiffon", "orange", "darkred"), attrs = attrs, add_scores = TRUE, title = "Hierarchical Prediction set" ) ``` # R.session Info {.unnumbered} ```{r SessionInfo, echo=FALSE, message=FALSE, warning=FALSE, comment=NA} sessionInfo() ```