---
title: "Conformal Prediction for cell type annotation"
author:
    - name: Daniela Corbetta
      affiliation: Department of Statistical Sciences, University of Padova 
      email: daniela.corbetta@unipd.it
package: scConform
output:
  BiocStyle::html_document:
    toc: true
    toc_float: true
bibliography: REFERENCES.bib
vignette: >
  %\VignetteIndexEntry{Conformal Prediction for cell type annotation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```


# Introduction

Cell type annotation is a crucial step in bioinformatic research. A multitude 
of annotated datasets are now readily accessible, providing valuable references 
for annotating cells in unannotated datasets originating from similar tissues. 
Typically, a model is chosen and trained on the reference data to predict the 
label of a new, unannotated cell in the query dataset.

These methods commonly provide point predictions of the cell label, along with
estimated probabilities or scores assigned to each cell type in the reference 
dataset. However, relying solely on point predictions can be problematic when 
the corresponding estimated probability is low, leading to unreliable 
classification. To address this issue, we propose to exploit conformal
inference to return prediction sets that include multiple labels, with the set 
size reflecting the confidence in the point prediction. 
The package `scConform` implements two methods to achieve this:

1. The first method uses split conformal inference (as a reference, 
see for example section 1 of @angelopoulos2022gentle 
The output is a 
prediction set, $C(X_{new})$, that contains some of the cell types present in 
the reference data.
Let $Y_{new}$ be the true cell type of a new cell in the query dataset. Then, 
$C(X_{new})$ satisfies $P(Y_{new} \in C(X_{new})) \geq 1-\alpha$, where
$\alpha$ is a user-chosen error level.

2. The second method is based on conformal risk control 
(angelopoulos2025conformalriskcontrol). It considers
the inherent relationships among cell types that are encoded as 
graph-structured constraints available through the cell ontology. 
The final output is a prediction set aligned with the
cell ontology structure. Specifically, the prediction set will include all the
children of an ancestor of the predicted class. The more unsure the point 
prediction, the broader the classification. Also this method satisfies 
$P(Y_{new} \in C(X_{new})) \geq 1-\alpha$. More details are 
available at @corbetta2025conformal.


# Installation

To install it
```{r inst, eval=FALSE}
## install if needed
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("scConform")
```


# Preliminaries

## Setup

```{r libraries, message=FALSE}
library(scConform)
library(SingleCellExperiment)
library(VGAM)
library(ontoProc)
library(MerfishData)
library(igraph)
library(scuttle)
library(scran)
library(BiocParallel)

`%notin%` <- Negate(`%in%`)
```


## Load data and cell ontology

As an example, we'll use the mouse ileum Merfish data from the `MerfishData` 
Bioconductor package, segmented with Baysor.
For the purpose of this vignette, we'll treat the data as if they were 
single-cell data, discarding the spatial information. This is a toy example
in which reference and query data will be a subset of the same dataset. In real
life examples, data integration is a necessary first step.

Next, load the cell ontology trough the `ontoProc` Bioconductor package.

```{r data}
# Load data
spe_baysor <- MouseIleumPetukhov2021(
    segmentation = "baysor",
    use.images = FALSE, use.polygons = FALSE
)
# Load ontology
cl <- getOnto("cellOnto", "2023")
```

## Build the ontology

The object `cl` contains information regarding all the relationships among all
the known cell types. The first step is to filter the ontology and only keep 
the parts related to the cell types that are present in the data.

This dataset does not provide annotations with the cell ontology tags, so
we need to browse the ontology to find the interesting tags.

For more information on how to restrict the cell ontology, refer to the
[ontoProc](https://www.bioconductor.org/packages/release/bioc/html/ontoProc.html)
documentation.

```{r tags, echo=TRUE, fig.show='hide'}
tags <- c(
    "CL:0009022", # Stromal
    "CL:0000236", # B cell
    "CL:0009080", # Tuft
    "CL:1000411", # Endothelial
    "CL:1000335", # Enterocyte
    "CL:1000326", # Goblet
    "CL:0002088", # ICC
    "CL:0009007", # Macrophage + DC
    "CL:1000343", # Paneth
    "CL:0000669", # Pericyte
    "CL:1000278", # Smooth Muscle
    "CL:0009017", # Stem + TA
    "CL:0000492", # T (CD4+)
    "CL:0000625", # T (CD8+)
    "CL:0017004" # Telocyte
)
opi <- graph_from_graphnel(onto_plot2(cl, tags))
```

In the `opi` object, there are also instances from other ontologies 
(CARO and BFO) that need to be removed. 

Moreover, the names of the leaf nodes of the ontology must correspond to the 
names used for the annotation.

```{r build-ontology}
## Delete CARO and BFO instances
sel_ver <- V(opi)$name[c(grep("CARO", V(opi)$name), grep("BFO", V(opi)$name))]
opi1 <- opi - sel_ver

## Rename vertex to match annotations
V(opi1)$name[grep("CL:0000236", V(opi1)$name)] <- "B cell"
V(opi1)$name[grep("CL:1000411", V(opi1)$name)] <- "Endothelial"
V(opi1)$name[grep("CL:1000335", V(opi1)$name)] <- "Enterocyte"
V(opi1)$name[grep("CL:1000326", V(opi1)$name)] <- "Goblet"
V(opi1)$name[grep("CL:0002088", V(opi1)$name)] <- "ICC"
V(opi1)$name[grep("CL:0009007", V(opi1)$name)] <- "Macrophage + DC"
V(opi1)$name[grep("CL:1000343", V(opi1)$name)] <- "Paneth"
V(opi1)$name[grep("CL:0000669", V(opi1)$name)] <- "Pericyte"
V(opi1)$name[grep("CL:1000278", V(opi1)$name)] <- "Smooth Muscle"
V(opi1)$name[grep("CL:0009017", V(opi1)$name)] <- "Stem + TA"
V(opi1)$name[grep("CL:0009022", V(opi1)$name)] <- "Stromal"
V(opi1)$name[grep("CL:0000492", V(opi1)$name)] <- "T (CD4+)"
V(opi1)$name[grep("CL:0000625", V(opi1)$name)] <- "T (CD8+)"
V(opi1)$name[grep("CL:0017004", V(opi1)$name)] <- "Telocyte"
V(opi1)$name[grep("CL:0009080", V(opi1)$name)] <- "Tuft"

## Add the edge from connective tissue cell and telocyte and delete redundant
## nodes
opi1 <- add_edges(opi1, c("connective\ntissue cell\nCL:0002320", "Telocyte"))
gr <- as_graphnel(opi1)
opi2 <- opi1 - c(
    "somatic\ncell\nCL:0002371", "contractile\ncell\nCL:0000183",
    "native\ncell\nCL:0000003"
)


V(opi2)$name <- trimws(gsub("CL:.*|\\n", " ", V(opi2)$name))

gr1 <- as_graphnel(opi2)

## Plot the final ontology
attrs <- list(node = list(shape = "box", fontsize = 15, fixedsize = FALSE))
plot(gr1, attrs = attrs)
```

## Preprocess data

Modify the colData variable `leiden_final` to unify B cells and enterocytes

```{r preprocess}
spe_baysor$cell_type <- spe_baysor$leiden_final
spe_baysor$cell_type[spe_baysor$cell_type %in% c(
    "B (Follicular, Circulating)",
    "B (Plasma)"
)] <- "B cell"
spe_baysor$cell_type[grep("Enterocyte", spe_baysor$cell_type)] <- "Enterocyte"
spe_baysor <- spe_baysor[, spe_baysor$cell_type %notin% c(
    "Removed",
    "Myenteric Plexus"
)]
spe_baysor

# See frequencies of cell types
table(spe_baysor$cell_type)
```

Now randomly split the data into reference and query data. 
```{r split}
set.seed(1636)
ref <- sample(seq_len(ncol(spe_baysor)), 600)

spe_ref <- spe_baysor[, ref]
spe_query <- spe_baysor[, -ref]
# Reference data
spe_ref
# Query data
spe_query
```


## Fit a classification model

We now need to build a classification model. To this aim, we'll draw a random
sample of 300 cells from the reference data, and train a multinomial
model on those. As explanatory variables we'll use the 50 most variable genes.
For this step, every statistical model or machine learning method can be used,
as long as it provides also estimated probabilities for each class.

```{r model, warning=FALSE}
# Randomly select 300 cells
set.seed(1704)
train <- sample(seq_len(ncol(spe_ref)), 300)
# Training data
spe_train <- spe_ref[, train]
spe_train

# get HVGs
spe_train <- logNormCounts(spe_train)
v <- modelGeneVar(spe_train)
hvg <- getTopHVGs(v, n = 50)

# Extract counts and convert data into a data.frame format
df_train <- as.data.frame(t(as.matrix(counts(spe_train[hvg, ]))))
df_train$Y <- spe_train$cell_type
table(df_train$Y)

# Fit multinomial model
fit <- vglm(Y ~ .,
    family = multinomial(refLevel = "B cell"),
    data = df_train
)
```

# Method 1: standard conformal inference

## Algorithm

As anticipated in the introduction, the first method uses split conformal
inference to build the prediction sets. It requires to split the reference data
into two subsets:

- training set: portion of the reference data to be used to build a classifier
$\hat{f}$ (the multinomial model in our example)
- calibration set: portion of the reference data to be used to calibrate the 
procedure.

Given $\hat{f}$ and the calibration data, the objective of conformal inference 
is to build, for a new non-annotated cell $X_{new}$, a prediction set 
$C(X_{new})$ that satisfies

\begin{equation}
P(Y_{new} \in C(X_{new})) \geq 1-\alpha,
\end{equation}

where $Y_{new}$ is the true label of the new cell and $\alpha$ is a user-chosen 
error level. The algorithm for split conformal inference is the following:

1. For the data in the calibration set, $(X_1, Y_1), \dots, (X_n, Y_n),$ obtain
the conformal scores $s_1=1-\hat{f}(X_i)_{Y_i}$ (i.e. 1 minus the estimated
probability that the model is assigning to the true label of the $i$-th cell).
These scores will be high when the model is assigning a small probability to 
the true class, and low otherwise.

2. Obtain $\hat{q}$ as the $\lceil(1-\alpha)(n+1)\rceil/n$ empirical quantile 
of the conformal scores.

3. Finally, for a new cell $X_{new}$, build a prediction set by including all 
the classes or which the estimated probability is higher than $1-\hat{q}$:
$$
C(X_{new}) = \left\{y: \hat{f}(X_{n+1})_y\geq 1-\hat{q}\right\}
$$

## Obtain prediction matrices

The first step to build conformal prediction sets is to obtain 
matrices with the estimated probabilities for each cell type for cells in the 
calibration data and in the query data. 
Each row of the matrix corresponds to a particular cell, while each row to a 
different cell type. The entry $p_{i,j}$ of the matrix indicates the estimated 
probability that the cell $i$ is of type $j$.

```{r predictions, warning=FALSE}
spe_cal <- spe_ref[, -train]
# Prediction matrix for calibration data
df_cal <- as.data.frame(t(as.matrix(counts(spe_cal[hvg, ]))))
p_cal <- predict(fit, newdata = df_cal, type = "response")
head(round(p_cal, 3))
# Prediction matrix for query data
df_test <- as.data.frame(t(as.matrix(counts(spe_query[hvg, ]))))
p_test <- predict(fit, newdata = df_test, type = "response")
head(round(p_test, 3))
```


## Obtain conformal prediction sets

We can now directly call the `getPredictionSet` function by using as input the
prediction matrices for the calibration and the query dataset. In this case, 
the output of the function will be a list whose elements are the prediction 
sets for each query cell.

By setting `follow_ontology=FALSE`, we are asking the function to return
prediction sets obtained via split conformal inference.
The parameter `onto` is not necessary with this method, but if the ontology is
provided then it will be used by the function to retrieve the considered cell 
types. Alternatively, we can explicitly provide the labels with the
parameter `labels`. The parameter `alpha` indicates the allowed miscoverage
percentage. For example, if we set `alpha=0.1`, it means 
that at most 10% of the prediction sets we obtain will not include the true 
label.

```{r predsets}
labels <- colnames(p_test)
sets <- getPredictionSets(
    x_query = p_test,
    x_cal = p_cal,
    y_cal = spe_cal$cell_type,
    alpha = 0.1,
    follow_ontology = FALSE,
    labels = labels
)

# See the first six prediction sets
sets[1:6]
```
Since in this example the cell labels are available, we can explicitly check 
that the constraint on the miscoverage is satisfied by computing the 
proportion of sets that include the true label. We set `alpha=0.1`, so we
expect this proportion to be higher than 0.9.

```{r cvg}
# Check coverage
cvg <- rep(NA, length(sets))
for (i in seq_len(length(sets))) {
    cvg[i] <- spe_query$cell_type[i] %in% sets[[i]]
}
mean(cvg)
```


As an alternative, we can provide as input a `SingleCellExperiment` object. 
In this case, it needs to have the estimated probabilities for 
each cell type in the `colData`.
The names of these `colData` have to correspond to the names of the leaf nodes 
in the ontology. 

The output will be a `SingleCellExperiment` object with a new variable in the 
`colData` containing the prediction sets. The name of this variable can
be assigned trough the `pr_name` parameter.

```{r predsets-sc}
# Retrieve labels as leaf nodes of the ontology
labels <- V(opi2)$name[degree(opi2, mode = "out") == 0]

# Create corresponding colData
for (i in labels) {
    colData(spe_cal)[[i]] <- p_cal[, i]
    colData(spe_query)[[i]] <- p_test[, i]
}

# Create prediction sets
spe_query <- getPredictionSets(
    x_query = spe_query,
    x_cal = spe_cal,
    y_cal = spe_cal$cell_type,
    alpha = 0.1,
    follow_ontology = FALSE,
    pr_name = "pred_set",
    labels = labels
)

# See the new variable pred_set into the colData
head(colData(spe_query))
```

# Method 2: exploit the cell ontology

## Algorithm

Let $\hat{y}(x)$ be the class with the maximum estimated probability. Moreover,
given a directed graph let $P(v)$ and $A(v)$ be the set
of children nodes and ancestor nodes of $v$, respectively. Finally, for each 
node $v$ define a score $g(v,x)$ as the sum of the predicted probabilities of
the leaf nodes that are children of $v$. We build the sets as follows:

$$
P(v) \cup \{P(a): a\in A(\hat{y}(x)): 
g(a,x)\leq\lambda \},
$$
where $v:v\in A(\hat{y}(x)), \;g(v,x)\geq\lambda$ and 
$v=\arg\min_{u:g(u,x)\geq\lambda}g(u,x)$.

In words, we start from the predicted class and we go up in the graph until we 
find an ancestor of $\hat{y}(x)$ that has a score that is at least $\lambda$
and include in the prediction sets all its children. Then we add to this 
subgraph the other ones that contain $\hat{y}(x)$ for which the score is less 
than $\lambda$. To choose $\lambda$, we follow eq. (4) in @angelopoulos2025conformalriskcontrol, considering the miscoverage as loss function.
See @corbetta2025conformal for more details 
on the algorithm and a discussion on the rationale of the method.

## Obtain hierarchical prediction sets

To switch to the hierarchical algorithm, we just need to set 
`follow_ontology=TRUE` and provide an ontology to the `onto` parameter.
As in the previous case, if the input is a `SingleCellExperiment` object then 
the output will be again a `SingleCellExperiment` object with a new variable
containing the prediction sets in the colData.

```{r predset-hier}
spe_query <- getPredictionSets(
    x_query = spe_query,
    x_cal = spe_cal,
    y_cal = spe_cal$cell_type,
    onto = opi2,
    alpha = 0.1,
    follow_ontology = TRUE,
    method = "full",
    pr_name = "pred_set_hier",
    BPPARAM = MulticoreParam(workers = 2),
    simplify = FALSE
)

# See the first six prediction sets
head(spe_query$pred_set_hier)
```

Also in this case, we can check the coverage:

```{r cvg1}
# Check coverage
cvg1 <- rep(NA, length(spe_query$pred_set_hier))
for (i in seq_len(length(spe_query$pred_set_hier))) {
    cvg1[i] <- spe_query$cell_type[i] %in% spe_query$pred_set_hier[[i]]
}

mean(cvg1)
```

As an alternative, we can choose to return only the common ancestor, instead 
of the entire set of predicted labels. This can be done directly in the 
function by setting `simplify=TRUE`.
Alternatively, we can convert the obtained prediction sets invoking
the function `getCommonAncestor`.

```{r common-ancestor}
spe_query$pred_set_hier_simp <- vapply(
    spe_query$pred_set_hier,
    function(x) getCommonAncestor(x, opi2),
    character(1)
)
head(spe_query$pred_set_hier_simp)
```

# Visualization

Let's now visualize the obtained prediction sets. The simplest way is to just
ask to color the labels included in the prediction set. Let's compare 
the prediction sets for one cell obtained with the two methods

```{r plotres}
plotResult(spe_query$pred_set[[75]], opi2,
    col_grad = "pink", attrs = attrs, add_scores = FALSE,
    title = "Conformal Prediction set"
)
```

```{r plotres2}
plotResult(spe_query$pred_set_hier[[75]], opi2,
    col_grad = "pink", attrs = attrs, add_scores = FALSE,
    title = "Hierarchical prediction set"
)
```

We can now instead give as input also the estimated probabilities and a 
gradient. The function will color the cells according to the gradient, with 
stronger colors corresponding to higher probabilities. If `add_scores=TRUE`,
then the estimated probabilities will be added to the labels. 

```{r plotres3}
plotResult(spe_query$pred_set[[75]], opi2,
    probs = p_test[75, ],
    col_grad = c("lemonchiffon", "orange", "darkred"),
    attrs = attrs, add_scores = TRUE,
    title = "Conformal Prediction set"
)
```
```{r plotres4}
plotResult(spe_query$pred_set_hier[[75]], opi2,
    probs = p_test[75, ],
    col_grad = c("lemonchiffon", "orange", "darkred"),
    attrs = attrs, add_scores = TRUE,
    title = "Hierarchical Prediction set"
)
```


# R.session Info {.unnumbered}

```{r SessionInfo, echo=FALSE, message=FALSE, warning=FALSE, comment=NA}
sessionInfo()
```