Introduction
Here we present the R package RUCova, a novel method designed to
address confounding factors such as heterogeneous cell size and staining
efficiency in mass cytometry data. RUCova removes unwanted covariance
using multivariate linear regression based on Surrogates of Unwanted
Covariance (SUCs), and Principal Component Analysis (PCA).
RUCova comprises two major steps:
First, it fits a multivariate model for each measured marker
(\(m\)) across cells (\(i\)) from samples (\(j_i\)) with respect to the surrogates of
sources of unwanted covariance (SUC, \(\vec{x_i}\)). Samples \(j_i\) can be either different cell lines,
perturbations, conditions, metacells (clusters), or even
batches.
Second, it eliminates such dependency by assigning the residuals
\(\epsilon\) of the model as the new
modified expression of the marker. The fit can be expressed as: \[ y^{m}_i(\vec{x}_{i},j_i) = \underbrace{O^m(j_i)
+ S^m(\vec{x}_{i},j_i) }_{M(\vec{x}_{i},j_i)} + \epsilon^m_i\ \],
where \(S^m(\vec{x}_{i},j_i)\)
describes the slope of the fit and \(O^m(j_i)\) the intercept or
offset.
The predictors are the surrogates of sources unwanted covariance
(SUC, \(\vec{x}\)), which can be either
specific markers as proxies of the confounding factors or the principal
components (PCs) derived from a Principal Component Analysis (PCA)
performed on such markers.
RUCova offers 3 different uni- or multi-variate linear models to
describe the relationship between marker expression and SUC: (1) \(M_1(\vec{x}_{i})\): simple, (2) \(M_2(\vec{x}_i,j_i)\): offset, and (3) \(M_3(\vec{x}_i,j_i)\): interaction. In this
vignette you can find a detailed description of these methods and their
differences.
The RUCova method eliminates the dependency of each measured marker
on the SUCs by computing the model’s residuals and the intercept as the
revised expression for each marker (\(y^{m*}_i\)).:
\[ y^{*m}_i(j_i) = O_1^m(j_i) +
\epsilon^m_i \], where \(\epsilon^m_i\) are the residuals of the
model.
In this vignette, we guide you step-by-step through it so you can
make good use this package and reliably analyse single-cell mass
cytometry data.
Citation
If you use RUCova in published research, please cite:
RUCova: Removal of Unwanted Covariance in mass cytometry data Rosario
Astaburuaga-García, Thomas Sell, Samet Mutlu, Anja Sieber, Kirsten
Lauber, Nils Blüthgen. Bioinformatics 2024; doi: https://doi.org/10.1101/2024.05.24.595717
Installation
To install the stable release of RUCova from Bioconductor
(recommended), run:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("RUCova")
To install the latest development version directly from GitHub,
use:
remotes::install_github("molsysbio/RUCova@devel", force = TRUE)
library(RUCova)
library(ggplot2)
library(tidyr)
library(tibble)
library(dplyr)
library(SingleCellExperiment)
theme_set(theme_classic())
Output data
The output of the function rucova is a
SingleCellExperiment containing the same information as the input
SingleCellExperiment, plus:
Additional assay named name_assay_after containing the
modified marker intensity values in linear scale.
A metadata list named model_ concatenated with
name_assay_after accesible via metadata(sce).
This list contains:
All the input variables: name_assay_before,
markers, SUCs, name_reduced_dim,
apply_asinh_SUCs, model,
col_name_sample, center_SUCs,
keep_offset, `` (to remind you how you obtained this
result).
model_formula: the lm model formula applied
to each marker to regress-out surrogates of sources of unwanted
covariance.
model_coefficients: intercept \(O^m(j_i)\) and slope \(\alpha^m_{p}\) of the fit for surrogate
\(p\) and marker \(m\) (and sample \(j\) if applicable). This values are for
dummy variables. For meaningful coefficients, refer to
effective_coefficients.
model_residuals: the residuals of the fit \(\epsilon^m_i\).
adj_r2: adjusted R-squared or coefficient of
determination to evaluate the goodness of fit for each marker. It
quantify the proportion of the variance in the dependent variable
(marker) that is explained by the independent variables (surrogates) in
the model. The adjusted R-squared is adjusted by the number of
independent variables used to predict the target variable. This is done
to account for the automatic increase of \(R^2\) values when extra explanatory
variables are added to the model. By analysing the \(\rm{R^2_{adj}}\),we can determine whether
adding new variables increases the model fit. \[ R_{adj}^2 = \Big(1 -
\underbrace{(1-\frac{SS_{res}}{SS_{tot}}}_{R^2})\Big)\cdot\frac{n-1}{n-q-1}
\]
where \(SS_{res} = \sum_i
\epsilon_i^2\) is the residual sum of squares,\(\epsilon_i\) are the residuals of the
model, \(SS_{tot} = \sum_i(y_i -
\overline{y})\) is the total sum of squares, \(\overline{y}\) is the mean value of the
marker,\(n\)is the sample size (total
number of cells \(i\)) and \(q\) is the number of explanatory variables
in the model.
stand_slopes: We standardised the slope coefficients in
order to make them comparable. In the case of the interaction model,
where the slopes \(\alpha^m\) depend on
samples \(j_i\) and the SUC \(p\), we standardised the slope by
multiplying it by the standard deviation of the corresponding SUC in
each sample \(j\) ( \(\sigma_{x_{j_i,p}}\) ) and dividing it by
the standard deviation of the marker \(m\) in sample \(j\) (\(\sigma_{y^m_{j_i}}\)) : \[\alpha^{m*}_{j_i,x_{i,p}} =
\alpha^m_{j_i,x_{i,p}}\cdot
\frac{\sigma_{x_{j_i,p}}}{\sigma_{y^m_{j_i}}}\]
Examples
M1: Simple model
As we have multiple cancer cell lines in the current data set, we run
the simple model in only one cell line and across control and treated
condition.
sce_Cal33 <- sce[,sce$line == "Cal33"]
Let’s give a look at the pearson correlation coefficients between
markers before applying RUCova (symmetric matrix). RUCova offers the
option of returning the correlation coefficients on a matrix or directly
the heatmap, which is automatically plotted and can also be stored:
matrix_corr <- RUCova::compare_corr(sce_Cal33[c(m,x)])
heatmap_corr <- RUCova::heatmap_compare_corr(sce_Cal33[c(m,x)]) #same in lower and upper triangle
RUCova::heatmap_compare_corr(sce_Cal33[c(m,x)]) #same in lower and upper triangle

Remove correlations
with all surrogates
maintaing the logFC between treated and control
Let’s say you’re not convinced yet by the package, so you want to be
conservative and remove correlations to all SUCs but not the logFC
between conditions. In this case,
center_SUCs = "per_sample".
Now we can compare the new Pearson correlation coefficients
calculated from the assay counts_simple_persample (after
RUCova, upper triangle) to the original coefficients from the assay
``counts``` (lower triangle).
heatmap_compare_corr(sce_Cal33, name_assay_before = "counts", name_assay_after = "counts_simple_persample")

Log fold-changes between irradiated and control condition are kept
(positive means higher in irradiated).
FC_before <- t(assay(sce_Cal33,"counts")) |>
as.tibble() |>
cbind(colData(sce_Cal33)) |>
mutate_at(vars(x,m), asinh) |>
pivot_longer(names_to = "marker", values_to = "value", c(x,m)) |>
group_by(marker) |>
summarise(logFC = mean(value[dose=="10Gy"])-mean(value[dose=="0Gy"])) |>
mutate(data = "before RUCova")
FC_after <- t(assay(sce_Cal33,"counts_simple_persample")) |>
as.tibble() |> cbind(colData(sce_Cal33)) |>
mutate_at(vars(x,m), asinh) |>
pivot_longer(names_to = "marker", values_to = "value", c(x,m)) |>
group_by(marker) |>
summarise(logFC = mean(value[dose=="10Gy"])-mean(value[dose=="0Gy"])) |>
ungroup() |>
mutate(data = "simple all, per sample")
rbind(FC_before,FC_after) |>
ggplot(aes(x = logFC, y = marker, fill = data)) +
geom_col(position = "dodge")

changing logFC between samples accordingly
As radiation changes the cell volume, we think differences in protein
intensities between treated and control are confounded. Hence, we want
to remove any difference that correlates with the SUCs
(center_SUCs = "across_sample").
heatmap_compare_corr(sce_Cal33, name_assay_before = "counts", name_assay_after = "counts_simple_acrosssamples")

Log fold-changes between irradiated and control condition are
modified accordingly (positive means higher in irradiated).
FC_before <- t(assay(sce_Cal33,"counts")) |>
as.tibble() |>
cbind(colData(sce_Cal33)) |>
mutate_at(vars(x,m), asinh) |>
pivot_longer(names_to = "marker", values_to = "value", c(x,m)) |>
group_by(marker) |>
summarise(logFC = mean(value[dose=="10Gy"])-mean(value[dose=="0Gy"])) |>
mutate(data = "before RUCova")
FC_after <- t(assay(sce_Cal33,"counts_simple_acrosssamples")) |>
as.tibble() |> cbind(colData(sce_Cal33)) |>
mutate_at(vars(x,m), asinh) |>
pivot_longer(names_to = "marker", values_to = "value", c(x,m)) |>
group_by(marker) |>
summarise(logFC = mean(value[dose=="10Gy"])-mean(value[dose=="0Gy"])) |>
ungroup() |>
mutate(data = "simple all, per sample")
rbind(FC_before,FC_after) |>
ggplot(aes(x = logFC, y = marker, fill = data)) +
geom_col(position = "dodge")

with PC1 only
Let’s imagine you want to be conservative and only remove
correlations between markers and PC1 (of SUCs).
pca_cal33 <- t(assay(sce_Cal33,"counts")) |>
as.tibble() |>
cbind(colData(sce_Cal33)) |>
select(x) |>
mutate_all(asinh) |>
mutate_all(scale) |>
prcomp()
Calculate and plot the variance explained by each PC:
tibble(perc = as.numeric(pca_cal33$sdev^2/sum(pca_cal33$sdev^2))*100,
PC = 1:length(pca_cal33$sdev)) |>
ggplot(aes(x = PC, y = perc, label = round(perc,1))) +
geom_col() +
geom_label()

Check the loadings of each PC:
as.data.frame(pca_cal33$rotation) |>
rownames_to_column("x") |>
pivot_longer(names_to = "PC", values_to = "loadings", -x) |>
ggplot(aes(x = loadings, y = x)) +
geom_col() +
facet_wrap(~PC, nrow = 1)

In this example, PC1 has positive loadings. Meaning PC1 will
positively correlate with the markers, which is intuitive if we think of
it as the cell size. In case for your data set, PC1 has negative
loadings, you can just the direction for a more intuitive analysis:
pca_cal33$x |> as.data.frame() |> mutate(PC1 = -PC1) #variable not saved as not necessary here
Add the PCA to the sce object under the name “PCA”:
name_reduced_dim = "PCA"
reducedDim(sce_Cal33, name_reduced_dim) <- pca_cal33$x
Then, SUCs= "PC1" and
apply_asinh_SUCs = FALSE, as asinh transformation is not
necessary on PCs (it was applied on SUCs before PCA). This applies to
all models.
If we regress-out any PCs and want to check the correlation
coefficient, it is important we specify now the name for the heatmap
function to include it: ``name_reduced_dim = “PCA”```.
heatmap_compare_corr(sce_Cal33, name_assay_before = "counts", name_assay_after = "counts_simple_PC1", name_reduced_dim = "PCA")

Log fold-changes between irradiated and control condition are
modified accordingly (positive means higher in irradiated).
FC_before <- t(assay(sce_Cal33,"counts")) |>
as.tibble() |>
cbind(colData(sce_Cal33)) |>
mutate_at(vars(x,m), asinh) |>
pivot_longer(names_to = "marker", values_to = "value", c(x,m)) |>
group_by(marker) |>
summarise(logFC = mean(value[dose=="10Gy"])-mean(value[dose=="0Gy"])) |>
mutate(data = "before RUCova")
FC_after <- t(assay(sce_Cal33,"counts_simple_PC1")) |>
as.tibble() |> cbind(colData(sce_Cal33)) |>
mutate_at(vars(x,m), asinh) |>
pivot_longer(names_to = "marker", values_to = "value", c(x,m)) |>
group_by(marker) |>
summarise(logFC = mean(value[dose=="10Gy"])-mean(value[dose=="0Gy"])) |>
ungroup() |>
mutate(data = "simple all, per sample")
rbind(FC_before,FC_after) |>
ggplot(aes(x = logFC, y = marker, fill = data)) +
geom_col(position = "dodge")

Adjusted R-squared
Let’s compare the adjusted \(R²\)
per marker (goodness of fit, see complete definition above).
r2_a <- metadata(sce_Cal33)$model_counts_simple_acrosssamples$adjr2 |> mutate(model = "simple all, across samples")
r2_b <- metadata(sce_Cal33)$model_counts_simple_persample$adjr2|> mutate(model = "simple all, per sample")
r2_c <- metadata(sce_Cal33)$model_counts_simple_PC1$adjr2 |> mutate(model = "simple PC1, across sample")
rbind(rbind(r2_a,r2_b),r2_c) |>
ggplot(aes(x = adj_r_squared, y = marker, fill = model)) +
geom_col(position = "dodge")

In general, across all markers, the adjusted \(R^2\) is higher when using all SUCs as
predictive variables (instead of just PC1) and when SUCs are center
across samples (residuals are smaller).
rbind(rbind(r2_a,r2_b),r2_c) |>
ggplot(aes(y = adj_r_squared, x = model, color = model)) +
geom_boxplot()

The decision of which model parameters to use should be based on: (1)
the extent of variance to remove (all SUCs or some PCs) and (2) whether
to keep or not the FC between samples.
Standardised slopes
We can also compare the slopes for each marker in each model.
slope_a <- metadata(sce_Cal33)$model_counts_simple_acrosssamples$stand_slopes |> mutate(model = "simple all, across samples")
slope_b <- metadata(sce_Cal33)$model_counts_simple_persample$stand_slopes |> mutate(model = "simple all, per sample")
slope_c <- metadata(sce_Cal33)$model_counts_simple_PC1$stand_slopes |> mutate(model = "simple PC1, across sample")
rbind(slope_a, slope_b, slope_c) |>
mutate(coef_key = factor(coef_key, levels = c("mean_DNA", "mean_BC","pan_Akt", "total_ERK", "PC1"))) |>
ggplot(aes(x = stand_value, y = marker, fill = model)) +
geom_col(position = "dodge") +
facet_wrap(~coef_key, nrow = 1)

M2: Offset model
Here we just show one example: Let’s say different samples (dose)
within Cal33 are different batches, and we would like to adjust for it
in addition to removing unwanted correlations with all SUCs. Then
keep_offset = FALSE.
sce_Cal33 <- RUCova::rucova(sce = sce_Cal33,
name_assay_before = "counts",
markers = m,
SUCs = x,
apply_asinh_SUCs = TRUE,
keep_offset = FALSE,
model = "offset",
center_SUCs = "per_sample",
col_name_sample = "dose",
name_assay_after = "counts_offset_persample")
Fold-change between conditions (“dose”) becomes zero:
FC_before <- t(assay(sce_Cal33,"counts")) |>
as.tibble() |>
cbind(colData(sce_Cal33)) |>
mutate_at(vars(x,m), asinh) |>
pivot_longer(names_to = "marker", values_to = "value", c(x,m)) |>
group_by(marker) |>
summarise(logFC = mean(value[dose=="10Gy"])-mean(value[dose=="0Gy"])) |>
mutate(data = "before RUCova") |>
ungroup()
FC_after <- t(assay(sce_Cal33,"counts_offset_persample")) |>
as.tibble() |>
cbind(colData(sce_Cal33)) |>
mutate_at(vars(x,m), asinh) |>
pivot_longer(names_to = "marker", values_to = "value", c(x,m)) |>
group_by(marker) |>
summarise(logFC = mean(value[dose=="10Gy"])-mean(value[dose=="0Gy"])) |>
ungroup() |>
mutate(data = "offset all, per sample")
rbind(FC_before, FC_after) |>
ggplot(aes(x = logFC, y = marker, fill = data)) +
geom_col(position = "dodge")

Different intercepts between samples:
metadata(sce_Cal33)$model_counts_offset_persample$eff_coefficients |>
filter(surrogate == FALSE) |>
ggplot(aes(x = eff_value, y = marker, fill = sample)) +
geom_col(position = "dodge") +
xlab("eff_value (intercept)")

M3: Interaction model
Here we just show one example, the least conservative: we eliminate
differences between cell lines that correlate with the SUCs
(center_SUCs = "across_samples") and apply one fit per cell
line (model = "interaction"). Then:
sce <- RUCova::rucova(sce = sce,
name_assay_before = "counts",
markers = m,
SUCs = x,
apply_asinh_SUCs = TRUE,
model = "interaction",
center_SUCs = "across_samples",
col_name_sample = "line",
name_assay_after = "counts_interaction_persample")
Different slopes for different cell lines:
metadata(sce)$model_counts_interaction_persample$eff_coefficients |>
filter(surrogate != FALSE) |>
ggplot(aes(x = eff_value, y = marker, fill = sample)) +
geom_col(position = "dodge") +
facet_wrap(~surrogate) +
xlab("eff_value (slope)")
