1 Introduction

Multi-donor single-cell RNA-seq studies are now the norm for disease studies, clinical trials, and population-level atlases. The standard approach to differential expression (DE) in these datasets is pseudo-bulk analysis: aggregate per-cell counts into per-donor profiles, then apply bulk DE tools like DESeq2 or edgeR.

This works well but has three underappreciated problems:

Problem 1: Speed. Bulk DE tools fit one model per gene in a serial R loop. On a dataset with 30,000 genes and 50 donors, this loop runs 30,000 times. At scale — 100k+ cells, multiple cell types — this becomes the analysis bottleneck.

Problem 2: Donor weighting. Standard pseudo-bulk tools treat a donor with 5 cells identically to a donor with 500 cells. The 5-cell pseudo-bulk is far noisier, yet gets equal weight in the DE model.

Problem 3: Paired designs. Many studies collect cells from the same donors under multiple conditions (e.g. before/after treatment, or stimulated/unstimulated). Naively aggregating by donor alone mixes conditions together, destroying the treatment signal.

scFastDE addresses all three:

  1. Vectorised testing — the entire pseudo-bulk matrix is passed to limma::lmFit in one call, which uses LAPACK routines to fit all gene models simultaneously.
  2. Cell-count weights — each sample is weighted by sqrt(n_cells), giving principled influence to well-represented samples.
  3. Sparse donor guard — donors with too few cells are removed before aggregation, preventing noisy pseudo-bulks from distorting results.
  4. Auto-detect paired designsfastDE automatically detects whether donors appear in multiple conditions and builds the appropriate linear model (paired ~ 0 + condition + donor or unpaired ~ 0 + condition).

2 Installation

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("scFastDE")

3 Quick start: unpaired design

In an unpaired design, each donor belongs to exactly one condition (e.g. Donors 1–6 = healthy, Donors 7–12 = disease). scFastDE automatically detects this and uses a simple ~ 0 + condition model.

library(scFastDE)
library(SingleCellExperiment)

# Simulate a multi-donor, two-condition dataset (unpaired)
set.seed(2024)
n_genes <- 500
n_cells <- 120

counts <- matrix(rpois(n_genes * n_cells, 8L),
                 nrow = n_genes, ncol = n_cells)
rownames(counts) <- paste0("Gene", seq_len(n_genes))
colnames(counts) <- paste0("Cell", seq_len(n_cells))

# Inject DE signal: first 20 genes upregulated in treatment
counts[1:20, 61:120] <- counts[1:20, 61:120] * 4L

sce <- SingleCellExperiment(assays = list(counts = counts))
sce$donor     <- rep(paste0("Donor", 1:12), each = 10)
sce$cell_type <- "CD4_Tcell"
sce$condition <- rep(c("ctrl", "treat"), each = 60)

sce
#> class: SingleCellExperiment 
#> dim: 500 120 
#> metadata(0):
#> assays(1): counts
#> rownames(500): Gene1 Gene2 ... Gene499 Gene500
#> rowData names(0):
#> colnames(120): Cell1 Cell2 ... Cell119 Cell120
#> colData names(3): donor cell_type condition
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):

4 Step 1: Filter sparse donors

Before aggregation, remove donor-cell type combinations with too few cells.

sce <- filterSparseDonors(sce,
                           donor     = "donor",
                           cell_type = "cell_type",
                           min_cells = 5)
ncol(sce)
#> [1] 120

5 Step 2: Inspect pseudo-bulk profiles

pb <- fastPseudobulk(sce,
                      donor       = "donor",
                      cell_type   = "cell_type",
                      target_type = "CD4_Tcell")

cat("Pseudo-bulk matrix:", nrow(pb$pseudobulk),
    "genes x", ncol(pb$pseudobulk), "donors\n")
#> Pseudo-bulk matrix: 500 genes x 12 donors

cat("\nDonor weights (sqrt of cell count):\n")
#> 
#> Donor weights (sqrt of cell count):
print(round(pb$donor_weights, 2))
#>  Donor1  Donor2  Donor3  Donor4  Donor5  Donor6  Donor7  Donor8  Donor9 Donor10 
#>    3.16    3.16    3.16    3.16    3.16    3.16    3.16    3.16    3.16    3.16 
#> Donor11 Donor12 
#>    3.16    3.16

Notice that all donors have equal cell counts here (10 each), so weights are identical. In real data, weights will vary substantially.

6 Step 3: Run differential expression

result <- fastDE(sce,
                  donor       = "donor",
                  cell_type   = "cell_type",
                  condition   = "condition",
                  target_type = "CD4_Tcell",
                  min_cells   = 5,
                  min_cpm     = 1,
                  min_donors  = 2)

result
#> FDEResult
#>   Genes tested   : 500 
#>   Samples        : 12 
#>   Significant    : 149 (adj.P.Val < 0.05)
#>   Cell type      : CD4_Tcell 
#>   Condition      : condition 
#>   Design         : unpaired
# Top significant genes
dt <- as.data.frame(deTable(result))
dt_sorted <- dt[order(dt$adj.P.Val), ]
head(dt_sorted[, c("gene", "logFC", "P.Value", "adj.P.Val")], 15)
#>          gene    logFC      P.Value    adj.P.Val
#> Gene10 Gene10 1.940020 2.392649e-89 1.196324e-86
#> Gene11 Gene11 1.869497 4.761455e-86 1.190364e-83
#> Gene5   Gene5 2.019865 1.125598e-84 1.875997e-82
#> Gene16 Gene16 1.850108 2.794630e-84 3.493288e-82
#> Gene6   Gene6 1.954038 3.604269e-84 3.604269e-82
#> Gene3   Gene3 1.940079 1.012996e-80 8.441630e-79
#> Gene2   Gene2 1.814565 9.484184e-80 5.977602e-78
#> Gene15 Gene15 1.917421 9.564163e-80 5.977602e-78
#> Gene7   Gene7 1.837197 6.391214e-79 3.550675e-77
#> Gene20 Gene20 1.918719 1.290645e-78 6.453225e-77
#> Gene17 Gene17 1.795905 1.479058e-77 6.722991e-76
#> Gene12 Gene12 1.874153 5.052708e-77 2.105295e-75
#> Gene8   Gene8 1.758143 5.758639e-77 2.214861e-75
#> Gene13 Gene13 1.850643 3.588716e-75 1.281684e-73
#> Gene14 Gene14 1.710195 1.754404e-73 5.848012e-72
# Summary
sig <- sum(dt$adj.P.Val < 0.05, na.rm = TRUE)
cat(sprintf("\n%d / %d genes significant (FDR < 0.05)\n",
            sig, nrow(dt)))
#> 
#> 149 / 500 genes significant (FDR < 0.05)
cat(sprintf("Injected genes recovered: %d / 20\n",
            sum(paste0("Gene", 1:20) %in%
                rownames(dt)[dt$adj.P.Val < 0.05])))
#> Injected genes recovered: 20 / 20

7 Step 4: Visualise results

plotDEResults(result, fdr_thresh = 0.05, lfc_thresh = 1, top_n = 10)
Volcano plot of DE results. Red = upregulated, blue = downregulated at FDR < 0.05 and |logFC| > 1.

Figure 1: Volcano plot of DE results
Red = upregulated, blue = downregulated at FDR < 0.05 and |logFC| > 1.

8 The donor weighting advantage

Here we demonstrate why weighting matters. We compare fastDE output when one donor has far fewer cells than the others.

# Create an unbalanced dataset
set.seed(99)
sce_unbal <- sce
# Simulate donor 1 having only 2 cells by keeping just 2
keep <- c(which(sce$donor != "Donor1"),
          which(sce$donor == "Donor1")[1:2])
sce_unbal <- sce_unbal[, keep]

pb_unbal <- fastPseudobulk(sce_unbal,
                             donor       = "donor",
                             cell_type   = "cell_type",
                             target_type = "CD4_Tcell")

cat("Donor weights (unbalanced):\n")
#> Donor weights (unbalanced):
print(round(pb_unbal$donor_weights, 2))
#>  Donor2  Donor3  Donor4  Donor5  Donor6  Donor7  Donor8  Donor9 Donor10 Donor11 
#>    3.16    3.16    3.16    3.16    3.16    3.16    3.16    3.16    3.16    3.16 
#> Donor12  Donor1 
#>    3.16    1.41

Donor1’s weight (sqrt(2) ≈ 1.41) is much lower than the other donors (sqrt(10) ≈ 3.16), so its noisy pseudo-bulk has proportionally less influence on the DE estimates — without being discarded entirely.

9 Paired design support

Many single-cell experiments use a paired design where the same donors contribute cells under multiple conditions — for example, stimulated vs unstimulated PBMCs from the same individuals (Kang et al. 2018).

scFastDE automatically detects paired designs: when the same donor IDs appear in more than one condition, fastDE switches to a paired model that:

  1. Aggregates pseudo-bulk per donor × condition pair (not per donor)
  2. Builds a ~ 0 + condition + donor design matrix that blocks on donor
  3. Extracts the treatment effect while accounting for inter-donor variation

This is critical because naively aggregating by donor alone would mix conditions together, destroying the treatment signal entirely.

9.1 Simulated paired design

# Simulate paired data: SAME 6 donors in BOTH conditions
set.seed(2025)
n_genes <- 500
n_donors <- 6
cells_per_donor_cond <- 15
n_cells <- n_donors * 2 * cells_per_donor_cond  # 180 cells

counts <- matrix(rpois(n_genes * n_cells, 8L),
                 nrow = n_genes, ncol = n_cells)
rownames(counts) <- paste0("Gene", seq_len(n_genes))
colnames(counts) <- paste0("Cell", seq_len(n_cells))

# Inject DE signal: first 25 genes upregulated in stimulated condition
stim_cols <- seq(n_donors * cells_per_donor_cond + 1, n_cells)
counts[1:25, stim_cols] <- counts[1:25, stim_cols] * 4L

sce_paired <- SingleCellExperiment(assays = list(counts = counts))

# KEY: same donors (D1–D6) appear in BOTH conditions
sce_paired$donor <- rep(
    rep(paste0("D", seq_len(n_donors)), each = cells_per_donor_cond),
    2
)
sce_paired$cell_type <- "Monocyte"
sce_paired$condition <- rep(c("ctrl", "stim"),
                             each = n_donors * cells_per_donor_cond)

# Verify: each donor appears in both conditions
cat("Cells per donor and condition:\n")
#> Cells per donor and condition:
print(table(sce_paired$donor, sce_paired$condition))
#>     
#>      ctrl stim
#>   D1   15   15
#>   D2   15   15
#>   D3   15   15
#>   D4   15   15
#>   D5   15   15
#>   D6   15   15

9.2 Pseudo-bulk by donor × condition

When the condition argument is passed, fastPseudobulk creates one pseudo-bulk column per donor-condition pair:

pb_paired <- fastPseudobulk(sce_paired,
                              donor       = "donor",
                              cell_type   = "cell_type",
                              target_type = "Monocyte",
                              condition   = "condition")

cat(sprintf("Pseudo-bulk: %d genes × %d samples\n",
            nrow(pb_paired$pseudobulk),
            ncol(pb_paired$pseudobulk)))
#> Pseudo-bulk: 500 genes × 12 samples
cat("  (6 donors × 2 conditions = 12 samples)\n\n")
#>   (6 donors × 2 conditions = 12 samples)

cat("Sample info:\n")
#> Sample info:
print(pb_paired$sample_info)
#>    sample_id donor condition
#> 1  D1___ctrl    D1      ctrl
#> 2  D2___ctrl    D2      ctrl
#> 3  D3___ctrl    D3      ctrl
#> 4  D4___ctrl    D4      ctrl
#> 5  D5___ctrl    D5      ctrl
#> 6  D6___ctrl    D6      ctrl
#> 7  D1___stim    D1      stim
#> 8  D2___stim    D2      stim
#> 9  D3___stim    D3      stim
#> 10 D4___stim    D4      stim
#> 11 D5___stim    D5      stim
#> 12 D6___stim    D6      stim

9.3 Running fastDE on paired data

fastDE auto-detects the paired design and reports it:

result_paired <- fastDE(sce_paired,
                         donor       = "donor",
                         cell_type   = "cell_type",
                         condition   = "condition",
                         target_type = "Monocyte",
                         min_cells   = 5,
                         min_cpm     = 1,
                         min_donors  = 2)

result_paired
#> FDEResult
#>   Genes tested   : 500 
#>   Samples        : 12 
#>   Significant    : 343 (adj.P.Val < 0.05)
#>   Cell type      : Monocyte 
#>   Condition      : condition 
#>   Design         : paired

Note the output shows Design: paired and Samples: 12 (not 6).

dt_paired <- as.data.frame(deTable(result_paired))
dt_paired_sorted <- dt_paired[order(dt_paired$adj.P.Val), ]

sig_paired <- sum(dt_paired$adj.P.Val < 0.05, na.rm = TRUE)
recovered  <- sum(paste0("Gene", 1:25) %in%
                  rownames(dt_paired)[dt_paired$adj.P.Val < 0.05])

cat(sprintf("Significant genes (FDR < 0.05): %d / %d\n",
            sig_paired, nrow(dt_paired)))
#> Significant genes (FDR < 0.05): 343 / 500
cat(sprintf("Injected DE genes recovered:    %d / 25\n", recovered))
#> Injected DE genes recovered:    25 / 25
plotDEResults(result_paired, fdr_thresh = 0.05, lfc_thresh = 1,
             top_n = 10)
Volcano plot for paired design. The paired model correctly blocks on donor, recovering the injected DE signal.

(#fig:paired_volcano)Volcano plot for paired design. The paired model correctly blocks on donor, recovering the injected DE signal.

9.4 Why paired design matters

The paired model accounts for donor-level variation (age, genetics, etc.) as a blocking factor. This has two major benefits:

  • More power: by removing inter-donor noise, the residual variance is smaller, making it easier to detect true condition effects.
  • Correct aggregation: without condition-aware aggregation, ctrl and stim cells from the same donor would be mixed into a single pseudo-bulk, washing out the treatment signal completely.

On real data like the Kang et al. 2018 IFN-β PBMC dataset, the paired model recovers 5,550 DE genes (including all 10 known IFN-β response genes with logFCs of 2–8), whereas naively aggregating by donor alone recovers zero.

10 Session information

sessionInfo()
#> R version 4.6.0 RC (2026-04-17 r89917)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] SingleCellExperiment_1.34.0 SummarizedExperiment_1.42.0
#>  [3] Biobase_2.72.0              GenomicRanges_1.64.0       
#>  [5] Seqinfo_1.2.0               IRanges_2.46.0             
#>  [7] S4Vectors_0.50.0            BiocGenerics_0.58.0        
#>  [9] generics_0.1.4              MatrixGenerics_1.24.0      
#> [11] matrixStats_1.5.0           scFastDE_0.99.0            
#> [13] BiocStyle_2.40.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.10         SparseArray_1.12.2  lattice_0.22-9     
#>  [4] magrittr_2.0.5      digest_0.6.39       evaluate_1.0.5     
#>  [7] grid_4.6.0          RColorBrewer_1.1-3  bookdown_0.46      
#> [10] fastmap_1.2.0       jsonlite_2.0.0      Matrix_1.7-5       
#> [13] limma_3.68.1        tinytex_0.59        BiocManager_1.30.27
#> [16] scales_1.4.0        codetools_0.2-20    jquerylib_0.1.4    
#> [19] abind_1.4-8         cli_3.6.6           rlang_1.2.0        
#> [22] XVector_0.52.0      withr_3.0.2         cachem_1.1.0       
#> [25] DelayedArray_0.38.1 yaml_2.3.12         otel_0.2.0         
#> [28] S4Arrays_1.12.0     tools_4.6.0         parallel_4.6.0     
#> [31] BiocParallel_1.46.0 dplyr_1.2.1         ggplot2_4.0.3      
#> [34] vctrs_0.7.3         R6_2.6.1            magick_2.9.1       
#> [37] lifecycle_1.0.5     pkgconfig_2.0.3     pillar_1.11.1      
#> [40] bslib_0.10.0        gtable_0.3.6        Rcpp_1.1.1-1.1     
#> [43] glue_1.8.1          statmod_1.5.1       tidyselect_1.2.1   
#> [46] tibble_3.3.1        xfun_0.57           knitr_1.51         
#> [49] dichromat_2.0-0.1   farver_2.1.2        htmltools_0.5.9    
#> [52] labeling_0.4.3      rmarkdown_2.31      compiler_4.6.0     
#> [55] S7_0.2.2