---
title: "tidybulk: An R tidy framework for modular transcriptomic data analysis"
author: "Stefano Mangiola"
date: "`r Sys.Date()`"
package: tidybulk
output:
  BiocStyle::html_document:
    toc_float: true
bibliography: references.bib
link-citations: true

keywords: "transcriptomics, RNA-seq, differential expression, data analysis, tidyverse, SummarizedExperiment, 
  bioinformatics, genomics, gene expression, clustering, dimensionality reduction, cellularity analysis, 
  gene enrichment, R package"
vignette: >
  %\VignetteIndexEntry{Overview of the tidybulk package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

<!-- badges: start -->
[![Lifecycle:maturing](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://www.tidyverse.org/lifecycle/#maturing) 
[![R build status](https://github.com/stemangiola/tidybulk/workflows/R-CMD-check/badge.svg)](https://github.com/stemangiola/tidybulk/actions/)
[![Bioconductor status](https://bioconductor.org/shields/build/release/bioc/tidybulk.svg)](https://bioconductor.org/checkResults/release/bioc-LATEST/tidybulk/)
<!-- badges: end -->


**tidybulk** is a powerful R package designed for modular transcriptomic data analysis that brings transcriptomics to the tidyverse.

## Why tidybulk?

Tidybulk provides a unified interface for comprehensive transcriptomic data analysis with seamless integration of SummarizedExperiment objects and tidyverse principles. It streamlines the entire workflow from raw data to biological insights.

## Functions/utilities available

### Abundance Normalization Functions
Function | Description
------------ | -------------
`scale_abundance()` | Scale abundance data
`quantile_normalise_abundance()` | Quantile normalization
`adjust_abundance()` | Adjust abundance for unwanted variation
`fill_missing_abundance()` | Fill missing abundance values
`impute_missing_abundance()` | Impute missing abundance values

### Filtering and Selection Functions
Function | Description
------------ | -------------
`identify_abundant()` | Identify abundant transcripts without removing them
`keep_abundant()` | Keep abundant transcripts
`keep_variable()` | Keep variable transcripts
`filterByExpr()` | Filter by expression

### Dimensionality Reduction Functions
Function | Description
------------ | -------------
`reduce_dimensions()` | Reduce dimensions with PCA/MDS/tSNE/UMAP
`rotate_dimensions()` | Rotate dimensions
`remove_redundancy()` | Remove redundant features

### Clustering Functions
Function | Description
------------ | -------------
`cluster_elements()` | Cluster elements with various methods
`kmeans clustering` | K-means clustering
`SNN clustering` | Shared nearest neighbor clustering
`hierarchical clustering` | Hierarchical clustering
`DBSCAN clustering` | Density-based clustering

### Differential Analysis Functions
Function | Description
------------ | -------------
`test_differential_expression()` | Test differential expression with various methods

### Cellularity Analysis Functions
Function | Description
------------ | -------------
`deconvolve_cellularity()` | Deconvolve cellularity with various methods
`cibersort()` | CIBERSORT analysis

### Gene Enrichment Functions
Function | Description
------------ | -------------
`test_gene_enrichment()` | Test gene enrichment
`test_gene_overrepresentation()` | Test gene overrepresentation
`test_gene_rank()` | Test gene rank

### Utility Functions
Function | Description
------------ | -------------
`describe_transcript()` | Describe transcript characteristics
`get_bibliography()` | Get bibliography
`resolve_complete_confounders_of_non_interest()` | Resolve confounders

### Validation and Utility Functions
Function | Description
------------ | -------------
`check_if_counts_is_na()` | Check if counts contain NA values
`check_if_duplicated_genes()` | Check for duplicated genes
`check_if_wrong_input()` | Validate input data
`log10_reverse_trans()` | Log10 reverse transformation
`logit_trans()` | Logit transformation

All functions are directly compatible with `SummarizedExperiment` objects and follow tidyverse principles for seamless integration with the tidyverse ecosystem.

### Scientific Citation

Mangiola, Stefano, Ramyar Molania, Ruining Dong, Maria A. Doyle, and Anthony T. Papenfuss. 2021. "Tidybulk: An R tidy framework for modular transcriptomic data analysis." Genome Biology 22 (42). https://doi.org/10.1186/s13059-020-02233-7

[Genome Biology - tidybulk: an R tidy framework for modular transcriptomic data analysis](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02233-7)

# Full vignette

[HERE](https://stemangiola.github.io/tidybulk/)

# Installation Guide

**Bioconductor**

```{r install-bioconductor, message=FALSE, eval=FALSE}
if (!requireNamespace("BiocManager")) install.packages("BiocManager")
BiocManager::install("tidybulk")
```

**Github**

```{r install-github, message=FALSE, eval=FALSE}
devtools::install_github("stemangiola/tidybulk")
```


```{r setup-libraries-and-theme, echo=FALSE, include=FALSE}
library(knitr)
library(dplyr)
library(tidyr)
library(tibble)
library(purrr)
library(magrittr)
library(forcats)
library(ggplot2)
library(ggrepel)
library(SummarizedExperiment)
library(tidybulk)

# Define theme
my_theme = 	
	theme_bw() +
	theme(
		panel.border = element_blank(),
		axis.line = element_line(),
		panel.grid.major = element_line(linewidth = 0.2),
		panel.grid.minor = element_line(linewidth = 0.1),
		text = element_text(size=12),
		legend.position="bottom",
		aspect.ratio=1,
		strip.background = element_blank(),
		axis.title.x  = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10)),
		axis.title.y  = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10))
	)
```

In this vignette we will use the `airway` dataset, a `SummarizedExperiment` object containing RNA-seq data from an experiment studying the effect of dexamethasone treatment on airway smooth muscle cells. This dataset is available in the [airway](https://bioconductor.org/packages/airway/) package.

```{r load airway}
library(airway)
data(airway)
```

This workflow, will use the [tidySummarizedExperiment](https://bioconductor.org/packages/tidySummarizedExperiment/) package to manipulate the data in a `tidyverse` fashion. This approach streamlines the data manipulation and analysis process, making it more efficient and easier to understand.

```{r load tidySummarizedExperiment}
library(tidySummarizedExperiment)
```

Here we will add a gene symbol column to the `airway` object. This will be used to interpret the differential expression analysis, and to deconvolve the cellularity.

```{r add-gene-symbol}
library(org.Hs.eg.db)
library(AnnotationDbi)

# Add gene symbol and entrez
airway <-
  airway |>
  
  mutate(entrezid = mapIds(org.Hs.eg.db,
                                      keys = gene_name,
                                      keytype = "SYMBOL",
                                      column = "ENTREZID",
                                      multiVals = "first"
)) 


detach("package:org.Hs.eg.db", unload = TRUE)
detach("package:AnnotationDbi", unload = TRUE)
```


# Comprehensive Example Pipeline

This vignette demonstrates a complete transcriptomic analysis workflow using tidybulk, with special emphasis on differential expression analysis.

## Data Overview

We will use the `airway` dataset, a `SummarizedExperiment` object containing RNA-seq data from an experiment studying the effect of dexamethasone treatment on airway smooth muscle cells:

```{r data-overview}
airway
```

Loading `tidySummarizedExperiment` makes the `SummarizedExperiment` objects compatible with tidyverse tools while maintaining its `SummarizedExperiment` nature. This is useful because it allows us to use the `tidyverse` tools to manipulate the data.

```{r check-se-class}
class(airway)
```

### Prepare Data for Analysis

Before analysis, we need to ensure our variables are in the correct format:

```{r convert-condition-to-factor}
# Convert dex to factor for proper differential expression analysis
airway = airway |>
  mutate(dex = as.factor(dex))
```

### Visualize Raw Counts
Visualize the distribution of raw counts before any filtering:

```{r plot-raw-counts, fig.width = 10, fig.height = 10}
airway |>
  ggplot(aes(counts + 1, group = .sample, color = dex)) +
  geom_density() +
  scale_x_log10() +
  my_theme +
  labs(title = "Raw counts by treatment (before any filtering)")
```

## Step 1: Data Preprocessing

### Aggregate Duplicated Transcripts (optional)
Aggregate duplicated transcripts (e.g., isoforms, ensembl IDs):

> Transcript aggregation is a standard bioinformatics approach for gene-level summarization.

```{r preprocessing-aggregate-duplicates}
# Aggregate duplicates
airway = airway |> aggregate_duplicates(feature = "gene_name", aggregation_function = mean)
```

### Abundance Filtering

Abundance filtering can be performed using established methods [@robinson2010edger; @chen2016edgeR].

#### Run multiple methods

```{r filtering-abundance-methods}
# Default (simple filtering)
airway_abundant_default = airway |> keep_abundant(minimum_counts = 10, minimum_proportion = 0.5)

# With factor_of_interest (recommended for complex designs)
airway_abundant_formula = airway |> keep_abundant(minimum_counts = 10, minimum_proportion = 0.5, factor_of_interest = dex)

# With CPM threshold (using design parameter)
airway_abundant_cpm = airway |> keep_abundant(minimum_count_per_million  = 10, minimum_proportion = 0.5, factor_of_interest = dex)

```

#### Compare methods 

```{r filtering-summary-statistics, fig.width = 10, fig.height = 10, warning = FALSE}
# Example: summary for default tidybulk filtering
# Before filtering
airway |> summarise(
  n_features = n_distinct(.feature),
  min_count = min(counts),
  median_count = median(counts),
  max_count = max(counts)
)

# After filtering
airway_abundant_default |> summarise(
  n_features = n_distinct(.feature),
  min_count = min(counts),
  median_count = median(counts),
  max_count = max(counts)
)

airway_abundant_formula |> summarise(
  n_features = n_distinct(.feature),
  min_count = min(counts),
  median_count = median(counts),
  max_count = max(counts)
)

airway_abundant_cpm |> summarise(
  n_features = n_distinct(.feature),
  min_count = min(counts),
  median_count = median(counts),
  max_count = max(counts)
)
```

```{r filtering-density-plot-comparison, fig.width = 10, fig.height = 10}
# Merge all methods into a single tibble
airway_abundant_all = 
  bind_rows(
    airway |> assay() |> as_tibble(rownames = ".feature") |> pivot_longer(cols = -.feature, names_to = ".sample", values_to = "counts") |> mutate(method = "no filter"),
    airway_abundant_default |> assay() |> as_tibble(rownames = ".feature") |> pivot_longer(cols = -.feature, names_to = ".sample", values_to = "counts") |> mutate(method = "default"),
    airway_abundant_formula |> assay() |> as_tibble(rownames = ".feature") |> pivot_longer(cols = -.feature, names_to = ".sample", values_to = "counts") |> mutate(method = "formula"),
    airway_abundant_cpm |> assay() |> as_tibble(rownames = ".feature") |> pivot_longer(cols = -.feature, names_to = ".sample", values_to = "counts") |> mutate(method = "cpm")
  )

# Density plot across methods
airway_abundant_all |> 
  ggplot(aes(counts + 1, group = .sample, color = method)) +
    geom_density() +
    scale_x_log10() +
    facet_wrap(~fct_relevel(method, "no filter", "default", "formula", "cpm")) +
    my_theme +
    labs(title = "Counts after abundance filtering (tidybulk default)")
```

Update the `airway` object with the filtered data:

```{r update-se-mini-with-filtered}
airway = airway_abundant_formula
```

> **Tip:** Use `formula_design` for complex designs, and use the CPM threshold for library-size-aware filtering.

### Remove Redundant Transcripts

Redundancy removal is a standard approach for reducing highly correlated features.

```{r preprocessing-remove-redundancy}
airway_non_redundant = 
  airway |> 
  remove_redundancy(method = "correlation", top = 100, of_samples = FALSE) 

  # Make  

airway |> summarise(
  n_features = n_distinct(.feature),
  min_count = min(counts),
  median_count = median(counts),
  max_count = max(counts)
)

# Summary statistics
airway_non_redundant |> summarise(
  n_features = n_distinct(.feature),
  min_count = min(counts),
  median_count = median(counts),
  max_count = max(counts)
)

# Plot before and after
# Merge before and after into a single tibble
airway_all = bind_rows(
  airway |> assay() |> as_tibble(rownames = ".feature") |> pivot_longer(cols = -.feature, names_to = ".sample", values_to = "counts") |>  mutate(method = "before"),
  airway_non_redundant |> assay() |> as_tibble(rownames = ".feature") |> pivot_longer(cols = -.feature, names_to = ".sample", values_to = "counts") |>  mutate(method = "after")
)

# Density plot
airway_all |>
  ggplot(aes(counts + 1, group = .sample, color = method)) +
  geom_density() +
  scale_x_log10() +
  facet_wrap(~fct_relevel(method, "before", "after")) +
  my_theme +
  labs(title = "Counts after removing redundant transcripts")

```


### Filter Variable Transcripts
Keep only the most variable transcripts for downstream analysis.

Variance-based feature selection using edgeR methodology [@robinson2010edger] is used for selecting informative features.

```{r filtering-keep-variable}
airway_variable = airway |> keep_variable()
```

### Visualize After Variable Filtering Variable Transcripts (optional)
```{r filtering-variable-summary-and-plot, fig.width = 10, fig.height = 10}
# Before filtering
airway |> summarise(
  n_features = n_distinct(.feature),
  min_count = min(counts),
  median_count = median(counts),
  max_count = max(counts)
)

# After filtering
airway_variable |> summarise(
  n_features = n_distinct(.feature),
  min_count = min(counts),
  median_count = median(counts),
  max_count = max(counts)
)

# Density plot
# Merge before and after into a single tibble
airway_all = bind_rows(
  airway |> assay() |> as_tibble(rownames = ".feature") |> pivot_longer(cols = -.feature, names_to = ".sample", values_to = "counts") |> mutate(method = "before"),
  airway_variable |> assay() |> as_tibble(rownames = ".feature") |> pivot_longer(cols = -.feature, names_to = ".sample", values_to = "counts") |> mutate(method = "after")
)

# Density plot
airway_all |>
  ggplot(aes(counts + 1, group = .sample, color = method)) +
  geom_density() +
  scale_x_log10() +
  facet_wrap(~fct_relevel(method, "before", "after")) +
  my_theme +
  labs(title = "Counts after variable filtering")
```


### Scale Abundance
Scale for sequencing depth using TMM [@robinson2010edger], upper quartile [@bullard2010uq], and RLE [@anders2010rle] normalization.

```{r normalization-scale-abundance}
airway = 
airway |> 
	scale_abundance(method = "TMM", suffix = "_tmm") |>
	scale_abundance(method = "upperquartile", suffix = "_upperquartile") |>
	scale_abundance(method = "RLE", suffix = "_RLE")

```

### Visualize After Scaling
```{r normalization-visualize-scaling, fig.width = 10, fig.height = 10}
# Before scaling
airway |> assay("counts") |> as.matrix() |> rowMeans() |> summary()

airway |> assay("counts_tmm") |> as.matrix() |> rowMeans() |> summary()

airway |> assay("counts_upperquartile") |> as.matrix() |> rowMeans() |> summary()

airway |> assay("counts_RLE") |> as.matrix() |> rowMeans() |> summary()

# Merge all methods into a single tibble
airway_scaled_all = bind_rows(
  airway |> assay("counts") |> as_tibble(rownames = ".feature") |> pivot_longer(cols = -.feature, names_to = ".sample", values_to = "counts") |> mutate(method = "no_scaling"),
  airway |> assay("counts_tmm") |> as_tibble(rownames = ".feature") |> pivot_longer(cols = -.feature, names_to = ".sample", values_to = "counts") |> mutate(method = "TMM"),
  airway |> assay("counts_upperquartile") |> as_tibble(rownames = ".feature") |> pivot_longer(cols = -.feature, names_to = ".sample", values_to = "counts") |> mutate(method = "upperquartile"),
  airway |> assay("counts_RLE") |> as_tibble(rownames = ".feature") |> pivot_longer(cols = -.feature, names_to = ".sample", values_to = "counts") |> mutate(method = "RLE")
)


# Density plot
airway_scaled_all |>
  ggplot(aes(counts + 1, group = .sample, color = method)) +
  geom_density() +
  scale_x_log10() +
  facet_wrap(~fct_relevel(method, "no_scaling", "TMM", "upperquartile", "RLE")) +
  my_theme +
  labs(title = "Scaled counts by method (after scaling)")
```


## Step 2: Exploratory Data Analysis

### Remove Zero-Variance Features (required for PCA)

Variance filtering is a standard preprocessing step for dimensionality reduction.

```{r eda-remove-zero-variance}
library(matrixStats)
# Remove features with zero variance across samples
airway = airway[rowVars(assay(airway)) > 0, ]
```

### Dimensionality Reduction

MDS [@kruskal1964mds] using limma::plotMDS [@ritchie2015limma] and PCA [@hotelling1933pca] are used for dimensionality reduction.

```{r eda-mds-analysis, fig.width = 10, fig.height = 10}
airway = airway |>
  reduce_dimensions(method="MDS", .dims = 2)
```

```{r eda-pca-analysis, fig.width = 10, fig.height = 10}
airway = airway |>
  reduce_dimensions(method="PCA", .dims = 2)
```

### Visualize Dimensionality Reduction Results

```{r eda-plot-dimensionality-reduction, fig.width = 10, fig.height = 10}
# MDS plot
airway |>
	pivot_sample() |>
	ggplot(aes(x=`Dim1`, y=`Dim2`, color=`dex`)) +
  geom_point() +
	my_theme +
	labs(title = "MDS Analysis")

# PCA plot
	airway |>
	pivot_sample() |>
	ggplot(aes(x=`PC1`, y=`PC2`, color=`dex`)) +
	geom_point() +
	my_theme +
	labs(title = "PCA Analysis")
```

### Clustering Analysis

K-means clustering [@macqueen1967kmeans] is used for unsupervised grouping.

```{r eda-clustering-analysis, fig.width = 10, fig.height = 10}
airway = airway |>
  cluster_elements(method="kmeans", centers = 2)

# Visualize clustering
	airway |>
	ggplot(aes(x=`Dim1`, y=`Dim2`, color=`cluster_kmeans`)) +
  geom_point() +
  my_theme +
  labs(title = "K-means Clustering")
```

## Step 3: Differential Expression Analysis


### Basic Differential Expression

**Methods:**

- **edgeR quasi-likelihood:** Quasi-likelihood F-tests for differential expression [@robinson2010edger; @chen2016edgeR]

- **edgeR robust likelihood ratio:** Robust likelihood ratio tests [@chen2016edgeR]

- **DESeq2:** Negative binomial distribution with dispersion estimation [@love2014deseq2]

- **limma-voom:** Linear modeling with empirical Bayes moderation [@law2014voom]

- **limma-voom with sample weights:** Enhanced voom with quality weights [@liu2015voomweights]

**References:**

- Robinson et al. (2010) edgeR: a Bioconductor package for differential expression analysis

- Chen et al. (2016) From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline

- Love et al. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

- Law et al. (2014) voom: precision weights unlock linear model analysis tools for RNA-seq read counts

- Liu et al. (2015) Why weight? Modelling sample and observational level variability improves power in RNA-seq analyses

```{r differential-expression-multiple-methods, message = FALSE}
# Standard differential expression analysis
airway = airway |>

# Use QL method
	test_differential_expression(~ dex, method = "edgeR_quasi_likelihood", prefix = "ql__") |>
	
	# Use edger_robust_likelihood_ratio
	test_differential_expression(~ dex, method = "edger_robust_likelihood_ratio", prefix = "lr_robust__") |>
	
# Use DESeq2 method
	test_differential_expression(~ dex, method = "DESeq2", prefix = "deseq2__") |>
	
	# Use limma_voom
	test_differential_expression(~ dex, method = "limma_voom", prefix = "voom__") |>

# Use limma_voom_sample_weights
	test_differential_expression(~ dex, method = "limma_voom_sample_weights", prefix = "voom_weights__") 


```

### Quality Control of the Fit

It is important to check the quality of the fit. All methods produce a fit object that can be used for quality control. The fit object produced by each underlying method are stored in as attributes of the `airway_mini` object. We can use them for example to perform quality control of the fit.

#### For edgeR

Plot the biological coefficient of variation (BCV) trend. This plot is helpful to understant the dispersion of the data.

```{r differential-expression-edgeR-object}
library(edgeR)

metadata(airway)$tidybulk$edgeR_quasi_likelihood_object |>
  plotBCV()

```

Plot the log-fold change vs mean plot. 

```{r differential-expression-edgeR-fit}
library(edgeR)

metadata(airway)$tidybulk$edgeR_quasi_likelihood_fit |>
  plotMD()

```

#### For DESeq2

Plot the mean-variance trend.

```{r differential-expression-DESeq2-object}
library(DESeq2)

metadata(airway)$tidybulk$DESeq2_object |>
  plotDispEsts()

```

Plot the log-fold change vs mean plot.

```{r differential-expression-DESeq2-fit}
library(DESeq2)

metadata(airway)$tidybulk$DESeq2_object |>
  plotMA()

```


### Histograms of p-values across methods

Inspection of the raw p-value histogram provides a rapid check of differential-expression results. When no gene is truly differentially expressed, the p-values follow a uniform U(0,1) distribution across the interval 0–1, so the histogram appears flat [Source](https://bioconductor.org/help/course-materials/2014/useR2014/Workflows.html). In a more realistic scenario where only a subset of genes changes, this uniform background is still present but an obvious spike emerges close to zero, created by the genuine signals.

Thanks to the modularity of the `tidybulk` workflow, that can multiplex different methods, we can easily compare the p-values across methods.

```{r differential-expression-pvalue-histograms, fig.width = 10, fig.height = 10}
	airway |>
  pivot_transcript() |>
  select(
    ql__PValue, 
    lr_robust__PValue, 
    voom__P.Value, 
    voom_weights__P.Value, 
    deseq2__pvalue
  ) |> 
  pivot_longer(everything(), names_to = "method", values_to = "pvalue") |>
  ggplot(aes(x = pvalue, fill = method)) +
  geom_histogram(binwidth = 0.01) +
  facet_wrap(~method) +
  my_theme +
  labs(title = "Histogram of p-values across methods")
```


### Compare Results Across Methods

```{r differential-expression-summary-statistics}
# Summary statistics
airway |> 
  pivot_transcript() |> 
  select(contains("ql|lr_robust|voom|voom_weights|deseq2")) |> 
  select(contains("logFC")) |> 
  summarise(across(everything(), list(min = min, median = median, max = max), na.rm = TRUE))

```


### Pairplot of pvalues across methods (GGpairs)

```{r differential-expression-pvalue-pairplot, fig.width = 10, fig.height = 10, message = FALSE, warning=FALSE}

library(GGally)
airway |> 
  pivot_transcript() |> 
  select(ql__PValue, lr_robust__PValue, voom__P.Value, voom_weights__P.Value, deseq2__pvalue) |> 
  ggpairs(columns = 1:5, size = 0.5) +
  scale_y_log10_reverse() +
  scale_x_log10_reverse() +
  my_theme +
  labs(title = "Pairplot of p-values across methods")
```

### Pairplot of effect sizes across methods (GGpairs)

```{r differential-expression-effectsize-pairplot, fig.width = 10, fig.height = 10, message = FALSE, warning=FALSE}
library(GGally)
airway |> 
  pivot_transcript() |> 
  select(ql__logFC, lr_robust__logFC, voom__logFC, voom_weights__logFC, deseq2__log2FoldChange) |> 
  ggpairs(columns = 1:5, size = 0.5) +
  my_theme +
  labs(title = "Pairplot of effect sizes across methods")
```

### Volcano Plots for Each Method

Visualising the significance and effect size of the differential expression results as a volcano plots we appreciate that some methods have much lower p-values distributions than other methods, for the same model and data.

```{r differential-expression-volcano-plots-1, fig.width = 10, fig.height = 10}
# Create volcano plots
airway |>

	# Select the columns we want to plot
    pivot_transcript() |> 
    select(
			.feature,
      ql__logFC, ql__PValue,
      lr_robust__logFC, lr_robust__PValue,
      voom__logFC, voom__P.Value,
      voom_weights__logFC, voom_weights__P.Value,
      deseq2__log2FoldChange, deseq2__pvalue
    ) |>

	# Pivot longer to get a tidy data frame
	pivot_longer(
      - .feature,
      names_to = c("method", "stat"),
      values_to = "value", names_sep = "__"
    ) |>

	# Harmonize column names
	mutate(stat  = case_when(
		stat %in% c("logFC", "log2FoldChange") ~ "logFC",
		stat %in% c("PValue", "pvalue", "P.Value", "p.value") ~ "PValue"
	)) |>
  pivot_wider(names_from = "stat", values_from = "value") |>
  unnest(c(logFC, PValue)) |> 

	# Plot
  ggplot(aes(x = logFC, y = PValue)) +
  geom_point(aes(color = PValue < 0.05, size = PValue < 0.05)) +
  scale_y_log10_reverse() +
  scale_color_manual(values = c("TRUE" = "red", "FALSE" = "black")) +
  scale_size_manual(values = c("TRUE" = 0.5, "FALSE" = 0.1)) +
  facet_wrap(~method) +
  my_theme +
  labs(title = "Volcano Plots by Method")
```

Plotting independent y-axis scales for the p-values and effect sizes allows us to compare the top genes across methods.

```{r differential-expression-volcano-plots-2, fig.width = 10, fig.height = 10}
# Create volcano plots
airway |>

	# Select the columns we want to plot
    pivot_transcript() |> 
    select(
			symbol,
      ql__logFC, ql__PValue,
      lr_robust__logFC, lr_robust__PValue,
      voom__logFC, voom__P.Value,
      voom_weights__logFC, voom_weights__P.Value,
      deseq2__log2FoldChange, deseq2__pvalue
    ) |>

	# Pivot longer to get a tidy data frame
	pivot_longer(
      - symbol,
      names_to = c("method", "stat"),
      values_to = "value", names_sep = "__"
    ) |>

	# Harmonize column names
	mutate(stat  = case_when(
		stat %in% c("logFC", "log2FoldChange") ~ "logFC",
		stat %in% c("PValue", "pvalue", "P.Value", "p.value") ~ "PValue"
	)) |>
  pivot_wider(names_from = "stat", values_from = "value") |>
  unnest(c(logFC, PValue)) |> 

	# Plot
  ggplot(aes(x = logFC, y = PValue)) +
  geom_point(aes(color = PValue < 0.05, size = PValue < 0.05)) +
  ggrepel::geom_text_repel(aes(label = symbol), size = 2, max.overlaps = 20) +
  scale_y_log10_reverse() +
  scale_color_manual(values = c("TRUE" = "red", "FALSE" = "black")) +
  scale_size_manual(values = c("TRUE" = 0.5, "FALSE" = 0.1)) +
  facet_wrap(~method, scales = "free_y") +
  my_theme +
  labs(title = "Volcano Plots by Method")
```

### Differential Expression with Contrasts

Contrast-based differential expression analysis, is available for most methods. It is a standard statistical approach for testing specific comparisons in complex designs.

#### For edgeR

```{r differential-expression-contrasts}
# Using contrasts for more complex comparisons
airway |>
	test_differential_expression(
		~ 0 + dex,                  
		contrasts  = c("dextrt - dexuntrt"),
        method = "edgeR_quasi_likelihood", 
		prefix = "contrasts__"
	) |> 

	# Print the gene statistics
  pivot_transcript() |>
  select(contains("contrasts"))
```

#### For DESeq2

```{r differential-expression-contrasts-DESeq2}
# Using contrasts for more complex comparisons
airway |>
	test_differential_expression(
		~ 0 + dex,                  
		contrasts  = list(c("dex", "trt", "untrt")),
		method = "DESeq2",
		prefix = "contrasts__"
  ) |>
  pivot_transcript() |>
  select(contains("contrasts"))
```

#### For limma-voom

```{r differential-expression-contrasts-limma-voom}
# Using contrasts for more complex comparisons
airway |>
	test_differential_expression(
		~ 0 + dex,                  
		contrasts  = c("dextrt - dexuntrt"),
		method = "limma_voom",
		prefix = "contrasts__"
	) |>
  pivot_transcript() |>
  select(contains("contrasts"))
```


### Differential Expression with minimum fold change (TREAT method)

TREAT method [@mccarthy2009treat] is used for testing significance relative to a fold-change threshold.

```{r differential-expression-treat-method}
# Using contrasts for more complex comparisons
airway |>
	test_differential_expression(
		~ 0 + dex,                  
		contrasts  = c("dextrt - dexuntrt"),
        method = "edgeR_quasi_likelihood", 
		test_above_log2_fold_change = 2, 
		prefix = "treat__"
	) |> 

	# Print the gene statistics
  pivot_transcript() |>
  select(contains("treat"))
```

### Mixed Models for Complex Designs

glmmSeq [@ma2020glmmseq] is used for generalized linear mixed models for RNA-seq data.

```{r differential-expression-mixed-models}
# Using glmmSeq for mixed models
airway |>
  keep_abundant(formula_design = ~ dex) |>
  
  # Select 100 genes in the interest of execution time
  _[1:100,] |>

  # Fit model
  test_differential_expression(
    ~ dex + (1|cell), 
    method = "glmmseq_lme4", 
    cores = 1,
	prefix = "glmmseq__"
  ) 

  airway |>
  pivot_transcript() 
```


### Gene Description

With tidybulk, retrieving gene descriptions is straightforward, making it easy to enhance the interpretability of your differential expression results.

```{r differential-expression-gene-description}
# Add gene descriptions using the original SummarizedExperiment
airway |> 

	describe_transcript() |>

	# Filter top significant genes
	filter(ql__FDR < 0.05) |>

	# Print the gene statistics
	pivot_transcript() |> 
  filter(description |> is.na() |> not()) |>
	select(.feature, description, contains("ql")) |> 
	head()

```

## Step 4: Batch Effect Correction

ComBat-seq [@zhang2017combatseq] is used for batch effect correction in RNA-seq data.

```{r batch-correction-adjust-abundance}
# Adjust for batch effects
airway = airway |>
  adjust_abundance(
	  .factor_unwanted = cell, 
	  .factor_of_interest = dex, 
    method = "combat_seq", 
	  abundance = "counts_tmm"
  )

# Scatter plot of adjusted vs unadjusted
airway |> 
  
  # Subset genes to speed up plotting
  _[1:100,] |> 
  select(symbol, .sample, counts_tmm, counts_tmm_adjusted) |> 
  
  ggplot(aes(x = counts_tmm + 1, y = counts_tmm_adjusted + 1)) +
  geom_point(aes(color = .sample), size = 0.1) +
  ggrepel::geom_text_repel(aes(label = symbol), size = 2, max.overlaps = 20) +
  scale_x_log10() +
  scale_y_log10() +
  my_theme +
  labs(title = "Scatter plot of adjusted vs unadjusted")
```


## Step 5: Gene Enrichment Analysis

Gene Set Enrichment Analysis (GSEA) [@subramanian2005gsea] is used for gene set enrichment.

```{r enrichment-gene-rank-analysis}
# Run gene rank enrichment (GSEA style)
gene_rank_res =
  airway |>

    # Filter for genes with entrez IDs
  filter(!entrezid |> is.na()) |>
  aggregate_duplicates(feature = "entrezid") |> 

  # Test gene rank
  test_gene_rank(
    .entrez = entrezid,
    .arrange_desc = lr_robust__logFC,
    species = "Homo sapiens",
    gene_sets = c("H", "C2", "C5")
  )
```

```{r enrichment-inspect-significant-genesets}
# Inspect significant gene sets (example for C2 collection)
gene_rank_res |>
  filter(gs_collection == "C2") |>
  dplyr::select(-fit) |>
  unnest(test) |>
  filter(p.adjust < 0.05)
```

#### Visualize enrichment 
```{r enrichment-visualize-gsea-plots}
  library(enrichplot)
  library(patchwork)
  gene_rank_res |>
    unnest(test) |>
    head() |>
    mutate(plot = pmap(
      list(fit, ID, idx_for_plotting, p.adjust),
      ~ enrichplot::gseaplot2(
        ..1, geneSetID = ..3,
        title = sprintf("%s \nadj pvalue %s", ..2, round(..4, 2)),
        base_size = 6, rel_heights = c(1.5, 0.5), subplots = c(1, 2)
      )
    )) |>
    pull(plot) 
detach("package:enrichplot", unload = TRUE)
```

Gene Ontology overrepresentation analysis [@ashburner2000go] is used for functional enrichment. 

```{r enrichment-gene-overrepresentation}
# Test gene overrepresentation
airway_overrep = 
  airway |>
  
  # Label genes to test overrepresentation of
  mutate(genes_to_test = ql__FDR < 0.05) |>
  
    # Filter for genes with entrez IDs
  filter(!entrezid |> is.na()) |>
  
  test_gene_overrepresentation(
    .entrez = entrezid,
    species = "Homo sapiens",
    .do_test = genes_to_test,
    gene_sets = c("H", "C2", "C5")
  )

  airway_overrep
```

EGSEA [@alhamdoosh2017egsea] is used for ensemble gene set enrichment analysis. EGSEA is a method that combines multiple gene set enrichment analysis methods to provide a more robust and comprehensive analysis of gene set enrichment. It creates a web-based interactive tool that allows you to explore the results of the gene set enrichment analysis.

```{r enrichment-egsea-analysis, eval=FALSE}
library(EGSEA)
# Test gene enrichment
  airway |> 

  # Filter for genes with entrez IDs
  filter(!entrezid |> is.na()) |>
  aggregate_duplicates(feature = "entrezid") |> 

  # Test gene enrichment
  test_gene_enrichment(
    .formula = ~dex,
    .entrez = entrezid,
    species = "human", 
    gene_sets = "h",
    methods = c("roast"),  # Use a more robust method
    cores = 2
  )
detach("package:EGSEA", unload = TRUE)
```

## Step 6: Cellularity Analysis

CIBERSORT [@newman2015cibersort] is used for cell type deconvolution.

Cellularity deconvolution is a standard approach for estimating the cellular composition of a sample.

### Available Deconvolution Methods

The `tidybulk` package provides several methods for deconvolution:

- **CIBERSORT** [@newman2015cibersort]: Uses support vector regression to deconvolve cell type proportions. Requires the `class`, `e1071`, and `preprocessCore` packages.
- **LLSR** [@newman2015cibersort]: Linear Least Squares Regression for deconvolution.
- **EPIC** [@racle2017epic]: Uses a reference-based approach to estimate cell fractions.
- **MCP-counter** [@becht2016mcp]: Quantifies the abundance of immune and stromal cell populations.
- **quanTIseq** [@finotello2019quantiseq]: A computational framework for inferring the immune contexture of tumors.
- **xCell** [@aran2017xcell]: Performs cell type enrichment analysis.

### Example Usage

```{r deconvolution-examples}

airway = 
  airway |> 
  
  filter(!symbol |> is.na()) |> 
  deconvolve_cellularity(method = "cibersort", cores = 1, prefix = "cibersort__", feature_column = "symbol") 

```

For the rest of the methods, you need to install the `immunedeconv` package.

```{r install-immunedeconv, eval=FALSE}
if (!requireNamespace("immunedeconv")) BiocManager::install("immunedeconv")
```

```{r deconvolution-examples2, eval=FALSE}

airway = 

airway |> 
# Example using LLSR
deconvolve_cellularity(method = "llsr", prefix = "llsr__", feature_column = "symbol") |> 

# Example using EPIC
deconvolve_cellularity(method = "epic", cores = 1, prefix = "epic__") |> 

# Example using MCP-counter
deconvolve_cellularity(method = "mcp_counter", cores = 1, prefix = "mcp__") |> 

# Example using quanTIseq
deconvolve_cellularity(method = "quantiseq", cores = 1, prefix = "quantiseq__") |> 

# Example using xCell
deconvolve_cellularity(method = "xcell", cores = 1, prefix = "xcell__")
```

### Plotting Results

Visualize the cell type proportions as a stacked barplot for each method:

```{r deconvolution-plotting, fig.width = 10, fig.height = 10}
# Visualize CIBERSORT results
airway	 |>
  pivot_sample() |>
  dplyr::select(.sample, contains("cibersort__")) |>
  pivot_longer(cols = -1, names_to = "Cell_type_inferred", values_to = "proportion") |>
  ggplot(aes(x = .sample, y = proportion, fill = Cell_type_inferred)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = scales::percent) +
  my_theme +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
  labs(title = "CIBERSORT Cell Type Proportions")

```


```{r deconvolution-plotting2, fig.width = 10, fig.height = 10, eval=FALSE}

 # Repeat similar plotting for LLSR, EPIC, MCP-counter, quanTIseq, and xCell
airway	 |>
  pivot_sample() |>
  select(.sample, contains("llsr__")) |>
  pivot_longer(cols = -1, names_to = "Cell_type_inferred", values_to = "proportion") |>
  ggplot(aes(x = .sample, y = proportion, fill = Cell_type_inferred)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = scales::percent) +
  my_theme +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
  labs(title = "LLSR Cell Type Proportions")

  airway	 |>
  pivot_sample() |>
  select(.sample, contains("epic__")) |>
  pivot_longer(cols = -1, names_to = "Cell_type_inferred", values_to = "proportion") |>
  ggplot(aes(x = .sample, y = proportion, fill = Cell_type_inferred)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = scales::percent) +
  my_theme +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
  labs(title = "EPIC Cell Type Proportions")

  airway	 |>
  pivot_sample() |>
  select(.sample, contains("mcp__")) |>
  pivot_longer(cols = -1, names_to = "Cell_type_inferred", values_to = "proportion") |>
  ggplot(aes(x = .sample, y = proportion, fill = Cell_type_inferred)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = scales::percent) +
  my_theme +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
  labs(title = "MCP-counter Cell Type Proportions")

  airway	 |>
  pivot_sample() |>
  select(.sample, contains("quantiseq__")) |>
  pivot_longer(cols = -1, names_to = "Cell_type_inferred", values_to = "proportion") |>
  ggplot(aes(x = .sample, y = proportion, fill = Cell_type_inferred)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = scales::percent) +
  my_theme +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
  labs(title = "quanTIseq Cell Type Proportions")

  airway	 |>
  pivot_sample() |>
  select(.sample, contains("xcell__")) |>
  pivot_longer(cols = -1, names_to = "Cell_type_inferred", values_to = "proportion") |>
  ggplot(aes(x = .sample, y = proportion, fill = Cell_type_inferred)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = scales::percent) +
  my_theme +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
  labs(title = "xCell Cell Type Proportions")
  
```


## Bibliography

`tidybulk` allows you to get the bibliography of all methods used in our workflow.

```{r bibliography-get-methods}
# Get bibliography of all methods used in our workflow
airway |> get_bibliography()
```


```{r session-info}
sessionInfo()
```