--- title: "tidybulk: An R tidy framework for modular transcriptomic data analysis" author: "Stefano Mangiola" date: "`r Sys.Date()`" package: tidybulk output: BiocStyle::html_document: toc_float: true bibliography: references.bib link-citations: true keywords: "transcriptomics, RNA-seq, differential expression, data analysis, tidyverse, SummarizedExperiment, bioinformatics, genomics, gene expression, clustering, dimensionality reduction, cellularity analysis, gene enrichment, R package" vignette: > %\VignetteIndexEntry{Overview of the tidybulk package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- [![Lifecycle:maturing](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://www.tidyverse.org/lifecycle/#maturing) [![R build status](https://github.com/stemangiola/tidybulk/workflows/R-CMD-check/badge.svg)](https://github.com/stemangiola/tidybulk/actions/) [![Bioconductor status](https://bioconductor.org/shields/build/release/bioc/tidybulk.svg)](https://bioconductor.org/checkResults/release/bioc-LATEST/tidybulk/) **tidybulk** is a powerful R package designed for modular transcriptomic data analysis that brings transcriptomics to the tidyverse. ## Why tidybulk? Tidybulk provides a unified interface for comprehensive transcriptomic data analysis with seamless integration of SummarizedExperiment objects and tidyverse principles. It streamlines the entire workflow from raw data to biological insights. ## Functions/utilities available ### Abundance Normalization Functions Function | Description ------------ | ------------- `scale_abundance()` | Scale abundance data `quantile_normalise_abundance()` | Quantile normalization `adjust_abundance()` | Adjust abundance for unwanted variation `fill_missing_abundance()` | Fill missing abundance values `impute_missing_abundance()` | Impute missing abundance values ### Filtering and Selection Functions Function | Description ------------ | ------------- `identify_abundant()` | Identify abundant transcripts without removing them `keep_abundant()` | Keep abundant transcripts `keep_variable()` | Keep variable transcripts `filterByExpr()` | Filter by expression ### Dimensionality Reduction Functions Function | Description ------------ | ------------- `reduce_dimensions()` | Reduce dimensions with PCA/MDS/tSNE/UMAP `rotate_dimensions()` | Rotate dimensions `remove_redundancy()` | Remove redundant features ### Clustering Functions Function | Description ------------ | ------------- `cluster_elements()` | Cluster elements with various methods `kmeans clustering` | K-means clustering `SNN clustering` | Shared nearest neighbor clustering `hierarchical clustering` | Hierarchical clustering `DBSCAN clustering` | Density-based clustering ### Differential Analysis Functions Function | Description ------------ | ------------- `test_differential_expression()` | Test differential expression with various methods ### Cellularity Analysis Functions Function | Description ------------ | ------------- `deconvolve_cellularity()` | Deconvolve cellularity with various methods `cibersort()` | CIBERSORT analysis ### Gene Enrichment Functions Function | Description ------------ | ------------- `test_gene_enrichment()` | Test gene enrichment `test_gene_overrepresentation()` | Test gene overrepresentation `test_gene_rank()` | Test gene rank ### Utility Functions Function | Description ------------ | ------------- `describe_transcript()` | Describe transcript characteristics `get_bibliography()` | Get bibliography `resolve_complete_confounders_of_non_interest()` | Resolve confounders ### Validation and Utility Functions Function | Description ------------ | ------------- `check_if_counts_is_na()` | Check if counts contain NA values `check_if_duplicated_genes()` | Check for duplicated genes `check_if_wrong_input()` | Validate input data `log10_reverse_trans()` | Log10 reverse transformation `logit_trans()` | Logit transformation All functions are directly compatible with `SummarizedExperiment` objects and follow tidyverse principles for seamless integration with the tidyverse ecosystem. ### Scientific Citation Mangiola, Stefano, Ramyar Molania, Ruining Dong, Maria A. Doyle, and Anthony T. Papenfuss. 2021. "Tidybulk: An R tidy framework for modular transcriptomic data analysis." Genome Biology 22 (42). https://doi.org/10.1186/s13059-020-02233-7 [Genome Biology - tidybulk: an R tidy framework for modular transcriptomic data analysis](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02233-7) # Full vignette [HERE](https://stemangiola.github.io/tidybulk/) # Installation Guide **Bioconductor** ```{r install-bioconductor, message=FALSE, eval=FALSE} if (!requireNamespace("BiocManager")) install.packages("BiocManager") BiocManager::install("tidybulk") ``` **Github** ```{r install-github, message=FALSE, eval=FALSE} devtools::install_github("stemangiola/tidybulk") ``` ```{r setup-libraries-and-theme, echo=FALSE, include=FALSE} library(knitr) library(dplyr) library(tidyr) library(tibble) library(purrr) library(magrittr) library(forcats) library(ggplot2) library(ggrepel) library(SummarizedExperiment) library(tidybulk) # Define theme my_theme = theme_bw() + theme( panel.border = element_blank(), axis.line = element_line(), panel.grid.major = element_line(linewidth = 0.2), panel.grid.minor = element_line(linewidth = 0.1), text = element_text(size=12), legend.position="bottom", aspect.ratio=1, strip.background = element_blank(), axis.title.x = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10)), axis.title.y = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10)) ) ``` In this vignette we will use the `airway` dataset, a `SummarizedExperiment` object containing RNA-seq data from an experiment studying the effect of dexamethasone treatment on airway smooth muscle cells. This dataset is available in the [airway](https://bioconductor.org/packages/airway/) package. ```{r load airway} library(airway) data(airway) ``` This workflow, will use the [tidySummarizedExperiment](https://bioconductor.org/packages/tidySummarizedExperiment/) package to manipulate the data in a `tidyverse` fashion. This approach streamlines the data manipulation and analysis process, making it more efficient and easier to understand. ```{r load tidySummarizedExperiment} library(tidySummarizedExperiment) ``` Here we will add a gene symbol column to the `airway` object. This will be used to interpret the differential expression analysis, and to deconvolve the cellularity. ```{r add-gene-symbol} library(org.Hs.eg.db) library(AnnotationDbi) # Add gene symbol and entrez airway <- airway |> mutate(entrezid = mapIds(org.Hs.eg.db, keys = gene_name, keytype = "SYMBOL", column = "ENTREZID", multiVals = "first" )) detach("package:org.Hs.eg.db", unload = TRUE) detach("package:AnnotationDbi", unload = TRUE) ``` # Comprehensive Example Pipeline This vignette demonstrates a complete transcriptomic analysis workflow using tidybulk, with special emphasis on differential expression analysis. ## Data Overview We will use the `airway` dataset, a `SummarizedExperiment` object containing RNA-seq data from an experiment studying the effect of dexamethasone treatment on airway smooth muscle cells: ```{r data-overview} airway ``` Loading `tidySummarizedExperiment` makes the `SummarizedExperiment` objects compatible with tidyverse tools while maintaining its `SummarizedExperiment` nature. This is useful because it allows us to use the `tidyverse` tools to manipulate the data. ```{r check-se-class} class(airway) ``` ### Prepare Data for Analysis Before analysis, we need to ensure our variables are in the correct format: ```{r convert-condition-to-factor} # Convert dex to factor for proper differential expression analysis airway = airway |> mutate(dex = as.factor(dex)) ``` ### Visualize Raw Counts Visualize the distribution of raw counts before any filtering: ```{r plot-raw-counts, fig.width = 10, fig.height = 10} airway |> ggplot(aes(counts + 1, group = .sample, color = dex)) + geom_density() + scale_x_log10() + my_theme + labs(title = "Raw counts by treatment (before any filtering)") ``` ## Step 1: Data Preprocessing ### Aggregate Duplicated Transcripts (optional) Aggregate duplicated transcripts (e.g., isoforms, ensembl IDs): > Transcript aggregation is a standard bioinformatics approach for gene-level summarization. ```{r preprocessing-aggregate-duplicates} # Aggregate duplicates airway = airway |> aggregate_duplicates(feature = "gene_name", aggregation_function = mean) ``` ### Abundance Filtering Abundance filtering can be performed using established methods [@robinson2010edger; @chen2016edgeR]. #### Run multiple methods ```{r filtering-abundance-methods} # Default (simple filtering) airway_abundant_default = airway |> keep_abundant(minimum_counts = 10, minimum_proportion = 0.5) # With factor_of_interest (recommended for complex designs) airway_abundant_formula = airway |> keep_abundant(minimum_counts = 10, minimum_proportion = 0.5, factor_of_interest = dex) # With CPM threshold (using design parameter) airway_abundant_cpm = airway |> keep_abundant(minimum_count_per_million = 10, minimum_proportion = 0.5, factor_of_interest = dex) ``` #### Compare methods ```{r filtering-summary-statistics, fig.width = 10, fig.height = 10, warning = FALSE} # Example: summary for default tidybulk filtering # Before filtering airway |> summarise( n_features = n_distinct(.feature), min_count = min(counts), median_count = median(counts), max_count = max(counts) ) # After filtering airway_abundant_default |> summarise( n_features = n_distinct(.feature), min_count = min(counts), median_count = median(counts), max_count = max(counts) ) airway_abundant_formula |> summarise( n_features = n_distinct(.feature), min_count = min(counts), median_count = median(counts), max_count = max(counts) ) airway_abundant_cpm |> summarise( n_features = n_distinct(.feature), min_count = min(counts), median_count = median(counts), max_count = max(counts) ) ``` ```{r filtering-density-plot-comparison, fig.width = 10, fig.height = 10} # Merge all methods into a single tibble airway_abundant_all = bind_rows( airway |> assay() |> as_tibble(rownames = ".feature") |> pivot_longer(cols = -.feature, names_to = ".sample", values_to = "counts") |> mutate(method = "no filter"), airway_abundant_default |> assay() |> as_tibble(rownames = ".feature") |> pivot_longer(cols = -.feature, names_to = ".sample", values_to = "counts") |> mutate(method = "default"), airway_abundant_formula |> assay() |> as_tibble(rownames = ".feature") |> pivot_longer(cols = -.feature, names_to = ".sample", values_to = "counts") |> mutate(method = "formula"), airway_abundant_cpm |> assay() |> as_tibble(rownames = ".feature") |> pivot_longer(cols = -.feature, names_to = ".sample", values_to = "counts") |> mutate(method = "cpm") ) # Density plot across methods airway_abundant_all |> ggplot(aes(counts + 1, group = .sample, color = method)) + geom_density() + scale_x_log10() + facet_wrap(~fct_relevel(method, "no filter", "default", "formula", "cpm")) + my_theme + labs(title = "Counts after abundance filtering (tidybulk default)") ``` Update the `airway` object with the filtered data: ```{r update-se-mini-with-filtered} airway = airway_abundant_formula ``` > **Tip:** Use `formula_design` for complex designs, and use the CPM threshold for library-size-aware filtering. ### Remove Redundant Transcripts Redundancy removal is a standard approach for reducing highly correlated features. ```{r preprocessing-remove-redundancy} airway_non_redundant = airway |> remove_redundancy(method = "correlation", top = 100, of_samples = FALSE) # Make airway |> summarise( n_features = n_distinct(.feature), min_count = min(counts), median_count = median(counts), max_count = max(counts) ) # Summary statistics airway_non_redundant |> summarise( n_features = n_distinct(.feature), min_count = min(counts), median_count = median(counts), max_count = max(counts) ) # Plot before and after # Merge before and after into a single tibble airway_all = bind_rows( airway |> assay() |> as_tibble(rownames = ".feature") |> pivot_longer(cols = -.feature, names_to = ".sample", values_to = "counts") |> mutate(method = "before"), airway_non_redundant |> assay() |> as_tibble(rownames = ".feature") |> pivot_longer(cols = -.feature, names_to = ".sample", values_to = "counts") |> mutate(method = "after") ) # Density plot airway_all |> ggplot(aes(counts + 1, group = .sample, color = method)) + geom_density() + scale_x_log10() + facet_wrap(~fct_relevel(method, "before", "after")) + my_theme + labs(title = "Counts after removing redundant transcripts") ``` ### Filter Variable Transcripts Keep only the most variable transcripts for downstream analysis. Variance-based feature selection using edgeR methodology [@robinson2010edger] is used for selecting informative features. ```{r filtering-keep-variable} airway_variable = airway |> keep_variable() ``` ### Visualize After Variable Filtering Variable Transcripts (optional) ```{r filtering-variable-summary-and-plot, fig.width = 10, fig.height = 10} # Before filtering airway |> summarise( n_features = n_distinct(.feature), min_count = min(counts), median_count = median(counts), max_count = max(counts) ) # After filtering airway_variable |> summarise( n_features = n_distinct(.feature), min_count = min(counts), median_count = median(counts), max_count = max(counts) ) # Density plot # Merge before and after into a single tibble airway_all = bind_rows( airway |> assay() |> as_tibble(rownames = ".feature") |> pivot_longer(cols = -.feature, names_to = ".sample", values_to = "counts") |> mutate(method = "before"), airway_variable |> assay() |> as_tibble(rownames = ".feature") |> pivot_longer(cols = -.feature, names_to = ".sample", values_to = "counts") |> mutate(method = "after") ) # Density plot airway_all |> ggplot(aes(counts + 1, group = .sample, color = method)) + geom_density() + scale_x_log10() + facet_wrap(~fct_relevel(method, "before", "after")) + my_theme + labs(title = "Counts after variable filtering") ``` ### Scale Abundance Scale for sequencing depth using TMM [@robinson2010edger], upper quartile [@bullard2010uq], and RLE [@anders2010rle] normalization. ```{r normalization-scale-abundance} airway = airway |> scale_abundance(method = "TMM", suffix = "_tmm") |> scale_abundance(method = "upperquartile", suffix = "_upperquartile") |> scale_abundance(method = "RLE", suffix = "_RLE") ``` ### Visualize After Scaling ```{r normalization-visualize-scaling, fig.width = 10, fig.height = 10} # Before scaling airway |> assay("counts") |> as.matrix() |> rowMeans() |> summary() airway |> assay("counts_tmm") |> as.matrix() |> rowMeans() |> summary() airway |> assay("counts_upperquartile") |> as.matrix() |> rowMeans() |> summary() airway |> assay("counts_RLE") |> as.matrix() |> rowMeans() |> summary() # Merge all methods into a single tibble airway_scaled_all = bind_rows( airway |> assay("counts") |> as_tibble(rownames = ".feature") |> pivot_longer(cols = -.feature, names_to = ".sample", values_to = "counts") |> mutate(method = "no_scaling"), airway |> assay("counts_tmm") |> as_tibble(rownames = ".feature") |> pivot_longer(cols = -.feature, names_to = ".sample", values_to = "counts") |> mutate(method = "TMM"), airway |> assay("counts_upperquartile") |> as_tibble(rownames = ".feature") |> pivot_longer(cols = -.feature, names_to = ".sample", values_to = "counts") |> mutate(method = "upperquartile"), airway |> assay("counts_RLE") |> as_tibble(rownames = ".feature") |> pivot_longer(cols = -.feature, names_to = ".sample", values_to = "counts") |> mutate(method = "RLE") ) # Density plot airway_scaled_all |> ggplot(aes(counts + 1, group = .sample, color = method)) + geom_density() + scale_x_log10() + facet_wrap(~fct_relevel(method, "no_scaling", "TMM", "upperquartile", "RLE")) + my_theme + labs(title = "Scaled counts by method (after scaling)") ``` ## Step 2: Exploratory Data Analysis ### Remove Zero-Variance Features (required for PCA) Variance filtering is a standard preprocessing step for dimensionality reduction. ```{r eda-remove-zero-variance} library(matrixStats) # Remove features with zero variance across samples airway = airway[rowVars(assay(airway)) > 0, ] ``` ### Dimensionality Reduction MDS [@kruskal1964mds] using limma::plotMDS [@ritchie2015limma] and PCA [@hotelling1933pca] are used for dimensionality reduction. ```{r eda-mds-analysis, fig.width = 10, fig.height = 10} airway = airway |> reduce_dimensions(method="MDS", .dims = 2) ``` ```{r eda-pca-analysis, fig.width = 10, fig.height = 10} airway = airway |> reduce_dimensions(method="PCA", .dims = 2) ``` ### Visualize Dimensionality Reduction Results ```{r eda-plot-dimensionality-reduction, fig.width = 10, fig.height = 10} # MDS plot airway |> pivot_sample() |> ggplot(aes(x=`Dim1`, y=`Dim2`, color=`dex`)) + geom_point() + my_theme + labs(title = "MDS Analysis") # PCA plot airway |> pivot_sample() |> ggplot(aes(x=`PC1`, y=`PC2`, color=`dex`)) + geom_point() + my_theme + labs(title = "PCA Analysis") ``` ### Clustering Analysis K-means clustering [@macqueen1967kmeans] is used for unsupervised grouping. ```{r eda-clustering-analysis, fig.width = 10, fig.height = 10} airway = airway |> cluster_elements(method="kmeans", centers = 2) # Visualize clustering airway |> ggplot(aes(x=`Dim1`, y=`Dim2`, color=`cluster_kmeans`)) + geom_point() + my_theme + labs(title = "K-means Clustering") ``` ## Step 3: Differential Expression Analysis ### Basic Differential Expression **Methods:** - **edgeR quasi-likelihood:** Quasi-likelihood F-tests for differential expression [@robinson2010edger; @chen2016edgeR] - **edgeR robust likelihood ratio:** Robust likelihood ratio tests [@chen2016edgeR] - **DESeq2:** Negative binomial distribution with dispersion estimation [@love2014deseq2] - **limma-voom:** Linear modeling with empirical Bayes moderation [@law2014voom] - **limma-voom with sample weights:** Enhanced voom with quality weights [@liu2015voomweights] **References:** - Robinson et al. (2010) edgeR: a Bioconductor package for differential expression analysis - Chen et al. (2016) From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline - Love et al. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 - Law et al. (2014) voom: precision weights unlock linear model analysis tools for RNA-seq read counts - Liu et al. (2015) Why weight? Modelling sample and observational level variability improves power in RNA-seq analyses ```{r differential-expression-multiple-methods, message = FALSE} # Standard differential expression analysis airway = airway |> # Use QL method test_differential_expression(~ dex, method = "edgeR_quasi_likelihood", prefix = "ql__") |> # Use edger_robust_likelihood_ratio test_differential_expression(~ dex, method = "edger_robust_likelihood_ratio", prefix = "lr_robust__") |> # Use DESeq2 method test_differential_expression(~ dex, method = "DESeq2", prefix = "deseq2__") |> # Use limma_voom test_differential_expression(~ dex, method = "limma_voom", prefix = "voom__") |> # Use limma_voom_sample_weights test_differential_expression(~ dex, method = "limma_voom_sample_weights", prefix = "voom_weights__") ``` ### Quality Control of the Fit It is important to check the quality of the fit. All methods produce a fit object that can be used for quality control. The fit object produced by each underlying method are stored in as attributes of the `airway_mini` object. We can use them for example to perform quality control of the fit. #### For edgeR Plot the biological coefficient of variation (BCV) trend. This plot is helpful to understant the dispersion of the data. ```{r differential-expression-edgeR-object} library(edgeR) metadata(airway)$tidybulk$edgeR_quasi_likelihood_object |> plotBCV() ``` Plot the log-fold change vs mean plot. ```{r differential-expression-edgeR-fit} library(edgeR) metadata(airway)$tidybulk$edgeR_quasi_likelihood_fit |> plotMD() ``` #### For DESeq2 Plot the mean-variance trend. ```{r differential-expression-DESeq2-object} library(DESeq2) metadata(airway)$tidybulk$DESeq2_object |> plotDispEsts() ``` Plot the log-fold change vs mean plot. ```{r differential-expression-DESeq2-fit} library(DESeq2) metadata(airway)$tidybulk$DESeq2_object |> plotMA() ``` ### Histograms of p-values across methods Inspection of the raw p-value histogram provides a rapid check of differential-expression results. When no gene is truly differentially expressed, the p-values follow a uniform U(0,1) distribution across the interval 0–1, so the histogram appears flat [Source](https://bioconductor.org/help/course-materials/2014/useR2014/Workflows.html). In a more realistic scenario where only a subset of genes changes, this uniform background is still present but an obvious spike emerges close to zero, created by the genuine signals. Thanks to the modularity of the `tidybulk` workflow, that can multiplex different methods, we can easily compare the p-values across methods. ```{r differential-expression-pvalue-histograms, fig.width = 10, fig.height = 10} airway |> pivot_transcript() |> select( ql__PValue, lr_robust__PValue, voom__P.Value, voom_weights__P.Value, deseq2__pvalue ) |> pivot_longer(everything(), names_to = "method", values_to = "pvalue") |> ggplot(aes(x = pvalue, fill = method)) + geom_histogram(binwidth = 0.01) + facet_wrap(~method) + my_theme + labs(title = "Histogram of p-values across methods") ``` ### Compare Results Across Methods ```{r differential-expression-summary-statistics} # Summary statistics airway |> pivot_transcript() |> select(contains("ql|lr_robust|voom|voom_weights|deseq2")) |> select(contains("logFC")) |> summarise(across(everything(), list(min = min, median = median, max = max), na.rm = TRUE)) ``` ### Pairplot of pvalues across methods (GGpairs) ```{r differential-expression-pvalue-pairplot, fig.width = 10, fig.height = 10, message = FALSE, warning=FALSE} library(GGally) airway |> pivot_transcript() |> select(ql__PValue, lr_robust__PValue, voom__P.Value, voom_weights__P.Value, deseq2__pvalue) |> ggpairs(columns = 1:5, size = 0.5) + scale_y_log10_reverse() + scale_x_log10_reverse() + my_theme + labs(title = "Pairplot of p-values across methods") ``` ### Pairplot of effect sizes across methods (GGpairs) ```{r differential-expression-effectsize-pairplot, fig.width = 10, fig.height = 10, message = FALSE, warning=FALSE} library(GGally) airway |> pivot_transcript() |> select(ql__logFC, lr_robust__logFC, voom__logFC, voom_weights__logFC, deseq2__log2FoldChange) |> ggpairs(columns = 1:5, size = 0.5) + my_theme + labs(title = "Pairplot of effect sizes across methods") ``` ### Volcano Plots for Each Method Visualising the significance and effect size of the differential expression results as a volcano plots we appreciate that some methods have much lower p-values distributions than other methods, for the same model and data. ```{r differential-expression-volcano-plots-1, fig.width = 10, fig.height = 10} # Create volcano plots airway |> # Select the columns we want to plot pivot_transcript() |> select( .feature, ql__logFC, ql__PValue, lr_robust__logFC, lr_robust__PValue, voom__logFC, voom__P.Value, voom_weights__logFC, voom_weights__P.Value, deseq2__log2FoldChange, deseq2__pvalue ) |> # Pivot longer to get a tidy data frame pivot_longer( - .feature, names_to = c("method", "stat"), values_to = "value", names_sep = "__" ) |> # Harmonize column names mutate(stat = case_when( stat %in% c("logFC", "log2FoldChange") ~ "logFC", stat %in% c("PValue", "pvalue", "P.Value", "p.value") ~ "PValue" )) |> pivot_wider(names_from = "stat", values_from = "value") |> unnest(c(logFC, PValue)) |> # Plot ggplot(aes(x = logFC, y = PValue)) + geom_point(aes(color = PValue < 0.05, size = PValue < 0.05)) + scale_y_log10_reverse() + scale_color_manual(values = c("TRUE" = "red", "FALSE" = "black")) + scale_size_manual(values = c("TRUE" = 0.5, "FALSE" = 0.1)) + facet_wrap(~method) + my_theme + labs(title = "Volcano Plots by Method") ``` Plotting independent y-axis scales for the p-values and effect sizes allows us to compare the top genes across methods. ```{r differential-expression-volcano-plots-2, fig.width = 10, fig.height = 10} # Create volcano plots airway |> # Select the columns we want to plot pivot_transcript() |> select( symbol, ql__logFC, ql__PValue, lr_robust__logFC, lr_robust__PValue, voom__logFC, voom__P.Value, voom_weights__logFC, voom_weights__P.Value, deseq2__log2FoldChange, deseq2__pvalue ) |> # Pivot longer to get a tidy data frame pivot_longer( - symbol, names_to = c("method", "stat"), values_to = "value", names_sep = "__" ) |> # Harmonize column names mutate(stat = case_when( stat %in% c("logFC", "log2FoldChange") ~ "logFC", stat %in% c("PValue", "pvalue", "P.Value", "p.value") ~ "PValue" )) |> pivot_wider(names_from = "stat", values_from = "value") |> unnest(c(logFC, PValue)) |> # Plot ggplot(aes(x = logFC, y = PValue)) + geom_point(aes(color = PValue < 0.05, size = PValue < 0.05)) + ggrepel::geom_text_repel(aes(label = symbol), size = 2, max.overlaps = 20) + scale_y_log10_reverse() + scale_color_manual(values = c("TRUE" = "red", "FALSE" = "black")) + scale_size_manual(values = c("TRUE" = 0.5, "FALSE" = 0.1)) + facet_wrap(~method, scales = "free_y") + my_theme + labs(title = "Volcano Plots by Method") ``` ### Differential Expression with Contrasts Contrast-based differential expression analysis, is available for most methods. It is a standard statistical approach for testing specific comparisons in complex designs. #### For edgeR ```{r differential-expression-contrasts} # Using contrasts for more complex comparisons airway |> test_differential_expression( ~ 0 + dex, contrasts = c("dextrt - dexuntrt"), method = "edgeR_quasi_likelihood", prefix = "contrasts__" ) |> # Print the gene statistics pivot_transcript() |> select(contains("contrasts")) ``` #### For DESeq2 ```{r differential-expression-contrasts-DESeq2} # Using contrasts for more complex comparisons airway |> test_differential_expression( ~ 0 + dex, contrasts = list(c("dex", "trt", "untrt")), method = "DESeq2", prefix = "contrasts__" ) |> pivot_transcript() |> select(contains("contrasts")) ``` #### For limma-voom ```{r differential-expression-contrasts-limma-voom} # Using contrasts for more complex comparisons airway |> test_differential_expression( ~ 0 + dex, contrasts = c("dextrt - dexuntrt"), method = "limma_voom", prefix = "contrasts__" ) |> pivot_transcript() |> select(contains("contrasts")) ``` ### Differential Expression with minimum fold change (TREAT method) TREAT method [@mccarthy2009treat] is used for testing significance relative to a fold-change threshold. ```{r differential-expression-treat-method} # Using contrasts for more complex comparisons airway |> test_differential_expression( ~ 0 + dex, contrasts = c("dextrt - dexuntrt"), method = "edgeR_quasi_likelihood", test_above_log2_fold_change = 2, prefix = "treat__" ) |> # Print the gene statistics pivot_transcript() |> select(contains("treat")) ``` ### Mixed Models for Complex Designs glmmSeq [@ma2020glmmseq] is used for generalized linear mixed models for RNA-seq data. ```{r differential-expression-mixed-models} # Using glmmSeq for mixed models airway |> keep_abundant(formula_design = ~ dex) |> # Select 100 genes in the interest of execution time _[1:100,] |> # Fit model test_differential_expression( ~ dex + (1|cell), method = "glmmseq_lme4", cores = 1, prefix = "glmmseq__" ) airway |> pivot_transcript() ``` ### Gene Description With tidybulk, retrieving gene descriptions is straightforward, making it easy to enhance the interpretability of your differential expression results. ```{r differential-expression-gene-description} # Add gene descriptions using the original SummarizedExperiment airway |> describe_transcript() |> # Filter top significant genes filter(ql__FDR < 0.05) |> # Print the gene statistics pivot_transcript() |> filter(description |> is.na() |> not()) |> select(.feature, description, contains("ql")) |> head() ``` ## Step 4: Batch Effect Correction ComBat-seq [@zhang2017combatseq] is used for batch effect correction in RNA-seq data. ```{r batch-correction-adjust-abundance} # Adjust for batch effects airway = airway |> adjust_abundance( .factor_unwanted = cell, .factor_of_interest = dex, method = "combat_seq", abundance = "counts_tmm" ) # Scatter plot of adjusted vs unadjusted airway |> # Subset genes to speed up plotting _[1:100,] |> select(symbol, .sample, counts_tmm, counts_tmm_adjusted) |> ggplot(aes(x = counts_tmm + 1, y = counts_tmm_adjusted + 1)) + geom_point(aes(color = .sample), size = 0.1) + ggrepel::geom_text_repel(aes(label = symbol), size = 2, max.overlaps = 20) + scale_x_log10() + scale_y_log10() + my_theme + labs(title = "Scatter plot of adjusted vs unadjusted") ``` ## Step 5: Gene Enrichment Analysis Gene Set Enrichment Analysis (GSEA) [@subramanian2005gsea] is used for gene set enrichment. ```{r enrichment-gene-rank-analysis} # Run gene rank enrichment (GSEA style) gene_rank_res = airway |> # Filter for genes with entrez IDs filter(!entrezid |> is.na()) |> aggregate_duplicates(feature = "entrezid") |> # Test gene rank test_gene_rank( .entrez = entrezid, .arrange_desc = lr_robust__logFC, species = "Homo sapiens", gene_sets = c("H", "C2", "C5") ) ``` ```{r enrichment-inspect-significant-genesets} # Inspect significant gene sets (example for C2 collection) gene_rank_res |> filter(gs_collection == "C2") |> dplyr::select(-fit) |> unnest(test) |> filter(p.adjust < 0.05) ``` #### Visualize enrichment ```{r enrichment-visualize-gsea-plots} library(enrichplot) library(patchwork) gene_rank_res |> unnest(test) |> head() |> mutate(plot = pmap( list(fit, ID, idx_for_plotting, p.adjust), ~ enrichplot::gseaplot2( ..1, geneSetID = ..3, title = sprintf("%s \nadj pvalue %s", ..2, round(..4, 2)), base_size = 6, rel_heights = c(1.5, 0.5), subplots = c(1, 2) ) )) |> pull(plot) detach("package:enrichplot", unload = TRUE) ``` Gene Ontology overrepresentation analysis [@ashburner2000go] is used for functional enrichment. ```{r enrichment-gene-overrepresentation} # Test gene overrepresentation airway_overrep = airway |> # Label genes to test overrepresentation of mutate(genes_to_test = ql__FDR < 0.05) |> # Filter for genes with entrez IDs filter(!entrezid |> is.na()) |> test_gene_overrepresentation( .entrez = entrezid, species = "Homo sapiens", .do_test = genes_to_test, gene_sets = c("H", "C2", "C5") ) airway_overrep ``` EGSEA [@alhamdoosh2017egsea] is used for ensemble gene set enrichment analysis. EGSEA is a method that combines multiple gene set enrichment analysis methods to provide a more robust and comprehensive analysis of gene set enrichment. It creates a web-based interactive tool that allows you to explore the results of the gene set enrichment analysis. ```{r enrichment-egsea-analysis, eval=FALSE} library(EGSEA) # Test gene enrichment airway |> # Filter for genes with entrez IDs filter(!entrezid |> is.na()) |> aggregate_duplicates(feature = "entrezid") |> # Test gene enrichment test_gene_enrichment( .formula = ~dex, .entrez = entrezid, species = "human", gene_sets = "h", methods = c("roast"), # Use a more robust method cores = 2 ) detach("package:EGSEA", unload = TRUE) ``` ## Step 6: Cellularity Analysis CIBERSORT [@newman2015cibersort] is used for cell type deconvolution. Cellularity deconvolution is a standard approach for estimating the cellular composition of a sample. ### Available Deconvolution Methods The `tidybulk` package provides several methods for deconvolution: - **CIBERSORT** [@newman2015cibersort]: Uses support vector regression to deconvolve cell type proportions. Requires the `class`, `e1071`, and `preprocessCore` packages. - **LLSR** [@newman2015cibersort]: Linear Least Squares Regression for deconvolution. - **EPIC** [@racle2017epic]: Uses a reference-based approach to estimate cell fractions. - **MCP-counter** [@becht2016mcp]: Quantifies the abundance of immune and stromal cell populations. - **quanTIseq** [@finotello2019quantiseq]: A computational framework for inferring the immune contexture of tumors. - **xCell** [@aran2017xcell]: Performs cell type enrichment analysis. ### Example Usage ```{r deconvolution-examples} airway = airway |> filter(!symbol |> is.na()) |> deconvolve_cellularity(method = "cibersort", cores = 1, prefix = "cibersort__", feature_column = "symbol") ``` For the rest of the methods, you need to install the `immunedeconv` package. ```{r install-immunedeconv, eval=FALSE} if (!requireNamespace("immunedeconv")) BiocManager::install("immunedeconv") ``` ```{r deconvolution-examples2, eval=FALSE} airway = airway |> # Example using LLSR deconvolve_cellularity(method = "llsr", prefix = "llsr__", feature_column = "symbol") |> # Example using EPIC deconvolve_cellularity(method = "epic", cores = 1, prefix = "epic__") |> # Example using MCP-counter deconvolve_cellularity(method = "mcp_counter", cores = 1, prefix = "mcp__") |> # Example using quanTIseq deconvolve_cellularity(method = "quantiseq", cores = 1, prefix = "quantiseq__") |> # Example using xCell deconvolve_cellularity(method = "xcell", cores = 1, prefix = "xcell__") ``` ### Plotting Results Visualize the cell type proportions as a stacked barplot for each method: ```{r deconvolution-plotting, fig.width = 10, fig.height = 10} # Visualize CIBERSORT results airway |> pivot_sample() |> dplyr::select(.sample, contains("cibersort__")) |> pivot_longer(cols = -1, names_to = "Cell_type_inferred", values_to = "proportion") |> ggplot(aes(x = .sample, y = proportion, fill = Cell_type_inferred)) + geom_bar(stat = "identity") + scale_y_continuous(labels = scales::percent) + my_theme + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) + labs(title = "CIBERSORT Cell Type Proportions") ``` ```{r deconvolution-plotting2, fig.width = 10, fig.height = 10, eval=FALSE} # Repeat similar plotting for LLSR, EPIC, MCP-counter, quanTIseq, and xCell airway |> pivot_sample() |> select(.sample, contains("llsr__")) |> pivot_longer(cols = -1, names_to = "Cell_type_inferred", values_to = "proportion") |> ggplot(aes(x = .sample, y = proportion, fill = Cell_type_inferred)) + geom_bar(stat = "identity") + scale_y_continuous(labels = scales::percent) + my_theme + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) + labs(title = "LLSR Cell Type Proportions") airway |> pivot_sample() |> select(.sample, contains("epic__")) |> pivot_longer(cols = -1, names_to = "Cell_type_inferred", values_to = "proportion") |> ggplot(aes(x = .sample, y = proportion, fill = Cell_type_inferred)) + geom_bar(stat = "identity") + scale_y_continuous(labels = scales::percent) + my_theme + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) + labs(title = "EPIC Cell Type Proportions") airway |> pivot_sample() |> select(.sample, contains("mcp__")) |> pivot_longer(cols = -1, names_to = "Cell_type_inferred", values_to = "proportion") |> ggplot(aes(x = .sample, y = proportion, fill = Cell_type_inferred)) + geom_bar(stat = "identity") + scale_y_continuous(labels = scales::percent) + my_theme + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) + labs(title = "MCP-counter Cell Type Proportions") airway |> pivot_sample() |> select(.sample, contains("quantiseq__")) |> pivot_longer(cols = -1, names_to = "Cell_type_inferred", values_to = "proportion") |> ggplot(aes(x = .sample, y = proportion, fill = Cell_type_inferred)) + geom_bar(stat = "identity") + scale_y_continuous(labels = scales::percent) + my_theme + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) + labs(title = "quanTIseq Cell Type Proportions") airway |> pivot_sample() |> select(.sample, contains("xcell__")) |> pivot_longer(cols = -1, names_to = "Cell_type_inferred", values_to = "proportion") |> ggplot(aes(x = .sample, y = proportion, fill = Cell_type_inferred)) + geom_bar(stat = "identity") + scale_y_continuous(labels = scales::percent) + my_theme + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) + labs(title = "xCell Cell Type Proportions") ``` ## Bibliography `tidybulk` allows you to get the bibliography of all methods used in our workflow. ```{r bibliography-get-methods} # Get bibliography of all methods used in our workflow airway |> get_bibliography() ``` ```{r session-info} sessionInfo() ```