--- title: "scGraphVerse Case Study: B-cell GRN Reconstruction" author: "Francesco Cecere" output: BiocStyle::html_document: toc: true number_sections: true vignette: > %\VignetteIndexEntry{scGraphVerse Case Study: B-cell GRN Reconstruction} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction This vignette demonstrates the **scGraphVerse** workflow on a single-cell RNA-seq dataset. We show how to: 1. Load and preprocess public PBMC data. 2. Infer gene regulatory networks with **GENIE3**. 3. Build consensus networks and detect communities. 4. Validate inferred edges using STRINGdb. ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 5, eval = TRUE ) library(scGraphVerse) ``` # 1. Dataset and Preprocessing In this real-data analysis, we'll work with two public PBMC (Peripheral Blood Mononuclear Cell) datasets from 10X Genomics. Our strategy is to focus specifically on B cells and identify a common set of highly expressed genes across both datasets. This approach allows us to compare regulatory networks between different experimental batches while controlling for cell type and gene selection effects. **Data Processing Workflow:** 1. Load two PBMC datasets (3k and 4k cells) from TENxPBMCData 2. Use SingleR for automated cell type annotation 3. Select top 500 most expressed genes in B cells from each dataset 4. Find intersection of expressed genes to ensure comparable gene sets 5. Subset datasets to B cells only with common gene set This preprocessing ensures we have clean, comparable data for network inference. ```{r loaddata} # Note: This vignette requires external packages from Suggests # Install if needed: # BiocManager::install(c("TENxPBMCData", "scater", "SingleR", "celldex")) # Helper function to preprocess PBMC data preprocess_pbmc <- function(pbmc_obj) { sce <- scater::logNormCounts(pbmc_obj) symbols_tenx <- SummarizedExperiment::rowData(sce)$Symbol_TENx valid <- !is.na(symbols_tenx) & symbols_tenx != "" sce <- sce[valid, ] rownames(sce) <- make.unique(symbols_tenx[valid]) SummarizedExperiment::assay(sce, "logcounts") <- as.matrix(SummarizedExperiment::assay(sce, "logcounts")) colnames(sce) <- paste0("cell_", seq_len(ncol(sce))) return(sce) } # 1. Load and preprocess PBMC data if (requireNamespace("TENxPBMCData", quietly = TRUE)) { pbmc_obj <- TENxPBMCData::TENxPBMCData("pbmc3k") pbmc_obj1 <- TENxPBMCData::TENxPBMCData("pbmc4k") sce <- preprocess_pbmc(pbmc_obj) sce1 <- preprocess_pbmc(pbmc_obj1) # 2. Cell type annotation if (requireNamespace("celldex", quietly = TRUE) && requireNamespace("SingleR", quietly = TRUE)) { ref <- celldex::HumanPrimaryCellAtlasData() pred1 <- SingleR::SingleR(test = sce1, ref = ref, labels=ref$label.main) SummarizedExperiment::colData(sce1)$pcell <- pred1$labels pred <- SingleR::SingleR(test = sce, ref = ref, labels=ref$label.main) SummarizedExperiment::colData(sce)$pcell <- pred$labels # 3. Select top 100 B-cell genes genes <- selgene( object = sce, top_n = 100, cell_type = "B_cell", cell_type_col = "pcell", remove_rib = TRUE, remove_mt = TRUE ) genes1 <- selgene( object = sce1, top_n = 100, cell_type = "B_cell", cell_type_col = "pcell", remove_rib = TRUE, remove_mt = TRUE ) # 4. Find intersection commong <- intersect(genes, genes1) # 5. Subset data pbmc1_sub <- sce[ commong, SummarizedExperiment::colData(sce)$pcell == "B_cell" ] pbmc2_sub <- sce1[ commong, SummarizedExperiment::colData(sce1)$pcell == "B_cell" ] pbmc1_sub <- pbmc1_sub[, 1:85] pbmc2_sub <- pbmc2_sub[, 1:85] # Create MultiAssayExperiment for multi-sample analysis bcell_mae <- create_mae(list(pbmc1 = pbmc1_sub, pbmc2 = pbmc2_sub)) } } ``` # 2. Network Inference Now we'll infer gene regulatory networks from our preprocessed B cell data using GENIE3. Unlike the simulation study where we had control over all parameters, here we're working with real biological data that presents unique challenges: B cells have distinct expression patterns, lower total gene counts compared to the full transcriptome, and batch effects between the two PBMC datasets. The function now accepts a `MultiAssayExperiment` object as input, which provides a structured way to handle multiple experimental conditions. ```{r genie3} # Choose method: "GENIE3", "GRNBoost2", "ZILGM", "PCzinb" or "JRF" method <- "GENIE3" if (exists("bcell_mae")) { networks <- infer_networks( count_matrices_list = bcell_mae, method = method, nCores = 1 ) } ``` ## 2.1. Building Adjacency Matrices Here we apply a more stringent threshold (99th percentile) compared to the simulation study (95th percentile). This is appropriate for real data analysis where we expect higher noise levels and want to focus on the most confident regulatory relationships. With B cell data, we're particularly interested in high-confidence edges controlling B cell function. The functions now return and work with `SummarizedExperiment` objects for better data organization and metadata tracking. ```{r adjacency, fig.alt="plot Graphs"} if (exists("networks")) { # Weighted adjacency (returns SummarizedExperiment) wadj_se <- generate_adjacency(networks) # Symmetrize (returns SummarizedExperiment) swadj_se <- symmetrize(wadj_se, weight_function = "mean") # Binary cutoff (top 1%, returns SummarizedExperiment) binary_se <- cutoff_adjacency( count_matrices = bcell_mae, weighted_adjm_list = swadj_se, n = 1, method = method, quantile_threshold = 0.99, nCores = 1 ) # Plot if (requireNamespace("ggraph", quietly = TRUE)) { plotg(binary_se) } } ``` # 3. Consensus and Community Detection With only two PBMC datasets, our consensus network will include edges that appear in both batches, providing high confidence in the regulatory relationships. ```{r consensus, fig.alt="plot consensus Graph"} if (exists("binary_se")) { # Consensus by vote consensus <- create_consensus(binary_se, method = "vote") # Plot consensus network if (requireNamespace("ggraph", quietly = TRUE)) { plotg(consensus) } } ``` Here we demonstrate the use of `classify_edges()` and `plot_network_comparison()` for detailed network comparison analysis. The `false_plot` parameter controls whether to include false positive/negative plots. ```{r comparecommunity, fig.alt="comparing Graphs with reference"} if (exists("consensus")) { # Note: For comparison with external databases like STRINGdb, # use stringdb_adjacency() to fetch reference networks # Community detection if (requireNamespace("igraph", quietly = TRUE)) { communities <- community_path(consensus) } } ``` # 4. Validation with STRINGdb Unlike the simulation study where we used STRINGdb to create ground truth, here we use it for validation of our inferred B cell regulatory network. The `keep_all_genes = TRUE` parameter ensures we retain information for all genes in our dataset, even those not found in STRING, which is important for comprehensive edge mining analysis. ```{r stringdb} # Note: This requires internet connection for STRINGdb downloads # This section requires the STRINGdb package from Suggests if (exists("consensus") && requireNamespace("STRINGdb", quietly = TRUE)) { str_result <- stringdb_adjacency( genes = rownames(consensus), species = 9606, required_score = 900, keep_all_genes = TRUE ) # Extract binary adjacency and symmetrize str_binary <- str_result$binary ground_truth_se <- symmetrize( build_network_se(list(string_network = str_binary)), weight_function = "mean" ) ground_truth <- SummarizedExperiment::assay(ground_truth_se, 1) # Edge mining: TP rates (returns list of dataframes) em <- edge_mining( consensus, ground_truth = ground_truth, query_edge_types = "TP" ) # Display results if (length(em) > 0) { print(head(em[[1]])) } } ``` # 5. Conclusion This case study illustrates how **scGraphVerse** enables end-to-end GRN reconstruction and validation in single-cell data. Users can swap inference algorithms, tune thresholds, and incorporate external prior networks. # Session Information ```{r sessioninfo} sessionInfo() ```