--- title: "Measuring tissue specificity from single cell data with SPICEY" author: "Georgina Fuentes-Páez" output: BiocStyle::html_document: fig_width: 7 fig_height: 5 vignette: > %\VignetteIndexEntry{Measuring tissue specificity from single cell data with SPICEY} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = TRUE, warning = FALSE, message = FALSE, fig.align = "center", out.width = "80%" ) ``` ```{r logo, echo=FALSE, eval=TRUE, out.width='10%'} knitr::include_graphics("../man/figures/SPICEY_LOGO.svg", dpi = 800) ``` # Introduction SPICEY is an R package for quantifying cell-type specificity from single-cell omics data. It supports inputs from common analysis frameworks such as `Seurat` (for single-cell RNA-seq) and `Signac` (for single-cell ATAC-seq), and provides a unified interface for computing cell-type-specificity metrics at both the regulatory and transcriptional levels. ## Preamble SPICEY is designed to work with the results of differential analyses performed on single-cell ATAC-seq (to assess chromatin accessibility) or single-cell RNA-seq (to measure gene expression). These analyses typically compare features—either regulatory elements (e.g., peaks) or genes—across cell types to identify cell-type-specific activity or expression. Rather than performing differential testing itself, SPICEY leverages pre-computed statistics from such analyses, specifically the **log2 fold-change** and associated **adjusted p-values**, which reflect both the magnitude and significance of cell-type enrichment. Using these inputs, SPICEY computes composite specificity scores that capture both the intensity and selectivity of differential signal across cell types. In this tutorial, we illustrate the application of SPICEY on single-cell datasets derived from non-diabetic human pancreatic islets obtained through the Human Pancreas Analysis Program (HPAP) @ref. We analyzed paired single-nucleus ATAC-seq and RNA-seq data from three donors with similar total cell counts and comparable distributions of major pancreatic islet cell types. To associate distal regulatory elements with their putative target genes, we used co-accessibility links inferred via the `Cicero` package. Differential expression and accessibility analyses were conducted using `Seurat` (`FindMarkers()`) function on both RNA and ATAC assays. All genes and regions were considered (`logfc.threshold = 0`), and the **Wilcoxon Rank Sum test** was used (`test.use = "wilcox"`), following `Seurat` default parameters. ## Cell-Type Specificity Indices: RETSI and GETSI SPICEY defines two complementary metrics to quantify cell-type specificity: - **RETSI** (Regulatory Element Tissue Specificity Index): Evaluates the specificity of regulatory elements (e.g., ATAC-seq peaks) across different cell types. - **GETSI** (Gene Expression Tissue Specificity Index): Evaluates the specificity of gene expression patterns across cell types. Both RETSI and GETSI are calculated using a weighted specificity score for each feature (regulatory region or gene), reflecting the level of enrichment in each cell type. These specificity indices are computed as follows: $$ TSI_{i,x} = \frac{FC_{i,x}}{\max(FC)_x} \cdot w_{i} $$ $$ w_{i} = \text{rescale}\left( -\log_{10}(padj_{i}),\ [0, 1] \right) $$ where: - $i$ is a specific cell type - $x \in \{r, g\}$, with $r$ for regulatory region (RETSI) and $g$ for gene (GETSI) - $\text{FC}_{i,x}$ is the log fold-change of feature $x$ in cell type $i$ - $w_{i}$ is the weight term, defined as the rescaled value of $-\log_{10}(\text{adjusted } p\text{-value})$ between 0 and 1 SPICEY offers two strategies for linking regulatory elements to their putative target genes: 1. **Nearest-gene assignment:** based on proximity to the closest transcription start site (TSS). 2. **Co-accessibility-based linking:** leveraging correlations in chromatin accessibility (e.g., via Cicero). These linking methods enable the integration of regulatory specificity (RETSI) and gene expression specificity (GETSI), facilitating the construction of detailed cell-type-specific regulatory networks from single-cell data. **Note:** This vignette illustrates the use of SPICEY with preprocessed `GRanges` inputs, which must be prepared externally as they are **not** provided within the package. ## Entropy-based Specificity Indices In addition to fold change–based metrics, SPICEY incorporates an entropy-based approach to quantify the distributional concentration or distributional skewness of feature activity across cell types using **Shannon entropy**. For each feature $x$ (e.g., gene or regulatory region), the activity or expression scores across $N$ cell types are normalized into a probability distribution. This allows measuring how specific or broadly distributed a feature's activity is across cell types. Lower entropy values indicate higher cell-type specificity, while higher entropy reflects more ubiquitous expression or accessibility. These entropy-based indices are computed as follows: $$ H(x) = - \sum_{i=1}^{N} p_{i,x} \log_2(p_{i,x}) $$ $$ H_{\text{norm}}(x) = 1 - e^{-H(x)} $$ where: $$ p_{i,x} = \frac{s_{i,x}}{\sum_{j=1}^{N} s_{j,x}} $$ being: - $s_{i,x}$ is the specificity score (e.g., RETSI or GETSI) of feature $x$ in cell type $i$ - $p_{i,x}$ is the normalized score, interpreted as the probability of feature $x$'s activity in cell type $i$ - $H(x)$ is the Shannon entropy, quantifying signal dispersion across cell types - $H_{\text{norm}}(x)$ is the normalized entropy, bounded between 0 and 1: - Values close to 0 indicate strong specificity - Values close to 1 indicate broad distribution
This entropy framework provides a robust and interpretable measure of cell-type specificity that is complementary to fold change–based approaches. # Installation Install the latest release of SPICEY from Bioconductor: ```{r install, eval=FALSE, echo=TRUE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("SPICEY") ``` # Input Requirements ```{r load-spicey} library(SPICEY) ``` SPICEY requires preprocessed input files generated from differential analyses of single-cell omics data, specifically chromatin accessibility (ATAC-seq) and/or gene expression (RNA-seq). These inputs are **not included** in the package and must be prepared externally. Below is a detailed overview of the required input formats and metadata. ## Single-cell ATAC-seq Data - **Accepted Formats:** - A **named list** of `GRanges` objects or `data.frame`s (convertible to `GRanges`), where each element contains differential accessibility results for a specific cell type.\ - A `GRangesList` with each element representing a distinct cell type. All inputs must be **named** with cell type identifiers. - **Required Columns (per element):** - `region_id` — Unique peak identifier matching the format in your ATAC data (e.g., `chr1-5000-5800` or `chr1:5000-5800`). - `avg_log2FC` — Average log2 fold-change of accessibility in the given cell type. - `p_val_adj` — Adjusted p-value from the differential test. - `cell_type` — Cell type or cluster label for each measurement. Each peak should have multiple entries, one per cell type, allowing SPICEY to assess cell-type-specific accessibility patterns. ```{r da-atac, message=FALSE, warning=FALSE} data("atac") ``` ## Single-cell RNA-seq Data - **Accepted Formats:** - A **named list** of `GRanges` objects or `data.frame`s (convertible to `GRanges`), with each element containing differential expression results for a specific cell type.\ - A `GRangesList`, where each element corresponds to a distinct cell type. The list or `GRangesList` must be **named** with cell type identifiers. - **Required Columns (per element):** - `gene_id` — Gene identifier column, consistent across all entries, using *official gene symbols* (e.g., `GAPDH`). - `avg_log2FC` — Average log2 fold-change in expression for the gene within the specific cell type. - `p_val_adj` — Adjusted p-value from the differential test. - `cell_type` — Cell type or cluster label associated with the measurement. Each gene should have multiple entries—one per cell type—enabling SPICEY to evaluate expression specificity across cell populations. ```{r da-rna, message=FALSE, warning=FALSE} data("rna") ``` ## Co-accessibility Links (Optional for Region-to-Gene Linking) - **Purpose:**\ Link regulatory elements to putative target genes based on co-accessibility, typically derived from tools like `Cicero`. This input is required only when using the `annotation` parameter for RETSI/GETSI integration. - **Format:**\ A `data.frame` containing co-accessibility pairs and their scores. - **Required Columns:** - `Peaks1` — Genomic coordinate or ID of the first peak in the pair. - `Peaks2` — Genomic coordinate or ID of the second peak. - `coaccess` — Co-accessibility score or correlation value between the paired peaks. **Note:** Peak identifiers in `Peaks1` and `Peaks2` must exactly match those in the ATAC input, including coordinate formatting (e.g., `chr1:5000-6000` vs. `chr1-5000-6000`). Inconsistent formats will disrupt accurate annotation. ```{r links, message=FALSE, warning=FALSE} data("cicero_links") head(cicero_links) ``` # Quick example SPICEY offers a comprehensive main function, `SPICEY()`, which streamlines the entire workflow into a single, reproducible process. It also includes optional helper functions for region-to-gene annotation using different strategies: `annotate_with_coaccessibility()` and `annotate_with_nearest()`. When supplied with differential chromatin accessibility (ATAC) and/or gene expression (RNA) data from single-cell experiments, along with a chosen region-to-gene mapping method ("nearest" or "coaccessibility"), the `SPICEY()` function performs the following: 1. Computes tissue specificity scores from single-cell data: - RETSI (Regulatory Element Tissue Specificity Index) for ATAC data - GETSI (Gene Expression Tissue Specificity Index) for RNA data 2. (**Optional**) Annotates regulatory elements to target genes using either: - Nearest gene transcription start site (TSS) (e.g., +/- 2000bp around the TSS) - Co-accessibility links (e.g., derived from Cicero) 3. Builds tissue-specific regulatory networks by integrating RETSI and GETSI scores via the provided region-to-gene annotation. The output is a structured list containing all intermediate and final results, facilitating downstream analysis and reuse. ```{r quick-example, fig.width=2, fig.height=3} # Load necessary packages library(dplyr) library(GenomicRanges) library(TxDb.Hsapiens.UCSC.hg38.knownGene) library(org.Hs.eg.db) library(SPICEY) data(atac) data(rna) data(cicero_links) # Annotate peaks to genes with coaccessibility peaks <- unique(unlist(atac)[, c("region_id")]) annotation_coacc <- annotate_with_coaccessibility( peaks = peaks, txdb = TxDb.Hsapiens.UCSC.hg38.knownGene, links_df = cicero_links, annot_dbi = org.Hs.eg.db, protein_coding_only = TRUE, verbose = TRUE, add_tss_annotation = FALSE, upstream = 2000, downstream = 2000 ) # Calculate SPICEY measures and link them with coaccessibility spicey_coacc <- SPICEY( rna = rna, atac = atac, annotation = annotation_coacc ) # Plot results spicey_heatmap(spicey_coacc$linked, spicey_measure = "SPICEY", combined_zscore = TRUE ) ``` # Example: Step-by-Step SPICEY Workflow This section outlines how to execute SPICEY using preprocessed input files, as detailed previously. ## Computing RETSI To calculate the **Regulatory Element Tissue Specificity Index (RETSI)**, the following inputs are required: - **`atac`**: A named list of `GRanges` objects or `data.frame`s (convertible to `GRanges`), each representing differential accessibility results for a distinct cell type. SPICEY returns a `data.frame` combining the input differential accessibility data with RETSI scores for each regulatory region. **Note:** Confirm that your inputs comply with the format outlined in the *Input Requirements* section prior to execution. ```{r retsi, message=FALSE, warning=FALSE} retsi <- SPICEY(atac = atac) head(retsi) ``` ## Computing GETSI To calculate the **Gene Expression Tissue Specificity Index (GETSI)**, the following inputs are required: - **`rna`**: A named list of `GRanges` objects or `data.frame`s (convertible to `GRanges`), each containing differential gene expression results for a specific cell type. SPICEY outputs a `data.frame` that integrates the input differential expression data with computed GETSI scores for each gene. **Note:** Verify that your inputs adhere to the format described in the *Input Requirements* section before execution. ```{r getsi, message=FALSE, warning=FALSE} getsi <- SPICEY(rna = rna) head(getsi) ``` ## Computing SPICEY SPICEY offers an integrated approach to quantify tissue specificity at both the epigenetic and transcriptional levels by simultaneously computing Regulatory Element Tissue Specificity Indices (RETSI) and Gene Expression Tissue Specificity Indices (GETSI) from single-cell data. By supplying both `atac` and `rna` inputs together to the `SPICEY()` function, the tool calculates RETSI and GETSI scores and returns a list with each specificity measure provided separately. ```{r spicey, message=FALSE, warning=FALSE} spicey <- SPICEY( atac = atac, rna = rna ) lapply(spicey, head) ``` ## Building Tissue-Specific Regulatory Networks SPICEY facilitates the construction of tissue-specific regulatory networks by integrating RETSI and GETSI scores through the association of regulatory elements with their putative target genes. This linkage connects each regulatory region’s cell type-specific RETSI score to the corresponding gene’s GETSI score, enabling comprehensive downstream network analyses. To establish these links, regulatory regions must be annotated to target genes. SPICEY supports two annotation strategies: 1. **Proximity-based annotation** — assigning regions to the nearest transcription start site (TSS).\ 2. **Co-accessibility-based annotation** — leveraging promoter–enhancer co-accessibility relationships derived from snATAC-seq data (e.g., via `Cicero`). ## Annotating Regions to Putative Target Genes (Optional) #### Nearest Gene Annotation This method links each regulatory region to its closest gene TSS. ##### Required inputs: - `peaks`: A `GRanges` object or `data.frame` containing peaks to annotate, including these columns: - `seqnames`: Chromosome name (e.g., chr1) - `start`: Peak start coordinate - `end`: Peak end coordinate - `region_id`: Unique region identifier in genomic coordinate format (e.g., `chr1-5000-5800` or `chr1:5000-5800`) - `txdb`: A `TxDb` object for genome annotation (e.g., `TxDb.Hsapiens.UCSC.hg38.knownGene`). - `annot_dbi`: An `AnnotationDbi` object for gene metadata (e.g., `org.Hs.eg.db`). - `upstream`: Number of bases upstream from the TSS to consider (default: 2000). - `downstream`: Number of bases downstream from the TSS to consider (default: 2000). ##### Optional parameters: - `protein_coding_only`: Limit annotations to protein-coding genes (default: `TRUE`). - `add_tss_annotation`: Annotate regulatory elements overlapping the TSS (default: `FALSE`). If enabled, uses ±1 bp around TSS instead of ±2000 bp. - `verbose`: Display progress messages (default: `TRUE`). ```{r re-gene-nearest, message=FALSE, warning=FALSE} peaks <- unique(unlist(atac)[, c("region_id")]) annotation_near <- annotate_with_nearest( peaks = peaks, txdb = TxDb.Hsapiens.UCSC.hg38.knownGene, annot_dbi = org.Hs.eg.db, protein_coding_only = TRUE, verbose = TRUE, add_tss_annotation = FALSE, upstream = 2000, downstream = 2000 ) head(annotation_near) ``` This function returns a data frame linking each regulatory region to its nearest gene based on genomic proximity. #### Annotate to co-accessible Genes This method connects regulatory regions to genes through co-accessibility interactions, which must be computed externally (e.g., using **Cicero**). ##### Required inputs: - `peaks`: A `GRanges` object or `data.frame` of peaks to annotate, including the following columns: - `seqnames`: Chromosome name (e.g., chr1) - `start`: Peak start coordinate - `end`: Peak end coordinate - `region_id`: Unique region identifier in genomic coordinate format (e.g., `chr1-5000-5800` or `chr1:5000-5800`) - `txdb`: A `TxDb` object for genome annotation (e.g., `TxDb.Hsapiens.UCSC.hg38.knownGene`). - `annot_dbi`: An `AnnotationDbi` object for gene metadata (e.g., `org.Hs.eg.db`). - `links_df`: A `data.frame` containing co-accessibility links or scores (e.g., from Cicero), with these required columns: - `Peaks1`: Genomic coordinate or ID of the first peak in the pair - `Peaks2`: Genomic coordinate or ID of the second peak linked by co-accessibility - `coaccess`: Co-accessibility score or correlation value between peaks - `upstream`: Number of bases upstream from the TSS to consider (default: 2000). - `downstream`: Number of bases downstream from the TSS to consider (default: 2000). ##### Optional parameters: - `protein_coding_only`: Limit annotations to protein-coding genes (default: `TRUE`). - `add_tss_annotation`: Annotate regulatory elements overlapping the TSS (default: `FALSE`). If enabled, uses ±1 bp around TSS instead of ±2000 bp. - `verbose`: Display progress messages (default: `TRUE`). ```{r re-gene-coaccessibility, message=FALSE, warning=FALSE} peaks <- unique(unlist(atac)[, c("region_id")]) annotation_coacc <- annotate_with_coaccessibility( peaks = peaks, txdb = TxDb.Hsapiens.UCSC.hg38.knownGene, links_df = cicero_links, annot_dbi = org.Hs.eg.db, protein_coding_only = TRUE, verbose = TRUE, add_tss_annotation = FALSE, upstream = 2000, downstream = 2000 ) head(annotation_coacc) ``` This function returns a data frame linking each regulatory region to its associated gene based on co-accessibility relationships. #### Using custom region-to-gene annotation The annotation step(s) are **optional**. Users can either generate these annotations using SPICEY’s built-in functions or supply a pre-computed `data.frame` of gene–region associations. When using pre-computed data to construct tissue-specific regulatory networks, the `data.frame` must include at minimum the following columns: - `region_id`: Unique identifier for each region, ideally following standard genomic coordinate formats such as *chr-start-end* (e.g., `chr1-5000-5800`) or *chr:start-end* (e.g., `chr1:5000-5800`). It should be the same being provided to `SPICEY` as `region_id`. - `gene_id`: Unique gene identifier. It should be the same being provided to `SPICEY` as `gene_id`. ### Integrating SPICEY measures to build tissue-specific networks With the annotation linking regulatory elements to their target genes (via TSS proximity or co-accessibility), SPICEY enables integration of RETSI and GETSI scores. This links each regulatory region’s cell type-specific RETSI score with its target gene’s corresponding GETSI score, facilitating the construction of detailed tissue-specific regulatory networks. This approach supports comprehensive analysis of regulatory mechanisms at both epigenetic and transcriptional levels. ```{r spicey_link_annot, message=FALSE, warning=FALSE} # Using nearest gene annotation data spicey_near <- SPICEY( rna = rna, atac = atac, annotation = annotation_near ) # Using co-accessibility gene annotation data spicey_coacc <- SPICEY( rna = rna, atac = atac, annotation = annotation_coacc ) ``` This function returns a list containing separately computed RETSI and GETSI scores, along with an additional element that links these measures based on the selected region-to-gene annotation method. # Output Description The primary output is a `data.frame` or a list of `data.frames` representing genomic regions (e.g., enhancers or promoters), each annotated with rich metadata derived from chromatin accessibility and gene expression data. Key columns include: ### SPICEY Measures | Column | Description | |----|----| | `cell_type` | Cell type exhibiting significant accessibility or expression (e.g., `"Acinar"`). | | `p_val_adj` | Adjusted p-value from differential testing, controlling for multiple comparisons. | | `avg_log2FC` | Average log2 fold-change of accessibility/expression in the specified cell type vs others. | | `region_id` | Genomic interval identifier (e.g., `"chr7:142748901-142749963"`). | | `RETSI` | Regulatory Element Tissue Specificity Index; values near 1 indicate high cell-type specificity. | | `RETSI_entropy` | Entropy-based specificity metric; lower values indicate greater cell-type specificity. | | `gene_id` or `nearestGeneSymbol` | Gene identifier (official gene symbol). | | `GETSI` | Gene Expression Tissue Specificity Index for linked genes, indicating expression specificity. | | `GETSI_entropy` | Entropy-based specificity metric for gene expression, analogous to `RETSI_entropy`. | ### Linked SPICEY Measures | Column | Description | |----|----| | `cell_type` | Cell type showing significant accessibility (e.g., `"Acinar"`). | | `p_val_adj_ATAC` | Adjusted p-value from differential accessibility testing. | | `avg_log2FC_ATAC` | Average log2 fold-change in accessibility for the cell type versus others. | | `region_id` | Genomic interval identifier (e.g., `"chr7:142748901-142749963"`). | | `RETSI` | Regulatory Element Tissue Specificity Index. | | `RETSI_entropy` | Entropy-based specificity measure for accessibility. | | `gene_id` or `nearestGeneSymbol` | Gene identifier (official gene symbol). | | `p_val_adj_RNA` | Adjusted p-value from differential expression testing. | | `avg_log2FC_RNA` | Average log2 fold-change in gene expression for the cell type versus others. | | `pct.1` | Percentage of cells expressing the gene in the first group of the differential analysis. | | `pct.2` | Percentage of cells expressing the gene in the second group. | | `GETSI` | Gene Expression Tissue Specificity Index for linked genes. | | `GETSI_entropy` | Entropy-based specificity metric for gene expression. | This annotated `data.frame` serves as a comprehensive resource combining genomic coordinates, cell type-specific regulatory activity, gene associations, and expression specificity—enabling advanced functional genomics analyses. # Visualizing cell-type specificity The `SPICEY` package provides a flexible function for visualizing gene-regulatory-level specificity across cell types using the `spicey_heatmap()` function. This function supports visualizing: - **RETSI** - **GETSI** - **SPICEY**, which is the combined score of both RETSI and GETSI. These visualizations help uncover transcriptional and regulatory programs that are specific to individual cell types and can assist in identifying marker genes or highly specific regulatory programs. The `spicey_heatmap()` function performs the following steps: 1. **Filters and scales the input**: Computes z-scores for the selected specificity index (`RETSI`, `GETSI`, or a combined SPICEY score). 2. **Selects top genes**: For each cell type, identifies the top `n` genes based on their z-score ranks. 3. **Generates a heatmap**: Creates a gene-by-cell-type heatmap. Genes are ordered by their highest-scoring cell type and their z-score magnitude. Depending on the `spicey_measure` and `combined_zscore` arguments, the function adapts the output: - **Single-modality visualization:** - `spicey_measure = "RETSI"`: Plots z-scored RETSI values, where RETSI reflects cell-type-specific accessibility of regulatory regions linked to genes. - `spicey_measure = "GETSI"`: Plots z-scored GETSI values based on gene expression. - **Dual-modality (SPICEY) visualization:** - `spicey_measure = "SPICEY"` and `combined_zscore = FALSE`: Displays two separate heatmaps -one for RETSI, one for GETSI - enabling side-by-side comparison. - `spicey_measure = "SPICEY"` and `combined_zscore = TRUE`: RETSI and GETSI are independently z-scored and averaged per gene-cell-type pair. A single unified heatmap is generated to highlight shared specificity across both data types. ### Required inputs: To run `spicey_heatmap()`, the input must be a `data.frame` with the following columns: - `gene_id`: - For **RETSI**: The gene annotated to the given regulatory region (e.g., from co-accessibility or distance to the nearest TSS). - For **GETSI**: The expressed gene for which expression specificity is measured. - `cell_type`: Cell type or cluster label (e.g., `"Acinar"`). - `GETSI` and/or `RETSI` score. ### Arguments - `df`: A `data.frame` containing the columns described above. - `top_n`: Number of top-ranked genes to retain per cell type (default: `5`). - `spicey_measure`: Score type to visualize. Must be one of `"RETSI"`, `"GETSI"`, or `"SPICEY"`. - `combined_zscore`: Logical. If `TRUE` and `spicey_measure = "SPICEY"`, computes a combined z-score across RETSI and GETSI (default: `FALSE`). ```{r spicey-heatmap-sep, message=FALSE, warning=FALSE, fig.width=2, fig.height=3} # RETSI and/or GETSI separately spicey_heatmap(spicey_coacc$GETSI, spicey_measure = "GETSI" ) # If only RETSI is calculated (only ATAC data available), join with coaccessibility or nearest gene annotation to show gene_id in the plots retsi <- spicey_coacc$RETSI retsi <- retsi |> left_join(annotation_near, by = c("region_id")) spicey_heatmap(retsi, spicey_measure = "RETSI" ) ``` ```{r spicey-heatmap-pair, message=FALSE, warning=FALSE, fig.width=3.5, fig.height=4.5} spicey_heatmap(spicey_coacc$linked, spicey_measure = "SPICEY", combined_zscore = FALSE ) ``` ```{r spicey-heatmap-comb, message=FALSE, warning=FALSE, fig.width=2, fig.height=3} # SPICEY: combined average RETSI and GETSI scores spicey_heatmap(spicey_coacc$linked, spicey_measure = "SPICEY", combined_zscore = TRUE ) ``` # References # SessionInfo ```{r sessionInfo} sessionInfo() ```