--- title: "SpaceMarkers Functions Step-by-Step" author: "Atul Deshpande" date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: yes toc_float: toc_collapsed: true vignette: > %\VignetteIndexEntry{SpaceMarkers Functions Step-by-Step} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r global.options, include = FALSE} knitr::opts_knit$set( collapse = TRUE, comment = "#>", fig.align = 'center' ) knitr::opts_chunk$set( out.extra = 'style="display:block; margin:auto;"', warning = FALSE, message = FALSE, # Compact, HTML-friendly rendering. JPEG (rather than PNG) is the # right choice for the H&E histology overlays — photographic raster # compresses ~5x better as JPEG at quality 80, keeping the rendered # vignette under Bioconductor's 5 MB per-file size guideline. dev = "jpeg", dev.args = list(quality = 80), dpi = 72, fig.width = 5, fig.height = 4, fig.retina = 1 ) ``` # Overview This vignette mirrors the *SpaceMarkers with SpatialExperiment Objects* vignette but replaces the single `SpaceMarkers()` call with each constituent function called explicitly. Every step dispatches on the `SpaceMarkersExperiment` (SME) and returns an updated SME, so the analysis is a chain of SME-in / SME-out calls — no manual slot assignment or pre-extracted matrices required. The same breast cancer Visium dataset is used throughout. # Installation ```{r eval = FALSE} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("SpaceMarkers") ``` ```{r message=FALSE, warning=FALSE} library(SpaceMarkers) ``` # Setup ## Links to Data ```{r} data_dir <- "visiumBrCA" unlink(file.path(data_dir), recursive = TRUE) dir.create(data_dir, showWarnings = FALSE) main_10xlink <- "https://cf.10xgenomics.com/samples/spatial-exp/1.3.0" counts_folder <- "Visium_Human_Breast_Cancer" counts_file <- "Visium_Human_Breast_Cancer_filtered_feature_bc_matrix.h5" counts_url <- paste(c(main_10xlink, counts_folder, counts_file), collapse = "/") sp_folder <- "Visium_Human_Breast_Cancer" sp_file <- "Visium_Human_Breast_Cancer_spatial.tar.gz" sp_url <- paste(c(main_10xlink, sp_folder, sp_file), collapse = "/") ``` ## Downloading Data Downloads are routed through `BiocFileCache` so the ~260 MB dataset is fetched once and reused across vignette builds and across the other SpaceMarkers vignettes. ```{r} bfc <- BiocFileCache::BiocFileCache(ask = FALSE) counts_cache <- BiocFileCache::bfcrpath(bfc, counts_url) sp_cache <- BiocFileCache::bfcrpath(bfc, sp_url) file.copy(counts_cache, file.path(data_dir, counts_file), overwrite = TRUE) untar(sp_cache, exdir = data_dir) ``` # Loading Data into a SpaceMarkersExperiment `load10X()` searches for expression and spatial files with flexible regex matching, log-transforms counts, and returns an SME with pre-computed kernel parameters in `spatial_params()`. ```{r message=FALSE, warning=FALSE} cogaps_result <- readRDS(system.file("extdata", "CoGAPS_result.rds", package = "SpaceMarkers", mustWork = TRUE)) sme <- load10X(data_dir, features = cogaps_result, method = "CoGAPS") sme ``` ## Filtering Retain a curated gene list and spots expressed in at least 3 cells: ```{r} data("curated_genes") good_gene_threshold <- 3 keep_genes <- rownames(sme)[ apply(assay(sme, "logcounts"), 1, function(x) sum(x > 0) >= good_gene_threshold) ] keep_genes <- intersect(keep_genes, curated_genes) sme <- sme[keep_genes, ] sme ``` # Accessors The SME holds everything the pipeline needs. A few accessors are worth knowing before walking through the steps: | Accessor | Returns | |----------|---------| | `assay(sme, "logcounts")` | Genes × spots expression matrix | | `spatialCoords(sme)` | Spot coordinates | | `spatial_patterns(sme)` | Pattern values per spot | | `spatial_params(sme)` | Pre-computed kernel parameters (2 × n) | The SME pipeline methods read these internally — you do not need to extract them by hand before each step. ```{r} spatial_params(sme) ``` # Undirected Analysis The undirected analysis identifies genes that are differentially expressed in the overlap zone between two patterns. We focus on Pattern\_1 (immune) and Pattern\_2, which show substantial spatial co-occurrence in this dataset. Every step of the pipeline is an SME-in / SME-out method: the SME is passed through a chain of step functions that each read their inputs from the object and write their outputs back to the appropriate slot. Writes happen via the setter generics (`hotspots<-`, `interactions<-`, `undirected_scores<-`, `overlap_scores<-`, `analysis_type<-`); no manual `@spacemarkers` or `metadata()` assignment is needed. We'll walk through the five pipeline steps one at a time, visualizing what each step produces. ## Step 0: Visualize the starting patterns Before running any pipeline step, inspect the CoGAPS patterns on the tissue. `plot_spatial()` resolves a feature name against `colData` (for patterns), `rownames(sme)` (for gene expression), or the hotspot metadata (via the `hotspot_type` argument). ```{r, message=FALSE, warning=FALSE, dpi=60, fig.width=6} plot_spatial(sme, feature_col = "Pattern_1", title = "Pattern_1 (immune)") ``` And the second pattern we'll use for the undirected analysis: ```{r, message=FALSE, warning=FALSE, dpi=60, fig.width=6} plot_spatial(sme, feature_col = "Pattern_2", title = "Pattern_2") ``` ## Step 1: Find hotspots `find_all_hotspots()` labels each spot as a hotspot (or `NA`) for every pattern, using a Gaussian-kernel null distribution with the pre-computed `spatial_params(sme)`. The result is stored in `metadata(sme)$hotspots$undirected` via the `hotspots<-` setter. ```{r message=FALSE, warning=FALSE} sme_ud <- find_all_hotspots(sme) ``` ```{r} # Fraction of spots classified as hotspots per pattern hs <- hotspots(sme_ud, "undirected") pat_cols <- setdiff(colnames(hs), c("barcode", "y", "x")) vapply(pat_cols, function(p) mean(!is.na(hs[[p]])), numeric(1)) ``` Plot the hotspot labels for each pattern — pass `source = "hotspots"` so `plot_spatial()` reads from the hotspots metadata rather than the continuous pattern column in `colData`. The `colors` argument controls fill: for categorical hotspot labels (`"hot"` / `NA`) `plot_spatial()` passes it into `ggplot2::scale_fill_manual()`, so a single color string maps the `"hot"` level to that color. Other aesthetic knobs: `point_size`, `stroke`, `alpha`, `bg_color`, `text_size`, and `crop`. ```{r, message=FALSE, warning=FALSE, dpi=60, fig.width=6} plot_spatial(sme_ud, feature_col = "Pattern_1", source = "hotspots", hotspot_type = "undirected", colors = "steelblue", title = "Pattern_1 hotspots") ``` …and the Pattern\_2 hotspots picked up by the same call (a single `find_all_hotspots()` populates hotspots for every pattern column): ```{r, message=FALSE, warning=FALSE, dpi=60, fig.width=6} plot_spatial(sme_ud, feature_col = "Pattern_2", source = "hotspots", hotspot_type = "undirected", colors = "firebrick", title = "Pattern_2 hotspots") ``` A combined interaction overlay classifies each hotspot spot as Pattern\_1 only, interacting (hot in both), or Pattern\_2 only. Pass `source = "interaction"` with `interaction_patterns = c("Pattern_1", "Pattern_2")` and a three-color vector to distinguish the three classes. ```{r, message=FALSE, warning=FALSE, dpi=60, fig.width=6} plot_spatial(sme_ud, source = "interaction", hotspot_type = "undirected", interaction_patterns = c("Pattern_1", "Pattern_2"), colors = c(Pattern_1 = "steelblue", interacting = "purple", Pattern_2 = "firebrick")) ``` ## Step 2: Calculate overlap scores Before scoring genes we already know where every pair of hotspots sits in space, so we can compute the pairwise spatial overlap. `calculate_overlap_undirected()` measures how much each pair's hotspots overlap (Szymkiewicz–Simpson by default); the resulting table is stored via `overlap_scores<-` and `analysis_type(sme)` is set to `"undirected"`. ```{r} sme_ud <- calculate_overlap_undirected(sme_ud) overlap_scores(sme_ud) analysis_type(sme_ud) ``` ```{r, message=FALSE, warning=FALSE, dpi=60, fig.width=6} plot_overlap_scores(sme_ud) ``` The heatmap highlights which pattern pairs share territory; pairs with substantial overlap are the interesting candidates for the gene-level tests in the next step. ## Step 3: Find pairwise interacting genes `get_pairwise_interacting_genes()` tests each gene for differential expression between a pair's overlap zone and each pattern's exclusive zone. The `pattern_pairs` argument scopes the test to a single pair so the narrative stays focused; omit it to compute all pairs. The full per-gene result list is written to `metadata(sme)$interactions`. ```{r message=FALSE, warning=FALSE} sme_ud <- get_pairwise_interacting_genes( sme_ud, mode = "DE", analysis = "enrichment", minOverlap = 10, workers = 1, pattern_pairs = rbind(c("Pattern_1", "Pattern_2")) ) ``` ```{r} # Detailed per-gene statistics for the Pattern_1 x Pattern_2 pair names(interactions(sme_ud)) pair1 <- interactions(sme_ud)[[1]] if (length(pair1$interacting_genes) > 0) { head(pair1$interacting_genes[[1]]) } ``` ## Step 4: Calculate IM scores `get_im_scores()` aggregates the per-gene Kruskal–Wallis + Dunn's-test statistics from the previous step into a single score matrix (genes × pattern pairs), stored via `undirected_scores<-`. ```{r} sme_ud <- get_im_scores(sme_ud) head(undirected_scores(sme_ud)) ``` The top-ranked genes for the Pattern\_1 × Pattern\_2 interaction: ```{r, message=FALSE, warning=FALSE, dpi=60, fig.width=6} pair_col <- setdiff(colnames(undirected_scores(sme_ud)), "Gene")[1] plot_im_scores(sme_ud, interaction = pair_col, nGenes = 10) ``` Plot the top gene's spatial expression — the IM score is a summary of where a gene is differentially expressed in the overlap zone. ```{r, message=FALSE, warning=FALSE, dpi=60, fig.width=6} ims <- undirected_scores(sme_ud) top_gene <- ims[order(ims[[pair_col]], decreasing = TRUE), "Gene"][1] plot_spatial(sme_ud, feature_col = top_gene, title = sprintf("%s (top %s gene)", top_gene, pair_col)) ``` # Directed Analysis The directed analysis quantifies how much each pattern's *spatial influence* drives gene expression in the hotspot of the opposing pattern. We focus on Pattern\_1 (immune) → Pattern\_5 (cancer), matching the main vignette. The directed workflow uses Gaussian mixture model thresholds rather than the kernel null distribution used in the undirected workflow, so the hotspot step differs — you supply `type = "pattern"` (on the pattern values) or `type = "influence"` (on the smoothed influence map). ## Step 0: Visualize the starting patterns ```{r, message=FALSE, warning=FALSE, dpi=60, fig.width=6} plot_spatial(sme, feature_col = "Pattern_1", title = "Pattern_1 (immune)") ``` The directed analysis focuses on how Pattern\_1 (immune) radiates into the Pattern\_5 (cancer) region, so we'll track the cancer pattern in the same way: ```{r, message=FALSE, warning=FALSE, dpi=60, fig.width=6} plot_spatial(sme, feature_col = "Pattern_5", title = "Pattern_5 (cancer)") ``` ## Step 1: Calculate influence `calculate_influence()` applies kernel smoothing to each pattern column independently, producing per-spot values that describe how far each pattern "radiates" into the surrounding tissue. The result is a data.frame stored in `metadata(sme)$influence` via `influence_map<-`. ```{r message=FALSE, warning=FALSE} sme_dir <- calculate_influence(sme) head(influence_map(sme_dir)) ``` Plot the smoothed influence map for each pattern (`source = "influence_map"` makes `plot_spatial` read from `metadata(sme)$influence` instead of the continuous pattern column in `colData`): ```{r, message=FALSE, warning=FALSE, dpi=60, fig.width=6} plot_spatial(sme_dir, feature_col = "Pattern_1", source = "influence_map", title = "Pattern_1 influence") ``` Pattern\_5's influence map, on the same tissue: ```{r, message=FALSE, warning=FALSE, dpi=60, fig.width=6} plot_spatial(sme_dir, feature_col = "Pattern_5", source = "influence_map", title = "Pattern_5 influence") ``` ## Step 2: Find pattern hotspots `find_hotspots_gmm(type = "pattern")` fits a two-component Gaussian mixture model to each pattern's value distribution and labels spots above the boundary as hotspots. Stored under `metadata(sme)$hotspots$pattern`. ```{r message=FALSE, warning=FALSE} sme_dir <- find_hotspots_gmm(sme_dir, type = "pattern") ``` ```{r, message=FALSE, warning=FALSE, dpi=60, fig.width=6} plot_spatial(sme_dir, feature_col = "Pattern_1", source = "hotspots", hotspot_type = "pattern", colors = "steelblue", title = "Pattern_1 GMM hotspots") ``` …and Pattern\_5's GMM hotspot, which defines the *target* region the directed analysis will test: ```{r, message=FALSE, warning=FALSE, dpi=60, fig.width=6} plot_spatial(sme_dir, feature_col = "Pattern_5", source = "hotspots", hotspot_type = "pattern", colors = "firebrick", title = "Pattern_5 GMM hotspots") ``` ## Step 3: Find influence hotspots Running `find_hotspots_gmm(type = "influence")` applies the same GMM thresholding but to the smoothed influence map from Step 1 — spots under strong influence of each pattern. Stored under `metadata(sme)$hotspots$influence`. ```{r message=FALSE, warning=FALSE} sme_dir <- find_hotspots_gmm(sme_dir, type = "influence") ``` ```{r, message=FALSE, warning=FALSE, dpi=60, fig.width=6} plot_spatial(sme_dir, feature_col = "Pattern_1", source = "hotspots", hotspot_type = "influence", colors = "steelblue", title = "Pattern_1 influence hotspots") ``` Pattern\_5's influence hotspot — the complementary *source* region used in the reverse-direction test (Pattern\_5 influencing Pattern\_1): ```{r, message=FALSE, warning=FALSE, dpi=60, fig.width=6} plot_spatial(sme_dir, feature_col = "Pattern_5", source = "hotspots", hotspot_type = "influence", colors = "firebrick", title = "Pattern_5 influence hotspots") ``` With both pattern and influence hotspots in place, the directed interaction zone is well-defined. A directed analysis runs *two* independent t-tests per pair — one for each direction — so the per-spot classification is best read one direction at a time: - **Forward (Pattern\_1 → Pattern\_5):** inside Pattern\_1's GMM hotspot, split spots into `"Pattern_1"` (not in Pattern\_5's influence) and `"Pattern_1 near Pattern_5"` (in Pattern\_5's influence — the t-test's interaction zone). The "control" vs "treatment" groups the gene-level test compares. - **Reverse (Pattern\_5 → Pattern\_1):** the mirror — `"Pattern_5"` vs `"Pattern_5 near Pattern_1"`. Each plot also carries a `" influence"` context layer so the reach of the target's influence is visible beyond the source's hotspot. ```{r} labs_fwd <- overlap_map(sme_dir, c("Pattern_1", "Pattern_5"), direction = "forward") table(labs_fwd, useNA = "ifany") ``` ```{r, message=FALSE, warning=FALSE, dpi=60, fig.width=6} plot_spatial(sme_dir, source = "interaction", interaction_patterns = c("Pattern_1", "Pattern_5"), direction = "forward", colors = c(Pattern_1 = "steelblue", "Pattern_1 near Pattern_5" = "purple", "Pattern_5 influence" = "gray60")) ``` ```{r} labs_rev <- overlap_map(sme_dir, c("Pattern_1", "Pattern_5"), direction = "reverse") table(labs_rev, useNA = "ifany") ``` ```{r, message=FALSE, warning=FALSE, dpi=60, fig.width=6} plot_spatial(sme_dir, source = "interaction", interaction_patterns = c("Pattern_1", "Pattern_5"), direction = "reverse", colors = c(Pattern_5 = "firebrick", "Pattern_5 near Pattern_1" = "goldenrod", "Pattern_1 influence" = "gray60")) ``` ## Step 4: Directed overlap scores Before scoring genes we look at the pairwise directional overlap across every source → target pair. `calculate_overlap_directed()` quantifies how much each pattern's GMM hotspot overlaps with the opposing pattern's influence zone, stores the table via `overlap_scores<-`, and sets `analysis_type(sme)` to `"directed"`. ```{r} sme_dir <- calculate_overlap_directed(sme_dir) overlap_scores(sme_dir) analysis_type(sme_dir) ``` `plot_overlap_scores()` auto-detects whether the stored scores are undirected (columns `pattern1`, `pattern2`, `overlapScore`) or directed (columns `pattern`, `influence`, `relAbundance`) and picks the right axes and fill scale accordingly. ```{r, message=FALSE, warning=FALSE, dpi=60, fig.width=6} plot_overlap_scores(sme_dir, title = "Directed overlap (relative abundance)") ``` The heatmap flags which source → target pairs have the most directional interaction — good candidates for the per-gene tests in the next step. ## Step 5: Directed gene scores `calculate_gene_scores_directed()` scores each gene for each source → target pair, using the intersection of the target's GMM hotspot and the source's influence hotspot as the interaction zone. `pattern_pairs` restricts to the Pattern\_1 × Pattern\_5 pair. Results written via `directed_scores<-`. ```{r message=FALSE, warning=FALSE} sme_dir <- calculate_gene_scores_directed( sme_dir, pattern_pairs = rbind(c("Pattern_1", "Pattern_5")) ) ``` ```{r} head(directed_scores(sme_dir)) ``` The top gene for the strongest directed interaction, plotted on the tissue: ```{r, message=FALSE, warning=FALSE, dpi=60, fig.width=6} ds <- directed_scores(sme_dir) ci <- sort(unique(ds$cell_interaction))[1] top_gene <- ds[ds$cell_interaction == ci, ] top_gene <- top_gene[order(-top_gene$effect_size), "gene"][1] plot_spatial(sme_dir, feature_col = top_gene, title = sprintf("%s (top directed gene, %s)", top_gene, ci)) ``` # Putting it together Once the individual steps make sense, the whole analysis is a single pipe chain: ```{r eval=FALSE} sme_ud <- sme |> find_all_hotspots() |> calculate_overlap_undirected() |> get_pairwise_interacting_genes( mode = "DE", analysis = "enrichment", minOverlap = 10, workers = 1, pattern_pairs = rbind(c("Pattern_1", "Pattern_2")) ) |> get_im_scores() sme_dir <- sme |> calculate_influence() |> find_hotspots_gmm(type = "pattern") |> find_hotspots_gmm(type = "influence") |> calculate_overlap_directed() |> calculate_gene_scores_directed( pattern_pairs = rbind(c("Pattern_1", "Pattern_5")) ) ``` # Cleanup ```{r} unlink(file.path(data_dir), recursive = TRUE) ``` # Session Info ```{r} sessionInfo() ```