Introduction
Splikit is a comprehensive R package designed for
analyzing alternative splicing events in single-cell RNA sequencing
data. Developed by the Computational and Statistical Genomics Laboratory
at McGill University, splikit provides a complete toolkit for processing
STARsolo splicing junction output, identifying highly variable splicing
events, and performing downstream analyses.
What Does Splikit Do?
Splikit transforms raw splicing junction data into meaningful
biological insights through several key capabilities:
- Junction Data Processing: Converts STARsolo
splicing junction output into structured matrices suitable for
analysis
- Inclusion/Exclusion Matrix Generation: Creates M1
(inclusion) and M2 (exclusion) matrices that capture splicing behavior
across cells
- Variable Event Detection: Identifies highly
variable splicing events using advanced statistical methods
- Gene Expression Integration: Processes standard
gene expression data alongside splicing information
- Statistical Analysis: Provides tools for
correlation analysis, variance computation, and clustering evaluation
### How splikit get these data? We grouped splice junctions according to
the coordinates of their intronic regions into “local junction variants”
(LJVs). An LJV is defined as a set of junctions that share either the
first or the last coordinate.
For each junction within an LJV, we extracted the junction abundance
count for every cell, which we refer to as the M1
count. The M2 count for a given junction in
each cell is defined as the sum of the counts of its alternative
junctions—that is, all other junctions within the same LJV. In other
words, for each junction and cell, the number of reads supporting that
junction is contrasted with the total number of reads supporting its
alternative junctions. For example, if \((
M1_{ij} )\) represents the count for junction \(j\) in cell \(i\), then the M2 count can be written
as:
M2_{ij} = \sum_{\substack{k \in J \\ k \neq j}} M1_{ik}
where \(J\) denotes the set of all
junctions in the LJV. The grouping method is determined by whether the
first or the last coordinate is used, with an appended -E
or -S added to the event IDs accordingly. This approach may
result in a single junction receiving two different measurements in the
M2 counts (in the exclusion matrix) while having identical measurements
in the M1 matrix (inclusion matrix).
We also address sample-specific junctions. If a junction is present
in only a subset of samples, a vector of zeros is assigned to the M1
matrix for the samples where the junction is absent. The M2 measurements
are then computed accordingly. This figure illustrates two exemplary
LJVs.

Traditional single-cell RNA-seq analysis focuses primarily on gene
expression levels, but alternative splicing represents an additional
layer of cellular complexity. Splikit enables researchers to:
- Capture Splicing Diversity: Identify
cell-type-specific splicing patterns that might be missed by gene
expression alone
- Integrate Multiple Modalities: Combine splicing and
expression data for comprehensive cellular characterization
- Scale Efficiently: Handle large single-cell
datasets with optimized C++ implementations
- Ensure Reproducibility: Follow standardized
workflows with well-documented functions
Core Concepts
Understanding splikit requires familiarity with several key concepts
in splicing analysis:
Splicing Junction Events
A splicing junction represents the connection between two exons after
intron removal. In single-cell data, we observe:
- Inclusion Events: When a particular exon or
junction is included in the mature transcript
- Exclusion Events: When a particular exon or
junction is skipped during splicing
Matrix Representations
Splikit uses two primary matrix types:
M1 Matrix (Inclusion Matrix): - Rows represent
splicing events - Columns represent individual cells - Values indicate
the number of reads supporting inclusion of each event
M2 Matrix (Exclusion Matrix): - Same dimensions as
M1 - Values indicate the number of reads supporting exclusion of each
event - Calculated based on the total splicing activity at shared
coordinate groups
Coordinate Groups
Splikit groups splicing junctions that share start or end
coordinates, enabling the calculation of exclusion events. This grouping
is essential because:
- Multiple junctions can share the same donor (5’) or acceptor (3’)
site
- The total splicing activity at these sites helps quantify exclusion
events
- Groups with only one member cannot generate exclusion events
Variable Event Detection
The package identifies highly variable splicing events using
statistical methods:
- Deviance-based Approach: Models splicing ratios
using beta-binomial or similar distributions
- Variance Stabilization: Applies transformations to
identify events with genuine biological variability
- Library-wise Calculation: Accounts for technical
variation between samples or libraries
R6 Object-Oriented Interface
Starting with version 2.0.0, splikit introduces a modern R6
object-oriented interface through the SplikitObject class.
This provides a more intuitive and streamlined workflow with method
chaining support.
Creating a SplikitObject
library(splikit)
# Load toy data for demonstration
toy_sj_data <- load_toy_SJ_object()
# Create SplikitObject from junction abundance data
spl <- SplikitObject$new(junction_ab_object = toy_sj_data)
# View object summary
print(spl)
Complete Workflow with Method Chaining
The R6 interface supports method chaining for a clean, pipeline-style
workflow:
# Complete analysis pipeline with method chaining
spl <- SplikitObject$new(junction_ab_object = toy_sj_data)$
makeM1(min_counts = 1)$
makeM2(n_threads = 4, verbose = TRUE)$
findVariableEvents(min_row_sum = 30, n_threads = 4)
# Access results
m1_matrix <- spl$m1
m2_matrix <- spl$m2
variable_events <- spl$variableEvents
event_data <- spl$eventData
Step-by-Step Workflow
For more control, you can call methods individually:
# Create object
spl <- SplikitObject$new(junction_ab_object = toy_sj_data)
# Step 1: Create M1 matrix
spl$makeM1(min_counts = 1, verbose = TRUE)
print(dim(spl$m1)) # Check dimensions
# Step 2: Create M2 matrix with multi-threading
spl$makeM2(
n_threads = 4, # Number of OpenMP threads
use_cpp = TRUE, # Use optimized C++ implementation
verbose = TRUE
)
print(dim(spl$m2)) # Same dimensions as M1
# Step 3: Find highly variable events
spl$findVariableEvents(
min_row_sum = 30,
n_threads = 4
)
# View top variable events
top_events <- spl$variableEvents[order(-sum_deviance)][1:10]
print(top_events)
Subsetting SplikitObjects
SplikitObjects support flexible subsetting by events, cells, or
both:
# Subset by events (rows)
top_100_events <- spl$variableEvents$events[1:100]
spl_subset <- spl$subset(events = top_100_events)
# Subset by cells (columns)
selected_cells <- colnames(spl$m1)[1:500]
spl_cells <- spl$subset(cells = selected_cells)
# Subset both events and cells
spl_combined <- spl$subset(
events = top_100_events,
cells = selected_cells
)
# Check dimensions
print(dim(spl_subset$m1))
print(dim(spl_cells$m1))
print(dim(spl_combined$m1))
Working with Existing Matrices
You can also create a SplikitObject from pre-computed matrices:
# Load pre-computed M1/M2 data
toy_m1m2 <- load_toy_M1_M2_object()
# Create SplikitObject from existing matrices
spl <- SplikitObject$new(
m1_matrix = toy_m1m2$m1,
m2_matrix = toy_m1m2$m2
)
# Directly find variable events (M1 and M2 already computed)
spl$findVariableEvents(min_row_sum = 50, n_threads = 4)
Accessing SplikitObject Fields
The SplikitObject provides access to all data through fields:
# Core matrices
spl$m1 # Inclusion matrix (events × cells)
spl$m2 # Exclusion matrix (events × cells)
# Metadata
spl$eventData # Event metadata data.table
spl$variableEvents # Variable events results
# Original data
spl$junctionAbObject # Raw junction abundance data
R6 vs Function-Based Interface Comparison
Both interfaces produce identical results. Choose based on your
workflow preference:
R6 Interface (recommended for new users):
spl <- SplikitObject$new(junction_ab_object = toy_sj_data)$
makeM1(min_counts = 1)$
makeM2(n_threads = 4)$
findVariableEvents(min_row_sum = 30)
m1 <- spl$m1
m2 <- spl$m2
hve <- spl$variableEvents
Function-Based Interface (traditional):
m1_result <- make_m1(toy_sj_data, min_counts = 1)
m2 <- make_m2(m1_result$m1_inclusion_matrix, m1_result$event_data, n_threads = 4)
hve <- find_variable_events(m1_result$m1_inclusion_matrix, m2, min_row_sum = 30)
Main Workflow Functions
make_junction_ab()
Purpose: Processes STARsolo splicing junction output
into structured R objects.
Function Signature:
make_junction_ab(STARsolo_SJ_dirs, white_barcode_lists = NULL,
sample_ids, use_internal_whitelist = TRUE,
verbose = FALSE, keep_multi_mapped_junctions = FALSE, ...)
Parameters:
STARsolo_SJ_dirs: Character vector of paths to STARsolo
SJ directories
white_barcode_lists: Optional list of barcode
whitelists for filtering
sample_ids: Unique identifiers for each sample
use_internal_whitelist: Whether to use STARsolo’s
internal filtered barcodes (default TRUE)
verbose: Logical for detailed progress reporting
(default FALSE)
keep_multi_mapped_junctions: Whether to keep
multi-mapped junctions (default FALSE)
...: Additional parameters for future extensions
Expected Directory Structure:
STARsolo_output/
├── SJ/
│ ├── raw/
│ │ ├── matrix.mtx # Junction count matrix
│ │ ├── barcodes.tsv # Cell barcodes
│ │ └── features.tsv # Junction features
│ └── filtered/
│ └── barcodes.tsv # Filtered barcodes
└── SJ.out.tab # STAR junction output
Returns: A named list where each element corresponds
to a sample and contains: - eventdata: Data.table with
junction metadata (chromosome, coordinates, strand) -
junction_ab: Sparse matrix of junction abundances
(junctions × cells)
Example:
# Single sample processing
sj_dirs <- "/path/to/STARsolo/output/SJ"
sample_names <- "Sample1"
junction_data <- make_junction_ab(
STARsolo_SJ_dirs = sj_dirs,
sample_ids = sample_names,
use_internal_whitelist = TRUE
)
# Multiple samples with custom barcodes
sj_dirs <- c("/path/to/sample1/SJ", "/path/to/sample2/SJ")
sample_names <- c("Control", "Treatment")
custom_barcodes <- list(
c("AAACCTGAGCATCGCA", "AAACCTGAGCGAACGC"), # Control barcodes
c("AAACCTGAGGAATGGT", "AAACCTGAGGGCTCTC") # Treatment barcodes
)
junction_data <- make_junction_ab(
STARsolo_SJ_dirs = sj_dirs,
white_barcode_lists = custom_barcodes,
sample_ids = sample_names,
use_internal_whitelist = FALSE
)
Understanding the Output:
The eventdata component contains crucial information for
each junction: - chr, start, end,
strand: Genomic coordinates - start_cor_id,
end_cor_id: Coordinate identifiers for grouping -
row_names_mtx: Unique junction identifiers -
sample_id: Sample origin
make_m1()
Purpose: Transforms junction abundance data into the
inclusion matrix (M1) with coordinate-based grouping.
Function Signature:
make_m1(junction_ab_object, min_counts = 1, verbose = FALSE)
Parameters: - junction_ab_object:
Output from make_junction_ab() - min_counts:
Minimum count threshold for filtering events (default 1) -
verbose: Logical for detailed progress reporting (default
FALSE)
The Coordinate Grouping Process:
This function performs sophisticated grouping of junctions that share
coordinates:
- Start Coordinate Grouping: Junctions sharing the
same 5’ splice site are grouped
- End Coordinate Grouping: Junctions sharing the same
3’ splice site are grouped
- Matrix Expansion: Each group generates new rows in
the M1 matrix
- Event Naming: Groups are labeled with “_S” (start)
or “_E” (end) suffixes
Returns: - m1_inclusion_matrix: Sparse
matrix with expanded events based on coordinate groups -
event_data: Enhanced metadata with group information
Example:
# Process junction data into M1 matrix
m1_result <- make_m1(junction_data)
# Examine the structure
print(dim(m1_result$m1_inclusion_matrix)) # Events × Cells
print(nrow(m1_result$event_data)) # Number of events
# Check the grouping information
table(m1_result$event_data$group_count) # Distribution of group sizes
Understanding M1 Structure:
The M1 matrix represents inclusion evidence: - Each row corresponds
to a specific splicing event (possibly grouped) - Each column
corresponds to a single cell - Values represent the number of reads
supporting inclusion of that event - Events with “_S” suffix represent
start coordinate groups - Events with “_E” suffix represent end
coordinate groups
make_m2()
Purpose: Creates the exclusion matrix (M2) from the
M1 matrix and event data with intelligent memory management.
Function Signature:
make_m2(m1_inclusion_matrix, eventdata, batch_size = 5000,
memory_threshold = 2e9, force_fast = FALSE,
multi_thread = FALSE, n_threads = 1, use_cpp = TRUE,
verbose = FALSE)
Parameters: - m1_inclusion_matrix:
Output from make_m1() - eventdata: Event
metadata from make_m1() - batch_size: Number
of groups to process per batch when using batched mode (default 5000) -
memory_threshold: Maximum rows before switching to batched
processing (default 2e9) - force_fast: Force fast
processing regardless of memory estimates (default FALSE) -
multi_thread: Enable parallel processing for batched
operations (default FALSE) - n_threads: Number of OpenMP
threads for C++ implementation (default 1) - use_cpp: Use
optimized C++ implementation (default TRUE, recommended) -
verbose: Logical for detailed progress reporting (default
FALSE)
Performance Notes (v2.0.0): The C++ implementation
has been completely rewritten for version 2.0.0 with significant
improvements: - ~8x faster execution through two-pass
CSC format construction - ~30x lower memory usage by
eliminating intermediate dense matrices - Memory usage now scales with
output size only: O(nnz_output) + O(n_groups) - Previous versions used
O(n_groups × n_cells) for intermediate storage
The Exclusion Calculation Logic:
For each coordinate group, M2 values are calculated as:
M2[event, cell] = Total_Group_Activity[group, cell] - M1[event, cell]
This means: 1. For each coordinate group, sum all inclusion reads
across group members 2. For each individual event, subtract its
inclusion reads from the group total 3. The remainder represents
“exclusion” evidence for that specific event
Example:
# Create M2 matrix
m2_matrix <- make_m2(m1_result$m1_inclusion_matrix, m1_result$event_data)
# Compare dimensions
print(dim(m1_result$m1_inclusion_matrix))
print(dim(m2_matrix)) # Should be identical
# Examine the relationship between M1 and M2
cor_values <- diag(cor(as.matrix(m1_result$m1_inclusion_matrix[1:100, 1:50]),
as.matrix(m2_matrix[1:100, 1:50])))
summary(cor_values) # Often negative correlation
Interpreting M2 Values:
- High M2 values indicate strong evidence for exclusion of that
specific event
- M2 values of zero occur when an event represents all activity in its
coordinate group
- The M1/M2 ratio provides information about splicing ratios
Additional Data Processing Functions
make_gene_count()
Purpose: Processes standard gene expression matrices
alongside splicing data.
gene_expression <- make_gene_count(
expression_dirs = "/path/to/10X/output",
sample_ids = "Sample1",
use_internal_whitelist = TRUE
)
make_velo_count()
Purpose: Processes spliced/unspliced matrices for
RNA velocity analysis.
velocity_data <- make_velo_count(
velocyto_dirs = "/path/to/velocyto/output",
sample_ids = "Sample1",
merge_counts = TRUE # Combine spliced/unspliced across samples
)
Feature Selection Functions
find_variable_events()
Purpose: Identifies highly variable splicing events
using deviance-based statistics.
Function Signature:
find_variable_events(m1_matrix, m2_matrix, min_row_sum = 50,
n_threads = 1, verbose = TRUE, ...)
Parameters: - m1_matrix,
m2_matrix: Inclusion and exclusion matrices -
min_row_sum: Minimum total reads per event for inclusion in
analysis - n_threads: Number of threads for parallel
processing (requires OpenMP) - verbose: Whether to print
progress messages
The Statistical Method:
The function calculates deviance statistics for each event:
- Library-wise Processing: Analyzes each
sample/library separately
- Ratio Modeling: Models the M1/(M1+M2) ratio for
each event
- Deviance Calculation: Computes how much each event
deviates from expected behavior
- Summation: Combines deviance scores across all
libraries
Returns: A data.table with columns: -
events: Event identifiers - sum_deviance:
Combined deviance score across all libraries
Example:
# Load toy data
toy_data <- load_toy_M1_M2_object()
m1_matrix <- toy_data$m1
m2_matrix <- toy_data$m2
# Find variable events with different parameters
hve_default <- find_variable_events(m1_matrix, m2_matrix)
hve_strict <- find_variable_events(m1_matrix, m2_matrix, min_row_sum = 100)
# Compare results
print(nrow(hve_default)) # Number of events analyzed
print(nrow(hve_strict)) # Fewer events with stricter filter
# Top variable events
top_events <- hve_default[order(-sum_deviance)][1:20]
print(top_events)
Interpreting Results:
- Higher
sum_deviance values indicate more variable
splicing behavior
- Events with high variability might represent:
- Cell-type-specific splicing patterns
- Response to environmental conditions
- Technical artifacts (require validation)
find_variable_genes()
Purpose: Identifies highly variable genes using
either variance stabilization or deviance methods.
Function Signature:
find_variable_genes(gene_expression_matrix, method = "vst",
n_threads = 1, verbose = TRUE, ...)
Parameters: - gene_expression_matrix:
Sparse gene expression matrix - method: Either “vst”
(variance stabilizing transformation) or “sum_deviance” -
n_threads: Number of threads for parallel processing (only
for sum_deviance method) - verbose: Logical for progress
reporting (default TRUE)
Method Comparison:
VST Method (Default): - Implements Seurat-style
variance stabilization - Faster computation through C++ implementation -
Good for identifying highly expressed variable genes
Sum Deviance Method: - Similar to the splicing event
approach - Models gene expression using negative binomial deviances -
Better for lowly expressed genes
Example:
# Using toy gene expression data
toy_data <- load_toy_M1_M2_object()
gene_expr <- toy_data$gene_expression
# VST method (faster)
hvg_vst <- find_variable_genes(gene_expr, method = "vst")
print(head(hvg_vst[order(-standardize_variance)]))
# Deviance method (more comprehensive)
hvg_dev <- find_variable_genes(gene_expr, method = "sum_deviance")
print(head(hvg_dev[order(-sum_deviance)]))
# Compare methods
length(intersect(hvg_vst[1:500]$events, hvg_dev[1:500]$events))
Utility Functions
get_pseudo_correlation()
Purpose: Computes pseudo-R² correlation between
external data and splicing behavior using beta-binomial modeling.
Function Signature:
get_pseudo_correlation(ZDB_matrix, m1_inclusion, m2_exclusion,
metric = "CoxSnell", suppress_warnings = TRUE)
Parameters: - ZDB_matrix: External data
matrix (e.g., gene expression, protein levels) - must be dense -
m1_inclusion, m2_exclusion: Can be either
sparse or dense matrices - metric: R² metric to use -
“CoxSnell” (default) or “Nagelkerke” - suppress_warnings:
Whether to suppress computational warnings
Use Cases: - Correlating splicing patterns with gene
expression - Identifying splicing events that respond to external
stimuli - Validating splicing predictions against protein data
Example:
# Create example data
toy_sj <- load_toy_SJ_object()
m1_obj <- make_m1(toy_sj)
m1_matrix <- as.matrix(m1_obj$m1_inclusion_matrix) # Convert to dense
m2_matrix <- as.matrix(make_m2(m1_obj$m1_inclusion_matrix, m1_obj$event_data))
# Create dummy external data (e.g., gene expression)
external_data <- matrix(
rnorm(nrow(m1_matrix) * ncol(m1_matrix)),
nrow = nrow(m1_matrix),
ncol = ncol(m1_matrix)
)
rownames(external_data) <- rownames(m1_matrix)
# Calculate pseudo-correlations
correlations <- get_pseudo_correlation(external_data, m1_matrix, m2_matrix)
# Examine results
head(correlations[order(-abs(pseudo_correlation))])
# Compare to null distribution
plot(correlations$pseudo_correlation, correlations$null_distribution,
xlab = "Observed Correlation", ylab = "Null Distribution",
main = "Pseudo-correlation Analysis")
abline(0, 1, col = "red")
get_rowVar()
Purpose: Efficiently computes row-wise variance for
dense or sparse matrices.
Function Signature:
get_rowVar(M, verbose = FALSE)
Example:
# Dense matrix
dense_mat <- matrix(rnorm(1000), nrow = 100)
var_dense <- get_rowVar(dense_mat)
# Sparse matrix
library(Matrix)
sparse_mat <- rsparsematrix(100, 10, density = 0.1)
var_sparse <- get_rowVar(sparse_mat, verbose = TRUE)
# Compare with base R (for dense matrices)
var_base <- apply(dense_mat, 1, var)
all.equal(var_dense, var_base)
get_silhouette_mean()
Purpose: Computes average silhouette width for
clustering evaluation.
Function Signature:
get_silhouette_mean(X, cluster_assignments, n_threads = 1)
Example:
# Create example clustering data
set.seed(42)
pc_matrix <- matrix(rnorm(1000 * 15), nrow = 1000, ncol = 15)
clusters <- sample(1:5, 1000, replace = TRUE)
# Calculate silhouette score
n_threads <- parallel::detectCores()
silhouette_score <- get_silhouette_mean(pc_matrix, clusters, n_threads)
print(paste("Average silhouette score:", round(silhouette_score, 3)))
Conclusion
Splikit provides a comprehensive framework for single-cell splicing
analysis, from raw STARsolo output to biological insights. The package’s
strength lies in its integrated workflow that handles the complexities
of splicing data while providing flexibility for advanced users.
Key advantages of splikit include:
- Complete workflow coverage from data processing to
statistical analysis
- Efficient implementation with C++ acceleration for
performance-critical functions
- Flexible design that accommodates various
experimental setups and data types
- Integration capabilities with popular single-cell
analysis frameworks
For additional support, examples, and updates, visit the package
documentation and GitHub repository. The splikit development team
welcomes feedback and contributions from the community.
This manual was generated for splikit version 2.0.0. For the most
up-to-date information, please refer to the package documentation and
vignettes.