immGLIPH provides an R implementation of the GLIPH (Grouping of Lymphocyte Interactions by Paratope Hotspots) and GLIPH2 algorithms for clustering T cell receptors (TCRs) that are predicted to bind the same HLA-restricted peptide antigen.
The package identifies TCR specificity groups by detecting statistically enriched CDR3\(\beta\) motifs (local similarity) and structurally similar CDR3\(\beta\) sequences (global similarity), then clusters them into convergence groups and scores each group for biological significance.
immGLIPH is an R implementation of existing algorithms. Users should cite the original publications:
GLIPH: Glanville, J. et al. Identifying specificity groups in the T cell receptor repertoire. Nature 547, 94–98 (2017). doi:10.1038/nature22976
GLIPH2: Huang, H. et al. Analyzing the Mycobacterium tuberculosis immune response by T-cell receptor clustering with GLIPH2 and genome-wide antigen screening. Nature Biotechnology 38, 1194–1202 (2020). doi:10.1038/s41587-020-0505-4
immGLIPH can be installed from Bioconductor:
BiocManager::install("immGLIPH")
The reference repertoire data (~19 MB) is downloaded automatically the first
time you run runGLIPH() and cached locally via
BiocFileCache. You can
pre-download the data with:
BiocManager::install("BiocFileCache")
ref <- getGLIPHreference()
library(immGLIPH)
immGLIPH integrates with the
scRepertoire ecosystem
through immApex. Both
scRepertoire and immApex are Bioconductor packages and can be installed via
BiocManager::install(). This means runGLIPH() can directly accept:
immGLIPH includes built-in example data derived from the scRepertoire example dataset (Yost et al. 2019):
gliph_input_data: A data frame of 365 TRB CDR3 sequences with
V-gene and patient annotations.gliph_sce: A SingleCellExperiment object with TCR clonotype
information in colData, demonstrating the single-cell workflow.data("gliph_input_data")
head(gliph_input_data)
## CDR3b TRBV patient
## 1 CAISEQGKGELFF TRBV10-3 P17B
## 2 CASSVRRERANTGELFF TRBV9 P17B
## 3 CASSVRRERANTGELFF TRBV9 P17B
## 4 CASSSRQDSTDTQYF TRBV27 P17B
## 5 CASSVRRERANTGELFF TRBV9 P17B
## 6 CASSPRAGTPNTEAFF TRBV4-1 P17B
dim(gliph_input_data)
## [1] 365 3
runGLIPH() accepts a data frame with the following columns:
| Column | Required | Description |
|---|---|---|
| CDR3b | Yes | CDR3\(\beta\) amino acid sequences |
| TRBV | No | V-gene usage (e.g., “TRBV5-1”) |
| patient | No | Donor/sample identifier |
| HLA | No | HLA alleles, comma-separated |
| counts | No | Clone frequency |
Alternative column names are automatically recognized (e.g., cdr3, v_gene,
sample, clone_count).
When working with single-cell immune repertoire data, you can use scRepertoire to prepare your data and pass the output directly to immGLIPH.
library(scRepertoire)
# After processing with cellranger/etc, combine contigs
combined <- combineTCR(contig_list[1:2],
samples = c("P1", "P2"))
# Pass scRepertoire output directly to runGLIPH
results <- runGLIPH(combined,
method = "gliph2")
For SingleCellExperiment objects that already contain TCR metadata (e.g.,
added via scRepertoire::combineExpression()), immGLIPH extracts the receptor
data automatically using immApex::getIR(). Here is an example using the
bundled gliph_sce dataset:
library(SingleCellExperiment)
data("gliph_sce")
# SingleCellExperiment object with TCR info in colData
results <- runGLIPH(gliph_sce,
method = "gliph2",
chains = "TRB")
runGLIPH() Function| Argument | Default | Description |
|---|---|---|
cdr3_sequences |
– | Input data (data frame, vector, SCE, or list) |
method |
"gliph2" |
Algorithm preset: "gliph1", "gliph2", or "custom" |
sim_depth |
1000 | Simulation depth (higher = more reproducible, slower) |
n_cores |
1 | Number of parallel cores |
refdb_beta |
"human_v2.0_CD48" |
Reference database (built-in name or custom data frame) |
local_similarities |
TRUE |
Search for local (motif-based) similarities |
global_similarities |
TRUE |
Search for global (structural) similarities |
structboundaries |
TRUE |
Trim structural boundaries before comparison |
accept_CF |
TRUE |
Filter to sequences starting with C, ending with F |
The method parameter configures a coordinated set of algorithm choices:
| Setting | "gliph1" |
"gliph2" |
"custom" |
|---|---|---|---|
| Local method | Repeated random sampling | Fisher’s exact test | User-defined |
| Global method | Hamming distance cutoff | Struct-based + Fisher | User-defined |
| Clustering | Connected components | Per-motif isolated | User-defined |
| Scoring | GLIPH1 formula | GLIPH2 formula | User-defined |
By default runGLIPH() downloads a reference repertoire via BiocFileCache.
For this vignette we supply a custom reference data frame directly using
refdb_beta to avoid network access:
data("gliph_input_data")
# Use the input data itself as a small reference for demonstration
ref_df <- gliph_input_data[, c("CDR3b", "TRBV")]
res_gliph2 <- runGLIPH(
cdr3_sequences = gliph_input_data[seq_len(200), ],
method = "gliph2",
refdb_beta = ref_df,
sim_depth = 100,
n_cores = 1
)
res_gliph1 <- runGLIPH(
cdr3_sequences = gliph_input_data[seq_len(200), ],
method = "gliph1",
refdb_beta = ref_df,
sim_depth = 100,
n_cores = 1
)
The output is a list with the following elements:
names(res_gliph2)
## [1] "sample_log" "motif_enrichment" "global_enrichment"
## [4] "connections" "cluster_properties" "cluster_list"
## [7] "parameters"
The cluster_properties data frame summarizes each convergence group with
enrichment scores:
head(res_gliph2$cluster_properties)
## NULL
The cluster_list is a named list where each element contains the member
TCRs of a convergence group:
# Number of convergence groups
length(res_gliph2$cluster_list)
## [1] 0
# Members of the first cluster (if any found)
if (length(res_gliph2$cluster_list) > 0) {
head(res_gliph2$cluster_list[[1]])
}
The motif_enrichment element contains the locally enriched motifs:
# Significantly enriched motifs
if (!is.null(res_gliph2$motif_enrichment$selected_motifs)) {
head(res_gliph2$motif_enrichment$selected_motifs)
}
## [1] motif counts num_in_ref avgRef topRef OvE p.value
## <0 rows> (or 0-length row.names)
The connections data frame contains the edge list representing the TCR
similarity network:
if (!is.null(res_gliph2$connections)) {
head(res_gliph2$connections)
}
method = "custom"The "custom" method allows independent control over each algorithmic
component:
res_custom <- runGLIPH(
cdr3_sequences = gliph_input_data[seq_len(200), ],
method = "custom",
refdb_beta = ref_df,
local_method = "fisher", # or "rrs"
global_method = "cutoff", # or "fisher"
clustering_method = "GLIPH1.0", # or "GLIPH2.0"
scoring_method = "GLIPH2.0", # or "GLIPH1.0"
sim_depth = 100,
n_cores = 1
)
For the Fisher-based local method (GLIPH2), you can adjust:
lcminp: Maximum p-value for a motif to be considered enriched
(default 0.01)lcminove: Minimum fold-enrichment per motif length
(default c(1000, 100, 10) for 2-mers, 3-mers, 4-mers)kmer_mindepth: Minimum motif observations in the sample (default 3)res_strict <- runGLIPH(
cdr3_sequences = gliph_input_data[seq_len(200), ],
method = "gliph2",
refdb_beta = ref_df,
lcminp = 0.001, # Stricter p-value
lcminove = c(10000, 1000, 100), # Higher fold-change
sim_depth = 100,
n_cores = 1
)
immGLIPH ships with reference repertoires for human and mouse from the original GLIPH publications. Each is available as CD4, CD8, or combined (CD48) subsets:
| Name | Species | Version | Subset | Source |
|---|---|---|---|---|
"human_v1.0_CD4" |
Human | v1.0 | CD4 | Glanville et al. (2017) |
"human_v1.0_CD8" |
Human | v1.0 | CD8 | Glanville et al. (2017) |
"human_v1.0_CD48" |
Human | v1.0 | CD4+CD8 | Glanville et al. (2017) |
"human_v2.0_CD4" |
Human | v2.0 | CD4 | Huang et al. (2020) |
"human_v2.0_CD8" |
Human | v2.0 | CD8 | Huang et al. (2020) |
"human_v2.0_CD48" |
Human | v2.0 | CD4+CD8 | Huang et al. (2020) |
"mouse_v1.0_CD4" |
Mouse | v1.0 | CD4 | Huang et al. (2020) |
"mouse_v1.0_CD8" |
Mouse | v1.0 | CD8 | Huang et al. (2020) |
"mouse_v1.0_CD48" |
Mouse | v1.0 | CD4+CD8 | Huang et al. (2020) |
"gliph_reference" |
Human | v1.0 | CD4+CD8 | Legacy alias for human_v1.0_CD48 |
Each reference includes pre-computed V-gene usage and CDR3 length frequency distributions, which are automatically used for cluster scoring.
# Use the GLIPH2 mouse reference
res_mouse <- runGLIPH(
cdr3_sequences = mouse_tcr_data,
refdb_beta = "mouse_v1.0_CD48",
method = "gliph2",
sim_depth = 100,
n_cores = 1
)
You can also supply your own reference as a data frame:
custom_ref <- data.frame(
CDR3b = c("CASSLAPGATNEKLFF", "CASSLDRGEVFF", ...),
TRBV = c("TRBV5-1", "TRBV6-2", ...)
)
res <- runGLIPH(
cdr3_sequences = gliph_input_data[seq_len(200), ],
refdb_beta = custom_ref,
method = "gliph2",
sim_depth = 100,
n_cores = 1
)
findMotifs()The findMotifs() function searches for continuous and discontinuous k-mer
motifs in a set of CDR3 sequences. It is used internally by runGLIPH() but
can also be called independently.
| Argument | Default | Description |
|---|---|---|
seqs |
– | Character vector of CDR3\(\beta\) sequences |
q |
2:4 |
Motif lengths to search |
kmer_mindepth |
NULL |
Minimum motif count to return |
discontinuous |
FALSE |
Include discontinuous motifs (with one variable position) |
data("gliph_input_data")
sample_seqs <- as.character(gliph_input_data$CDR3b[seq_len(100)])
# Find all 3-mers appearing at least 5 times
motifs <- findMotifs(seqs = sample_seqs,
q = 3,
kmer_mindepth = 5)
head(motifs[order(motifs$V1, decreasing = TRUE), ])
## motif V1
## 49 CAS 85
## 46 ASS 77
## 45 QYF 34
## 30 SSL 27
## 19 LFF 26
## 21 GEL 22
Including discontinuous motifs (e.g., C.S where . is any amino acid):
disc_motifs <- findMotifs(seqs = sample_seqs,
q = 2,
kmer_mindepth = 5,
discontinuous = TRUE
)
# Show discontinuous motifs (those containing a dot)
disc_only <- disc_motifs[grep("\\.", disc_motifs$motif), ]
head(disc_only[order(disc_only$V1, decreasing = TRUE), ])
## motif V1
## 52 .S 233
## 71 S. 233
## 58 .A 180
## 64 A. 178
## 41 .F 163
## 24 G. 162
clusterScoring()The clusterScoring() function evaluates convergence groups using up to five
metrics. This is called automatically by runGLIPH(), but can be re-run with
different parameters on existing results.
| Argument | Default | Description |
|---|---|---|
cluster_list |
– | Named list of cluster data frames (from runGLIPH()$cluster_list) |
cdr3_sequences |
– | Original input data frame |
refdb_beta |
"human_v2.0_CD48" |
Reference database |
gliph_version |
1 | Scoring formula: 1 (GLIPH) or 2 (GLIPH2) |
sim_depth |
1000 | Resampling depth for score estimation |
The total score is derived from up to five components (depending on available data):
network.size.score: Probability of observing a cluster of this size
by chancecdr3.length.score: Enrichment of CDR3 length distribution within the
clustervgene.score: Enrichment of V-gene usage (requires TRBV column)clonal.expansion.score: Enrichment of expanded clones (requires counts
column)hla.score: Enrichment of shared HLA alleles among donors (requires
patient + HLA columns)# Re-score with GLIPH2 formula
if (length(res_gliph1$cluster_list) > 0) {
rescored <- clusterScoring(
cluster_list = res_gliph1$cluster_list,
cdr3_sequences = gliph_input_data[seq_len(200), ],
refdb_beta = ref_df,
gliph_version = 2,
sim_depth = 100,
n_cores = 1)
head(rescored)
}
deNovoTCRs()The deNovoTCRs() function generates artificial CDR3\(\beta\) sequences that
resemble the positional amino acid composition of a given convergence group.
This can be used to predict novel TCR sequences with similar binding
characteristics.
| Argument | Default | Description |
|---|---|---|
convergence_group_tag |
– | Tag identifying the cluster (from cluster_properties$tag) |
clustering_output |
NULL |
Output list from runGLIPH() |
result_folder |
"" |
Alternative: load from files |
sims |
100,000 | Number of de novo sequences to generate |
num_tops |
1,000 | Return top N highest-scoring sequences |
normalization |
FALSE |
Normalize scores against the reference database |
make_figure |
FALSE |
Plot score vs. rank |
# Generate de novo TCRs for the first convergence group (if any found)
if (length(res_gliph1$cluster_list) > 0) {
de_novo <- deNovoTCRs(
convergence_group_tag = names(res_gliph1$cluster_list)[1],
clustering_output = res_gliph1,
refdb_beta = ref_df,
sims = 10000,
num_tops = 100,
make_figure = FALSE,
n_cores = 1
)
# Top predicted sequences
head(de_novo$de_novo_sequences)
# Positional weight matrix used for generation
head(de_novo$PWM_Scoring)
}
plotNetwork()The plotNetwork() function creates an interactive network visualization of
the convergence groups using the visNetwork package.
| Argument | Default | Description |
|---|---|---|
clustering_output |
NULL |
Output list from runGLIPH() |
color_info |
"total.score" |
Column name for node coloring |
color_palette |
viridis::viridis |
Color palette function |
local_edge_color |
"orange" |
Color for local similarity edges |
global_edge_color |
"#68bceb" |
Color for global similarity edges |
size_info |
NULL |
Column name for node sizing |
cluster_min_size |
3 | Minimum cluster size to display |
if (!is.null(res_gliph1$cluster_properties) &&
nrow(res_gliph1$cluster_properties) > 0) {
plotNetwork(
clustering_output = res_gliph1,
color_info = "total.score",
cluster_min_size = 2,
n_cores = 1
)
}
loadGLIPH()If you saved results to disk using result_folder, you can reload them:
# Save results
res <- runGLIPH(
cdr3_sequences = gliph_input_data,
method = "gliph2",
result_folder = "my_results/",
n_cores = 1
)
# Later, reload
reloaded <- loadGLIPH(result_folder = "my_results/")
When result_folder is specified, runGLIPH() writes several output files:
| File | Description |
|---|---|
local_similarities.txt |
Enriched motifs |
all_motifs.txt |
All tested motifs with statistics |
clone_network.txt |
Network edge list |
convergence_groups.txt |
Cluster properties and scores |
cluster_member_details.txt |
Full member information per cluster |
parameters.txt |
All parameters used |
When immApex is installed, immGLIPH automatically uses its C++-accelerated backends for two computationally intensive steps:
Motif enumeration (findMotifs()): Uses immApex::calculateMotif()
with OpenMP multithreading instead of the pure-R stringdist::qgrams()
approach.
Global Hamming distance network (GLIPH1 method): Uses
immApex::buildNetwork() to compute pairwise distances in a single C++
call, replacing the parallel loop over stringdist::stringdist().
If immApex is not installed, immGLIPH falls back to the original pure-R implementations transparently, no code changes are needed.
# Install immApex for performance acceleration
BiocManager::install("BorchLab/immApex")
Start with GLIPH2: The Fisher-based approach is generally more statistically rigorous than repeated random sampling.
Sample size matters: GLIPH works best with >200 unique CDR3\(\beta\) sequences. Very small samples may yield few or no convergence groups.
Include V-gene information: When available, TRBV data improves both global similarity detection and scoring accuracy.
Adjust sim_depth: For publication-quality results, use
sim_depth >= 1000. For exploratory analysis, sim_depth = 100 is faster.
Parallelization: For large datasets (>5,000 sequences), set
n_cores > 1 to use parallel processing via BiocParallel.
Install immApex: For best performance, install immApex to enable C++-accelerated motif enumeration and network construction (see Performance section above).
Choose the right reference: For mouse data, use refdb_beta = "mouse_v1.0_CD48".
For specialized repertoires, provide a custom data frame via refdb_beta.
sessionInfo()
## R version 4.6.0 RC (2026-04-17 r89917)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] immGLIPH_0.99.4 BiocStyle_2.40.0
##
## loaded via a namespace (and not attached):
## [1] viridis_0.6.5 sass_0.4.10
## [3] generics_0.1.4 SparseArray_1.12.2
## [5] stringi_1.8.7 lattice_0.22-9
## [7] digest_0.6.39 magrittr_2.0.5
## [9] evaluate_1.0.5 grid_4.6.0
## [11] RColorBrewer_1.1-3 bookdown_0.46
## [13] fastmap_1.2.0 jsonlite_2.0.0
## [15] Matrix_1.7-5 gridExtra_2.3
## [17] BiocManager_1.30.27 SingleCellExperiment_1.34.0
## [19] viridisLite_0.4.3 scales_1.4.0
## [21] codetools_0.2-20 jquerylib_0.1.4
## [23] abind_1.4-8 cli_3.6.6
## [25] rlang_1.2.0 XVector_0.52.0
## [27] Biobase_2.72.0 DelayedArray_0.38.1
## [29] cachem_1.1.0 yaml_2.3.12
## [31] otel_0.2.0 S4Arrays_1.12.0
## [33] tools_4.6.0 parallel_4.6.0
## [35] BiocParallel_1.46.0 dplyr_1.2.1
## [37] ggplot2_4.0.3 SummarizedExperiment_1.42.0
## [39] BiocGenerics_0.58.0 vctrs_0.7.3
## [41] R6_2.6.1 stats4_4.6.0
## [43] matrixStats_1.5.0 lifecycle_1.0.5
## [45] stringr_1.6.0 Seqinfo_1.2.0
## [47] IRanges_2.46.0 S4Vectors_0.50.0
## [49] pkgconfig_2.0.3 pillar_1.11.1
## [51] bslib_0.10.0 gtable_0.3.6
## [53] glue_1.8.1 Rcpp_1.1.1-1.1
## [55] GenomicRanges_1.64.0 xfun_0.57
## [57] tibble_3.3.1 tidyselect_1.2.1
## [59] MatrixGenerics_1.24.0 knitr_1.51
## [61] dichromat_2.0-0.1 farver_2.1.2
## [63] igraph_2.3.0 htmltools_0.5.9
## [65] rmarkdown_2.31 immApex_1.6.0
## [67] compiler_4.6.0 S7_0.2.2