1 Introduction

immGLIPH provides an R implementation of the GLIPH (Grouping of Lymphocyte Interactions by Paratope Hotspots) and GLIPH2 algorithms for clustering T cell receptors (TCRs) that are predicted to bind the same HLA-restricted peptide antigen.

The package identifies TCR specificity groups by detecting statistically enriched CDR3\(\beta\) motifs (local similarity) and structurally similar CDR3\(\beta\) sequences (global similarity), then clusters them into convergence groups and scores each group for biological significance.

immGLIPH is an R implementation of existing algorithms. Users should cite the original publications:

  • GLIPH: Glanville, J. et al. Identifying specificity groups in the T cell receptor repertoire. Nature 547, 94–98 (2017). doi:10.1038/nature22976

  • GLIPH2: Huang, H. et al. Analyzing the Mycobacterium tuberculosis immune response by T-cell receptor clustering with GLIPH2 and genome-wide antigen screening. Nature Biotechnology 38, 1194–1202 (2020). doi:10.1038/s41587-020-0505-4

1.1 Installation

immGLIPH can be installed from Bioconductor:

BiocManager::install("immGLIPH")

The reference repertoire data (~19 MB) is downloaded automatically the first time you run runGLIPH() and cached locally via BiocFileCache. You can pre-download the data with:

BiocManager::install("BiocFileCache")
ref <- getGLIPHreference()

1.1.1 Loading Package

library(immGLIPH)

1.2 Integration with the scRepertoire Ecosystem

immGLIPH integrates with the scRepertoire ecosystem through immApex. Both scRepertoire and immApex are Bioconductor packages and can be installed via BiocManager::install(). This means runGLIPH() can directly accept:

  • SingleCellExperiment objects with TCR information
  • combineTCR() output lists from scRepertoire
  • Standard data frames or character vectors

2 Quick Start

2.1 Loading the Example Data

immGLIPH includes built-in example data derived from the scRepertoire example dataset (Yost et al. 2019):

  • gliph_input_data: A data frame of 365 TRB CDR3 sequences with V-gene and patient annotations.
  • gliph_sce: A SingleCellExperiment object with TCR clonotype information in colData, demonstrating the single-cell workflow.
data("gliph_input_data")
head(gliph_input_data)
##               CDR3b     TRBV patient
## 1     CAISEQGKGELFF TRBV10-3    P17B
## 2 CASSVRRERANTGELFF    TRBV9    P17B
## 3 CASSVRRERANTGELFF    TRBV9    P17B
## 4   CASSSRQDSTDTQYF   TRBV27    P17B
## 5 CASSVRRERANTGELFF    TRBV9    P17B
## 6  CASSPRAGTPNTEAFF  TRBV4-1    P17B
dim(gliph_input_data)
## [1] 365   3

2.2 Input Data Format

runGLIPH() accepts a data frame with the following columns:

Column Required Description
CDR3b Yes CDR3\(\beta\) amino acid sequences
TRBV No V-gene usage (e.g., “TRBV5-1”)
patient No Donor/sample identifier
HLA No HLA alleles, comma-separated
counts No Clone frequency

Alternative column names are automatically recognized (e.g., cdr3, v_gene, sample, clone_count).

3 Working with Single-Cell Data

When working with single-cell immune repertoire data, you can use scRepertoire to prepare your data and pass the output directly to immGLIPH.

library(scRepertoire)

# After processing with cellranger/etc, combine contigs
combined <- combineTCR(contig_list[1:2],
                        samples = c("P1", "P2"))

# Pass scRepertoire output directly to runGLIPH
results <- runGLIPH(combined, 
                    method = "gliph2")

For SingleCellExperiment objects that already contain TCR metadata (e.g., added via scRepertoire::combineExpression()), immGLIPH extracts the receptor data automatically using immApex::getIR(). Here is an example using the bundled gliph_sce dataset:

library(SingleCellExperiment)
data("gliph_sce")

# SingleCellExperiment object with TCR info in colData
results <- runGLIPH(gliph_sce, 
                    method = "gliph2", 
                    chains = "TRB")

4 The runGLIPH() Function

4.1 Key Arguments

Argument Default Description
cdr3_sequences Input data (data frame, vector, SCE, or list)
method "gliph2" Algorithm preset: "gliph1", "gliph2", or "custom"
sim_depth 1000 Simulation depth (higher = more reproducible, slower)
n_cores 1 Number of parallel cores
refdb_beta "human_v2.0_CD48" Reference database (built-in name or custom data frame)
local_similarities TRUE Search for local (motif-based) similarities
global_similarities TRUE Search for global (structural) similarities
structboundaries TRUE Trim structural boundaries before comparison
accept_CF TRUE Filter to sequences starting with C, ending with F

4.2 Method Presets

The method parameter configures a coordinated set of algorithm choices:

Setting "gliph1" "gliph2" "custom"
Local method Repeated random sampling Fisher’s exact test User-defined
Global method Hamming distance cutoff Struct-based + Fisher User-defined
Clustering Connected components Per-motif isolated User-defined
Scoring GLIPH1 formula GLIPH2 formula User-defined

4.3 Running GLIPH2 (Default)

By default runGLIPH() downloads a reference repertoire via BiocFileCache. For this vignette we supply a custom reference data frame directly using refdb_beta to avoid network access:

data("gliph_input_data")

# Use the input data itself as a small reference for demonstration
ref_df <- gliph_input_data[, c("CDR3b", "TRBV")]

res_gliph2 <- runGLIPH(
  cdr3_sequences = gliph_input_data[seq_len(200), ],
  method         = "gliph2",
  refdb_beta     = ref_df,
  sim_depth      = 100,
  n_cores        = 1
)

4.4 Running GLIPH1

res_gliph1 <- runGLIPH(
  cdr3_sequences = gliph_input_data[seq_len(200), ],
  method         = "gliph1",
  refdb_beta     = ref_df,
  sim_depth      = 100,
  n_cores        = 1
)

4.5 Understanding the Output

The output is a list with the following elements:

names(res_gliph2)
## [1] "sample_log"         "motif_enrichment"   "global_enrichment" 
## [4] "connections"        "cluster_properties" "cluster_list"      
## [7] "parameters"

4.5.1 Cluster Properties

The cluster_properties data frame summarizes each convergence group with enrichment scores:

head(res_gliph2$cluster_properties)
## NULL

4.5.2 Cluster Membership

The cluster_list is a named list where each element contains the member TCRs of a convergence group:

# Number of convergence groups
length(res_gliph2$cluster_list)
## [1] 0
# Members of the first cluster (if any found)
if (length(res_gliph2$cluster_list) > 0) {
  head(res_gliph2$cluster_list[[1]])
}

4.5.3 Motif Enrichment

The motif_enrichment element contains the locally enriched motifs:

# Significantly enriched motifs
if (!is.null(res_gliph2$motif_enrichment$selected_motifs)) {
  head(res_gliph2$motif_enrichment$selected_motifs)
}
## [1] motif      counts     num_in_ref avgRef     topRef     OvE        p.value   
## <0 rows> (or 0-length row.names)

4.5.4 Network Edges

The connections data frame contains the edge list representing the TCR similarity network:

if (!is.null(res_gliph2$connections)) {
  head(res_gliph2$connections)
}

5 Customizing the Analysis

5.1 Using method = "custom"

The "custom" method allows independent control over each algorithmic component:

res_custom <- runGLIPH(
  cdr3_sequences  = gliph_input_data[seq_len(200), ],
  method          = "custom",
  refdb_beta      = ref_df,
  local_method    = "fisher",    # or "rrs"
  global_method   = "cutoff",    # or "fisher"
  clustering_method = "GLIPH1.0", # or "GLIPH2.0"
  scoring_method  = "GLIPH2.0",  # or "GLIPH1.0"
  sim_depth       = 100,
  n_cores         = 1
)

5.2 Adjusting Significance Thresholds

For the Fisher-based local method (GLIPH2), you can adjust:

  • lcminp: Maximum p-value for a motif to be considered enriched (default 0.01)
  • lcminove: Minimum fold-enrichment per motif length (default c(1000, 100, 10) for 2-mers, 3-mers, 4-mers)
  • kmer_mindepth: Minimum motif observations in the sample (default 3)
res_strict <- runGLIPH(
  cdr3_sequences = gliph_input_data[seq_len(200), ],
  method         = "gliph2",
  refdb_beta     = ref_df,
  lcminp         = 0.001,            # Stricter p-value
  lcminove       = c(10000, 1000, 100), # Higher fold-change
  sim_depth      = 100,
  n_cores        = 1
)

5.3 Choosing a Reference Database

immGLIPH ships with reference repertoires for human and mouse from the original GLIPH publications. Each is available as CD4, CD8, or combined (CD48) subsets:

Name Species Version Subset Source
"human_v1.0_CD4" Human v1.0 CD4 Glanville et al. (2017)
"human_v1.0_CD8" Human v1.0 CD8 Glanville et al. (2017)
"human_v1.0_CD48" Human v1.0 CD4+CD8 Glanville et al. (2017)
"human_v2.0_CD4" Human v2.0 CD4 Huang et al. (2020)
"human_v2.0_CD8" Human v2.0 CD8 Huang et al. (2020)
"human_v2.0_CD48" Human v2.0 CD4+CD8 Huang et al. (2020)
"mouse_v1.0_CD4" Mouse v1.0 CD4 Huang et al. (2020)
"mouse_v1.0_CD8" Mouse v1.0 CD8 Huang et al. (2020)
"mouse_v1.0_CD48" Mouse v1.0 CD4+CD8 Huang et al. (2020)
"gliph_reference" Human v1.0 CD4+CD8 Legacy alias for human_v1.0_CD48

Each reference includes pre-computed V-gene usage and CDR3 length frequency distributions, which are automatically used for cluster scoring.

# Use the GLIPH2 mouse reference
res_mouse <- runGLIPH(
  cdr3_sequences = mouse_tcr_data,
  refdb_beta     = "mouse_v1.0_CD48",
  method         = "gliph2",
  sim_depth      = 100,
  n_cores        = 1
)

5.4 Using a Custom Reference Database

You can also supply your own reference as a data frame:

custom_ref <- data.frame(
  CDR3b = c("CASSLAPGATNEKLFF", "CASSLDRGEVFF", ...),
  TRBV  = c("TRBV5-1", "TRBV6-2", ...)
)

res <- runGLIPH(
  cdr3_sequences = gliph_input_data[seq_len(200), ],
  refdb_beta     = custom_ref,
  method         = "gliph2",
  sim_depth      = 100,
  n_cores        = 1
)

6 Motif Discovery with findMotifs()

The findMotifs() function searches for continuous and discontinuous k-mer motifs in a set of CDR3 sequences. It is used internally by runGLIPH() but can also be called independently.

6.1 Key Arguments

Argument Default Description
seqs Character vector of CDR3\(\beta\) sequences
q 2:4 Motif lengths to search
kmer_mindepth NULL Minimum motif count to return
discontinuous FALSE Include discontinuous motifs (with one variable position)

6.2 Example

data("gliph_input_data")
sample_seqs <- as.character(gliph_input_data$CDR3b[seq_len(100)])

# Find all 3-mers appearing at least 5 times
motifs <- findMotifs(seqs = sample_seqs, 
                     q = 3, 
                     kmer_mindepth = 5)
head(motifs[order(motifs$V1, decreasing = TRUE), ])
##    motif V1
## 49   CAS 85
## 46   ASS 77
## 45   QYF 34
## 30   SSL 27
## 19   LFF 26
## 21   GEL 22

Including discontinuous motifs (e.g., C.S where . is any amino acid):

disc_motifs <- findMotifs(seqs          = sample_seqs,
                          q             = 2,
                          kmer_mindepth = 5,
                          discontinuous = TRUE
)
# Show discontinuous motifs (those containing a dot)
disc_only <- disc_motifs[grep("\\.", disc_motifs$motif), ]
head(disc_only[order(disc_only$V1, decreasing = TRUE), ])
##    motif  V1
## 52    .S 233
## 71    S. 233
## 58    .A 180
## 64    A. 178
## 41    .F 163
## 24    G. 162

7 Cluster Scoring with clusterScoring()

The clusterScoring() function evaluates convergence groups using up to five metrics. This is called automatically by runGLIPH(), but can be re-run with different parameters on existing results.

7.1 Key Arguments

Argument Default Description
cluster_list Named list of cluster data frames (from runGLIPH()$cluster_list)
cdr3_sequences Original input data frame
refdb_beta "human_v2.0_CD48" Reference database
gliph_version 1 Scoring formula: 1 (GLIPH) or 2 (GLIPH2)
sim_depth 1000 Resampling depth for score estimation

7.2 Scoring Components

The total score is derived from up to five components (depending on available data):

  1. network.size.score: Probability of observing a cluster of this size by chance
  2. cdr3.length.score: Enrichment of CDR3 length distribution within the cluster
  3. vgene.score: Enrichment of V-gene usage (requires TRBV column)
  4. clonal.expansion.score: Enrichment of expanded clones (requires counts column)
  5. hla.score: Enrichment of shared HLA alleles among donors (requires patient + HLA columns)

7.3 Example

# Re-score with GLIPH2 formula
if (length(res_gliph1$cluster_list) > 0) {
  rescored <- clusterScoring(
    cluster_list   = res_gliph1$cluster_list,
    cdr3_sequences = gliph_input_data[seq_len(200), ],
    refdb_beta     = ref_df,
    gliph_version  = 2,
    sim_depth      = 100,
    n_cores        = 1)
  head(rescored)
}

8 De Novo TCR Generation with deNovoTCRs()

The deNovoTCRs() function generates artificial CDR3\(\beta\) sequences that resemble the positional amino acid composition of a given convergence group. This can be used to predict novel TCR sequences with similar binding characteristics.

8.1 Key Arguments

Argument Default Description
convergence_group_tag Tag identifying the cluster (from cluster_properties$tag)
clustering_output NULL Output list from runGLIPH()
result_folder "" Alternative: load from files
sims 100,000 Number of de novo sequences to generate
num_tops 1,000 Return top N highest-scoring sequences
normalization FALSE Normalize scores against the reference database
make_figure FALSE Plot score vs. rank

8.2 Example

# Generate de novo TCRs for the first convergence group (if any found)
if (length(res_gliph1$cluster_list) > 0) {
  de_novo <- deNovoTCRs(
    convergence_group_tag = names(res_gliph1$cluster_list)[1],
    clustering_output     = res_gliph1,
    refdb_beta            = ref_df,
    sims                  = 10000,
    num_tops              = 100,
    make_figure           = FALSE,
    n_cores               = 1
  )

  # Top predicted sequences
  head(de_novo$de_novo_sequences)

  # Positional weight matrix used for generation
  head(de_novo$PWM_Scoring)
}

9 Network Visualization with plotNetwork()

The plotNetwork() function creates an interactive network visualization of the convergence groups using the visNetwork package.

9.1 Key Arguments

Argument Default Description
clustering_output NULL Output list from runGLIPH()
color_info "total.score" Column name for node coloring
color_palette viridis::viridis Color palette function
local_edge_color "orange" Color for local similarity edges
global_edge_color "#68bceb" Color for global similarity edges
size_info NULL Column name for node sizing
cluster_min_size 3 Minimum cluster size to display

9.2 Example

if (!is.null(res_gliph1$cluster_properties) &&
    nrow(res_gliph1$cluster_properties) > 0) {
  plotNetwork(
    clustering_output = res_gliph1,
    color_info        = "total.score",
    cluster_min_size  = 2,
    n_cores           = 1
  )
}

10 Loading Saved Results with loadGLIPH()

If you saved results to disk using result_folder, you can reload them:

# Save results
res <- runGLIPH(
  cdr3_sequences = gliph_input_data,
  method         = "gliph2",
  result_folder  = "my_results/",
  n_cores        = 1
)

# Later, reload
reloaded <- loadGLIPH(result_folder = "my_results/")

11 Saving Results to Disk

When result_folder is specified, runGLIPH() writes several output files:

File Description
local_similarities.txt Enriched motifs
all_motifs.txt All tested motifs with statistics
clone_network.txt Network edge list
convergence_groups.txt Cluster properties and scores
cluster_member_details.txt Full member information per cluster
parameters.txt All parameters used

12 Performance

12.1 Accelerated Computation with immApex

When immApex is installed, immGLIPH automatically uses its C++-accelerated backends for two computationally intensive steps:

  1. Motif enumeration (findMotifs()): Uses immApex::calculateMotif() with OpenMP multithreading instead of the pure-R stringdist::qgrams() approach.

  2. Global Hamming distance network (GLIPH1 method): Uses immApex::buildNetwork() to compute pairwise distances in a single C++ call, replacing the parallel loop over stringdist::stringdist().

If immApex is not installed, immGLIPH falls back to the original pure-R implementations transparently, no code changes are needed.

# Install immApex for performance acceleration
BiocManager::install("BorchLab/immApex")

13 Tips and Best Practices

  1. Start with GLIPH2: The Fisher-based approach is generally more statistically rigorous than repeated random sampling.

  2. Sample size matters: GLIPH works best with >200 unique CDR3\(\beta\) sequences. Very small samples may yield few or no convergence groups.

  3. Include V-gene information: When available, TRBV data improves both global similarity detection and scoring accuracy.

  4. Adjust sim_depth: For publication-quality results, use sim_depth >= 1000. For exploratory analysis, sim_depth = 100 is faster.

  5. Parallelization: For large datasets (>5,000 sequences), set n_cores > 1 to use parallel processing via BiocParallel.

  6. Install immApex: For best performance, install immApex to enable C++-accelerated motif enumeration and network construction (see Performance section above).

  7. Choose the right reference: For mouse data, use refdb_beta = "mouse_v1.0_CD48". For specialized repertoires, provide a custom data frame via refdb_beta.

14 Session Info

sessionInfo()
## R version 4.6.0 RC (2026-04-17 r89917)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] immGLIPH_0.99.4  BiocStyle_2.40.0
## 
## loaded via a namespace (and not attached):
##  [1] viridis_0.6.5               sass_0.4.10                
##  [3] generics_0.1.4              SparseArray_1.12.2         
##  [5] stringi_1.8.7               lattice_0.22-9             
##  [7] digest_0.6.39               magrittr_2.0.5             
##  [9] evaluate_1.0.5              grid_4.6.0                 
## [11] RColorBrewer_1.1-3          bookdown_0.46              
## [13] fastmap_1.2.0               jsonlite_2.0.0             
## [15] Matrix_1.7-5                gridExtra_2.3              
## [17] BiocManager_1.30.27         SingleCellExperiment_1.34.0
## [19] viridisLite_0.4.3           scales_1.4.0               
## [21] codetools_0.2-20            jquerylib_0.1.4            
## [23] abind_1.4-8                 cli_3.6.6                  
## [25] rlang_1.2.0                 XVector_0.52.0             
## [27] Biobase_2.72.0              DelayedArray_0.38.1        
## [29] cachem_1.1.0                yaml_2.3.12                
## [31] otel_0.2.0                  S4Arrays_1.12.0            
## [33] tools_4.6.0                 parallel_4.6.0             
## [35] BiocParallel_1.46.0         dplyr_1.2.1                
## [37] ggplot2_4.0.3               SummarizedExperiment_1.42.0
## [39] BiocGenerics_0.58.0         vctrs_0.7.3                
## [41] R6_2.6.1                    stats4_4.6.0               
## [43] matrixStats_1.5.0           lifecycle_1.0.5            
## [45] stringr_1.6.0               Seqinfo_1.2.0              
## [47] IRanges_2.46.0              S4Vectors_0.50.0           
## [49] pkgconfig_2.0.3             pillar_1.11.1              
## [51] bslib_0.10.0                gtable_0.3.6               
## [53] glue_1.8.1                  Rcpp_1.1.1-1.1             
## [55] GenomicRanges_1.64.0        xfun_0.57                  
## [57] tibble_3.3.1                tidyselect_1.2.1           
## [59] MatrixGenerics_1.24.0       knitr_1.51                 
## [61] dichromat_2.0-0.1           farver_2.1.2               
## [63] igraph_2.3.0                htmltools_0.5.9            
## [65] rmarkdown_2.31              immApex_1.6.0              
## [67] compiler_4.6.0              S7_0.2.2