---
title: "Advanced TCR Analysis with immLynx"
author: "Nick Borcherding"
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Advanced TCR Analysis with immLynx}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  message = FALSE,
  warning = FALSE,
  tidy = FALSE
)
library(BiocStyle)

python_available <- tryCatch({
    proc <- basilisk::basiliskStart(immLynx::immLynxEnv)
    on.exit(basilisk::basiliskStop(proc))
    TRUE
}, error = function(e) {
    FALSE
})
```

# Introduction

This vignette builds on the foundations established in
`vignette("immLynx_vignette", package = "immLynx")` to explore advanced
analytical workflows for T-cell receptor (TCR) repertoire data. While
the introductory vignette demonstrates each analysis function
individually, real-world studies often require combining multiple methods
to gain complementary views of repertoire biology. For instance,
sequence clustering identifies groups of structurally related TCRs,
while generation probability highlights sequences that are unlikely to
arise from random recombination and may therefore reflect
antigen-driven selection. Protein language model embeddings offer yet
another perspective by capturing learned biochemical features that do
not depend on predefined distance metrics.

This vignette covers the following advanced topics:

- **Method comparison**: Evaluating how different clustering parameters
  and algorithms partition the same repertoire
- **Custom embedding workflows**: Comparing ESM-2 model sizes and
  embedding paired alpha-beta chains
- **Integration with scRepertoire**: Relating immLynx analysis results
  to clonotype definitions from `r Biocpkg("scRepertoire")`
- **Selection pressure analysis**: Combining OLGA generation
  probabilities with sample metadata to identify selection signatures
- **Distance-based analysis**: Using tcrdist3 distance matrices for
  hierarchical clustering and nearest-neighbor analyses
- **HLA association testing**: Linking metaclone assignments to HLA
  genotype data
- **Scalability strategies**: Practical approaches for analyzing large
  datasets with limited computational resources

# Setup

We begin by loading the required packages and the example dataset. The
`immLynx_example` object is a `SingleCellExperiment` containing
scRNA-seq data with TCR annotations from `r Biocpkg("scRepertoire")`.

```{r load}
library(immLynx)
library(scran)
library(scater)
library(ggplot2)

data("immLynx_example")
```

# Comparing Clustering Methods

The MCL algorithm used by clusTCR has a key parameter, `inflation`, that
controls the granularity of cluster assignments. Lower inflation values
(e.g., 2.0) favor larger, more inclusive clusters, while higher values
(e.g., 3.0) produce tighter, more specific groupings. Choosing the
appropriate inflation parameter depends on the biological question: broad
clusters may capture groups of TCRs with shared structural motifs, while
tight clusters more closely approximate antigen-specificity groups.

Here we run clusTCR with two different inflation parameters on the same
dataset and compare the resulting cluster counts. By using distinct
`column_prefix` values, both sets of assignments are stored in the same
`SingleCellExperiment` object for side-by-side comparison.

```{r clustcr-compare, eval=python_available}
sce_mcl_2 <- runClustTCR(immLynx_example,
                            inflation = 2.0,
                            column_prefix = "mcl_2")
sce_mcl_3 <- runClustTCR(sce_mcl_2,
                            inflation = 3.0,
                            column_prefix = "mcl_3")

cat("MCL (inflation=2):",
    length(unique(na.omit(sce_mcl_3$mcl_2_TRB))),
    "clusters\n")
cat("MCL (inflation=3):",
    length(unique(na.omit(sce_mcl_3$mcl_3_TRB))),
    "clusters\n")
```

# Custom Embedding Workflows

## Using Different ESM-2 Model Sizes

The ESM-2 family of protein language models ranges from 8 million to 15
billion parameters. Larger models generally produce embeddings that
better capture structural and functional properties of protein
sequences, but they require substantially more memory and computation
time. For TCR CDR3 sequences, which are short (typically 10-20 amino
acids), smaller models often perform well and may be preferable for
large-scale analyses.

The following example compares embeddings from the 35M and 650M
parameter variants. Both sets of embeddings are stored as separate
dimensionality reductions in the `SingleCellExperiment` and can be
independently visualized with UMAP.

```{r esm-models, eval=FALSE}
sce_small <- runEmbeddings(
  immLynx_example,
  model_name = "facebook/esm2_t12_35M_UR50D",
  reduction_name = "esm_small"
)

sce_med <- runEmbeddings(
  sce_small,
  model_name = "facebook/esm2_t33_650M_UR50D",
  reduction_name = "esm_medium"
)

sce_med <- scater::runUMAP(sce_med,
                      dimred = "esm_small",
                      name = "umap_small")
sce_med <- scater::runUMAP(sce_med,
                      dimred = "esm_medium",
                      name = "umap_medium")

p1 <- scater::plotReducedDim(sce_med,
              dimred = "umap_small") +
  ggtitle("ESM-2 35M")
p2 <- scater::plotReducedDim(sce_med,
              dimred = "umap_medium") +
  ggtitle("ESM-2 650M")
p1 + p2
```

## Embedding Both Chains

For paired alpha-beta TCR data, embedding both chains together captures
the joint receptor structure. The `chains = "both"` option concatenates
the alpha and beta CDR3 sequences before computing embeddings, producing
a single representation per cell that reflects the full receptor
identity. This is particularly useful when alpha chain pairing
contributes to antigen specificity.

```{r both-chains, eval=FALSE}
sce_paired <- runEmbeddings(
  immLynx_example,
  chains = "both",
  reduction_name = "tcr_paired"
)

sce_paired <- scater::runUMAP(sce_paired,
                         dimred = "tcr_paired")
scater::plotReducedDim(sce_paired, dimred = "UMAP")
```

# Integration with scRepertoire Clonotypes

immLynx is designed to complement `r Biocpkg("scRepertoire")`, which
defines clonotypes based on exact CDR3 sequence matching. By comparing
clusTCR cluster assignments with scRepertoire's clonotype definitions,
we can assess how sequence-similarity-based clustering relates to strict
clonotype identity. Ideally, each cluster should contain one or a few
closely related clonotypes, and cells sharing the same clonotype should
be assigned to the same cluster.

```{r screpertoire-integration, eval=python_available}
sce_clust <- runClustTCR(immLynx_example,
                            chains = "TRB")

comparison <- table(
  clustcr = sce_clust$clustcr_TRB,
  clonotype = sce_clust$CTstrict
)

cat("Number of clustcr clusters:", nrow(comparison), "\n")
cat("Number of unique clonotypes:", ncol(comparison), "\n")
```

# Analyzing Selection Pressure

Generation probability (Pgen) from OLGA provides a window into
selection pressures acting on the TCR repertoire. Sequences with low
Pgen are rarely produced by V(D)J recombination; their presence in the
observed repertoire suggests positive selection, potentially driven by
antigen encounter. By comparing Pgen distributions across experimental
conditions or sample types, we can identify conditions associated with
enrichment of rare, potentially antigen-selected sequences.

```{r selection, eval=python_available}
sce_pgen <- runOLGA(immLynx_example,
                       chains = "TRB")

ggplot(as.data.frame(colData(sce_pgen)), aes(x = Type,
                            y = olga_pgen_log10_TRB)) +
  geom_boxplot() +
  labs(title = "TCR Generation Probability by Sample Type",
       x = "Sample Type",
       y = "log10(Pgen)")

low_pgen_cells <- which(colData(sce_pgen)$olga_pgen_log10_TRB < -15)
cat("Cells with Pgen < 10^-15:", length(low_pgen_cells), "\n")
```

# Combining Distance-Based Methods

tcrdist3 distance matrices provide a continuous measure of TCR
similarity that can serve as input to a variety of unsupervised learning
methods. Here we demonstrate hierarchical clustering of the distance
matrix, which complements the MCL-based approach used by clusTCR. The
`complete` linkage method ensures that all members of a cluster are
within a maximum distance of each other, producing compact groups. Note
that because tcrdist3 deduplicates input sequences, the distance matrix
dimensions correspond to unique TCR sequences rather than individual
cells.

```{r combined-distance, eval=python_available}
dist_results <- runTCRdist(immLynx_example, chains = "beta")

d <- as.dist(dist_results$distances$pw_beta)
hc <- hclust(d, method = "complete")
n_unique <- nrow(dist_results$distances$pw_beta)
clusters <- cutree(hc, k = min(50, n_unique))

cat("Distance matrix:", n_unique, "x", n_unique, "\n")
cat("Hierarchical clusters:", length(unique(clusters)), "\n")
```

# Working with Large Datasets

For datasets with thousands or tens of thousands of cells, computational
resources may become a limiting factor. The following strategies can
help manage memory and runtime:

- **Chunked embedding**: The `chunk_size` parameter in `runEmbeddings()`
  controls how many sequences are processed in each batch. Smaller
  chunks reduce peak memory usage at the cost of slightly longer runtime.
- **MCL clustering**: clusTCR's MCL algorithm scales efficiently to
  large sequence sets and is generally preferable to pairwise distance
  methods for datasets exceeding a few thousand unique sequences.
- **Subsampling for distances**: tcrdist3 computes all-versus-all
  distances, which scales quadratically. For very large datasets,
  computing distances on a representative subset is a practical
  alternative.

```{r large-data, eval=FALSE}
sce_large <- runEmbeddings(
  large_sce,
  chunk_size = 16,
  pool = "mean"
)

sce_large <- runClustTCR(
  large_sce,
  inflation = 2.0
)

sample_cells <- sample(colnames(large_sce), 5000)
subset_obj <- large_sce[, sample_cells]
dist_results <- runTCRdist(subset_obj)
```

# HLA Association Analysis

When HLA genotyping data is available, immLynx can test for associations
between metaclone assignments and specific HLA alleles using
`runHLAassociation()`. This analysis identifies metaclones whose
frequency differs significantly between HLA-positive and HLA-negative
individuals, suggesting potential HLA-restricted antigen recognition.
The function performs Fisher's exact test for each metaclone-HLA
combination and returns adjusted p-values.

```{r hla, eval=python_available}
metaclones <- runMetaclonotypist(immLynx_example,
                                 return_input = FALSE)

hla_data <- data.frame(
  barcode = metaclones$barcode,
  HLA_A_01_01 = sample(c(TRUE, FALSE),
                       nrow(metaclones),
                       replace = TRUE),
  HLA_A_02_01 = sample(c(TRUE, FALSE),
                       nrow(metaclones),
                       replace = TRUE),
  HLA_B_07_02 = sample(c(TRUE, FALSE),
                       nrow(metaclones),
                       replace = TRUE)
)

hla_results <- runHLAassociation(metaclones,
                                 hla_data)
head(hla_results)
```

# Exporting Results

All immLynx results can be exported to standard file formats for use
with external tools or for archiving. The `convertToTcrdist()` function
reformats TCR data into the tabular format expected by standalone
tcrdist3, facilitating interoperability with Python-based analysis
pipelines.

```{r export, eval=FALSE}
tcr_data <- extractTCRdata(immLynx_example,
                           chains = "both",
                           format = "wide")
tcrdist_format <- convertToTcrdist(tcr_data,
                                   chains = "both")
write.csv(tcrdist_format, "tcr_for_tcrdist.csv", row.names = FALSE)

clusters <- data.frame(
  barcode = colnames(sce_clust),
  clustcr = sce_clust$clustcr_TRB
)
write.csv(clusters, "cluster_assignments.csv", row.names = FALSE)

embeddings <- reducedDim(sce_small, "esm_small")
write.csv(embeddings, "tcr_embeddings.csv")
```

# Best Practices

1. **Data Quality**: Always validate TCR data before analysis using
   `validateTCRdata()`. This catches common issues such as missing CDR3
   sequences, non-standard amino acids, and inconsistent chain
   annotations that can cause downstream errors or misleading results.

   ```{r validate, eval=FALSE}
   validation <- validateTCRdata(tcr_data, check_sequences = TRUE)
   if (!validation$valid) {
     warning(validation$errors)
   }
   ```

2. **Memory Management**: For large datasets, use smaller chunk sizes in
   `runEmbeddings()` and consider subsampling strategies for
   distance-based methods.

3. **Reproducibility**: Set random seeds before running stochastic
   algorithms such as MCL clustering to ensure reproducible results.

   ```{r seed, eval=FALSE}
   set.seed(42)
   sce <- runClustTCR(sce)
   ```

4. **Multiple Methods**: Use multiple clustering or grouping methods and
   look for consensus. Sequences that are consistently co-clustered
   across methods are more likely to represent biologically meaningful
   groups.

5. **Biological Validation**: Computational clusters should be validated
   against biological evidence such as antigen specificity, phenotypic
   markers, or clinical outcomes whenever possible.

# Session Info

```{r session, eval=TRUE}
sessionInfo()
```