---
title: Getting Started with immReferent
author: 
- name: Nick Borcherding
  email: ncborch@gmail.com
  affiliation: Washington University in St. Louis, School of Medicine, St. Louis, MO, USA
date: 'Compiled: `r format(Sys.Date(), "%B %d, %Y")`'
output:
  BiocStyle::html_document:
    toc_float: true
package: immReferent
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{Getting Started with immReferent}
  %\VignetteEncoding{UTF-8} 
---

```{r, include = FALSE}
library(immReferent)
library(BiocStyle)

# Make chunks robust on CI: evaluate IMGT or OGRDB examples only if the site is reachable
imgt_ok  <- try(is_imgt_available(),  silent = TRUE)
ogrdb_ok <- try(is_ogrdb_available(), silent = TRUE)
imgt_ok  <- if (inherits(imgt_ok,  "try-error")) FALSE else isTRUE(imgt_ok)
ogrdb_ok <- if (inherits(ogrdb_ok, "try-error")) FALSE else isTRUE(ogrdb_ok)

knitr::opts_chunk$set(
  error   = FALSE,
  message = FALSE,
  warning = FALSE,
  tidy    = FALSE
)

set.seed(42)
```

## Introduction

The `immReferent` package provides a centralized and easy-to-use interface for downloading, managing, and loading immune repertoire and HLA reference sequences from the IMGT, IPD-IMGT/HLA and OGRDB databases. Its primary goal is to ensure that analyses are based on consistent, up-to-date, and correctly formatted reference data.

This vignette will walk you through the basic functionality of the package.

## Installation

```{r eval = F}
devtools::install_github("BorchLab/immReferent")
```

Or via Bioconductor (once accepted)

```{r eval = F}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("immReferent")
```

## Downloading Reference Sequences

```{r setup}
library(immReferent)
```

The main function for all data retrieval is `getIMGT()`. It handles both downloading new data from the source and loading previously downloaded data from a local cache.

### Downloading HLA Sequences (IPD-IMGT/HLA)

The IPD-IMGT/HLA database provides reference sequences for the Human Leukocyte Antigen (HLA) system. You can download the complete set of nucleotide or protein sequences using `gene = "HLA"`.

```{r get_hla, eval = imgt_ok}
# Download all available HLA protein sequences
# This will download the file to the cache on the first run
hla_prot <- getIMGT(gene = "HLA", 
                    type = "PROT")

# Inspect the result
print(hla_prot)
cat("Number of sequences:", length(hla_prot), "\n")
cat("First sequence name:", names(hla_prot)[1], "\n")
```

### Downloading TCR/BCR Sequences (IMGT)

For T-cell receptor (TCR) and B-cell receptor (BCR) genes, you can specify the species and the gene or gene family you are interested in.

```{r get_ighv, eval = imgt_ok}
# Download human IGHV nucleotide sequences
ighv_nuc <- getIMGT(species = "human", 
                    gene = "IGHV", 
                    type = "NUC")

# Inspect the result
print(ighv_nuc)
```

You can also download entire families of genes at once by specifying a group name like `"IGH"`, `"IGK"`, `"TRB"`, etc.

```{r get_trb, eval = imgt_ok}
# Download all mouse TRB genes (V, D, J, and C)
trb_mouse <- getIMGT(species = "mouse",
                     gene = "TRB", 
                     type = "NUC")

# This object will contain TRBV, TRBD, TRBJ, and TRBC sequences
print(trb_mouse)
```

### Downloading Germline Sets from OGRDB (AIRR)

OGRDB provides AIRR-compliant **germline sets** for immunoglobulin loci (and growing coverage more broadly). You can retrieve:

- **FASTA** (gapped or ungapped) nucleotide sequences, or
- **AIRR JSON**, which we parse into `DNAStringSet` (and optionally translate to `AAStringSet`).

```{r ogrdb_igh_fasta, eval=ogrdb_ok}
# Human IGH nucleotide sequences (gapped FASTA)
igh_ogrdb <- getOGRDB(
  species = "human",
  locus   = "IGH",
  type    = "NUC",
  format  = "FASTA_GAPPED"
)
igh_ogrdb
```

Pulling the AIRR-formatted sequences:

```{r ogrdb_igk_airr, eval=ogrdb_ok}
# Human IGK sequences via AIRR JSON (parsed to DNAStringSet)
igk_airr <- getOGRDB(
  species = "human",
  locus   = "IGK",
  type    = "NUC",
  format  = "AIRR"
)
igk_airr
```

## Working with the Cache

`immReferent` automatically caches all downloaded data to avoid repeated downloads and to enable offline work.

### Listing Cached Files

You can see all the files currently in your cache using the `listIMGT()` or `listOGRDB()` function.

```{r list_imgt, eval=imgt_ok}
# List the full paths of all cached files
listIMGT()
listOGRDB()
```

### Loading from Cache

When you call `getIMGT()`, it will always load data from the cache if it's available. If you want to *only* load from the cache and prevent any possibility of a download, you can use `loadIMGT()`. This function is useful in offline environments or for ensuring strict reproducibility.

```{r load_imgt, eval=imgt_ok}
# This will load from the cache if available, or download otherwise
ighv_nuc <- getIMGT(species = "human", 
                    gene = "IGHV", 
                    type = "NUC")

# This will load from the cache, or fail if not found and offline
ighv_nuc_from_cache <- loadIMGT(species = "human", 
                                gene = "IGHV", 
                                type = "NUC")
```

Similar to the above, we can pull and load from OGRDB using `getOGRDB()` and `loadOGRDB()`.

```{r eval=ogrdb_ok}
# This will load from the cache if available, or download otherwise
igh_nuc <- getOGRDB(species = "human", 
                    locus = "IGH", 
                    type = "NUC", 
                    format = "FASTA_GAPPED")

# This will load from the cache, or fail if not found and offline
igh_from_cache <- loadOGRDB(species = "human", 
                            locus = "IGH", 
                            type = "NUC", 
                            format = "FASTA_GAPPED")
```

### Refreshing the Cache

If you suspect the online data has been updated and you want to re-download it, you can use `refreshIMGT()` or `refreshOGRDB()`. This is just a convenient shortcut for `getIMGT(..., refresh = TRUE)`.

```{r refresh_imgt, eval=imgt_ok & ogrdb_ok}
# Force a re-download of the human IGHV sequences
ighv_nuc_fresh <- refreshIMGT(species = "human", 
                              gene = "IGHV", 
                              type = "NUC")

# Force a re-download of human IGK (gapped FASTA)
igk_fresh <- refreshOGRDB(species = "human", 
                          locus = "IGK", 
                          type = "NUC", 
                          format = "FASTA_GAPPED")
```

## Exporting Reference Databases for External Tools

`immReferent` provides export functions to create reference databases compatible with popular immune repertoire analysis tools. These functions format the downloaded sequences appropriately for each tool.

### Export to MiXCR

[MiXCR](https://mixcr.com/) is a popular tool for analyzing immune repertoire sequencing data. The `exportMiXCR()` function creates FASTA files formatted for MiXCR's `buildLibrary` command.

```{r export_mixcr, eval = imgt_ok}
# Download human IGH sequences
igh_seqs <- getIMGT(species = "human",
                    gene = "IGH",
                    type = "NUC",
                    suppressMessages = TRUE)

# Export to MiXCR format
mixcr_dir <- tempdir()
mixcr_files <- exportMiXCR(igh_seqs, 
                           mixcr_dir, 
                           chain = "IGH")

# View created files
print(mixcr_files)

# View first few lines of V gene file
if (!is.null(mixcr_files$v_genes)) {
  cat(head(readLines(mixcr_files$v_genes), 4), sep = "\n")
}
```

The output files can be used with MiXCR's `buildLibrary` command:
```bash
mixcr buildLibrary \
    --v-genes-from-fasta v-genes.IGH.fasta \
    --v-gene-feature VRegion \
    --j-genes-from-fasta j-genes.IGH.fasta \
    --d-genes-from-fasta d-genes.IGH.fasta \
    --c-genes-from-fasta c-genes.IGH.fasta \
    --chain IGH \
    --species hs \
    custom-IGH.json.gz
```

### Export to TRUST4

[TRUST4](https://github.com/liulab-dfci/TRUST4) is a tool for reconstructing immune repertoires from RNA-seq data. The `exportTRUST4()` function creates a FASTA file compatible with TRUST4's `--ref` parameter.

```{r export_trust4, eval = imgt_ok}
# Export to TRUST4 format (includes constant regions by default)
trust4_file <- tempfile(fileext = ".fa")
exportTRUST4(igh_seqs, trust4_file)

# View header format
cat(head(readLines(trust4_file), 6), sep = "\n")
```

Use the exported file with TRUST4:
```bash
./run-trust4 -f hg38_bcrtcr.fa --ref custom_IMGT+C.fa \
    -1 reads_1.fq -2 reads_2.fq -o output
```

### Export to Cell Ranger VDJ

[Cell Ranger](https://www.10xgenomics.com/support/software/cell-ranger) from 10x Genomics includes VDJ analysis capabilities. The `exportCellRanger()` function creates a FASTA file suitable for building a custom VDJ reference.

```{r export_cellranger, eval = imgt_ok}
# Export V genes for Cell Ranger
cellranger_file <- tempfile(fileext = ".fa")
exportCellRanger(igh_seqs, cellranger_file)

# View header format
cat(head(readLines(cellranger_file), 4), sep = "\n")
```
Note: For a complete Cell Ranger VDJ reference, you also need a GTF file with gene annotations and should use `cellranger mkvdjref` to build the reference.

### Export to IgBLAST

[IgBLAST](https://ncbi.github.io/igblast/) is NCBI's tool for analyzing immunoglobulin and T cell receptor sequences. The `exportIgBLAST()` function creates FASTA files with simplified headers compatible with IgBLAST.

```{r export_igblast, eval = imgt_ok}
# Export to IgBLAST format
igblast_dir <- tempdir()
igblast_files <- exportIgBLAST(igh_seqs, igblast_dir,
                                organism = "human",
                                receptor_type = "ig")

# View created files
print(igblast_files)

# View header format
if (!is.null(igblast_files$v_genes)) {
  cat(head(readLines(igblast_files$v_genes), 4), sep = "\n")
}
```

After exporting, create BLAST databases using `makeblastdb`:
```bash
makeblastdb -parse_seqids -dbtype nucl \
    -in human_ig_v.fasta -out human_ig_v
makeblastdb -parse_seqids -dbtype nucl \
    -in human_ig_d.fasta -out human_ig_d
makeblastdb -parse_seqids -dbtype nucl \
    -in human_ig_j.fasta -out human_ig_j
```

# Conclusion

This has been a general overview of the capabilities of **immReferent** for downloading and caching immune receptor and HLA sequences from IMGT and OGRDB, as well as exporting them to formats compatible with popular analysis tools. If you have any questions, comments, or suggestions, feel free to visit the [GitHub repository](https://github.com/BorchLab/immReferent).

## Session Info

```{r}
sessionInfo()
```