--- title: Getting Started with immReferent author: - name: Nick Borcherding email: ncborch@gmail.com affiliation: Washington University in St. Louis, School of Medicine, St. Louis, MO, USA date: 'Compiled: `r format(Sys.Date(), "%B %d, %Y")`' output: BiocStyle::html_document: toc_float: true package: immReferent vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Getting Started with immReferent} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} library(immReferent) library(BiocStyle) # Make chunks robust on CI: evaluate IMGT or OGRDB examples only if the site is reachable imgt_ok <- try(is_imgt_available(), silent = TRUE) ogrdb_ok <- try(is_ogrdb_available(), silent = TRUE) imgt_ok <- if (inherits(imgt_ok, "try-error")) FALSE else isTRUE(imgt_ok) ogrdb_ok <- if (inherits(ogrdb_ok, "try-error")) FALSE else isTRUE(ogrdb_ok) knitr::opts_chunk$set( error = FALSE, message = FALSE, warning = FALSE, tidy = FALSE ) set.seed(42) ``` ## Introduction The `immReferent` package provides a centralized and easy-to-use interface for downloading, managing, and loading immune repertoire and HLA reference sequences from the IMGT, IPD-IMGT/HLA and OGRDB databases. Its primary goal is to ensure that analyses are based on consistent, up-to-date, and correctly formatted reference data. This vignette will walk you through the basic functionality of the package. ## Installation ```{r eval = F} devtools::install_github("BorchLab/immReferent") ``` Or via Bioconductor (once accepted) ```{r eval = F} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("immReferent") ``` ## Downloading Reference Sequences ```{r setup} library(immReferent) ``` The main function for all data retrieval is `getIMGT()`. It handles both downloading new data from the source and loading previously downloaded data from a local cache. ### Downloading HLA Sequences (IPD-IMGT/HLA) The IPD-IMGT/HLA database provides reference sequences for the Human Leukocyte Antigen (HLA) system. You can download the complete set of nucleotide or protein sequences using `gene = "HLA"`. ```{r get_hla, eval = imgt_ok} # Download all available HLA protein sequences # This will download the file to the cache on the first run hla_prot <- getIMGT(gene = "HLA", type = "PROT") # Inspect the result print(hla_prot) cat("Number of sequences:", length(hla_prot), "\n") cat("First sequence name:", names(hla_prot)[1], "\n") ``` ### Downloading TCR/BCR Sequences (IMGT) For T-cell receptor (TCR) and B-cell receptor (BCR) genes, you can specify the species and the gene or gene family you are interested in. ```{r get_ighv, eval = imgt_ok} # Download human IGHV nucleotide sequences ighv_nuc <- getIMGT(species = "human", gene = "IGHV", type = "NUC") # Inspect the result print(ighv_nuc) ``` You can also download entire families of genes at once by specifying a group name like `"IGH"`, `"IGK"`, `"TRB"`, etc. ```{r get_trb, eval = imgt_ok} # Download all mouse TRB genes (V, D, J, and C) trb_mouse <- getIMGT(species = "mouse", gene = "TRB", type = "NUC") # This object will contain TRBV, TRBD, TRBJ, and TRBC sequences print(trb_mouse) ``` ### Downloading Germline Sets from OGRDB (AIRR) OGRDB provides AIRR-compliant **germline sets** for immunoglobulin loci (and growing coverage more broadly). You can retrieve: - **FASTA** (gapped or ungapped) nucleotide sequences, or - **AIRR JSON**, which we parse into `DNAStringSet` (and optionally translate to `AAStringSet`). ```{r ogrdb_igh_fasta, eval=ogrdb_ok} # Human IGH nucleotide sequences (gapped FASTA) igh_ogrdb <- getOGRDB( species = "human", locus = "IGH", type = "NUC", format = "FASTA_GAPPED" ) igh_ogrdb ``` Pulling the AIRR-formatted sequences: ```{r ogrdb_igk_airr, eval=ogrdb_ok} # Human IGK sequences via AIRR JSON (parsed to DNAStringSet) igk_airr <- getOGRDB( species = "human", locus = "IGK", type = "NUC", format = "AIRR" ) igk_airr ``` ## Working with the Cache `immReferent` automatically caches all downloaded data to avoid repeated downloads and to enable offline work. ### Listing Cached Files You can see all the files currently in your cache using the `listIMGT()` or `listOGRDB()` function. ```{r list_imgt, eval=imgt_ok} # List the full paths of all cached files listIMGT() listOGRDB() ``` ### Loading from Cache When you call `getIMGT()`, it will always load data from the cache if it's available. If you want to *only* load from the cache and prevent any possibility of a download, you can use `loadIMGT()`. This function is useful in offline environments or for ensuring strict reproducibility. ```{r load_imgt, eval=imgt_ok} # This will load from the cache if available, or download otherwise ighv_nuc <- getIMGT(species = "human", gene = "IGHV", type = "NUC") # This will load from the cache, or fail if not found and offline ighv_nuc_from_cache <- loadIMGT(species = "human", gene = "IGHV", type = "NUC") ``` Similar to the above, we can pull and load from OGRDB using `getOGRDB()` and `loadOGRDB()`. ```{r eval=ogrdb_ok} # This will load from the cache if available, or download otherwise igh_nuc <- getOGRDB(species = "human", locus = "IGH", type = "NUC", format = "FASTA_GAPPED") # This will load from the cache, or fail if not found and offline igh_from_cache <- loadOGRDB(species = "human", locus = "IGH", type = "NUC", format = "FASTA_GAPPED") ``` ### Refreshing the Cache If you suspect the online data has been updated and you want to re-download it, you can use `refreshIMGT()` or `refreshOGRDB()`. This is just a convenient shortcut for `getIMGT(..., refresh = TRUE)`. ```{r refresh_imgt, eval=imgt_ok & ogrdb_ok} # Force a re-download of the human IGHV sequences ighv_nuc_fresh <- refreshIMGT(species = "human", gene = "IGHV", type = "NUC") # Force a re-download of human IGK (gapped FASTA) igk_fresh <- refreshOGRDB(species = "human", locus = "IGK", type = "NUC", format = "FASTA_GAPPED") ``` ## Exporting Reference Databases for External Tools `immReferent` provides export functions to create reference databases compatible with popular immune repertoire analysis tools. These functions format the downloaded sequences appropriately for each tool. ### Export to MiXCR [MiXCR](https://mixcr.com/) is a popular tool for analyzing immune repertoire sequencing data. The `exportMiXCR()` function creates FASTA files formatted for MiXCR's `buildLibrary` command. ```{r export_mixcr, eval = imgt_ok} # Download human IGH sequences igh_seqs <- getIMGT(species = "human", gene = "IGH", type = "NUC", suppressMessages = TRUE) # Export to MiXCR format mixcr_dir <- tempdir() mixcr_files <- exportMiXCR(igh_seqs, mixcr_dir, chain = "IGH") # View created files print(mixcr_files) # View first few lines of V gene file if (!is.null(mixcr_files$v_genes)) { cat(head(readLines(mixcr_files$v_genes), 4), sep = "\n") } ``` The output files can be used with MiXCR's `buildLibrary` command: ```bash mixcr buildLibrary \ --v-genes-from-fasta v-genes.IGH.fasta \ --v-gene-feature VRegion \ --j-genes-from-fasta j-genes.IGH.fasta \ --d-genes-from-fasta d-genes.IGH.fasta \ --c-genes-from-fasta c-genes.IGH.fasta \ --chain IGH \ --species hs \ custom-IGH.json.gz ``` ### Export to TRUST4 [TRUST4](https://github.com/liulab-dfci/TRUST4) is a tool for reconstructing immune repertoires from RNA-seq data. The `exportTRUST4()` function creates a FASTA file compatible with TRUST4's `--ref` parameter. ```{r export_trust4, eval = imgt_ok} # Export to TRUST4 format (includes constant regions by default) trust4_file <- tempfile(fileext = ".fa") exportTRUST4(igh_seqs, trust4_file) # View header format cat(head(readLines(trust4_file), 6), sep = "\n") ``` Use the exported file with TRUST4: ```bash ./run-trust4 -f hg38_bcrtcr.fa --ref custom_IMGT+C.fa \ -1 reads_1.fq -2 reads_2.fq -o output ``` ### Export to Cell Ranger VDJ [Cell Ranger](https://www.10xgenomics.com/support/software/cell-ranger) from 10x Genomics includes VDJ analysis capabilities. The `exportCellRanger()` function creates a FASTA file suitable for building a custom VDJ reference. ```{r export_cellranger, eval = imgt_ok} # Export V genes for Cell Ranger cellranger_file <- tempfile(fileext = ".fa") exportCellRanger(igh_seqs, cellranger_file) # View header format cat(head(readLines(cellranger_file), 4), sep = "\n") ``` Note: For a complete Cell Ranger VDJ reference, you also need a GTF file with gene annotations and should use `cellranger mkvdjref` to build the reference. ### Export to IgBLAST [IgBLAST](https://ncbi.github.io/igblast/) is NCBI's tool for analyzing immunoglobulin and T cell receptor sequences. The `exportIgBLAST()` function creates FASTA files with simplified headers compatible with IgBLAST. ```{r export_igblast, eval = imgt_ok} # Export to IgBLAST format igblast_dir <- tempdir() igblast_files <- exportIgBLAST(igh_seqs, igblast_dir, organism = "human", receptor_type = "ig") # View created files print(igblast_files) # View header format if (!is.null(igblast_files$v_genes)) { cat(head(readLines(igblast_files$v_genes), 4), sep = "\n") } ``` After exporting, create BLAST databases using `makeblastdb`: ```bash makeblastdb -parse_seqids -dbtype nucl \ -in human_ig_v.fasta -out human_ig_v makeblastdb -parse_seqids -dbtype nucl \ -in human_ig_d.fasta -out human_ig_d makeblastdb -parse_seqids -dbtype nucl \ -in human_ig_j.fasta -out human_ig_j ``` # Conclusion This has been a general overview of the capabilities of **immReferent** for downloading and caching immune receptor and HLA sequences from IMGT and OGRDB, as well as exporting them to formats compatible with popular analysis tools. If you have any questions, comments, or suggestions, feel free to visit the [GitHub repository](https://github.com/BorchLab/immReferent). ## Session Info ```{r} sessionInfo() ```