---
title: "HiCaptuRe Introduction"
author: "Laureano Tomás-Daza"
package: HiCaptuRe
date: "`r format(Sys.time(), '%d %B %Y')`"
output: 
  BiocStyle::html_document:
    toc: true
    toc_float: true
vignette: >
  %\VignetteIndexEntry{Introduction}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
library(knitr)
library(kableExtra)
```

# Installation

You can install the latest release of `HiCaptuRe` from Bioconductor:

```{r, install, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("HiCaptuRe")
```


If you want to test the development version, you can install it from the github repository:

```{r, install2, eval=FALSE}
BiocManager::install("LaureTomas/HiCaptuRe")
```


# Overview

HiCaptuRe is an R package designed to manage and analyze Capture Hi-C data, including high-resolution methods like liCHi-C. These approaches detect long-range genomic interactions involving selected regions of interest ("baits") by combining chromatin capture with targeted probe enrichment. The result is a cost-effective, fragment-level resolution map of genomic contacts.

## Motivation for Bioconductor

HiCaptuRe builds directly on Bioconductor’s `GenomicInteractions` class, extending it with capture‑specific metadata and workflows for importing, and bait‑aware fragment handling in Capture Hi‑Cexperiments. This design ensures seamless use of existing utilities and visualizations, and straightforward integration with CHi‑C callers such as **CHiCAGO**. Unlike CHiCAGO, which focuses on statistical interaction calling, and GenomicInteractions, which provides general‑purpose infrastructure for chromatin contact data, HiCaptuRe addresses the capture‑specific preprocessing, organization, and annotation needs that precede or complement these tools. Distribution through Bioconductor guarantees consistent installation via `BiocManager`, access to genome and annotation resources, and continuous cross‑platform checks for reproducibility, helping integrate capture‑specific workflows into the broader Bioconductor 3D genomics framework.


# Basic Vocabulary

This section defines the key terms used throughout the HiCaptuRe package and Capture Hi-C analysis. Understanding this terminology is essential to correctly interpret the structure of interaction data and the functionality of the package.


- **Restriction fragment**: A genomic interval resulting from digestion of the genome using a restriction enzyme that recognizes a specific DNA motif. These fragments define the resolution of interaction detection in Capture Hi-C data.

- **Hi-C**: A chromosome conformation capture (3C) technique that measures the interaction frequency between all pairs of restriction fragments genome-wide, creating a genome-wide contact map.

- **Capture Hi-C**: A targeted variant of Hi-C that enriches for interactions involving specific genomic regions of interest. This is achieved using designed oligonucleotide probes (baits) that hybridize to the chosen regions, increasing sequencing coverage for relevant interactions.

- **Bait**: A restriction fragment targeted during the capture step (e.g., a promoter or enhancer). These fragments are considered the primary points of interest.

- **Other end (OE)**: A restriction fragment that is not directly targeted during capture but is found interacting with a bait. It represents the uncaptured partner in a bait–OE interaction.

- **Anchor**: One of the two restriction fragments involved in a detected interaction. Each interaction has two anchors: anchor1 and anchor2.

- **Interaction**: A paired-end read (or its post-processed equivalent) that represents a spatial interaction between two anchors. Depending on the annotation, this interaction may be bait–bait or bait–OE.

- **Interactome**: The complete set of interactions detected in a given sample. It can include various interaction classes depending on annotation completeness and data type.

- **Interaction types**:
  - **Bait–Bait (B_B)**: Both anchors are bait fragments.
  - **Bait–OtherEnd (B_OE)**: One anchor is a bait, the other is not.

- **Chicago Score**: A statistical score assigned to each interaction (e.g., by the CHiCAGO method) to assess confidence or interaction strength. This score is often included in peakmatrix or ibed formats.


# Data Origin and Experimental Workflow

HiCaptuRe is designed to support workflows based on data generated using the **Capture Hi-C** experimental pipeline. This typically includes:

1. **Chromatin conformation capture (3C)**:
   Crosslinked DNA is digested with a restriction enzyme (e.g., HindIII) and re-ligated to form hybrid DNA molecules that reflect spatial proximity in the nucleus.

2. **Capture enrichment**:
   A key step in Capture Hi-C is the hybridization of biotinylated oligonucleotide probes to regions of interest (e.g., promoters). This enriches the library for interactions involving those "bait" fragments.

3. **Library preparation and sequencing**:
   The ligated DNA is sheared and sequenced to detect paired-end reads representing interactions between restriction fragments.

4. **Read mapping and fragment-level assignment with HiCUP**:
   The [HiCUP](https://www.bioinformatics.babraham.ac.uk/projects/hicup/) pipeline aligns paired-end Hi-C reads to the reference genome, filters out invalid pairs (e.g., self-ligation, circularized reads, fragments too close), and assigns them to restriction fragments. It also distinguishes between capture and non-capture reads to assess **capture efficiency**. The output is a filtered BAM file.

5. **Interactions calling with CHiCAGO**:
   The [CHiCAGO](https://bioconductor.org/packages/release/bioc/html/Chicago.html) method assigns confidence scores to each interaction using a background model that accounts for distance and biases in Capture Hi-C. These scores (e.g., “CS”) are commonly included in `ibed` or `peakmatrix` files and used as thresholds for downstream filtering.

HiCaptuRe works downstream of this pipeline, assuming that a **interactions file** is already available.

You can use HiCaptuRe to annotate, filter, and format these interactions.

# Capture Hi-C Data Formats (some CHiCAGO outputs)

HiCaptuRe supports multiple file formats generated by the [CHiCAGO](https://bioconductor.org/packages/Chicago) pipeline and related tools. These formats vary in structure, completeness, and intended use. Below, we describe each format supported by HiCaptuRe.

For detailed explanations of how these files are generated and used within the CHiCAGO pipeline, please refer to the [CHiCAGO Bioconductor vignette](https://bioconductor.org/packages/release/bioc/vignettes/Chicago/inst/doc/Chicago.html).


## **interBed - ibed** — Recommended Format

```         
  bait_chr bait_start bait_end         bait_name otherEnd_chr otherEnd_start otherEnd_end otherEnd_name N_reads score
1       19     290159   302184             PLPP2           19         343893       369651         MIER2      21  6.07
2       19     290159   302184             PLPP2           19         370987       379828          THEG      15  7.00
3       19     290159   302184             PLPP2           19         402130       410516        C2CD4C      10  5.60
4       19     343893   369651             MIER2           19         530387       539467         CDC34       5  7.83
5       19     506618   515156 TPGS1,MADCAM1-AS1           19         530387       539467         CDC34      18 11.40
```
- **Purpose**: Standardized and complete interaction format used throughout CHiCAGO pipelines. Recommended for HiCaptuRe workflows.

- **Structure**: Each row represents a single interaction between a bait and another fragment (bait or other-end).

- **Columns**:
  - `bait_chr`, `bait_start`, `bait_end`: genomic location of the bait fragment
  - `bait_name`: name or annotation of the bait
  - `otherEnd_chr`, `otherEnd_start`, `otherEnd_end`: genomic location of the interacting other-end
  - `otherEnd_name`: name or annotation of the OE fragment
  - `N_reads`: number of supporting reads
  - `score`: CHiCAGO score for the interaction

## **peakmatrix** — Multi-sample Interaction Matrix

This format is generated by the `makePeakMatrix.R` script in CHiCAGO Tools and is commonly used in high-throughput CHi-C experiments such as liCHi-C.

```         
  baitChr baitStart baitEnd baitID baitName oeChr oeStart  oeEnd oeID oeName   dist  sample1  sample2
1      19    290159  302184     67    PLPP2    19  343893 369651   75  MIER2  60600 6.072719 3.028272
2      19    290159  302184     67    PLPP2    19  370987 379828   77   THEG  79236 7.004499 3.122154
3      19    290159  302184     67    PLPP2    19  402130 410516   80 C2CD4C 110151 5.600691 7.393738
4      19    343893  369651     75    MIER2    19  530387 539467   92  CDC34 178155 7.829787 4.538076
5      19    370987  379828     77     THEG    19  450586 456228   84      .  77999 2.356043 5.501240
```

- **Purpose**: Compact matrix storing CHiCAGO scores across multiple samples.

- **Structure**:
  - Each row represents a single interaction between a bait and another fragment (bait or other-end).
  - Additional columns store sample-specific CHiCAGO scores

- **Columns**:
  - `baitChr`, `baitStart`, `baitEnd`, `baitID`, `baitName`: genomic coordinates, ID, and annotation of the bait fragment
  - `oeChr`, `oeStart`, `oeEnd`, `oeID`, `oeName`: genomic coordinates, ID, and annotation of the other-end fragment
  - `dist`: genomic distance between bait and OE
  - `sampleX`: CHiCAGO interaction score for sample X (e.g., `sample1`, `sample2`, ...)

## **seqMonk** — Two-row Interaction Format

```         
  V1     V2     V3     V4 V5   V6
1 19 343893 369651  MIER2 21 6.07
2 19 290159 302184  PLPP2 21 6.07
3 19 370987 379828   THEG 15 7.00
4 19 290159 302184  PLPP2 15 7.00
```

- **Purpose**: Used for visualization in [SeqMonk](https://www.bioinformatics.babraham.ac.uk/projects/seqmonk/). It's an output of ChICAGO.

- **Structure**:
  - Each interaction is split across **two consecutive rows**:
    - First row = bait fragment
    - Second row = other end (OE)

- **Columns**:
  - `V1`: chromosome where the fragment is located
  - `V2`: start coordinate of the fragment
  - `V3`: end coordinate of the fragment
  - `V4`: name or identifier of the fragment (e.g., gene symbol or `.`)
  - `V5`: number of reads supporting the interaction
  - `V6`: confidence score assigned by the CHiCAGO method

## **bedpe ** — Generic Paired-End Interaction Format

```         
  V1     V2     V3 V4     V5     V6 V7    V8 V9 V10
1 19 290159 302184 19 343893 369651  1  6.07  *   *
2 19 290159 302184 19 370987 379828  2  7.00  *   *
3 19 290159 302184 19 402130 410516  3  5.60  *   *
4 19 343893 369651 19 530387 539467  4  7.83  *   *
5 19 506618 515156 19 530387 539467  5 11.40  *   *
```

- **Purpose**: A generic paired-end BED format used to describe interactions between two genomic regions. Supported by many tools but lacks bait/OE annotations. HiCaptuRe can import and annotate this format using `annotate_interactions()`.

- **Structure**:
  - Each row represents a single interaction between two genomic fragments
  - First six columns specify the coordinates of both anchors
  - Additional columns can include interaction metadata (e.g., ID, score)

- **Columns**:
  - `V1`, `V2`, `V3`: chromosome, start, and end of the first anchor
  - `V4`, `V5`, `V6`: chromosome, start, and end of the second anchor
  - `V7`: interaction ID (or row number)
  - `V8`: interaction score (e.g., CHiCAGO score)
  - `V9`, `V10`: optional fields (can include strand, annotations, or unused placeholders)

## **washU ** — Minimal Format for Browser Upload

```         
     V1     V2     V3                       V4
1 chr19 290159 302184 chr19:343893-369651,6.07
2 chr19 290159 302184    chr19:370987-379828,7
3 chr19 290159 302184  chr19:402130-410516,5.6
4 chr19 343893 369651 chr19:530387-539467,7.83
5 chr19 506618 515156 chr19:530387-539467,11.4
```

- **Purpose**: Used for uploading interaction data to the [WashU Epigenome Browser](https://epigenomegateway.wustl.edu/). Supported as an input format in HiCaptuRe.

- **Structure**:
  - Each row represents a single interaction between two genomic regions
  - Minimal format with no bait/OE annotation
  - The first region is split into separate columns (`chr`, `start`, `end`)
  - The second region and score are combined into a single string

- **Columns**:
  - `V1`: chromosome of the first anchor
  - `V2`: start coordinate of the first anchor
  - `V3`: end coordinate of the first anchor
  - `V4`: second anchor and CHiCAGO score (or any value) in the format `chr:start-end,score`

## **washUold ** — (Legacy)

```         
                   V1                  V2    V3
1 chr19:290159,302184 chr19:343893,369651  6.07
2 chr19:290159,302184 chr19:370987,379828  7.00
3 chr19:290159,302184 chr19:402130,410516  5.60
4 chr19:343893,369651 chr19:530387,539467  7.83
5 chr19:506618,515156 chr19:530387,539467 11.40
```
- **Purpose**: Legacy version of the WashU browser format with both anchors encoded as strings. Supported for backward compatibility in HiCaptuRe.

- **Structure**:
  - Each row represents a single interaction between two genomic regions
  - Both anchors are stored as combined strings with embedded coordinates
  - Includes CHiCAGO score (or any value)

- **Columns**:
  - `V1`: first anchor in the format `chr:start,end`
  - `V2`: second anchor in the format `chr:start,end`
  - `V3`: CHiCAGO score (or any value)

## Summary
```{r, summary,echo=FALSE}
# Define the table with emojis in logical columns
format_table <- data.frame(
    Format = c("`ibed`", "`peakmatrix`", "`seqMonk`", "`bedpe`", "`washU`", "`washUold`"),
    `Recommended Use` = c(
        "Standard HiCaptuRe input",
        "High-throughput liCHi-C",
        "Visualization in SeqMonk",
        "Generic interaction input/output",
        "WashU browser upload",
        "Legacy WashU format"
    ),
    `Bait/OE Annotation` = c("✅", "✅", "✅ (split rows)", "❌", "❌", "❌"),
    `CHiCAGO Score` = c("✅", "✅", "✅", "✅", "✅ (embedded)", "✅"),
    `Multi-sample` = c("❌", "✅", "❌", "❌", "❌", "❌"),
    check.names = FALSE
)

# Render the table
kable(format_table,
    format = "html", escape = FALSE,
    caption = "Summary of Supported Data Formats in HiCaptuRe"
) |>
    kable_styling(
        bootstrap_options = c("striped", "hover", "condensed", "responsive"),
        full_width = FALSE, position = "center"
    ) |>
    row_spec(0, extra_css = "text-align: center;")
```


# The HiCaptuRe Object

HiCaptuRe wraps a `GenomicInteractions` object and adds slots and metadata specific to Capture Hi-C. This provides:

- Compatibility with Bioconductor tools
- Storage of interaction classification
- Ability to record processing parameters and filtering operations

📦 Slots include:

- `parameters`: stores info about the digest, input file, annotations and different functions used
- `ByBaits`: optional list of bait-level summaries
- `ByRegions`: optional list of region-level summaries


# Typical Workflow

A typical HiCaptuRe workflow consists of the following steps:

1. **Digest the genome** using a restriction enzyme (e.g., HindIII)
2. **Load an interaction file** in ibed or any compatible format
3. **Annotate interactions** using bait fragment annotations
4. **Subset by bait(s) or external regions** (e.g., enhancers)
5. **Export** the processed interactions to standard formats (e.g., ibed, WashU)

Each step is designed to work with genomic data in a modular and reproducible way.


# Example Data
The HiCaptuRe package includes example files to illustrate how to load, annotate, and manipulate Capture Hi-C interaction data. These files are bundled under the `inst/extdata/` directory and can be accessed using `system.file()`.

📄 Available Files  

- **`ibed1_example.zip`**:  
  A ZIP archive containing a standard ibed-formatted interaction file from a sample experiment. Suitable for testing basic `load_interactions()` and `annotate_interactions()` functions.

- **`ibed2_example.zip`**:  
  A second ibed-formatted interaction dataset, useful for comparing samples or testing interaction intersection with `intersect_interactions()`.

- **`peakmatrix_example.zip`**:  
  Contains a multi-sample interaction matrix in `peakmatrix` format, as typically generated by CHiCAGO Tools. Can be used to test `load_interactions()` with multi-sample support and downstream summarization.

- **`annotation_example.txt`**:  
  A tab-delimited file with bait annotations (coordinates and identifiers) used to annotate interaction data. Intended for use with `annotate_interactions()` after genome digestion.

📥 How to Load the Example Files
```{r, example}
ibed1_file <- system.file("extdata", "ibed1_example.zip", package = "HiCaptuRe")
ibed2_file <- system.file("extdata", "ibed2_example.zip", package = "HiCaptuRe")
peakmatrix_file <- system.file("extdata", "peakmatrix_example.zip", package = "HiCaptuRe")
annotation_file <- system.file("extdata", "annotation_example.txt", package = "HiCaptuRe")
```

These files are small and optimized for fast loading during examples and tests.

# 📚  References

- Schoenfelder, S., Javierre, B. M. *et al.* *Promoter Capture Hi-C: High-resolution, genome-wide profiling of promoter interactions*.  J Vis Exp. 2018;(136):57320. https://doi.org/10.3791/57320

- Freire-Pritchett, P., Ray-Jones, H. *et al.*  *Detecting chromosomal interactions in Capture Hi-C data with CHiCAGO and companion tools*.  Nat Protoc. 2021;16:4144–4176. https://doi.org/10.1038/s41596-021-00567-5

- Tomás-Daza, L., Rovirosa, L. *et al.*  *Low input capture Hi-C (liCHi-C) identifies promoter–enhancer interactions at high resolution*.  Nat Commun. 2023;14:268. https://doi.org/10.1038/s41467-023-35911-8

# SessionInfo

```{r, sessioninfo}
sessionInfo()
```