---
title: "HiCaptuRe Functions"
author: "Laureano Tomás-Daza"
package: HiCaptuRe
date: "`r format(Sys.time(), '%d %B %Y')`"
output: 
  BiocStyle::html_document:
    toc: true
    toc_float: true
vignette: >
  %\VignetteIndexEntry{Functions}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(knitr)
library(kableExtra)
```


# Overview

This vignette demonstrates the core functionality of the `HiCaptuRe` package using example datasets bundled with the package. We will walk through typical tasks such as loading interaction data, annotating interactions, and exporting results in various formats.

To begin, we first load the example data files provided in `HiCaptuRe`:

```{r, example_data}
ibed1_file <- system.file("extdata", "ibed1_example.zip", package = "HiCaptuRe")
ibed2_file <- system.file("extdata", "ibed2_example.zip", package = "HiCaptuRe")
peakmatrix_file <- system.file("extdata", "peakmatrix_example.zip", package = "HiCaptuRe")
annotation_file <- system.file("extdata", "annotation_example.txt", package = "HiCaptuRe")
```
These files will be used throughout the vignette to showcase the full `HiCaptuRe` workflow.

Next, we load the package:

```{r, library, comment=F, message=F}
library(HiCaptuRe)
```

# Importing Interaction Data: `load_interactions()`

The first step in any `HiCaptuRe` workflow is importing your interaction file. This is done using the `load_interactions()` function.

This function performs multiple tasks:

- Automatically detects the format of the input file (`ibed`, `seqMonk`, `washU`, `washUold`, `bedpe`, or `peakmatrix`)

- Removes technical artifacts such as duplicated interactions

- Normalizes the data into a consistent and accessible structure: a `HiCaptuRe` object

Specifically, `load_interactions()` ensures that:

- Each interaction appears only once (even if present as A–B and B–A in the file)

- For duplicate interactions with differing CHiCAGO scores, the highest score is retained

- Structural consistency is enforced across input formats (e.g., missing annotations or read counts are filled in with placeholders)


```{r,load_interactions}
ibed1 <- load_interactions(file = ibed1_file, genome = "BSgenome.Hsapiens.NCBI.GRCh38")
```

The function automatically detects the format (in this case, ibed) and loads the data into a structured HiCaptuRe object.

```{r,ibed1}
ibed1
```

## What is a `HiCaptuRe` object?

The `HiCaptuRe` object extends the standard `GenomicInteractions` object by including additional metadata and slots relevant to Capture Hi-C experiments.

Each interaction includes:

- `bait_1`, `bait_2`: annotations for each anchor. If not captured, a "." placeholder is used.

- `ID_1`, `ID_2`: restriction fragment IDs derived from the reference genome digest (via `digest_genome()`).

- `reads`: number of reads supporting the interaction.

- `CS`: CHiCAGO score associated with the interaction.

- `counts`: count of times the interaction appears (will always be 1 post-cleaning).

- `int`: interaction class — `"B_B"` for bait–bait or `"B_OE"` for bait–other end.

- `distance`: distance (in bp) between the midpoints of the two interacting fragments.

**Note**
When loading formats that lack read count or annotation information (e.g., `washU` or `bedpe`), `load_interactions()` automatically fills:

- `"non-annotated"` in the bait fields

- `0` in the reads column

### Inspecting the HiCaptuRe Object


Beyond interaction data, the `HiCaptuRe` object contains additional internal components stored in S4 slots. These include both inherited slots from the `GenomicInteractions` class and new ones added specifically by `HiCaptuRe.`

We can inspect the available slots using `slotNames()`:
```{r,slots}
slotNames(ibed1)
```

The slots `anchor1`, `anchor2`, and others like `regions` and `elementMetadata` come from the `GenomicInteractions` class. `HiCaptuRe` introduces three new slots:

- `parameters`: stores metadata for reproducibility

- `ByBaits` and `ByRegions`: used to store interaction summaries generated by other functions

### Tracking Parameters Used in Each Step

The `parameters` slot is automatically updated each time you run a major `HiCaptuRe` function. This allows full traceability of how the object was built, including the genome used, enzyme, digestion settings, and file origins.

We can inspect this slot with `getParameters()`:

```{r, parameters}
getParameters(x = ibed1)
```

This tells us that two major operations have been logged:

- `digest`: shows the genome, restriction enzyme, and motif used when the genome was processed via `digest_genome()`

- `load`: tracks the interaction file path and the format detected during `load_interactions()`

This tracking system supports transparency and reproducibility throughout the analysis pipeline.

## Digesting the Genome: `digest_genome()`

The function `digest_genome()` performs a virtual digestion of a reference genome using a restriction enzyme motif. It generates a data frame of restriction fragments, each identified by a unique `fragment_ID`, which defines the resolution for subsequent interaction mapping.

This function is used both explicitly and internally:

- You can call it directly to explore the digestion or prepare custom fragments.

- It is called internally by `load_interactions()` to ensure that all loaded interaction files share a consistent genomic fragment map.

### Enzyme Parameters

`digest_genome()` supports both manual specification and automatic lookup of enzyme details:

If you provide only `RE_name` (e.g., "HindIII"), the function will automatically fill in the known **motif** and **cut position**

Supported enzymes include:

```{r, enzymes table,echo=FALSE}
enzymes <- data.frame(
    Enzyme = c("HindIII", "EcoRI", "BamHI", "MboI", "DpnII"),
    Motif = c("A^AGCTT", "G^AATTC", "G^GATCC", "^GATC", "^GATC"),
    Cut_Position = c(1, 1, 1, 0, 0)
)

kable(enzymes, caption = "Table: Recognized Restriction Enzymes and Their Motifs") |>
    kable_styling(
        bootstrap_options = c("striped", "hover", "condensed", "responsive"),
        full_width = FALSE, position = "center"
    ) |>
    row_spec(0, extra_css = "text-align: center;")
```

You can also manually override the motif and cut position if needed

### Controlling Genome Digestion

Key arguments that customize the digestion:

- `genome`: Genome identifier (e.g., `"GRCh38"`). Must match a `BSgenome` package.

- `select_chr`: Vector of chromosomes to digest (e.g., `1:22`, `"X"`, `"Y"`). This helps skip unplaced contigs or alternative scaffolds.

- `PAR_mask`: Logical. If `TRUE`, masks pseudoautosomal regions (PARs) from the Y chromosome to match X, preventing artificial duplicates.

- `PAR_file`: Optional file with PAR coordinates (columns: `seqnames`, `start`, `end`). For `"GRCh38"`, this file is included in the package and used automatically.

**Note**: For human genomes, PAR masking can avoid differences between UCSC and Ensembl versions of chromosome Y. If `PAR_mask = TRUE`, masked regions in Y are replaced with `"N"` to prevent motif matches.

### Performance and Caching

The first time you digest a genome, it may take a few seconds to compute all fragments. Internally, `HiCaptuRe` caches this result when used via `load_interactions()`, making repeated use much faster.


```{r, digest_genome}
digest <- digest_genome(genome = "BSgenome.Hsapiens.NCBI.GRCh38", RE_name = "HindIII")
head(digest$digest)
```
This returns a list with:

- `digest`: a data frame with `seqnames`, `start`, `end`, and `fragment_ID`

- `parameters`: metadata about the digestion process (enzyme, motif, PAR settings, etc.)

- `seqinfo`: reference genome sequence metadata


# Annotating Interactions: `annotate_interactions()`

The `annotate_interactions()` function allows you to assign biological annotations to bait fragments in your interaction data. This step is especially important when working with interaction files that lack annotation, such as those in washU or bedpe formats.

In Capture Hi-C experiments, the capture library defines a set of target regions (e.g., gene promoters, enhancers, or structural variants) that were enriched during sequencing. The annotation file provided to this function should represent that design — one line per targeted restriction fragment.

Annotation refers to linking each restriction fragment to a meaningful identifier, such as:

- Ensembl gene or transcript ID

- Gene symbol

- Enhancer or regulatory region ID

- Custom feature names (e.g., from a BED or GTF file)

```{r, annotate}
ibed1_annotated <- annotate_interactions(
    interactions = ibed1,
    annotation = annotation_file
)
ibed1_annotated
```

This call updates the fields `bait_1` and `bait_2` with new annotations for each bait fragment based on overlap with your capture library.

For example, the original `bait_1` column contained only Ensembl transcript IDs; now it includes gene names.

As with all major `HiCaptuRe` functions, annotation settings are tracked in the object’s `parameters` slot:

```{r, annotate_parameters}
parameters <- getParameters(x = ibed1_annotated)
parameters$annotate
```

# Filtering Interactions

After annotating interactions, it is often useful to focus on a subset of the data based on a biologically meaningful list of features. `HiCaptuRe` supports two main ways to filter interactions:

- By bait name using `interactionsByBaits()`

- By genomic region using `interactionsByRegions()`

Both functions return a new `HiCaptuRe` object containing only the selected interactions, and each one updates its own corresponding summary slot.


## Filtering interactions by Baits of interest: `interactionsByBaits()`

The `interactionsByBaits()` function filters your interaction dataset to retain only those interactions where at least one anchor corresponds to a bait of interest.

This is especially useful when you want to focus your analysis on specific genes or regulatory elements (e.g., from an RNA-seq differential expression result or a curated gene list).

```{r, interactionsByBaits}
baits_of_interest <- c("DAZAP1", "PLIN3", "FPR3", "TP53")

ibed_byBaits <- interactionsByBaits(
    interactions = ibed1_annotated,
    baits = baits_of_interest
)

ibed_byBaits
```

In this case, the filtered object contains only the 22 interactions involving the selected baits.

When printing the resulting object, you’ll notice in the output that the `ByBaits` slot has been updated.

### Bait Summary: `getByBaits()`

To view the bait-centric summary added by this function:

```{r, ByBaits}
getByBaits(ibed_byBaits)
```

This summary includes:

- The bait name and fragment ID where it is present

- Number of interactions it participates in

- Number of distinct other ends that is interacting with

- IDs, annotations and distance of the interacting fragments

If some bait is not present in the data it creates a row with missing data.

Each time you call `interactionsByBaits()`, a new entry is added to the `ByBaits` slot, so you can keep track of multiple filtering events.

As the previous functions the slot `parameters` has also been updated.

## Filtering by Genomic Regions: `interactionsByRegions()`

The `interactionsByRegions()` function filters the interaction dataset to retain interactions in which at least one anchor overlaps a given region of interest.

This is ideal for integrating orthogonal omics data such as ChIP-seq peaks, CUT&RUN binding sites, ATAC-seq regions, or structural variant calls.

```{r, interactionsByRegions}
regions <- GenomicRanges::GRanges(
    seqnames = 19,
    ranges = IRanges(start = c(500000, 1000000), end = c(510000, 1100000))
)

ibed_byRegions <- interactionsByRegions(
    interactions = ibed1_annotated,
    regions = regions
)

ibed_byRegions
```
After filtering, the resulting `HiCaptuRe` object includes 8 new metadata columns, 4 for each anchor:

- `region_1/2`    Logical: Does this anchor overlap any region?
- `Nregion_1/2`    Integer: Number of overlapping regions
- `regionID_1/2`    Character: IDs of the overlapping regions
- `regionCov_1/2`    Numeric: Total base pair overlap between anchor and region(s)

### Regions Summary: `getByRegions()`

To view the region-centric summary added by this function:

```{r, byRegions}
getByRegions(ibed_byRegions)
```

The `ByRegions` slot provides a region-centric summary, including:

- Region ID

- Number of interactions involving fragments that overlap the region

- Number of fragments in data overlapping the region

- Number of other end fragments in data overlapping the region

- IDs and annotation of the overlapping fragments

As with `ByBaits`, multiple calls to `interactionsByRegions()` are logged as separate elements, preserving the analysis history. And the `parameters` slot is also updated.

# Intersecting Interaction Sets: `intersect_interactions()`

The `intersect_interactions()` function allows you to compare and classify interactions across multiple `HiCaptuRe` datasets, identifying shared and unique interactions. This is analogous to a classic Venn diagram or UpSet plot operation for genomic interactions.

This function is useful when comparing biological replicates, different cell types, or experimental conditions to identify reproducible or condition-specific contacts.

To run this function, you must provide a named list of at least two `HiCaptuRe` objects. Each dataset should ideally be annotated using the same genome and bait reference for consistency.

```{r, intersect_interactions}
ibed2 <- load_interactions(file = ibed2_file, genome = "BSgenome.Hsapiens.NCBI.GRCh38")
ibed2_annotated <- annotate_interactions(interactions = ibed2, annotation = annotation_file)

interactions_list <- list(A = ibed1_annotated, B = ibed2_annotated)

output <- intersect_interactions(interactions_list = interactions_list)
```

The function returns a list with three elements:

1. **intersections**

A named list of `HiCaptuRe` objects representing each intersection class:

- Unique interactions in each dataset

- Shared interactions across datasets

```{r, intersections}
lapply(output$intersections, function(x) x[1:2])
```

For shared interactions (present in more than one dataset), the result is returned in a peakmatrix-like format, with separate columns containing CHiCAGO scores for each sample.

2. **upset**

An UpSet plot showing the distribution of intersection sets across samples:

```{r, upset}
output$upset
```

This plot is ideal for comparing many datasets simultaneously, and clearly visualizes the number of interactions in each intersection class.

3. **venn**

A Venn diagram visualization of the intersections:

```{r, venn}
output$venn
``` 

**Note**: The Venn diagram is only generated when the number of datasets is less than 8 to maintain visual clarity.

# Summarizing Interaction Distances: `distance_summary`

The `distance_summary()` function provides a quantitative overview of interaction distances, stratified into defined distance intervals. This is particularly useful when comparing **distance profiles between different samples or conditions**, such as to identify global shifts toward short- or long-range interactions.


```{r, distance_summary}
dist_sum <- distance_summary(
    interactions = ibed1_annotated,
    breaks = seq(0, 10^6, 10^5),
    sample = "ibed1"
)
dist_sum
```

In this example, interaction distances are grouped into bins from 0 to 1 Mb in 100 kb increments.

The function returns a tibble where each row represents a specific combination of:

- `int`: Type of interaction — either "B_B" (bait–bait), "B_OE" (bait–other end), or "Total" (combined).

- `total_per_int`: Total number of interactions of each type across all distance bins.

- `sample`: Sample name, as specified in the sample argument.

- `HiCaptuRe`: Total number of interactions in the input HiCaptuRe object.

- `breaks`: Distance bin label (e.g., [0,1e5], (1e5,2e5], etc.).

- `value`: Number of interactions of the given type (int) within that distance bin.


## Visualizing Distance Distributions: `plot_distance_summary()`
The `plot_distance_summary()` function generates bar plots from the output of `distance_summary()`, allowing you to explore how interactions are distributed across genomic distances.

You can visualize interaction counts in three different ways, depending on the normalization strategy:

1. **Absolute**

Plots the **raw number of interactions** per distance bin, without normalization.

```{r, absolute, fig.show="hold", out.width="50%"}
plots <- plot_distance_summary(distances = dist_sum, type_of_value = "absolute")
plots$int_dist
plots$total_dist
```

2. **by_int_type**

Normalizes values **within each interaction type**. This shows the proportion of B_B or B_OE interactions that fall into each distance bin.

```{r, by_int_type}
plots <- plot_distance_summary(distances = dist_sum, type_of_value = "by_int_type")
plots$int_dist_norm_int
```

3. **by_total**

Normalizes values **by the total number of interactions** in the full dataset. This helps compare global interaction profiles across samples.

```{r, by_total}
plots <- plot_distance_summary(distances = dist_sum, type_of_value = "by_total")
plots$int_dist_norm_total
plots$total_dist_norm_total
```

# Extracting Interactions from a peakmatrix: `peakmatrix2list()`

The `peakmatrix2list()` function is an **auxiliary utility** designed specifically for working with interaction data stored in **peakmatrix format**. This format is often used in multi-sample Capture Hi-C experiments, such as liCHi-C, where interactions from all samples are consolidated into a single table with per-sample CHiCAGO scores.

This function splits a peakmatrix-formatted `HiCaptuRe` object into **individual interaction sets**, one per sample, based on a user-defined CHiCAGO score threshold. The result is a **named list of** `HiCaptuRe` **objects**, each containing only the interactions that pass the cutoff in that specific sample.

Use `peakmatrix2list()` only when:

- Your interaction data was loaded using a peakmatrix file

- You need to work with per-sample interaction sets

- You want to perform downstream filtering or exporting for each sample independently

```{r, peakmatrix2list}
peakmatrix <- load_interactions(peakmatrix_file, genome = "BSgenome.Hsapiens.NCBI.GRCh38")
interactions_list <- peakmatrix2list(peakmatrix = peakmatrix)
names(interactions_list)
```

Each element in the output list corresponds to one sample, and contains a filtered `HiCaptuRe` object with only those interactions that passed the CHiCAGO score cutoff in that sample.

# Exporting Processed Interaction Data: `export_interactions()`

The `export_interactions()` function allows you to save a processed `HiCaptuRe` object to disk in a variety of formats for downstream analysis, visualization, or sharing.

This function is typically used **at the end of a workflow**, after annotation, filtering, or formatting steps have been applied.

**Supported Output Formats**

The exported file can be written in the following formats:

- `ibed` (default): Standard interaction format used throughout `HiCaptuRe`

- `peakmatrix`: Multi-sample interaction matrix (only valid for peakmatrix input)

- `washU`: Format for WashU Epigenome Browser (newer version)

- `washUold`: Legacy WashU format

- `cytoscape`: Edge list suitable for Cytoscape network visualization

- `bedpe`: Standard BEDPE format compatible with many genomic tools

```{r, export, eval=F}
export_interactions(
    interactions = ibed1_annotated,
    file = "/path/to/folder/ibed_annotated.ibed",
    type = "ibed"
)
```

**Notes and Behavior**

- If the `HiCaptuRe` object originates from a peakmatrix, it can be exported as:

  - A single peakmatrix file using `format = "peakmatrix"`

  - **Multiple files (one per sample)** if exporting in any **non-peakmatrix** format

  - The function will automatically name the output files based on sample names and append the appropriate extension.

- You can choose whether to overwrite existing files using the `over.write = TRUE` argument.

- **Optional metadata export**: Set `parameters = TRUE` to write a `.parameters` file alongside your exported interaction file. This records all processing steps (e.g., digestion, loading, annotation, filtering), supporting reproducibility.

# SessionInfo

```{r, sessioninfo}
sessionInfo()
```