---
title: "SEMPLR Vignette"
output:
  BiocStyle::html_document:
    toc: yes
    toc_float: yes
vignette: >
  %\VignetteIndexEntry{SEMPLR Vignette} 
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r knitr_setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    fig.height = 4,
    fig.width = 4
)
options(rmarkdown.html_vignette.check_title = FALSE)
```

```{r load_libraries, message=FALSE}
library(VariantAnnotation)
library(GenomicRanges)
library(BSgenome.Hsapiens.UCSC.hg19)
library(SEMPLR)
```


# SNP Effect Matrices

SEMPLR uses SNP Effect Matrices (SEMs) to score potential motifs. These are 
matrices contain binding affinity scores and have rows equal to the length of 
the motif and a column for each nucleotide. SEMs are produced by SEMpl, but a 
default set of 223 are included with this package in the `SEMC` data object. 
A full list of the transcription factors included in this default set can be 
found [here](github.com/grkenney/SEMPLR/blob/main/vignettes/sempl_metadata.csv) 
or by running `semData(SEMC)`.

SEMs are stored inside a `SNPEffectMatrix` object and sets of SEMs are stored 
in a `SNPEffectMatrixCollection`. 
The default collection can be loaded with:

```{r load_data}
SEMC
```

Printing the `SNPEffectMatrixCollection`, we can see that this object has 2
slots. The first holds the `SNPEffectMatrix` objects with the matrices and
baseline information for each motif. The second slots holds meta data for each
SEM. This object contains 223 SEMs and 12 meta data features for each.
```{r display_data}
SEMC
```

We can view the SEM meta data slot with the `semData` function.
```{r view_sem_metadata}
semData(SEMC)
```

We can access all SEMs in the collection the with the function `getSEMs()` 
or some subset of motifs by specifying a vector of semIds in the `semId` 
parameter. All semIds can be viewed in the `transctiption_factor` column
of the meta data. The resulting `SNPEffectMatrix` object contains the 
semId, the baseline value used for normalization, and the SEM that SEMPLR
will use to score binding affinity.
```{r subset_sem_collection}
getSEMs(SEMC, semId = "JUN")
```

# Scoring Binding

## Prepare Inputs

To calculate binding affinity for a given loci in the genome, we first make a 
`GRanges` object for that location. In this example, we will score a location 
on chromosome 12 of the human genome. 

```{r inputs_scoreBinding}
# create a GRanges object
gr <- GenomicRanges::GRanges(
    seqnames = "chr12",
    ranges = 94136009
)
```

## Scoring

This `GRanges` object is passed to the `scoreBinding()` function along with the 
collection of SEMs and a BSgenome object. 

Optionally, a `nFlank` parameter can be specified to dictate the number of 
nucleotides to flank each side of the sequence for the supplied position. 
By default, the number of nucleotides added will be the minimum number 
necessary to score the longest SEM.

A `seqId` parameter may also be used to specify a meta data column in the
`GRanges` object to use as a unique identifier. Otherwise, a unique identifier
will be created from positional information.

The resulting data object is a `SEMScores` object that stores information
about the scored range and its associated sequence, the SEM meta data, and
scoring information. 

Please see the (Scoring method)[#scoring-method] section below for details on 
how sequences are scored with SEMs.

```{r scoreBinding}
# calculate binding propensity
sb <- scoreBinding(x = gr, sem = SEMC, genome = Hsapiens)
sb
```

We can access this scores table with the `scores()` function and 
see that we have a row for each variant and SEM combination. The scoring 
results has 6 columns:

- **seqId**: unique id of the variant as defined in the `id` column of the 
input `GRanges` or `VRanges` object. If not defined, in the `seqId` parameter
of the `scoreBinding` function, a custom unique identifier 
is generated in the format [seqname]:[range]

- **SEM**: the unique identifier of the SEM

- **rc**: the orientation of the sequence scored

- **score**: the raw (unnormalized) binding affinity score 
for the reference and alternative alleles respectively

- **scoreNorm** : the normalized binding affinity score for the 
reference and alternative alleles respectively. More positive scores predict
TF binding and more ngative scores predict no TF binding.

- **index**: The position of the optimally scored sequence in the scored 
sequence.

- **seq**: the optimally scored sequence for the reference 
and alternative alleles respectively

```{r scores_accessor}
scores(sb)
```

We can subset these scores to just see results for the JUN motif and we see that
the normalized score, in the scoreNorm column, is negative.

We can visualize this sequence on the SEM score for each nucleotide position 
and see which nucleotides are contributing to this negative score.

```{r plot_sem}
# subset JUN score
jun_score <- scores(sb)[SEM == "JUN" & rc == "fwd"]
jun_score

# plot the JUN motif with the scored sequence
plotSEM(SEMC,
    motif = "JUN",
    motifSeq = jun_score$seq
)
```

# Enrichment

When scoring large sets of loci, you can also use SEMPLR to predict if some
transcription factors are enriched for binding, bound more than expected, 
within the loci of interest.

Here, we will demonstrate this analysis with simulated data. We will generate
and score a set of 1000 sequences. 200 of which will be generated using the 
JUN SEM to generate likely binding sequences. The remaining 800 sequences
will be completely random. 

We also include assistance in conducting this analysis on gene promoters. See
the (Enrichment in Promoters)[#enrichment-in-promoters] section for more
information.


## Prepare Inputs

Expand the section below to see the code used to generate these sequences.

<details>
<summary>See code used to generate simulated sequences</summary>

```{r simulation_functions}
# create random sequences weighted by ppm probabilities
simulatePPMSeqs <- function(ppm, nSeqs) {
    ppm_t <- t(ppm)
    position_samples <- lapply(
        seq_len(nrow(ppm_t)),
        function(i) {
            sample(colnames(ppm_t),
                size = nSeqs,
                replace = TRUE,
                prob = ppm_t[i, ]
            )
        }
    )

    rand_pwm_seq_mtx <- position_samples |>
        unlist() |>
        matrix(ncol = nrow(ppm_t), nrow = nSeqs)
    rand_pwm_seqs <- apply(
        rand_pwm_seq_mtx, 1,
        function(x) paste0(x, collapse = "")
    )

    return(rand_pwm_seqs)
}


# generate DNA sequences
simulateRandSeqs <- function(seqLength, nSeqs = 1) {
    bps <- c("A", "C", "G", "T")
    seqs <- sample(bps,
        replace = TRUE,
        size = seqLength
    ) |>
        stringi::stri_c(collapse = "") |>
        replicate(n = nSeqs)
    return(seqs)
}
```


```{r build_enrich_scenario}
ppm <- convertSEMsToPPMs(getSEMs(SEMC, "JUN"))[[1]]

# simulate sequences from the JUN PPM
sim_seqs <- simulatePPMSeqs(ppm = ppm, nSeqs = 200)
# add flanks to the simulated sequences to make them 3x the length of the motif
sim_seqs <- lapply(
    sim_seqs,
    function(x) {
        paste0(
            simulateRandSeqs(seqLength = ncol(ppm)), x,
            simulateRandSeqs(seqLength = ncol(ppm))
        )
    }
) |>
    unlist()

# generate random DNA sequences
rand_seqs <- simulateRandSeqs(seqLength = ncol(ppm) * 3, nSeqs = 800)
```

</details>

## Scoring

We have two vectors of sequences, `sim_seqs` and `rand_seqs`. They contain
200 and 800 sequences respectively. All sequences are 21 nucleotides in length.

We will combine all sequences into a single vector and pass this vector of
sequences to `scoreBinding`. Because we did not pass positional/range 
information, this function will now only return the scoring table.

```{r scoring_for_enrich_test}
# combine all sequences into a single vector
all_seqs <- c(sim_seqs, rand_seqs)

sb <- scoreBinding(
    x = all_seqs,
    sem = SEMC, genome = BSgenome.Hsapiens.UCSC.hg19::Hsapiens
)
sb
```


## Test for Enrichment

The results of `scoreBinding` can be passed directly to the `enrichSEMs` 
function along with our collection of SEMs, and the sequences we scored.

Optionally, we can provide a background set of sequences or ranges for this
analysis. If none is provided, `enrichSEMs` will scramble the scored sequences
provided to use as a background.

`enrichSEMs` performs a binomal test to determine if any of the transcription
factors scored are bound more than expected by chance. 

```{r enrichSEMs}
e <- enrichSEMs(sb, sem = SEMC, seqs = all_seqs)

# order the results by adjusted pvalue
head(e[order(padj, decreasing = FALSE)])
```

The resulting columns contain the p-value from the binomal test, the 
Benjamini & Hochberg ajusted p-value, and the number of sequences where
each TF was predicted to be bound in the foreground set (n_bound) and the
background set (n_bound_bg).

## Visualization

While JUN was the top significant SEM of our enrichment analysis, there are
also several other significant SEMs. Because some motif sequences can be very 
similar, this is not unexpected.

SEMPLR's `plotEnrich` function takes this motif relatedness into account 
when visualizing these enrichment results. SEMs are clustered by similarity
and plotted on a dendrogram so similar motifs are plotted near eachother.

The heatmap plots the -log10 adjusted pvalue of the binomial test 
( -log10(padj) ) and the TFs are colored according to an adjusted
pvalue threshold.

```{r plotEnrich}
plotEnrich(e,
    sem = SEMC,
    threshold = 0.05, method = "WPCC",
    pvalRange = c(0, 50),
    heatmapCols = c("lightgrey", "dodgerblue2")
)
```


## Enrichment in Promoters

One potential application of `enrichSEMs` is predicting TF(s) responsible for
observed differential gene expression between two biological conditions. 
Given gene sets of interest (e.g. upregulated genes) and a background set
(e.g. all expressed genes), we could try to predict there are TFs enriched
for binding the promoters of the genes of interest; providing supporting
evidence for a mechanism where the differential expression is driven by
differences in TF activity.

Promoter sequences are extracted and tested for predicted
binding site enrichment in foreground sets. Significant
enrichment of TF binding in the promoters of upregulated genes can provide
mechanistic insight into gene expression changes.

To aid in this analysis, we have provided a helper function, `enrichmentSets`,
to pull genomic ranges of gene promoter and, optionally, background sets given
gene or transcript ids. Below is a minimal example to illustrate this function.
In practice, gene ids of interest could be supplied to `enrichmentSets` to
generate GRanges objects that can be directly supplied to `enrichSEMs`.

```{r promoter_enrichment}
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(BSgenome.Hsapiens.UCSC.hg38)
library(org.Hs.eg.db)

txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
orgdb <- org.Hs.eg.db

genes_oi <- c("ENSG00000139618", "ENSG00000157764")
background_genes <- c("ENSG00000111640", "ENSG00000075624", "ENSG00000165704")

result <- enrichmentSets(
    foreground_ids = genes_oi,
    background_ids = background_genes,
    txdb = txdb,
    orgdb = orgdb,
    id_type = "ENSEMBL"
)
result
```

The result of `enrichmentSets` is a named list of 3 GRanges objects.
The "foregroundElements" entry includes the promoter ranges for all genes that
were sucessfully identified in the genome.
The "backgroundElements" entry is a randomly selected set of promoter
regions from the background set provided. The "backgroundUniverse" entry 
represents all of the background elements that were sucessfully mapped and 
sampled from. If a background set is not provided 
to `enrichmentSets`, one will be randomly selected from the promoters in the
genome not included in the gene set of interest.

To calculate enrichment for these promoters, we would first score the foreground
ranges with `scoreBinding`. Then we can provide the foreground scores and 
just the background ranges to `enrichSEMs`. While the example above is trivial,
the enrichment steps would look like this:

```
sb <- scoreBinding(
    x = result$foregroundElements,
    sem = SEMC, genome = BSgenome.Hsapiens.UCSC.hg38::Hsapiens
)
e <- enrichSEMs(sb, sem = SEMC, background = result$backgroundElements,
                genome = BSgenome.Hsapiens.UCSC.hg38::Hsapiens)
```


# Scoring Variants

SEMPLR also provides functionality for scoring variants, comparing TF
binding between alleles, and helping identify TFs whose binding is gained
or lost with genetic variation.

## Prepare Inputs

First, we will define the variants we want to score. Variants can be supplied 
to SEMPLR as either a `VRanges` or a `GRanges` object. If using `VRanges` 
alleles must be stored in the `ref` and `alt` parameters. If using `GRanges` 
the alleles should be stored in seperate metadata columns. 

Insertions and deletions should be specified by an empty character vector in 
the appropriate allele. 

Optionally, an `id` column can be defined in the object metadata to be used 
as a unique identifier for each variant.

```{r make_vranges}
vr <- VRanges(
    seqnames = c("chr12", "chr19"),
    ranges = c(94136009, 10640062),
    ref = c("G", "T"), alt = c("C", "A"),
    id = c("variant1", "variant2")
)
```

## Scoring

The `scoreVariants` function scores each allele of each variant against every 
SEM provided. From the resulting data object, we can see that we scored 2 
variants, the SEM meta data stored in the object has 13 fields, and we have 
446 rows in the `scores` results table.

There is a row in the `scores` table for each variant/SEM combination. 
Here we scored 2 variants x 223 SEMs to get 446 rows. 
```{r score_variants}
sempl_results <- scoreVariants(
    x = vr,
    sem = SEMC,
    genome = BSgenome.Hsapiens.UCSC.hg38::Hsapiens,
    varId = "id"
)

sempl_results
```
Similar to the scoring table produced by `scoreBinding`,
`scoreVariants` produces a table with 10 columns:

- **varId**: unique id of the variant as defined in the `id` column of the 
input `GRanges` or `VRanges` object. If not defined, a custom unique identifier 
is generated in the format [seqname]:[range]:[ref_allele]>[alt_allele]

- **SEM**: the unique identifier of the SEM

- **rc**: the orientation of the sequence scored

- **refSeq** and **altSeq**: the optimally scored sequence for the reference 
and alternative alleles respectively

- **refScore** and **altScore**: the raw (unnormalized) binding affinity score 
for the reference and alternative alleles respectively

- **refNorm** and **altNorm**: the normalized binding affinity score for the 
reference and alternative alleles respectively. More positive scores predict
TF binding and more negative scores predict no TF binding.

- **refVarIndex** and **altVarIndex**: The position of the frame's starting 
indeces in the scored sequence

There are accessor functions to isolate each slot of the resulting data object:
```{r SEMScores_accessors, eval = FALSE}
# access the variants slot
getRanges(sempl_results)

# access the semData slot
semData(sempl_results)

# access the scores slot
scores(sempl_results)
```

## Visualization

SEMPLR provides two visualization functions for `scoreVariants` results.

The first function, `plotSEMMotifs` plots all SEM scores for a given variant.

Here, we can see that both HLF and CEBPG are predicted to be bound in the
alt allele of variant2, but no the ref allele.

```{r plotSemMotifs, fig.height=4, fig.width=4}
plotSEMMotifs(
    s = sempl_results,
    variant = "variant2",
    label = "transcription_factor"
)
```

The second function, `plotSEMVariants` plots the scores for all variants
for a given SEM.

Here, we can see that it's variant2 where the mutation from the ref to
the alt allele creates the potential for a gained TF binding site.

```{r plotSEMVariants}
plotSEMVariants(sempl_results, sem = "HLF")
```

# Extras

## Scoring Method

### Sequence preparation

Both the `scoreBinding` and `scoreVariants` functions use the same scoring 
method. If given genome ranges, as `GRanges` or `VRanges` objects, sequences
for those ranges are acquired from the provided `BSgenome` object. Flanks
are added to either end of this sequence equal to the `nFlank` parameter 
provided by the user, otherwise, the length of the flanks are equal to the
longest SEM (number of rows). 

### Scoring

SEMPLR attempts to find the optimal binding location of the associated 
transcription factor by scoring every frame that includes at least one 
nucleotide of the provided range. If no flanks were added (`nFlank = 0`) then
SEMPLR will find the optimal binding site within the provided region.

SEMs are log transformed matrices, and therefore can be added per base at 
each position to generate a binding affinity score (stored in the `score`
column of the scores table). These scores are normalized to the baselines
produced by the SEMpl commandline tool and the normalized scores are stored
in the `scoreNorm` field. These normalized scores should be used for further
analysis.

The optimal binding site is defined as the site with the highest normalized
score.

In general, more positive scores predict TF binding and negative scores predict
no TF binding.

## Loading a custom set of SEMs

The `SEMC` data object is provided with this package with a default set of 223
SEMs. While we think this set may be sufficient for many analyses, SEMPLR also
supports generation of new `SNPEffectMatrixCollection`s should you want to 
generate a custom set of SEMs with the SEMpl command line tool.

First, we create a list of file paths to the .sem files we want to include. 
We will also load the meta data for each sem in a `data.table` object. 
If meta data is used, all SEMs must be represented in the meta data table.

```{r load_custom_sems}
# find .sem files
sem_folder <- system.file("extdata", "SEMs", package = "SEMPLR")
sem_files <- list.files(sem_folder, full.names = TRUE)

# load metadata
sempl_metadata_file <- system.file("extdata", "sempl_metadata.csv",
    package = "SEMPLR"
)
sempl_metadata <- read.csv(sempl_metadata_file)
```

We will load the matrix and meta data for all SEMs in a single
`SNPEffectMatrixCollection` object. This object has two slots, one containing 
a named list of the matrices and a second slot containing our meta data table 
with a key column connecting the meta data to the names of the matrices.


```{r format_meta_data}
ix <- lapply(
    sem_files,
    function(x) which(sempl_metadata$SEM == basename(x))
) |>
    unlist()
sem_ids <- sempl_metadata$transcription_factor[ix]
```


```{r loadSEMCollection}
sc <- loadSEMCollection(
    semFiles = sem_files,
    semMetaData = sempl_metadata,
    semMetaKey = "transcription_factor",
    semIds = sem_ids
)
sc
```

```{r session_info}
devtools::session_info()
```