--- title: "SEMPLR Vignette" output: BiocStyle::html_document: toc: yes toc_float: yes vignette: > %\VignetteIndexEntry{SEMPLR Vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r knitr_setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.height = 4, fig.width = 4 ) options(rmarkdown.html_vignette.check_title = FALSE) ``` ```{r load_libraries, message=FALSE} library(VariantAnnotation) library(GenomicRanges) library(BSgenome.Hsapiens.UCSC.hg19) library(SEMPLR) ``` # SNP Effect Matrices SEMPLR uses SNP Effect Matrices (SEMs) to score potential motifs. These are matrices contain binding affinity scores and have rows equal to the length of the motif and a column for each nucleotide. SEMs are produced by SEMpl, but a default set of 223 are included with this package in the `SEMC` data object. A full list of the transcription factors included in this default set can be found [here](github.com/grkenney/SEMPLR/blob/main/vignettes/sempl_metadata.csv) or by running `semData(SEMC)`. SEMs are stored inside a `SNPEffectMatrix` object and sets of SEMs are stored in a `SNPEffectMatrixCollection`. The default collection can be loaded with: ```{r load_data} SEMC ``` Printing the `SNPEffectMatrixCollection`, we can see that this object has 2 slots. The first holds the `SNPEffectMatrix` objects with the matrices and baseline information for each motif. The second slots holds meta data for each SEM. This object contains 223 SEMs and 12 meta data features for each. ```{r display_data} SEMC ``` We can view the SEM meta data slot with the `semData` function. ```{r view_sem_metadata} semData(SEMC) ``` We can access all SEMs in the collection the with the function `getSEMs()` or some subset of motifs by specifying a vector of semIds in the `semId` parameter. All semIds can be viewed in the `transctiption_factor` column of the meta data. The resulting `SNPEffectMatrix` object contains the semId, the baseline value used for normalization, and the SEM that SEMPLR will use to score binding affinity. ```{r subset_sem_collection} getSEMs(SEMC, semId = "JUN") ``` # Scoring Binding ## Prepare Inputs To calculate binding affinity for a given loci in the genome, we first make a `GRanges` object for that location. In this example, we will score a location on chromosome 12 of the human genome. ```{r inputs_scoreBinding} # create a GRanges object gr <- GenomicRanges::GRanges( seqnames = "chr12", ranges = 94136009 ) ``` ## Scoring This `GRanges` object is passed to the `scoreBinding()` function along with the collection of SEMs and a BSgenome object. Optionally, a `nFlank` parameter can be specified to dictate the number of nucleotides to flank each side of the sequence for the supplied position. By default, the number of nucleotides added will be the minimum number necessary to score the longest SEM. A `seqId` parameter may also be used to specify a meta data column in the `GRanges` object to use as a unique identifier. Otherwise, a unique identifier will be created from positional information. The resulting data object is a `SEMScores` object that stores information about the scored range and its associated sequence, the SEM meta data, and scoring information. Please see the (Scoring method)[#scoring-method] section below for details on how sequences are scored with SEMs. ```{r scoreBinding} # calculate binding propensity sb <- scoreBinding(x = gr, sem = SEMC, genome = Hsapiens) sb ``` We can access this scores table with the `scores()` function and see that we have a row for each variant and SEM combination. The scoring results has 6 columns: - **seqId**: unique id of the variant as defined in the `id` column of the input `GRanges` or `VRanges` object. If not defined, in the `seqId` parameter of the `scoreBinding` function, a custom unique identifier is generated in the format [seqname]:[range] - **SEM**: the unique identifier of the SEM - **rc**: the orientation of the sequence scored - **score**: the raw (unnormalized) binding affinity score for the reference and alternative alleles respectively - **scoreNorm** : the normalized binding affinity score for the reference and alternative alleles respectively. More positive scores predict TF binding and more ngative scores predict no TF binding. - **index**: The position of the optimally scored sequence in the scored sequence. - **seq**: the optimally scored sequence for the reference and alternative alleles respectively ```{r scores_accessor} scores(sb) ``` We can subset these scores to just see results for the JUN motif and we see that the normalized score, in the scoreNorm column, is negative. We can visualize this sequence on the SEM score for each nucleotide position and see which nucleotides are contributing to this negative score. ```{r plot_sem} # subset JUN score jun_score <- scores(sb)[SEM == "JUN" & rc == "fwd"] jun_score # plot the JUN motif with the scored sequence plotSEM(SEMC, motif = "JUN", motifSeq = jun_score$seq ) ``` # Enrichment When scoring large sets of loci, you can also use SEMPLR to predict if some transcription factors are enriched for binding, bound more than expected, within the loci of interest. Here, we will demonstrate this analysis with simulated data. We will generate and score a set of 1000 sequences. 200 of which will be generated using the JUN SEM to generate likely binding sequences. The remaining 800 sequences will be completely random. We also include assistance in conducting this analysis on gene promoters. See the (Enrichment in Promoters)[#enrichment-in-promoters] section for more information. ## Prepare Inputs Expand the section below to see the code used to generate these sequences.
See code used to generate simulated sequences ```{r simulation_functions} # create random sequences weighted by ppm probabilities simulatePPMSeqs <- function(ppm, nSeqs) { ppm_t <- t(ppm) position_samples <- lapply( seq_len(nrow(ppm_t)), function(i) { sample(colnames(ppm_t), size = nSeqs, replace = TRUE, prob = ppm_t[i, ] ) } ) rand_pwm_seq_mtx <- position_samples |> unlist() |> matrix(ncol = nrow(ppm_t), nrow = nSeqs) rand_pwm_seqs <- apply( rand_pwm_seq_mtx, 1, function(x) paste0(x, collapse = "") ) return(rand_pwm_seqs) } # generate DNA sequences simulateRandSeqs <- function(seqLength, nSeqs = 1) { bps <- c("A", "C", "G", "T") seqs <- sample(bps, replace = TRUE, size = seqLength ) |> stringi::stri_c(collapse = "") |> replicate(n = nSeqs) return(seqs) } ``` ```{r build_enrich_scenario} ppm <- convertSEMsToPPMs(getSEMs(SEMC, "JUN"))[[1]] # simulate sequences from the JUN PPM sim_seqs <- simulatePPMSeqs(ppm = ppm, nSeqs = 200) # add flanks to the simulated sequences to make them 3x the length of the motif sim_seqs <- lapply( sim_seqs, function(x) { paste0( simulateRandSeqs(seqLength = ncol(ppm)), x, simulateRandSeqs(seqLength = ncol(ppm)) ) } ) |> unlist() # generate random DNA sequences rand_seqs <- simulateRandSeqs(seqLength = ncol(ppm) * 3, nSeqs = 800) ```
## Scoring We have two vectors of sequences, `sim_seqs` and `rand_seqs`. They contain 200 and 800 sequences respectively. All sequences are 21 nucleotides in length. We will combine all sequences into a single vector and pass this vector of sequences to `scoreBinding`. Because we did not pass positional/range information, this function will now only return the scoring table. ```{r scoring_for_enrich_test} # combine all sequences into a single vector all_seqs <- c(sim_seqs, rand_seqs) sb <- scoreBinding( x = all_seqs, sem = SEMC, genome = BSgenome.Hsapiens.UCSC.hg19::Hsapiens ) sb ``` ## Test for Enrichment The results of `scoreBinding` can be passed directly to the `enrichSEMs` function along with our collection of SEMs, and the sequences we scored. Optionally, we can provide a background set of sequences or ranges for this analysis. If none is provided, `enrichSEMs` will scramble the scored sequences provided to use as a background. `enrichSEMs` performs a binomal test to determine if any of the transcription factors scored are bound more than expected by chance. ```{r enrichSEMs} e <- enrichSEMs(sb, sem = SEMC, seqs = all_seqs) # order the results by adjusted pvalue head(e[order(padj, decreasing = FALSE)]) ``` The resulting columns contain the p-value from the binomal test, the Benjamini & Hochberg ajusted p-value, and the number of sequences where each TF was predicted to be bound in the foreground set (n_bound) and the background set (n_bound_bg). ## Visualization While JUN was the top significant SEM of our enrichment analysis, there are also several other significant SEMs. Because some motif sequences can be very similar, this is not unexpected. SEMPLR's `plotEnrich` function takes this motif relatedness into account when visualizing these enrichment results. SEMs are clustered by similarity and plotted on a dendrogram so similar motifs are plotted near eachother. The heatmap plots the -log10 adjusted pvalue of the binomial test ( -log10(padj) ) and the TFs are colored according to an adjusted pvalue threshold. ```{r plotEnrich} plotEnrich(e, sem = SEMC, threshold = 0.05, method = "WPCC", pvalRange = c(0, 50), heatmapCols = c("lightgrey", "dodgerblue2") ) ``` ## Enrichment in Promoters One potential application of `enrichSEMs` is predicting TF(s) responsible for observed differential gene expression between two biological conditions. Given gene sets of interest (e.g. upregulated genes) and a background set (e.g. all expressed genes), we could try to predict there are TFs enriched for binding the promoters of the genes of interest; providing supporting evidence for a mechanism where the differential expression is driven by differences in TF activity. Promoter sequences are extracted and tested for predicted binding site enrichment in foreground sets. Significant enrichment of TF binding in the promoters of upregulated genes can provide mechanistic insight into gene expression changes. To aid in this analysis, we have provided a helper function, `enrichmentSets`, to pull genomic ranges of gene promoter and, optionally, background sets given gene or transcript ids. Below is a minimal example to illustrate this function. In practice, gene ids of interest could be supplied to `enrichmentSets` to generate GRanges objects that can be directly supplied to `enrichSEMs`. ```{r promoter_enrichment} library(TxDb.Hsapiens.UCSC.hg38.knownGene) library(BSgenome.Hsapiens.UCSC.hg38) library(org.Hs.eg.db) txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene orgdb <- org.Hs.eg.db genes_oi <- c("ENSG00000139618", "ENSG00000157764") background_genes <- c("ENSG00000111640", "ENSG00000075624", "ENSG00000165704") result <- enrichmentSets( foreground_ids = genes_oi, background_ids = background_genes, txdb = txdb, orgdb = orgdb, id_type = "ENSEMBL" ) result ``` The result of `enrichmentSets` is a named list of 3 GRanges objects. The "foregroundElements" entry includes the promoter ranges for all genes that were sucessfully identified in the genome. The "backgroundElements" entry is a randomly selected set of promoter regions from the background set provided. The "backgroundUniverse" entry represents all of the background elements that were sucessfully mapped and sampled from. If a background set is not provided to `enrichmentSets`, one will be randomly selected from the promoters in the genome not included in the gene set of interest. To calculate enrichment for these promoters, we would first score the foreground ranges with `scoreBinding`. Then we can provide the foreground scores and just the background ranges to `enrichSEMs`. While the example above is trivial, the enrichment steps would look like this: ``` sb <- scoreBinding( x = result$foregroundElements, sem = SEMC, genome = BSgenome.Hsapiens.UCSC.hg38::Hsapiens ) e <- enrichSEMs(sb, sem = SEMC, background = result$backgroundElements, genome = BSgenome.Hsapiens.UCSC.hg38::Hsapiens) ``` # Scoring Variants SEMPLR also provides functionality for scoring variants, comparing TF binding between alleles, and helping identify TFs whose binding is gained or lost with genetic variation. ## Prepare Inputs First, we will define the variants we want to score. Variants can be supplied to SEMPLR as either a `VRanges` or a `GRanges` object. If using `VRanges` alleles must be stored in the `ref` and `alt` parameters. If using `GRanges` the alleles should be stored in seperate metadata columns. Insertions and deletions should be specified by an empty character vector in the appropriate allele. Optionally, an `id` column can be defined in the object metadata to be used as a unique identifier for each variant. ```{r make_vranges} vr <- VRanges( seqnames = c("chr12", "chr19"), ranges = c(94136009, 10640062), ref = c("G", "T"), alt = c("C", "A"), id = c("variant1", "variant2") ) ``` ## Scoring The `scoreVariants` function scores each allele of each variant against every SEM provided. From the resulting data object, we can see that we scored 2 variants, the SEM meta data stored in the object has 13 fields, and we have 446 rows in the `scores` results table. There is a row in the `scores` table for each variant/SEM combination. Here we scored 2 variants x 223 SEMs to get 446 rows. ```{r score_variants} sempl_results <- scoreVariants( x = vr, sem = SEMC, genome = BSgenome.Hsapiens.UCSC.hg38::Hsapiens, varId = "id" ) sempl_results ``` Similar to the scoring table produced by `scoreBinding`, `scoreVariants` produces a table with 10 columns: - **varId**: unique id of the variant as defined in the `id` column of the input `GRanges` or `VRanges` object. If not defined, a custom unique identifier is generated in the format [seqname]:[range]:[ref_allele]>[alt_allele] - **SEM**: the unique identifier of the SEM - **rc**: the orientation of the sequence scored - **refSeq** and **altSeq**: the optimally scored sequence for the reference and alternative alleles respectively - **refScore** and **altScore**: the raw (unnormalized) binding affinity score for the reference and alternative alleles respectively - **refNorm** and **altNorm**: the normalized binding affinity score for the reference and alternative alleles respectively. More positive scores predict TF binding and more negative scores predict no TF binding. - **refVarIndex** and **altVarIndex**: The position of the frame's starting indeces in the scored sequence There are accessor functions to isolate each slot of the resulting data object: ```{r SEMScores_accessors, eval = FALSE} # access the variants slot getRanges(sempl_results) # access the semData slot semData(sempl_results) # access the scores slot scores(sempl_results) ``` ## Visualization SEMPLR provides two visualization functions for `scoreVariants` results. The first function, `plotSEMMotifs` plots all SEM scores for a given variant. Here, we can see that both HLF and CEBPG are predicted to be bound in the alt allele of variant2, but no the ref allele. ```{r plotSemMotifs, fig.height=4, fig.width=4} plotSEMMotifs( s = sempl_results, variant = "variant2", label = "transcription_factor" ) ``` The second function, `plotSEMVariants` plots the scores for all variants for a given SEM. Here, we can see that it's variant2 where the mutation from the ref to the alt allele creates the potential for a gained TF binding site. ```{r plotSEMVariants} plotSEMVariants(sempl_results, sem = "HLF") ``` # Extras ## Scoring Method ### Sequence preparation Both the `scoreBinding` and `scoreVariants` functions use the same scoring method. If given genome ranges, as `GRanges` or `VRanges` objects, sequences for those ranges are acquired from the provided `BSgenome` object. Flanks are added to either end of this sequence equal to the `nFlank` parameter provided by the user, otherwise, the length of the flanks are equal to the longest SEM (number of rows). ### Scoring SEMPLR attempts to find the optimal binding location of the associated transcription factor by scoring every frame that includes at least one nucleotide of the provided range. If no flanks were added (`nFlank = 0`) then SEMPLR will find the optimal binding site within the provided region. SEMs are log transformed matrices, and therefore can be added per base at each position to generate a binding affinity score (stored in the `score` column of the scores table). These scores are normalized to the baselines produced by the SEMpl commandline tool and the normalized scores are stored in the `scoreNorm` field. These normalized scores should be used for further analysis. The optimal binding site is defined as the site with the highest normalized score. In general, more positive scores predict TF binding and negative scores predict no TF binding. ## Loading a custom set of SEMs The `SEMC` data object is provided with this package with a default set of 223 SEMs. While we think this set may be sufficient for many analyses, SEMPLR also supports generation of new `SNPEffectMatrixCollection`s should you want to generate a custom set of SEMs with the SEMpl command line tool. First, we create a list of file paths to the .sem files we want to include. We will also load the meta data for each sem in a `data.table` object. If meta data is used, all SEMs must be represented in the meta data table. ```{r load_custom_sems} # find .sem files sem_folder <- system.file("extdata", "SEMs", package = "SEMPLR") sem_files <- list.files(sem_folder, full.names = TRUE) # load metadata sempl_metadata_file <- system.file("extdata", "sempl_metadata.csv", package = "SEMPLR" ) sempl_metadata <- read.csv(sempl_metadata_file) ``` We will load the matrix and meta data for all SEMs in a single `SNPEffectMatrixCollection` object. This object has two slots, one containing a named list of the matrices and a second slot containing our meta data table with a key column connecting the meta data to the names of the matrices. ```{r format_meta_data} ix <- lapply( sem_files, function(x) which(sempl_metadata$SEM == basename(x)) ) |> unlist() sem_ids <- sempl_metadata$transcription_factor[ix] ``` ```{r loadSEMCollection} sc <- loadSEMCollection( semFiles = sem_files, semMetaData = sempl_metadata, semMetaKey = "transcription_factor", semIds = sem_ids ) sc ``` ```{r session_info} devtools::session_info() ```