---
title: "Getting Started with mutSignatures"
author: "Damiano Fantini, Ph.D."
date: "`r Sys.Date()`"
output: html_document
vignette: >
%\VignetteIndexEntry{Getting Started with mutSignatures}
%\VignetteEngine{knitr::rmarkdown}
%\usepackage[utf8]{inputenc}
% \VignetteDepends{mutSignatures}
---
```{r setup, include=FALSE}
set.seed(1234567)
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, fig.align = "center", results = "asis")
my.t0 <- Sys.time()
library(dplyr)
library(reshape2)
library(kableExtra)
library(ggplot2)
library(gridExtra)
library(mutSignatures)
# load data
data("mutSigData")
# Extra cols
head.muty <- c("T[C>A]G", "T[C>T]A", "G[C>T]A", "C[C>T]T", "T[C>G]T", "T[C>G]C",
"T[C>G]A", "T[C>T]T", "T[C>G]T", "T[C>T]G", "T[C>A]G", "T[C>T]A",
"A[T>C]T", "A[T>G]G", "C[T>C]T", "T[C>T]A", "T[C>T]A", "G[T>A]C")
```
Cancer cells accumulate DNA mutations as result of DNA damage and DNA repair processes. *mutSignatures* is a computational framework aimed at deciphering DNA mutational signatures operating in cancer.
- The framework includes three modules that support *1)* raw data import and pre-processing, *2)* mutation counts deconvolution, and *3)* data visualization. Therefore, *mutSignatures* is a **comprehensive software suite that can guide the user throughout all steps of a mutational signature analysis**.
- Inputs are VCF files, MAF files, as well as other commonly used file formats storing DNA variant information. These files are imported and pre-processed to obtain a `mutationCounts`-class object (which includes a numeric matrix of DNA mutation counts).
- The framework *de-novo* extracts mutational signatures via Non-Negative Matrix Factorization (NMF). Bootstrapping is performed as part of the analysis.
- The framework relies on parallelization (optional) and is optimized for use on multi-core systems.
- The *mutSignatures* framework was developed in the Meeks Lab at Northwestern University (Chicago, IL), and was described in a *Scientific Reports* paper [Fantini D et al, 2020](https://www.nature.com/articles/s41598-020-75062-0/)
- The software is based on a custom R-based implementation of the MATLAB WTSI framework by [Alexandrov LB et al, 2013](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3588146/), and includes a wide range of tools for expanded functionality.
- The *mutSignatures* framework has been used for analyses described in peer-reviewed publications, including [Fantini D et al, 2018](https://www.nature.com/articles/s41388-017-0099-6/) and [Fantini D et al, 2019](https://pubmed.ncbi.nlm.nih.gov/30446446/).
## Tutorials and How-tos
More vignettes discussing how to extract and analyze mutational signatures from different sources using `mutSignatures` can be found at the following URLs:
- Getting started with mutSignatures (extended): [link to PDF](https://www.data-pulse.com/projects/mutSignatures/vignette_A.pdf)
- Deploying mutSignatures on Computational Clusters: [link to PDF](https://www.data-pulse.com/projects/mutSignatures/vignette_B.pdf)
- Deconvoluting Mutation Counts Against Known Mutational Signatures: [link to PDF](https://www.data-pulse.com/projects/mutSignatures/vignette_C.pdf)
- More tutorials: [mutsignatures official website](https://www.data-pulse.com/dev_site/mutsignatures/)
## Installation
The package is available from CRAN. The *official* version of `mutSignatures` (version 2.1.1) can be installed from any CRAN mirror.
```{r eval = FALSE}
install.packages("mutSignatures")
```
The latest (most recently updated, *beta*) version of the package is available on GitHub. It is possible to install `mutSignatures` using `devtools`.
```{r eval = FALSE}
devtools::install_github("dami82/mutSignatures", force = TRUE,
build_opts = NULL, build_vignettes = TRUE)
```
### Note on software Dependencies
One of the pre-processing steps in the mutSignatures pipeline relies on a list of Bioconductor () libraries. These libraries are optional, since they are only required for the *attachContext()* step (these libs are not automatically downloaded when installing the package). The bioconductor libraries are: *i)* IRanges; *ii)* GenomicRanges; *iii)* BSgenome; *iv)* GenomeInfoDb. You can install them as shown below.
```{r echo=TRUE, eval=FALSE, include=TRUE}
# Get BiocManager
if (!"BiocManager" %in% rownames(utils::installed.packages()))
install.packages("BiocManager")
# Get Bioc libs
BiocManager::install(pkgs = c("IRanges", "GenomicRanges", "BSgenome", "GenomeInfoDb"))
```
To complete the *attachContext()* step, a BSgenome package including full genome sequences of the organism of interest has to be installed as well. For example, human *hg19* genome sequences can be installed as shown below.
```{r echo=TRUE, eval=FALSE, include=TRUE}
BiocManager::install(pkgs = "BSgenome.Hsapiens.UCSC.hg19")
```
## Pipeline
The typical `mutSignatures` analytic pipeline includes three steps:
- **Data Import and pre-processing**: Mutation data are imported from a VCF, MAF, or similar file. Single Nucleotide Variants (SNVs) are filtered, nucleotide context around the mutation site is retrieved, and tri-nucleotide mutation types are computed and counted across samples. The result of this step is a `mutationCounts`-class object.
- **De-novo extraction of mutational signatures**: Non-negative Matrix factorization is performed. Typically, 500 to 1,000 bootstrapping iterations are performed (for preparative analytic runs). 100-500 iterations are typically enough for a preliminary analysis. The NMF step returns both: *i)* **mutational signatures**, as well as *ii)* **mutSignature exposures**.
- **Visualization, downstream analysis, data export**: Results can be visualized (barplots). Signatures can be compared and matched to known signatures (*i.e.* previously extracted signatures or published signatures, such as the COSMIC signatures published by the Sanger Institute, London). You can use the `msigPlot()` method to generate plots.
#### Alternative pipeline
If the set of mutational signatures is already known, it is possible to use `mutSignatures` to estimate the contribution of each mutational pattern in a collection of samples. In this case, the pipeline includes the following steps:
- **Data Import and pre-processing**: *see above*.
- **Estimate exposures to known mutational signatures**: Use a Fast Combinatorial Nonnegative Least-Square approach to compute exposures when the mutation signatures are pre-defined.
- **Visualization, downstream analysis, data export**: *see above*.
## Load library, set-up environment
For running the demos described in this vignette, the following R libraries will be used: `dplyr`, `reshape2`, `ggplot2`, `kableExtra`, `BSgenome.Hsapiens.UCSC.hg19`, and `mutSignatures`. Before proceeding, we assign the *hg19* BSgenome object to a variable named `hg19`, and we load the `mutSigData` dataset, which is provided together with the `mutSignatures` package.
```{r eval=FALSE}
# Required libs
library(dplyr)
library(reshape2)
library(kableExtra)
library(ggplot2)
library(gridExtra)
library(BSgenome.Hsapiens.UCSC.hg19)
# Load mutSignatures
library(mutSignatures)
# prep hg19
hg19 <- BSgenome.Hsapiens.UCSC.hg19
# load data
data("mutSigData")
```
## Example 1 - De novo extraction of Mutational Signatures from BLCA samples
The first demo shows how to extract mutational signatures from a dataset including DNA mutations from bladder cancer samples.
### Data Import and pre-processing
Here, the input has a VCF-like structure, and is decorated with an extra column (namely, `SAMPLEID`) that includes a unique identifier for the biological samples.
```{r}
# Import data (VCF-like format)
x <- mutSigData$inputC
# Filter non SNV
x <- filterSNV(dataSet = x, seq_colNames = c("REF", "ALT"))
# Visualize head
head(x) %>% kable() %>% kable_styling(bootstrap_options = "striped")
```
```{r eval=FALSE}
# Attach context
x <- attachContext(mutData = x,
chr_colName = "CHROM",
start_colName = "POS",
end_colName = "POS",
nucl_contextN = 3,
BSGenomeDb = hg19)
```
```{r include=FALSE, eval=TRUE, echo=FALSE}
x <- mutSignatures::mutSigData$inputC.ctx
```
```{r}
# Visualize head
head(x) %>% kable() %>% kable_styling(bootstrap_options = "striped")
# Remove mismatches
x <- removeMismatchMut(mutData = x, # input data.frame
refMut_colName = "REF", # column name for ref base
context_colName = "context", # column name for context
refMut_format = "N")
```
```{r eval=FALSE}
# Compute mutType
x <- attachMutType(mutData = x, # as above
ref_colName = "REF", # column name for ref base
var_colName = "ALT", # column name for mut base
context_colName = "context")
```
```{r include=FALSE, eval=TRUE, echo=FALSE}
x <- x[1:18, ]
x$mutType <- head.muty
```
```{r}
# Visualize head
head(x) %>% kable() %>% kable_styling(bootstrap_options = "striped")
```
```{r eval=FALSE}
# Count
blca.counts <- countMutTypes(mutTable = x,
mutType_colName = "mutType",
sample_colName = "SAMPLEID")
```
```{r include=FALSE, eval=TRUE, echo=FALSE}
# Count
blca.counts <- mutSignatures::as.mutation.counts(mutSigData$blcaMUTS)
```
```{r results='markup'}
# Mutation Counts
print(blca.counts)
```
### Signature Extraction
After multi-nucleotide mutation count data have been prepared, it is possible to proceed with the *de-novo* signature extraction. Settings guiding the analysis are defined as a list of parameters that can be tuned via the `setMutClusterParams()` function. Next, the NMF analysis is executed by the `decipherMutationalProcesses()` function. At the end of the analysis, a silhouette plot is returned. This plot can be very helpful to understand if the obtained signatures are relatively weak or robust.
```{r}
# how many signatures should we extract?
num.sign <- 4
# Define parameters for the non-negative matrix factorization procedure.
# you should parallelize if possible
blca.params <-
mutSignatures::setMutClusterParams(
num_processesToExtract = num.sign, # num signatures to extract
num_totIterations = 20, # bootstrapping: usually 500-1000
num_parallelCores = 4) # total num of cores to use (parallelization)
```
```{r eval=FALSE, include=TRUE, echo=TRUE}
# Extract new signatures - may take a while
blca.analysis <-
decipherMutationalProcesses(input = blca.counts,
params = blca.params)
```
```{r eval=TRUE, include=TRUE, echo=FALSE}
blca.analysis <- list(Results = mutSigData$inputS$Results)
tmp <- mutSigData$inputS$silhouetteTMP
xrange <- c(min(tmp[,3]), max(tmp[,3]))
xrange[1] <- ifelse(xrange[1] > 0, 0, (-1.15) * abs(xrange[1]))
xrange[2] <- 1.15
graphics::barplot(tmp[nrow(tmp):1, 3],
col = base::as.factor(tmp[nrow(tmp):1,1]),
xlim = xrange,
horiz = TRUE, xlab = "Silhouette Value",
ylab = "",
main = "Silhouette Plot",
border = base::as.factor(tmp[nrow(tmp):1,1]))
graphics::abline(v=0)
graphics::title(ylab="Iter. Results (by Group)", line=1, cex.lab=1, font = 2)
```
### Downstream analyses and visualization
Profiles of *de-novo* extracted mutational signatures can be visualized (by barplots). Also it is possible to plot the estimated exposures of each sample to the identified mutational patterns. It is also possible to compare *de-novo* extracted signatures with known signatures (such as signatures from previous analyses, or COSMIC signatures). To cast `mutSignature` objects as `data.frames`, you can use the `mutSignatures::as.data.frame()` method.
**Note**: you can use the `mutSignatures::getCosmicSignatures()` function to download a copy of the COSMIC signatures.
```{r fig.align='center', fig.width=12, fig.height=3}
# Retrieve signatures (results)
blca.sig <- blca.analysis$Results$signatures
# Retrieve exposures (results)
blca.exp <- blca.analysis$Results$exposures
# Plot signature 1 (standard barplot, you can pass extra args such as ylim)
msigPlot(blca.sig, signature = 1, ylim = c(0, 0.10))
```
```{r}
# Plot exposures (ggplot2 object, you can customize as any other ggplot2 object)
msigPlot(blca.exp, main = "BLCA samples") +
scale_fill_manual(values = c("#1f78b4", "#cab2d6", "#ff7f00", "#a6cee3"))
# Export Signatures as data.frame
xprt <- coerceObj(x = blca.sig, to = "data.frame")
head(xprt) %>% kable() %>% kable_styling(bootstrap_options = "striped")
# Get signatures from data (imported as data.frame)
# and then convert it to mutSignatures object
cosmixSigs <- mutSigData$blcaSIGS %>%
dplyr::select(starts_with("COSMIC")) %>%
as.mutation.signatures()
blcaKnwnSigs <- mutSigData$blcaSIGS %>%
dplyr::select(starts_with("BLCA")) %>%
as.mutation.signatures()
# Compare de-novo signatures with selected COSMIC signatures
msig1 <- matchSignatures(mutSign = blca.sig, reference = cosmixSigs,
threshold = 0.45, plot = TRUE)
msig2 <- matchSignatures(mutSign = blca.sig, reference = blcaKnwnSigs,
threshold = 0.45, plot = TRUE)
```
```{r fig.height=5, fig.width=12, fig.align='center'}
# Visualize match
# signature 1 is similar to COSMIC ;
# signatures 2 and 3 are similar to COSMIC
# Here, we should probably extract only 2 mutational signatures
hm1 <- msig1$plot + ggtitle("Match to COSMIC signs.")
hm2 <- msig2$plot + ggtitle("Match to known BLCA signs.")
# Show
grid.arrange(hm1, hm2, ncol = 2)
```
## Example 2 - Estimate the contribution of a panel of known mutational signatures
The second demo illustrates a case where the signatures are known. We will use the COSMIC signatures 1, 2, 5, and 13 in the analysis, as well as a set of custom signatures previously identified in bladder cancer tumor samples. We will estimate their contributions to a collection of *blca* mutations that are included in the `mutSigData` dataset.
### Data Import and pre-processing
Data import and pre-processing can be conducted as shown above. **Note** that if a `data.frame` with counts of mutation types across samples is available, it is sufficient to use the `as.mutation.counts()` method to convert it to the desired object.
```{r}
# Retrieve a mutation.counts data.frame
x <- mutSigData$blcaMUTS
# Visualize header
x[1:10, 1:5] %>% kable() %>% kable_styling(bootstrap_options = "striped")
# Convert it
xx <- as.mutation.counts(x)
```
```{r results='markup'}
# Print to screen
print(xx)
```
### Estimate Exposures to known Mutational Signatures
After preparing the mutational signatures as a `mutationSignatures`-class object, it is sufficient to run the `resolveMutSignatures()` function. It is not recommended to use a large number of signatures when running this analysis (for example, refrain from using all COSMIC signatures). Likewise, you don't want to run this step using a very low number of signatures. The algorithm is way more accurate if **ONLY AND ALL** the relevant signatures are used (*i.e.*, the signatures we are reasonably expecting to be operative in the tumor samples being analyzed). This analysis is typically very fast!
```{r}
# Obtain 4 COSMIC signatures
cosmx <- mutSigData$blcaSIGS %>% dplyr::select(starts_with("COSMIC"))
cosmx <- as.mutation.signatures(cosmx)
# Obtain 4 BLCA signatures
blcmx <- mutSigData$blcaSIGS %>% dplyr::select(starts_with("BLCA"))
blcmx <- as.mutation.signatures(blcmx)
```
```{r results='markup'}
# Visualize cosmx
print(cosmx)
# Visualize cosmx
print(blcmx)
```
```{r}
# Run analysis
blca.expo1 <- resolveMutSignatures(mutCountData = xx,
signFreqData = cosmx)
blca.expo2 <- resolveMutSignatures(mutCountData = xx,
signFreqData = blcmx)
```
### Downstream analysis
As discussed above, exposures to mutational signatures can be plotted (barplots), and results can be exported using the `mutSignatures::data.frame` method.
```{r fig.width=12, fig.height=5}
# Retrieve exposures (results)
blca.exp.1x <- blca.expo1$results$count.result
blca.exp.2x <- blca.expo2$results$count.result
# Plot exposures
bp1 <- msigPlot(blca.exp.1x, main = "BLCA | COSMIC sigs.") +
scale_fill_manual(values = c("#fdbf6f", "#e31a1c", "#fb9a99", "#1f78b4"))
bp2 <- msigPlot(blca.exp.2x, main = "BLCA | pre-blca sigs.") +
scale_fill_manual(values = c("#fdbf6f", "#e31a1c", "#fb9a99", "#1f78b4"))
# Visualize
grid.arrange(bp1, bp2, ncol = 2)
# Compare sigs
# Export Exposures as data.frame
xprt <- as.data.frame(blca.exp.1x, transpose = TRUE)
head(xprt) %>% round() %>% kable() %>%
kable_styling(bootstrap_options = "striped")
```
## More info and examples
More info and other examples can be found at the following URLs:
- **Our Scientific Reports** paper describing the latest version of the software: *MutSignatures: An R Package for Extraction and Analysis of Cancer Mutational Signatures*, [https://www.nature.com/articles/s41598-020-75062-0/](https://www.nature.com/articles/s41598-020-75062-0/). Please, cite this paper if you use *mutSignatures* in your work! Thanks.
- **Oncogene paper:** Mutational Signatures operative in bladder cancer, [https://www.nature.com/articles/s41388-017-0099-6/](https://www.nature.com/articles/s41388-017-0099-6/)
- **More tutorials and vignettes about the mutSignatures R library:** [https://www.data-pulse.com/dev_site/mutsignatures/](https://www.data-pulse.com/dev_site/mutsignatures/)
## SessionInfo
```{r results='markup'}
sessionInfo()
```
Thanks for your interest in our software. If you run into any issue, bug, or you have a question about `mutSignatures`, feel free to email . At this time, *mutSignatures* is mainly supported by *DF* in his free time (so, please, be patient upon your inquires). **C-2020** *(November-01, 2020)*.