Title: Signature Analyzer for Targeted Sequencing (SATS)
Version: 1.0.4
Date: 2025-07-01
Description: Performs mutational signature analysis for targeted sequenced tumors. Unlike the canonical analysis of mutational signatures, SATS factorizes the mutation counts matrix into a panel context matrix (measuring the size of the targeted sequenced genome for each tumor in the unit of million base pairs (Mb)), a signature profile matrix, and a signature activity matrix. SATS also calculates the expected number of mutations attributed by a signature, namely signature expectancy, for each targeted sequenced tumor. For more details see Lee et al. (2024) <doi:10.1101/2023.05.18.23290188>.
Imports: stats, glmnet, GenomicRanges, IRanges, Biostrings, dplyr, BSgenome.Hsapiens.UCSC.hg19
Depends: R (≥ 4.1.0)
Suggests: testthat
License: GPL-2
NeedsCompilation: yes
Packaged: 2025-07-01 12:53:45 UTC; wheelerwi
Author: DongHyuk Lee [aut], Bin Zhu [aut], Bill Wheeler [cre]
Maintainer: Bill Wheeler <wheelerb@imsweb.com>
Repository: CRAN
Date/Publication: 2025-07-05 18:40:06 UTC

CalculateSignatureBurdens

Description

Estimation of the expected number of mutations attributed by TMB-based catalog signatures (signature expectancy) given the panel size matrix, the catalog signature profile matrix and the signature activities matrix.

Usage

 CalculateSignatureBurdens(L, W, H)

Arguments

L

Panel size matrix or data frame with samples in columns

W

Catalog signature profiles matrix or data frame with signatures in columns

H

Activity matrix or data frame with samples in columns

Details

The panel size matrix L is of size P (the mutation context) by N (the sample size). The catalog signature profile matrix has dimension of P by K (the number of signatures) and the activity matrix H is of size K by N. For single base substitutions (SBS), P is 96. If K is the number of signatures and N is the number of samples, then H must be of dimension K X N, ncol(L) = N, and ncol(W) = K.

Value

A matrix of dimension K X N, where K is the number of signatures and N is the number of samples.

Author(s)

Donghyuk Lee <dhyuklee@pusan.ac.kr> and Bin Zhu <bin.zhu@nih.gov>

See Also

EstimateSigActivity

Examples

    data(SimData, package="SATS")

    CalculateSignatureBurdens(SimData$L, SimData$TrueW_TMB, SimData$TrueH)

EstimateSigActivity

Description

Estimation of signature activities given the original mutation type matrix, the panel size matrix, and the catalog signature profile matrix.

Usage

 EstimateSigActivity(V, L, W, n.start=50, iter.max=5000, eps=1e-5)

Arguments

V

Mutation type matrix or data frame with samples in columns

L

Panel size matrix or data frame with samples in columns

W

Catalog signature profiles matrix or data frame with signatures in columns

n.start

Number of initializations. The default is 50.

iter.max

Maximum number iterations in the EM algorithm. The default is 5000.

eps

Stopping tolerance in the EM algorithm. The default is 1e-5.

Details

The panel size matrix L and mutation type matrix V are of size P (the mutation context) by N (the sample size). The catalog signature profile matrix has dimension of P by K (the number of signatures). For single base substitutions (SBS), P is 96. For the objects V, L, and W, we must have dim(V) = dim(L) and ncol(W) = K, where K is the number of signatures. EstimateSigActivity() uses the EM algorithm is used to estimate signature n.start, iter.max and eps control EM part. Because the convergence to a local saddle point can be an issue of the EM algorithm, it would be good practice to try multiple initial values (n.start, the default is 50). For each initial value, the default value of the maximal iteration of the EM algorithm (iter.max) is 5000, and the stopping tolerance (eps) is set to 1e-5.

Value

A list containing the estimated activity matrix H, the log-likelihood loglike, and the logical value converged.

Author(s)

Donghyuk Lee <dhyuklee@pusan.ac.kr> and Bin Zhu <bin.zhu@nih.gov>

See Also

CalculateSignatureBurdens

Examples

    data(SimData, package="SATS")

  
    EstimateSigActivity(SimData$V, SimData$L, SimData$TrueW_TMB)
   

Generate Panel Size Matrix

Description

Generation of the panel size matrix given the panel information.

Usage

 GeneratePanelSize(genomic_information, Types)

Arguments

genomic_information

Data frame of panel information (see details).

Types

Mutation type order either one of "COSMIC" or "signeR" (see details).

Details

The first argument 'genomic_information' should contain columns 'Chromosome', 'Start_Position', 'End_Position', 'SEQ_ASSAY_ID'. The column 'Chromosome' contains chromosome number where 'Start_Position' and 'End_Position' columns are start and end positions of the targeted panel. The last column 'SEQ_ASSAY_ID' distinguishes different panels consisting of the result. Please note that the column names of 'genomic_information' identical to 'Chromosome', 'Start_Position', 'End_Position', 'SEQ_ASSAY_ID'. The second argument specifies mutation type order as either one of "COSMIC" or "signeR" where "COSMIC" corresponds to the order from the COSMIC database v3.2 and "signeR" corresponds to the order from the signeR package. Note: The result of 'GeneratePanelSize()' may NOT be an 'L' matrix. The 'L' matrix can be constructed by attaching the columns of the function output that correspond to the columns of the 'V' matrix. The resulting augmented matrix can be used as the opportunity matrix for 'signeR()' function, 'L' matrix for 'EstimateSigActivity()' and 'CalculateSignatureBurdens()' functions. Therefore, it is important the mutation type order (row names) should be the same as input matrix (Mutation type matrix 'V'). We highly recommend to confirm that both 'V' and 'L' matrices have the same mutation type order corresponding to one of COSMIC database v3.2 or signeR package (both have the same order but have different expression) to conduct the consistent analysis.

Value

A data frame of 96 by 'S' (the number of panels, 'SEQ_ASSAY_ID') where entries denote the number of trinucleotides per million base pairs.

Author(s)

Donghyuk Lee <dhyuklee@pusan.ac.kr> and Bin Zhu <bin.zhu@nih.gov>

Examples

    data(SimData, package="SATS")

    GeneratePanelSize(genomic_information = SimData$PanelEx, Types = "COSMIC")
    GeneratePanelSize(genomic_information = SimData$PanelEx, Types = "signeR")

Find a subset of TMB-based catalog SBS signatures

Description

This function finds a subset of TMB-based catalog SBS signatures whose linear combination approximate de novo SBS signatures detected by signeR.

Usage

 MappingSignature(W_hat, W_ref=NULL, niter=100, cutoff.I2=0.1, min.repeats=80)

Arguments

W_hat

Matrix or data frame of de novo signatures from signeR

W_ref

NULL or a matrix or data frame of TMB-based catalog signatures. If NULL, then it will default to SimData$W_TMB (see SimData).

niter

Number of iterations. The default is 100.

cutoff.I2

Cutoff value to select signatures. The default is 0.1.

min.repeats

Minimum number of iterations to select signatures with I^2 > cutoff.I2 . The default is 80.

Details

MappingSignature() applies penalized non-negative least squares (pNNLS) for selecting the TMB-based catalog signatures. Specifically, it repeats pNNLS 100 times (niter) to reduce the randomness of cross-validation involved in pNNLS. Then TMB-based catalog signatures are selected with a coefficient greater than 0.1 (cutoff.I2) in more than 80 repeats (min.repeats).

Value

A data frame with column names of W_ref (it returns COSMIC SBS names if COSMIC catlog based reference signatures are used) and freq (the number of repetitions greater than cutoff coefficient values out of niter iterations).

Author(s)

Donghyuk Lee <dhyuklee@pusan.ac.kr> and Bin Zhu <bin.zhu@nih.gov>


Example Data

Description

The pan-cancer repertoire of reference signatures and the reference TMB (Tumor mutation burden) signature profiles of SBS (Single base substitutions) and DBS (Double base substitutions).

Details

This file consists of the list RefTMB with the following objects:


SATS (Signature Analyzer for Targeted Sequencing)

Description

This package is created to perform mutational signature analysis for targeted sequenced tumors. Unlike the canonical analysis of mutational signatures, SATS factorizes the mutation counts matrix into a panel context matrix (measuring the size of the targeted sequenced genome for each tumor in the unit of million base pairs (Mb)), a signature profile matrix, and a signature activity matrix. SATS also calculates the expected number of mutations attributed by a signature, namely signature expectancy, for each targeted sequenced tumor.

Details

This package includes a novel algorithm, SATS, to perform mutational signature analysis for targeted sequenced tumors. The algorithm first applies the signeR algorithm to extract profiles of de novo mutational signatures by appropriately adjusting for various panel sizes. Next, the profiles of identified de novo mutational signatures are mapped to the profiles of catalog signatures of tumor mutation burden (TMB), in the unit of the number of mutations per million base pairs, using penalized non-negative least squares. Then, given the panel sizes and profiles of mapped TMB catalog signatures, signature activities are estimated for all samples simultaneously through the Expectation-Maximization (EM) algorithm. Finally, the expected number of mutations attributed by a signature, namely signature expectancy, is calculated for each targeted sequenced tumor.

The main functions in this package are EstimateSigActivity, CalculateSignatureBurdens, and MappingSignature.

Author(s)

Donghyuk Lee <dhyuklee@pusan.ac.kr> and Bin Zhu <bin.zhu@nih.gov>

References

Lee, D., Hua, M., Wang, D., Song, L., Yu, K., Yang, X., Shi, J., Landi, M., Zhu, B. The mutational signatures of 100,477 targeted sequenced tumors. Submitted.


Data for examples

Description

Simulated data as an example

Details

This file consists of the list SimData with the following objects: