| Title: | Analyze and Integrate Any Type of Nucleotide Recoding RNA-Seq Data |
| Version: | 0.1.0 |
| Description: | A complete rewrite and reimagining of 'bakR' (see 'Vock et al.' (2025) <doi:10.1371/journal.pcbi.1013179>). Designed to support a wide array of analyses of nucleotide recoding RNA-seq (NR-seq) datasets of any type, including TimeLapse-seq/SLAM-seq/TUC-seq, Start-TimeLapse-seq (STL-seq), TT-TimeLapse-seq (TT-TL-seq), and subcellular NR-seq. 'EZbakR' extends standard NR-seq standard NR-seq mutational modeling to support multi-label analyses (e.g., 4sU and 6sG dual labeling), and implements an improved hierarchical model to better account for transcript-to-transcript variance in metabolic label incorporation. 'EZbakR' also generalized dynamical systems modeling of NR-seq data to support analyses of premature mRNA processing and flow between subcellular compartments. Finally, 'EZbakR' implements flexible and well-powered comparative analyses of all estimated parameters via design matrix-specified generalized linear modeling. |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.2 |
| Suggests: | knitr, MASS, rmarkdown, testthat (≥ 3.0.0) |
| Config/testthat/edition: | 3 |
| Imports: | arrow, data.table, dplyr, ggplot2, magrittr, methods, purrr, rlang, tidyr, tximport |
| Depends: | R (≥ 3.5) |
| LazyData: | true |
| VignetteBuilder: | knitr |
| URL: | https://isaacvock.github.io/EZbakR/, https://github.com/isaacvock/EZbakR |
| BugReports: | https://github.com/isaacvock/EZbakR/issues |
| NeedsCompilation: | no |
| Packaged: | 2025-12-04 22:09:30 UTC; isaac |
| Author: | Isaac Vock |
| Maintainer: | Isaac Vock <isaac.vock@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2025-12-10 21:20:08 UTC |
Average parameter estimates across replicates, and regularize variance estimates
Description
AverageAndRegularize fits a generalized linear model to your data to effectively
average parameter estimates across replicates and get overall uncertainty estimates
for those parameters. The linear model to which your data is fit is specified via
an R formula object supplied to the formula_mean parameter. Uncertainty estimates
are regularized via a hierarchical modeling strategy originally introduced with
bakR, though slightly improved upon since then.
Usage
AverageAndRegularize(
obj,
features = NULL,
parameter = "log_kdeg",
type = "kinetics",
kstrat = NULL,
populations = NULL,
fraction_design = NULL,
exactMatch = TRUE,
repeatID = NULL,
formula_mean = NULL,
sd_grouping_factors = NULL,
include_all_parameters = TRUE,
sd_reg_factor = 10,
error_if_singular = TRUE,
min_reads = 10,
convert_tl_to_factor = TRUE,
regress_se_with_abs = TRUE,
force_lm = FALSE,
force_optim = force_lm,
conservative = FALSE,
character_limit = 20,
feature_lengths = NULL,
feature_sample_counts = NULL,
scale_factor_df = NULL,
overwrite = TRUE
)
Arguments
obj |
An |
features |
Character vector of the set of features you want to stratify
reads by and estimate proportions of each RNA population. The default of "all"
will use all feature columns in the |
parameter |
Parameter to average across replicates of a given condition. |
type |
What type of table is the parameter found in? Default is "kinetics", but can also set to "fractions". |
kstrat |
If |
populations |
Character vector of the set of mutational populations that you want to infer the fractions of. Only relevant if type == "fractions". |
fraction_design |
"Design matrix" specifying which RNA populations exist in your samples. Only relevant if type == "fractions". |
exactMatch |
If TRUE, then |
repeatID |
If multiple |
formula_mean |
An R formula object specifying how the NOTE: EZbakR automatically removes any intercept terms from the model. That way, there is no ambiguity about what parameter is defined as the reference. |
sd_grouping_factors |
What metadf columns should data be grouped by when estimating
standard deviations across replicates? If this is NULL, then EZbakR will check to see
if the |
include_all_parameters |
If TRUE, an additional table will be saved with the prefix |
sd_reg_factor |
Determines how strongly variance estimates are shrunk towards trend. Higher numbers lead to more regularization. Eventually, this will be replaced with estimation of how much variance there seems to be in the population of variances. |
error_if_singular |
If TRUE, linear model will throw an error if parameters cannot be uniquely identified. This is most often caused by parameters that cannot be estimated from the data, e.g., due to limited replicate numbers or correlated sample characteristics (i.e., all treatment As also correspond to batch As, and all treatment Bs correspond to batch Bs). |
min_reads |
Minimum number of reads in all samples for a feature to be kept. |
convert_tl_to_factor |
If a label time variable is included in the |
regress_se_with_abs |
If TRUE, and if |
force_lm |
Certain formula lend them selves to efficient approximations of the
full call to |
force_optim |
Old parameter that is now passed the value |
conservative |
If TRUE, conservative variance regularation will be performed. In this case, variances below the trend will be regularized up to the trend, and variances above the trend will be left unregularized. This avoids underestimation of variances. |
character_limit |
Limit on the number of characters of the name given to the output table. Will attempt to concatenate the parameter name with the names of all of the features. If this is too long, only the parameter name will be used. |
feature_lengths |
Table of effective lengths for each feature combination in your data. For example, if your analysis includes features named GF and XF, this should be a data frame with columns GF, XF, and length. |
feature_sample_counts |
Data frame with columns |
scale_factor_df |
Data frame with columns "sample" and a second column of whatever name you please. The second column should denote scale factors by which read counts in that sample should be multiplied by in order to normalize these read counts. |
overwrite |
If TRUE, identical, existing output will be overwritten. |
Details
The EZbakR website has an extensive vignette walking through various use cases
and model types that you can fit with AverageAndRegularize(): vignette link.
EZbakR improves upon bakR by balancing extra conservativeness in several steps
with a more highly powered statistical testing scheme in its CompareParameters()
function. In particular, the following changes to the variance regularization
scheme were made:
Sample-specific parameter uncertainties are used to generate conservative estimates of feature-specific replicate variabilties. In addition, a small floor is set to ensure that replicate variance estimates are never below a certain level, for the same reason.
Condition-wide replicate variabilities are regressed against both read coverage and either a) |logit(estimate)| when modeling average fraction labeled. This captures the fact thta estimates are best around a logit(fraction labeled) of 0 and get worse for more extreme fraction labeled's.; b) log(kdeg) when modeling log degradation rate constants. At first, I considered a strategy similar to the fraction labeled modeling, but found that agreement between a fully rigorous MCMC sampling approach and EZbakR was significantly improved by just regressing hee value of the log kientic parameter, likely due to the non-linear transformation of fraction labeled to log(kdeg); and c) only coverage in all other cases.
Features with replicate variabilities below the inferred trend have their replicate variabilites set equal to that predicted by the trend. This helps limit underestimation of parameter variance. Features with above-trend replicate variabilties have their replicate variabilities regularized with a Normal prior Normal likelihood Bayesian model, as in bakR (so the log(variance) is the inferred mean of this distribution, and the known variance is inferred from the amount of variance about the linear dataset-wide trend).
All of this allows CompareParameters() to use a less conservative statistical test
when calculating p-values, while still controlling false discovery rates.
Value
EZbakRData object with an additional "averages" table, as well as a
fullfit table under the same list heading, which includes extra information about
the priors used for regularization purposes.
Examples
# Simulate data to analyze
simdata <- EZSimulate(30)
# Create EZbakR input
ezbdo <- EZbakRData(simdata$cB, simdata$metadf)
# Estimate Fractions
ezbdo <- EstimateFractions(ezbdo)
# Estimate Kinetics
ezbdo <- EstimateKinetics(ezbdo)
# Average estimates across replicate
ezbdo <- AverageAndRegularize(ezbdo)
Get contrasts of estimated parameters
Description
CompareParameters() calculates differences in parameters estimated by
AverageAndRegularize() or EZDynamics() and performs null hypothesis
statistical testing, comparing their values to a null hypothesis of 0.
Usage
CompareParameters(
obj,
design_factor,
reference,
experimental,
param_name,
parameter = "log_kdeg",
type = "averages",
param_function,
condition = NULL,
features = NULL,
exactMatch = TRUE,
repeatID = NULL,
formula_mean = NULL,
sd_grouping_factors = NULL,
fit_params = NULL,
normalize_by_median = FALSE,
reference_levels = NULL,
experimental_levels = NULL,
overwrite = TRUE
)
Arguments
obj |
An |
design_factor |
Name of factor from |
reference |
Name of reference |
experimental |
Name of |
param_name |
If you want to assess the significance of a single parameter, rather than the comparison of two parameters, specify that one parameter's name here. |
parameter |
Parameter to average across replicates of a given condition. |
type |
Type of table to use. Can either be "averages" or "dynamics". |
param_function |
NOT YET IMPLEMENTED. Will allow you to specify more complicated functions of parameters when hypotheses you need to test are combinations of parameters rather than individual parameters or simple differences in two parameters. |
condition |
Same as |
features |
Character vector of the set of features you want to stratify
reads by and estimate proportions of each RNA population. The default of "all"
will use all feature columns in the |
exactMatch |
If TRUE, then |
repeatID |
If multiple |
formula_mean |
An R formula object specifying how the |
sd_grouping_factors |
Metadf columns should data was grouped by when estimating standard deviations across replicates for the averages object you want to use. |
fit_params |
Character vector of parameter names in the averages object you would like to use. |
normalize_by_median |
If TRUE, then median difference will be set equal to 0. This can account for global biases in parameter estimates due to things like differences in effective label times. Does risk eliminating real signal though, so user discretion is advised. |
reference_levels |
Same as |
experimental_levels |
Same as |
overwrite |
If TRUE, then identical output will be overwritten if it exists. |
Details
The EZbakR website has an extensive vignette walking through various use cases
and parameters you can compare with CompareParameters(): vignette link.
There are essentially 3 scenarios that CompareParameters() can handle:
Pairwise comparisons: compare
referencetoexperimentalparameter estimates of a specifieddesign_factorfromAverageAndRegularize(). log(experimental/reference) is the computed "difference" in this case.Assess the value of a single parameter, which itself should represent a difference between other parameters. The name of this parameter can be specified via the
param_nameargument. This is useful for various interaction models where some of the parameters of these models may represent things like "effect of A on condition X".Pairwise comparison of dynamical systems model parameter estimate: similar to the first case listed above, but now when
type == "dynamics".design_factorcan now be a vector of all themetadfcolumns you stratified parameter estimates by.
Eventually, a 4th option via the currently non-functional param_function argument
will be implemented, which will allow you to specify functions of parameters to be assessed,
which can be useful for certain interaction models.
CompareParameters() calculates p-values using what is essentially an asymptotic Wald test,
meaning that a standard normal distribution is integrated. P-values are then multiple-test
adjusted using the method of Benjamini and Hochberg, implemented in R's p.adjust()
function.
Value
EZbakRData object with an additional "comparisons" table, detailing
the result of a comparison of a parameter estimate's valules across two different
conditions.
Examples
# Simulate data to analyze
simdata <- EZSimulate(30)
# Create EZbakR input
ezbdo <- EZbakRData(simdata$cB, simdata$metadf)
# Estimate Fractions
ezbdo <- EstimateFractions(ezbdo)
# Estimate Kinetics
ezbdo <- EstimateKinetics(ezbdo)
# Average estimates across replicate
ezbdo <- AverageAndRegularize(ezbdo)
# Compare parameters across conditions
ezbdo <- CompareParameters(
ezbdo,
design_factor = "treatment",
reference = "treatment1",
experimental = "treatment2"
)
Correct for experimental/bioinformatic dropout of labeled RNA.
Description
Uses the strategy described here, and similar to that originally presented in Berg et al. 2024.
Usage
CorrectDropout(
obj,
strategy = c("grandR", "bakR"),
grouping_factors = NULL,
features = NULL,
populations = NULL,
fraction_design = NULL,
repeatID = NULL,
exactMatch = TRUE,
read_cutoff = 25,
dropout_cutoff = 5,
...
)
Arguments
obj |
An EZbakRFractions object, which is an EZbakRData object on which
you have run |
strategy |
Which dropout correction strategy to use. Options are:
The "bakR" strategy has the advantage of being model-derived, making it possible to assess model fit and thus whether the simple assumptions of both the "bakR" and "grandR" dropout models are met. The "grandR" strategy has the advantage of being more robust. Thus, the "grandR" strategy is currently used by default. |
grouping_factors |
Which sample-detail columns in the metadf should be used
to group -s4U samples by for calculating the average -s4U RPM? The default value of
|
features |
Character vector of the set of features you want to stratify
reads by and estimate proportions of each RNA population. The default of |
populations |
Mutational populations that were analyzed to generate the fractions table to use. For example, this would be "TC" for a standard s4U-based nucleotide recoding experiment. |
fraction_design |
"Design matrix" specifying which RNA populations exist
in your samples. By default, this will be created automatically and will assume
that all combinations of the |
repeatID |
If multiple |
exactMatch |
If TRUE, then |
read_cutoff |
Minimum number of reads for a feature to be used to fit the dropout model. |
dropout_cutoff |
Maximum ratio of -s4U:+s4U RPMs for a feature to be used to fit the dropout model (i.e., simple outlier filtering cutoff). |
... |
Parameters passed to internal |
Details
Dropout is the disproportionate loss of labeled RNA/reads from said RNA
described independently here
and here. It can originate from a combination of
bioinformatic (loss of high mutation content reads due to alignment problems),
technical (loss of labeled RNA during RNA extraction), and biological (transcriptional
shutoff in rare cases caused by metabolic label toxicity) sources.
CorrectDropout() compares label-fed and label-free controls from the same
experimental conditions to estimate and correct for this dropout. It assumes
that there is a single number (referred to as the dropout rate, or pdo) which
describes the rate at which labeled RNA is lost (relative to unlabeled RNA).
pdo ranges from 0 (no dropout) to 1 (complete loss of all labeled RNA), and
is thus interpreted as the percentage of labeled RNA/reads from labeled RNA
disproportionately lost, relative to the equivalent unlabeled species.
Value
An EZbakRData object with the specified "fractions" table replaced
with a dropout corrected table.
Examples
# Simulate data to analyze
simdata <- EZSimulate(30)
# Create EZbakR input
ezbdo <- EZbakRData(simdata$cB, simdata$metadf)
# Estimate Fractions
ezbdo <- EstimateFractions(ezbdo)
# Correct for dropout
ezbdo <- CorrectDropout(ezbdo)
Deconvolve multi-feature fractions.
Description
Combines the output of EstimateFractions with feature
quantification performed by an outside tool (e.g., RSEM, kallisto, salmon, etc.)
to infer fraction news for features that reads cannot always be assigned to
unambiguously. This is a generalization of EstimateIsoformFractions
which performs this deconvolution for transcript isoforms.
Usage
DeconvolveFractions(
obj,
feature_type = c("gene", "isoform"),
features = NULL,
populations = NULL,
fraction_design = NULL,
repeatID = NULL,
exactMatch = TRUE,
fraction_name = NULL,
quant_name = NULL,
gene_to_transcript = NULL,
overwrite = TRUE,
TPM_min = 1,
count_min = 10
)
Arguments
obj |
An |
feature_type |
Either |
features |
Character vector of the set of features you want to stratify
reads by and estimate proportions of each RNA population. The default of "all"
will use all feature columns in the |
populations |
Mutational populations that were analyzed to generate the fractions table to use. For example, this would be "TC" for a standard s4U-based nucleotide recoding experiment. |
fraction_design |
"Design matrix" specifying which RNA populations exist
in your samples. By default, this will be created automatically and will assume
that all combinations of the |
repeatID |
If multiple |
exactMatch |
If TRUE, then |
fraction_name |
Name of fraction estimate table to use. Should be stored in the
|
quant_name |
Name of transcript isoform quantification table to use. Should be stored
in the obj$readcounts list under this name. Use |
gene_to_transcript |
Table with columns |
overwrite |
If TRUE and a fractions estimate output already exists that would possess the same metadata (features analyzed, populations analyzed, and fraction_design), then it will get overwritten with the new output. Else, it will be saved as a separate output with the same name + "_#" where "#" is a numerical ID to distinguish the similar outputs. |
TPM_min |
Minimum TPM for a transcript to be kept in analysis. |
count_min |
Minimum expected_count for a transcript to be kept in analysis. |
Details
DeconvolveFractions expects as input a "fractions" table with estimates
for fraction news of at least one convolved feature set. A convolved feature
set is one where some reads cannot be unambiguously assigned to one instance
of that feature type. For example, it is often impossible to assign short
reads to a single transcript isoform. Thus, something like the "TEC" assignment
provided by fastq2EZbakR is an instance of a convolved feature set, as it
is an assignment of reads to transcript isoforms with which they are compatible.
Another example is assignment to the exonic regions of genes, for fusion genes
(where a read may be consistent with both the putative fusion gene as well
as one of the fusion components).
DeconvolveFractions deconvolves fraction news
by fitting a linear mixing model to the convolved fraction new estimates +
feature abundance estimates. In other words, each convolved fraction new (data) is modeled
as a weighted average of single feature fraction news (parameters to estimate),
with the weights determined by the relative abundances of the features
in the convolved set (data). The convolved fraction new is modeled as a beta distribution with mean
equal to the weighted feature fraction new average and variance related
to the number of reads in the convolved feature set.
Features with estimated TPMs less than TPM_min (1 by default) or an estimated number of read
counts less than count_min (10 by default) are removed from convolved feature sets and will
not have their fraction news estimated.
Value
An EZbakRData object with an additional table under the "fractions"
list. Has the same form as the output of EstimateFractions(), and will have the
feature column "transcript_id".
Examples
# Load dependencies
library(dplyr)
# Simulates a single sample worth of data
simdata_iso <- SimulateIsoforms(nfeatures = 30)
# We have to manually create the metadf in this case
metadf <- tibble(sample = 'sampleA',
tl = 4,
condition = 'A')
ezbdo <- EZbakRData(simdata_iso$cB,
metadf)
ezbdo <- EstimateFractions(ezbdo)
### Hack in the true, simulated isoform levels
reads <- simdata_iso$ground_truth %>%
dplyr::select(transcript_id, true_count, true_TPM) %>%
dplyr::mutate(sample = 'sampleA',
effective_length = 10000) %>%
dplyr::rename(expected_count = true_count,
TPM = true_TPM)
# Name of table needs to have "isoform_quant" in it
ezbdo[['readcounts']][['simulated_isoform_quant']] <- reads
### Perform deconvolution
ezbdo <- DeconvolveFractions(ezbdo, feature_type = "isoform")
Generalized dynamical systems modeling
Description
EZDynamics() estimates parameters of a user-specified dynamical systems
model. The dynamical system model is specified through an adjacency matrix,
which is an NxN matrix described below (see graph documentation). Modeling
can either be done for species all assayed in each sample, or species that
are assayed across a set of independent samples (e.g., subcellular fractionation
involves assaying different species in different samples).
Usage
EZDynamics(
obj,
graph,
sub_features,
grouping_features,
scale_factors = NULL,
sample_feature = NULL,
modeled_to_measured = NULL,
parameter_names = paste0("logk", 1:max(graph)),
unassigned_name = "__no_feature",
type = "averages",
prior_means = rep(-3, times = max(graph)),
prior_sds = c(3, rep(1, times = max(graph) - 1)),
avg_params_tokeep = NULL,
avg_params_todrop = NULL,
label_time_name = "tl",
features = NULL,
populations = NULL,
fraction_design = NULL,
parameter = NULL,
repeatID = NULL,
exactMatch = TRUE,
feature_lengths = NULL,
use_coverage = TRUE,
normalization_repeatID = NULL,
normalization_exactMatch = TRUE,
species_to_sf = NULL,
overwrite = TRUE
)
Arguments
obj |
Currently must be an EZbakRData object on which |
graph |
An NxN adjacency matrix, where N represents the number of species being modeled. One of these species must be called "0" and represent the "no RNA" species. This is the species from which some species are synthesized (e.g., 0 -> P, means premature RNA is synthesized from no RNA), and the species to which some species are degraded (e.g., M -> 0 means mature RNA is converted to "no RNA" via degradation). The rows and columns of this matrix must be the names of all modled species, and rownames(graph) == colnames(graph). Entry i,j of the matrix is either 0 if species i cannot be converted into species j under your model, and an integer from 1:npars (where npars = total number of parameters to be estimated) if it can. For example, the model 0 -> P -> M -> 0 would have the |
sub_features |
Which feature columns distinguish between the different
measured species? Note, the measured species need not have the same name,
and may not be directly equivalent to, the modeled species. The relationship
between the modeled species in |
grouping_features |
Which features are the overarching feature assignments
by which |
scale_factors |
Data frame mapping samples to factors by which to multiply read counts so as ensure proper normalization between different RNA populations. Only relevant if you are modeling relationships between distinct RNA populations, for example RNA from nuclear and cytoplasmic fractions. Will eventually be inferred automatically. |
sample_feature |
If different samples involve assaying different species,
this must be the name of the metadf column that distinguishes the different
classes of samples. For example, if analyzing a subcellular fractionation
dataset, you likely included a column in your metadf called "compartment".
This would then be your |
modeled_to_measured |
If For example, if your model is 0 -> P -> M -> 0, where P stands for premature RNA
and M stands for mature RNA, and you have a column called
GF that corresponds to assignment anywhere in the gene, and XF that corresponds
to assignment of exclusively exon mapping reads, then your As another example, if your model is 0 -> N -> C -> 0, where N stands for
"nuclear RNA" and C stands for "cytoplasmic RNA", and your |
parameter_names |
Vector of names you would like to give to the estimated
parameters. ith element should correspond to name of parameter given the ID
i in |
unassigned_name |
What value will a |
type |
What type of table would you like to use? Currently only supports "averages", but will support "fractions" in the near future. |
prior_means |
Mean of log-Normal prior for kinetic parameters. Should
be vector where ith value is mean for ith parameter, i = index in |
prior_sds |
Std. dev. of log-Normal prior for kinetic parameter.
Should be vector where ith value is mean for ith parameter, i = index in |
avg_params_tokeep |
Names of parameters in averages table that you would like to keep. Other parameters will be discarded. Don't include the prefixes "mean_", "sd_", or "coverage_"; these will be imputed automatically. In other words, this should be the base parameter names. |
avg_params_todrop |
Names of parameters in averages table that you would like to drop. Other parameters will be kept. Don't include the prefixes "mean_", "sd_", or "coverage_"; these will be imputed automatically. In other words, this should be the base parameter names. |
label_time_name |
Name of relevant label time column that will be found in the parameter names. Defaults to the standard "tl". |
features |
Character vector of the set of features you want to stratify
reads by and estimate proportions of each RNA population. The default of "all"
will use all feature columns in the |
populations |
Mutational populations that were analyzed to generate the fractions table to use. For example, this would be "TC" for a standard s4U-based nucleotide recoding experiment. |
fraction_design |
"Design matrix" specifying which RNA populations exist
in your samples. By default, this will be created automatically and will assume
that all combinations of the |
parameter |
Parameter to average across replicates of a given condition.
Has to be |
repeatID |
If multiple |
exactMatch |
If TRUE, then |
feature_lengths |
Table of effective lengths for each feature combination in your data. For example, if your analysis includes features named GF and XF, this should be a data frame with columns GF, XF, and length. |
use_coverage |
If TRUE, normalized read counts will be used to inform kinetic parameter estimates. If FALSE, only fraction news will be used, which will leave some parameters (e.g., synthesis rate) unidentifiable, though has the advantage of avoiding the potential biases induced by normalization problems. |
normalization_repeatID |
For extracting the |
normalization_exactMatch |
For extracting the |
species_to_sf |
List mapping individual RNA species in |
overwrite |
If TRUE and a fractions estimate output already exists that would possess the same metadata (features analyzed, populations analyzed, and fraction_design), then it will get overwritten with the new output. Else, it will be saved as a separate output with the same name + "_#" where "#" is a numerical ID to distinguish the similar outputs. |
Details
When running AverageAndRegularize() to produce input for EZDynamics(), you
must set parameter to logit_fraction_high<muttype> (<muttype> = type of mutation
modeled by EstimateFractions(), e.g., TC). If you have multiple distinct label times,
you must also include the label time (tl of your metadf)
in your regression formula. EZDynamics() models the logit(fraction high <muttype>),
and this will depend on the label time (longer label time = higher fraction), which
is why these two conditions must be met. If you only have a single label time though,
EZDynamics will be able to impute this one value for all samples from your metadf.
You can also include additional interaction terms in your AverageAndRegularize()
model for different experimental conditions in which experiments were conducted,
so that inferred kinetic parameters can be compared across these conditions. Currently,
more complex modeling beyond simple stratification of samples by one or more condition
is not possible with EZDynamics().
For normalization purposes, especially if analyzing pre-mRNA processing dynamics,
you will need to provide AverageAndRegularize() with a table of feature lengths
via the feature_lengths parameter. This will be used in all cases to length
normalize read counts. Even in the case when you are just modeling mature mRNA
dynamics, this is technically necessary for accurate estimation of scale factors.
The first step of EZDynamics() is attempted inference of normalization scale
factors for read counts. If you have scale factors you calculated yourself,
e.g. via specialized spike-in protocols, you can provide these via the scale_factors
parameter. If not, EZDynamics() will try to infer these from the fraction labeled's
in each sample_feature (e.g., in different subcellular compartments). This relies on
having some sample_feature's that are a combination of other sample_feature's. For
example, if analyzing subcellular fractionation data, you may have 1) total RNA; 2)
cytoplasmic RNA; and 3) nuclear RNA. Total RNA = cytoplasmic + nuclear RNA, and thus
the fraction of reads from labeled RNA is a function of the total cytoplasmic and
nuclear fraction labeled's, as well as the relative molecular abundances of cytoplasmic
and nuclear RNA. The latter is precisely the scale factors we need to estimate.
If you do not have sufficient combinations of data to perform this scale factor estimation,
EZDynamics() will only use the fraction labeled's for modeling kinetic parameters.
It can then perform post-hoc normalization to estimate a single synthesis rate constant,
using the downstream rate constants to infer the unknown normalization scale factor necessary
to combine kinetic parameter estimates and read counts to infer this rate constant.
For estimating kinetic parameters, EZDynamics() infers the solution of the linear
system of ODEs specified in your graph matrix input. This is done by representing
the system of equations as a matrix, and deriving the general solution of the levels
of each modeled species of RNA from the eigenvalues and eigenvectors of this matrix.
While this makes EZDynamics() orders of magnitude more efficient than if it had to
numerically infer the solution for each round of optimization, needing to compute
eigenvalues and eigenvectors in each optimization iteration is still non-trivial,
meaning that EZDynamics() may take anywhere from 10s of minutes to a couple hours
to run, depending on how complex your model is and how many distinct set of samples
and experimental conditions you have.
Value
EZbakRData object with an additional "dynamics" table.
Examples
##### MODELING CYTOPLASMIC TO NUCLEAR FLOW
### Simulate data and get replicate average fractions estimates
simdata <- EZSimulate(
nfeatures = 30,
ntreatments = 1,
mode = "dynamics",
label_time = c(1, 3),
dynamics_preset = "nuc2cyto"
)
ezbdo <- EZbakRData(simdata$cB, simdata$metadf)
ezbdo <- EstimateFractions(ezbdo)
ezbdo <- AverageAndRegularize(ezbdo,
formula_mean = ~tl:compartment - 1,
type = "fractions",
parameter = "logit_fraction_highTC")
### ODE model: the graph and measured species
graph <- matrix(c(0, 1, 0,
0, 0, 2,
3, 0, 0),
nrow = 3,
ncol = 3,
byrow = TRUE)
colnames(graph) <- c("0", "N", "C")
rownames(graph) <- colnames(graph)
modeled_to_measured <- list(
nuclear = list(GF ~ N),
cytoplasm = list(GF ~ C),
total = list(GF ~ C + N) # total RNA is a combination of C and N
)
### Fit model
ezbdo <- EZDynamics(ezbdo,
graph = graph,
sub_features = "GF",
grouping_features = "GF",
sample_feature = "compartment",
modeled_to_measured = ode_models$nuc2cyto$formulas)
Make an MAPlot from EZbakR comparison
Description
Make a plot of effect size (y-axis) vs. log10(read coverage) (x-axis), coloring points by position relative to user-defined decision cutoffs.
Usage
EZMAPlot(
obj,
parameter = "log_kdeg",
design_factor = NULL,
reference = NULL,
experimental = NULL,
param_name = NULL,
param_function = NULL,
features = NULL,
condition = NULL,
repeatID = NULL,
exactMatch = TRUE,
plotlog2 = TRUE,
FDR_cutoff = 0.05,
difference_cutoff = log(2),
size = NULL,
features_to_highlight = NULL,
highlight_shape = 21,
highlight_size_diff = 1,
highlight_stroke = 0.7,
highlight_fill = NA,
highlight_color = "black"
)
Arguments
obj |
An object of class |
parameter |
Name of parameter whose comparison you want to plot. |
design_factor |
Name of factor from |
reference |
Name of reference |
experimental |
Name of |
param_name |
If you want to assess the significance of a single parameter, rather than the comparison of two parameters, specify that one parameter's name here. |
param_function |
NOT YET IMPLEMENTED. Will allow you to specify more complicated functions of parameters when hypotheses you need to test are combinations of parameters rather than individual parameters or simple differences in two parameters. |
features |
Character vector of feature names for which comparisons were made. |
condition |
Defunct parameter that has been replaced with |
repeatID |
If multiple |
exactMatch |
If TRUE, then |
plotlog2 |
If TRUE, assume that log(parameter) difference is passed in and that you want to plot log2(parameter) difference. |
FDR_cutoff |
False discovery cutoff by which to color points. |
difference_cutoff |
Minimum absolute difference cutoff by which to color points. |
size |
Size of points, passed to |
features_to_highlight |
Features you want to highlight in the plot (black circle will be drawn around them). This can either be a data frame with one column per feature type in the comparison table you are visualizing, or a vector of feature names if the relevant comparison table will only have one feature type noted. |
highlight_shape |
Shape of points overlayed on highlighted features. Defaults to an open circle |
highlight_size_diff |
Sets how much larger should the points overlayed on the highlighted features be than the original plot points. |
highlight_stroke |
Stroke width of the points overlayed on the highlighted features. |
highlight_fill |
Fill color of the points overlayed on the highlighted
features. Default is for them to be fill-less ( |
highlight_color |
Color of the points overlayed on the highlighted points. |
Details
EZMAPlot() accepts as input the output of CompareParameters(), i.e.,
an EZbakRData object with at least one "comparisons" table. It will plot
the "avg_coverage" column in this table vs. the "difference" column.
In the simplest case, "difference" represents a log-fold change in a kinetic
parameter (e.g., kdeg) estimate. More complicated linear model fits and
comparisons can yield different parameter estimates.
NOTE: some outputs of CompareParameters() are not meant for visualization
via an MA plot. For example, when fitting certain interaction models,
some of the parameter estimates may represent average log(kinetic paramter)
in one condition. See discussion of one example of this here.
EZbakR estimates kinetic parameters in EstimateKinetics() and EZDynamics()
on a log-scale. By default, since log2-fold changes are a bit easier to interpret
and more common for these kind of visualizations, EZMAPlot() multiplies
the y-axis value by log2(exp(1)), which is the factor required to convert from
a log to a log2 scale. You can turn this off by setting plotlog2 to FALSE.
Value
A ggplot2 object. Y-axis = log2(estimate of interest (e.g., fold-change
in degradation rate constant); X-axis = log10(average normalized read coverage);
points colored by location relative to FDR and effect size cutoffs.
Examples
# Simulate data to analyze
simdata <- EZSimulate(30)
# Create EZbakR input
ezbdo <- EZbakRData(simdata$cB, simdata$metadf)
# Estimate Fractions
ezbdo <- EstimateFractions(ezbdo)
# Estimate Kinetics
ezbdo <- EstimateKinetics(ezbdo)
# Average estimates across replicate
ezbdo <- AverageAndRegularize(ezbdo)
# Compare parameters across conditions
ezbdo <- CompareParameters(
ezbdo,
design_factor = "treatment",
reference = "treatment1",
experimental = "treatment2"
)
# Make MA plot (ggplot object that you can save and add/modify layers)
EZMAPlot(ezbdo)
Run quality control checks
Description
EZQC() assesses multiple aspects of your NR-seq data and generates a number
of plots visualizing dataset-wide trends.
Usage
EZQC(obj, ...)
Arguments
obj |
EZbakRData or EZbakRFractions object. |
... |
Parameters passed to the class-specific method.
If you have provided an EZbakRFractions object, then these can be (all play the
same role as in
If you have provided an EZbakRData object, then these can be (all same the same
purpose as in
|
Details
EZQC() checks the following aspects of your NR-seq data. If you have passed
an EZbakRData object, then EZQC() checks:
Raw mutation rates: In all sequencing reads, how many T's in the reference were a C in the read? The hope is that raw mutation rates are higher than -label controls in all +label samples. Higher raw mutation rates, especially when using standard label times (e.g., 2 hours or more in mammalian systems), are typically a sign of good label incorporation and low labeled RNA/read dropout. If you don't have -label samples, know that background mutation rates are typically less than 0.2%, so +label raw mutation rates several times higher than this would be preferable.
Mutation rates in labeled and unlabeled reads: The raw mutation rate counts all mutations in all reads. In a standard NR-seq experiment performed with a single metabolic label, there are typically two populations of reads:
Those from labeled RNA, having higher mutation rates due to chemical conversion/recoding of the metabolic label and 2) those from unlabeled RNA, having lower, background levels of mutations.
EZbakRfits a two component mixture model to estimate the mutation rates in these two populations separately. A successful NR-seq experiment should have a labeled read mutation rate of > 1% and a low background mutation rate of < 0.3%.Read count replicate correlation: Simply the log10 read count correlation for replicates, as inferred from your
metadf.
If you have passed an EZbakRFractions object, i.e., the output of EstimateFractions(),
then in addition to the checks in the EZbakRData input case, EZQC() also checks:
Fraction labeled distribution: This is the distribution of feature-wise fraction labeled's (or fraction high mutation content's) estimated by
EstimateFractions(). The "ideal" is a distribution with mean around 0.5, as this maximizes the amount of RNA with synthesis and degradation kinetics within the dynamic range of the experiment. In practice, you will (and should) be at least a bit lower than this as longer label times risk physiological impacts of metabolic labeling.Fraction labeled replicate correlation: This is the logit(fraction labeled) correlation between replicates, as inferred from your
metadf.
Value
A list of ggplot2 objects visualizing the various aspects of your data
assessed by EZQC().
Examples
# Simulate data to analyze
simdata <- EZSimulate(30)
# Create EZbakR input
ezbdo <- EZbakRData(simdata$cB, simdata$metadf)
# Estimate Fractions
ezbdo <- EstimateFractions(ezbdo)
# Run QC
QC <- EZQC(ezbdo)
Run quality control checks
Description
Run quality control checks
Usage
## S3 method for class 'EZbakRArrowData'
EZQC(
obj,
mutrate_populations = "all",
features = "all",
filter_cols = "all",
filter_condition = `&`,
remove_features = c("NA", "__no_feature"),
...
)
Arguments
obj |
An EZbakRData object |
mutrate_populations |
Same as in |
features |
Same as in |
filter_cols |
Same as in |
filter_condition |
Same as in |
remove_features |
Same as in |
... |
Additional arguments. Currently goes unused. |
Value
A list of ggplot2 objects visualizing the various aspects of your data
assessed by EZQC().
Run quality control checks
Description
Run quality control checks
Usage
## S3 method for class 'EZbakRData'
EZQC(
obj,
mutrate_populations = "all",
features = "all",
filter_cols = "all",
filter_condition = `&`,
remove_features = c("NA", "__no_feature"),
...
)
Arguments
obj |
An EZbakRData object |
mutrate_populations |
Same as in |
features |
Same as in |
filter_cols |
Same as in |
filter_condition |
Same as in |
remove_features |
Same as in |
... |
Additional arguments. Currently goes unused. |
Value
A list of ggplot2 objects visualizing the various aspects of your data
assessed by EZQC().
Run quality control checks
Description
Run quality control checks
Usage
## S3 method for class 'EZbakRFractions'
EZQC(obj, features = NULL, populations = NULL, fraction_design = NULL, ...)
Arguments
obj |
EZbakRFractions object, which is an EZbakRData object on which
|
features |
Set of features analyzed in the fractions table you are
interested QCing. This gets passed to |
populations |
Set of mutation types analyzed in the fractions table
you are interested in QCing. This gets passed to |
fraction_design |
The fraction "design matrix" specified to get the
fractions table you are interested in QCing. This gets passed to |
... |
Additional arguments. Currently goes unused. |
Value
A list of ggplot2 objects visualizing the various aspects of your data
assessed by EZQC().
Simulate NR-seq data for multiple replicates of multiple biological conditions
Description
EZSimulate() is a user friendly wrapper to SimulateMultiCondition(). It
sets convenient defaults so as to quickly generate easy to interpret output.
EZSimulate() has all of the same parameters as SimulateMultiCondition(),
but it also has a number of additional parameters that guide its default behavior
and allow you to simulate multi-condition data without specifying the multiple,
sometimes complex, arguments that you would need to specify in SimulateMultiCondition()
to get the same behavior. In particular, users only have to set a single parameter,
nfeatures (number of features to simulate data for), by default. The EZSimulate()-unique
parameters ntreatments and nreps have default values that guide the simulation in the
case where only nfeatures is specified. In particular, nreps of ntreatments different
conditions will be simulated, with the assumed model log(kdeg) ~ treatment and log(ksyn) ~ 1.
In other words, Different kdeg values will be simulated for each treatment level, and ksyn
values will not differ across conditions.
Usage
EZSimulate(
nfeatures,
mode = c("standard", "dynamics"),
ntreatments = ifelse(mode == "standard", 2, 1),
nreps = 3,
nctlreps = 1,
metadf = NULL,
mean_formula = NULL,
param_details = NULL,
seqdepth = nfeatures * 2500,
label_time = 2,
pnew = 0.05,
pold = 0.001,
readlength = 200,
Ucont_alpha = 25,
Ucont_beta = 75,
feature_prefix = "Gene",
dispslope = 5,
dispint = 0.01,
logkdegsdtrend_slope = -0.3,
logkdegsdtrend_intercept = -2.25,
logksynsdtrend_slope = -0.3,
logksynsdtrend_intercept = -2.25,
logkdeg_mean = -1.9,
logkdeg_sd = 0.7,
logksyn_mean = 2.3,
logksyn_sd = 0.7,
logkdeg_diff_avg = 0,
logksyn_diff_avg = 0,
logkdeg_diff_sd = 0.5,
logksyn_diff_sd = 0.5,
pdiff_kd = 0.1,
pdiff_ks = 0,
pdiff_both = 0,
pdo = 0,
dynamics_preset = c("preRNA", "nuc2cyto", "preRNAwithPdeg", "nuc2cytowithNdeg",
"subtlseq", "nuc2cytowithpreRNA"),
unassigned_name = "__no_feature",
dispersion = 1000,
lfn_sd = 0.2,
treatment_effects = NULL,
effect_avg_default = 0,
effect_sd_default = 0.5,
fraction_affected_default = 0.5,
log_means = NULL,
log_sds = NULL
)
Arguments
nfeatures |
Number of "features" (e.g., genes) for which to simulated data. |
mode |
Currently, EZSimulate can simulate in two modes: "standard" and "dynamics".
The former is the default and involves simulating multiple conditions of standard NR-seq data.
"dynamics" calls |
ntreatments |
Number of distinct treatments to simulate. This parameter is
only relevant if |
nreps |
Number of replicates of each treatment to simulate. This parameter is
only relevant if |
nctlreps |
Number of -s4U replicates of each treatment to simulate. This parameter
is only relevant if |
metadf |
A data frame with the following columns:
These parameters (described more below) can also be included in metadf to specify sample-specific simulation parameter:
|
mean_formula |
A formula object that specifies the linear model used to
relate the factors in the |
param_details |
A data frame with one row for each column of the design matrix
obtained from
If param_details is not specified by the user, the first column of the design matrix
is assumed to represent the reference parameter, all parameters are assumed to be
non-global, logkdeg_mean and logksyn_mean are set to the equivalently named parameter values
described below for the reference and |
seqdepth |
Total number of reads in each sample. |
label_time |
Length of s^4^U feed to simulate. |
pnew |
Probability that a T is mutated to a C if a read is new. |
pold |
Probability that a T is mutated to a C if a read is old. |
readlength |
Length of simulated reads. In this simple simulation, all reads are simulated as being exactly this length. |
Ucont_alpha |
Probability that a nucleotide in a simulated read from a given feature
is a U is drawn from a beta distribution with shape1 = |
Ucont_beta |
Probability that a nucleotide in a simulated read from a given feature
is a U is drawn from a beta distribution with shape2 = |
feature_prefix |
Name given to the i-th feature is |
dispslope |
Negative binomial dispersion parameter "slope" with respect to read counts. See DESeq2 paper for dispersion model used. |
dispint |
Negative binomial dispersion parameter "intercept" with respect to read counts. See DESeq2 paper for dispersion model used. |
logkdegsdtrend_slope |
Slope for log10(read count) vs. log(kdeg) replicate variability trend |
logkdegsdtrend_intercept |
Intercept for log10(read count) vs. log(kdeg) replicate variability trend |
logksynsdtrend_slope |
Slope for log10(read count) vs. log(ksyn) replicate variability trend |
logksynsdtrend_intercept |
Intercept for log10(read count) vs. log(ksyn) replicate variability trend |
logkdeg_mean |
Mean of normal distribution from which reference log(kdeg)
linear model parameter is drawn from for each feature if |
logkdeg_sd |
Standard deviation of normal distribution from which reference log(kdeg)
linear model parameter is drawn from for each feature if |
logksyn_mean |
Mean of normal distribution from which reference log(ksyn)
linear model parameter is drawn from for each feature if |
logksyn_sd |
Standard deviation of normal distribution from which reference log(ksyn)
linear model parameter is drawn from for each feature if |
logkdeg_diff_avg |
Mean of normal distribution from which non-reference log(kdeg)
linear model parameters are drawn from for each feature if |
logksyn_diff_avg |
Mean of normal distribution from which reference log(ksyn)
linear model parameter are drawn from for each feature if |
logkdeg_diff_sd |
Standard deviation of normal distribution from which reference log(kdeg)
linear model parameter are drawn from for each feature if |
logksyn_diff_sd |
Standard deviation of normal distribution from which reference log(ksyn)
linear model parameter are drawn from for each feature if |
pdiff_kd |
Proportion of features for which non-reference log(kdeg) linear model parameters differ from the reference. |
pdiff_ks |
Proportion of features for which non-reference log(ksyn) linear model parameters differ from the reference. |
pdiff_both |
Proportion of features for which BOTH non-reference log(kdeg) and log(ksyn) linear model parameters differ from the reference. |
pdo |
Dropout rate; think of this as the probability that a s4U containing
molecule is lost during library preparation and sequencing. If |
dynamics_preset |
Which preset model to use for simulation of dynamics.
Therefore, only relevant if
|
unassigned_name |
String to give to reads not assigned to a given feature. |
dispersion |
Negative binomial |
lfn_sd |
Logit(fn) replicate variability. |
treatment_effects |
Data frame describing effects of treatment on each
parameter. Should have five columns: "parameter_index", "treatment_index", "mean", "sd",
and "fraction_affected".
Each row corresponds to the effect the ith (i = treatment_index) treatment has on the
jth (j = parameter_index) kinetic parameter. Effect sizes, on a log-scale, are drawn from
a Normal distribution with mean and standard deviation set by the mean and sd columns,
respectively. The number of non-zero effects is set by "fraction_affected", and is
equal to |
effect_avg_default |
If |
effect_sd_default |
If |
fraction_affected_default |
If |
log_means |
Vector of log-Normal logmeans from which the distribution of
feature-specific parameters will be drawn from. Length of vector should be the same
as max(entries in |
log_sds |
Vector of log-Normal logsds from which the distribution of feature-specific parameters will be drawn from. If not provided, will be 0.4 for all parameters. |
Value
A list containing 5 elements:
cB: Tibble that can be provided as the
cBarg toEZbakRData().metadf: Tibble that can be provided as the
metadfarg toEZbakRData().PerRepTruth: Tibble containing replicate-by-replicate simulated ground truth
AvgTruth: Tibble containing average simulated ground truth
param_details: Tibble containing information about simulated linear model parameters
Examples
# Simulate standard data
simdata_standard <- EZSimulate(30)
# Simulate dynamical systems data
simdata_ode <- EZSimulate(30,
mode = "dynamics",
ntreatments = 1,
label_time = c(1, 3),
dynamics_preset = "nuc2cyto")
Make a VolcanoPlot from EZbakR comparison
Description
Make a plot of effect size (x-axis) vs. multiple-test adjusted p-value (y-axis), coloring points by position relative to user-defined decision cutoffs.
Usage
EZVolcanoPlot(
obj,
parameter = "log_kdeg",
design_factor = NULL,
reference = NULL,
experimental = NULL,
param_name = NULL,
param_function = NULL,
features = NULL,
condition = NULL,
normalize_by_median = NULL,
repeatID = NULL,
exactMatch = TRUE,
plotlog2 = TRUE,
FDR_cutoff = 0.05,
difference_cutoff = log(2),
size = NULL,
features_to_highlight = NULL,
highlight_shape = 21,
highlight_size_diff = 1,
highlight_stroke = 0.7,
highlight_fill = NA,
highlight_color = "black"
)
Arguments
obj |
An object of class |
parameter |
Name of parameter whose comparison you want to plot. |
design_factor |
Name of factor from |
reference |
Name of reference |
experimental |
Name of |
param_name |
If you want to assess the significance of a single parameter, rather than the comparison of two parameters, specify that one parameter's name here. |
param_function |
NOT YET IMPLEMENTED. Will allow you to specify more complicated functions of parameters when hypotheses you need to test are combinations of parameters rather than individual parameters or simple differences in two parameters. |
features |
Character vector of feature names for which comparisons were made. |
condition |
Defunct parameter that has been replaced with |
normalize_by_median |
Whether or not the median was subtracted from the estimated parameter differences. |
repeatID |
If multiple |
exactMatch |
If TRUE, then |
plotlog2 |
If TRUE, assume that log(parameter) difference is passed in and that you want to plot log2(parameter) difference. TO-DO: probably best to change this to a more general scale parameter by which the parameter is multiplied. Default would be log2(exp(1)) to convert log() to log2(). |
FDR_cutoff |
False discovery cutoff by which to color points. |
difference_cutoff |
Minimum absolute difference cutoff by which to color points. |
size |
Size of points, passed to |
features_to_highlight |
Features you want to highlight in the plot (black circle will be drawn around them). This can either be a data frame with one column per feature type in the comparison table you are visualizing, or a vector of feature names if the relevant comparison table will only have one feature type noted. |
highlight_shape |
Shape of points overlayed on highlighted features. Defaults to an open circle |
highlight_size_diff |
Sets how much larger should the points overlayed on the highlighted features be than the original plot points. |
highlight_stroke |
Stroke width of the points overlayed on the highlighted features. |
highlight_fill |
Fill color of the points overlayed on the highlighted
features. Default is for them to be fill-less ( |
highlight_color |
Stroke color of the points overlayed on the highlighted points. |
Details
EZVolcanoPlot() accepts as input the output of CompareParameters(), i.e.,
an EZbakRData object with at least one "comparisons" table. It will plot
the "difference" column in this table versus -log10 of the "padj" column.
In the simplest case, "difference" represents a log-fold change in a kinetic
parameter (e.g., kdeg) estimate. More complicated linear model fits and
comparisons can yield different parameter estimates.
NOTE: some outputs of CompareParameters() are not meant for visualization
via a volcano plot. For example, when fitting certain interaction models,
some of the parameter estimates may represent average log(kinetic paramter)
in one condition. See discussion of one example of this here.
EZbakR estimates kinetic parameters in EstimateKinetics() and EZDynamics()
on a log-scale. By default, since log2-fold changes are a bit easier to interpret
and more common for these kind of visualizations, EZVolcanoPlot() multiplies
the x-axis value by log2(exp(1)), which is the factor required to convert from
a log to a log2 scale. You can turn this off by setting plotlog2 to FALSE.
Value
A ggplot2 object. X-axis = log2(estimate of interest (e.g., fold-change
in degradation rate constant); Y-axis = -log10(multiple test adjusted p-value);
points colored by location relative to FDR and effect size cutoffs.
Examples
# Simulate data to analyze
simdata <- EZSimulate(30)
# Create EZbakR input
ezbdo <- EZbakRData(simdata$cB, simdata$metadf)
# Estimate Fractions
ezbdo <- EstimateFractions(ezbdo)
# Estimate Kinetics
ezbdo <- EstimateKinetics(ezbdo)
# Average estimates across replicate
ezbdo <- AverageAndRegularize(ezbdo)
# Compare parameters across conditions
ezbdo <- CompareParameters(
ezbdo,
design_factor = "treatment",
reference = "treatment1",
experimental = "treatment2"
)
# Make volcano plot (ggplot object that you can save and add/modify layers)
EZVolcanoPlot(ezbdo)
EZbakRArrowData object helper function for users
Description
EZbakRArrowData creates an object of class EZbakRArrowData and checks the validity
of the provided input.
Usage
EZbakRArrowData(cBds, metadf)
Arguments
cBds |
ArrowDataset with the following fields:
|
metadf |
Data frame detailing various aspects of each of the samples included in the cBds. This includes:
|
Value
An EZbakRArrowData object. This is simply a list of the provide cBds and
metadf with class EZbakRArrowData
Examples
# Load dependency
library(arrow)
simdata <- EZSimulate(30)
# Create directory to write dataset to
outdir <- tempdir()
dataset_dir <- file.path(outdir, "arrow_dataset")
# Create dataset
write_dataset(
simdata$cB,
path = dataset_dir,
format = "parquet",
partitioning = "sample"
)
# Create EZbakRArrowData object
ds <- open_dataset(dataset_dir)
ezbdo <- EZbakRArrowData(ds,
simdata$metadf)
EZbakRData object helper function for users
Description
EZbakRData creates an object of class EZbakRData and checks the validity
of the provided input.
Usage
EZbakRData(cB, metadf)
Arguments
cB |
Data frame with the following columns:
|
metadf |
Data frame detailing various aspects of each of the samples included in the cB. This includes:
|
Value
An EZbakRData object. This is simply a list of the provide cB and
metadf with class EZbakRData
Examples
# Simulate data
simdata <- EZSimulate(30)
# Create EZbakRData object
ezbdo <- EZbakRData(simdata$cB, simdata$metadf)
EZbakRFractions helper function for users
Description
EZbakRFractions creates an object of class EZbakRFractions and checks the validity
of the provided input.
Usage
EZbakRFractions(fractions, metadf, name = NULL, character_limit = 20)
Arguments
fractions |
Data frame with the following columns:
|
metadf |
Data frame detailing various aspects of each of the samples included in the fractions data frame. This includes:
|
name |
Optional; name to give to fractions table. |
character_limit |
Maximum number of characters for naming out fractions output. EZbakR
will try to name this as a "_" separated character vector of all of the features analyzed.
If this name is greater than |
Value
An EZbakRFractions object. This is simply a list of the provide fractions and
metadf with class EZbakRFractions
Examples
# Simulate data
simdata <- EZSimulate(30)
# Get fractions table by estimating (for demonstration)
ezbdo <- EZbakRData(simdata$cB, simdata$metadf)
ezbdo <- EstimateFractions(ezbdo)
fxns <- EZget(ezbdo, type = "fractions")
# Create EZbakRFractions object
ezbfo <- EZbakRFractions(fxns, simdata$metadf)
EZbakRKinetics helper function for users
Description
EZbakRKinetics creates an object of class EZbakRKinetics and checks the validity
of the provided input.
Usage
EZbakRKinetics(kinetics, metadf, features, name = NULL, character_limit = 20)
Arguments
kinetics |
Data frame with the following columns:
|
metadf |
Data frame detailing various aspects of each of the samples included in the kinetics data frame. This includes:
|
features |
Features tracked in |
name |
Optional; name to give to fractions table. |
character_limit |
If name is chosen automatically, limit on the number of
characters in said name. If default name yields a string longer than this,
then kinetics table will be named |
Value
An EZbakRKinetics object. This is simply a list of the provided kinetics and
metadf with class EZbakRKinetics
Examples
# Simulate data
simdata <- EZSimulate(30)
# Get kinetics table by estimating (for demonstration)
ezbdo <- EZbakRData(simdata$cB, simdata$metadf)
ezbdo <- EstimateFractions(ezbdo)
ezbdo <- EstimateKinetics(ezbdo)
kinetics <- EZget(ezbdo, type = "kinetics")
# Create EZbakRKinetics object
ezbko <- EZbakRKinetics(kinetics, simdata$metadf, features = "feature")
Easily get EZbakR table of estimates of interest
Description
EZget() returns a table of interest from your EZbakRData object. It is meant
to make it easier to find and access certain analyses, as a single EZbakRData
object may include analyses of different features, kinetic parameters, dynamical
systems models, comparisons, etc.
Usage
EZget(
obj,
type = c("fractions", "kinetics", "readcounts", "averages", "comparisons", "dynamics"),
features = NULL,
populations = NULL,
fraction_design = NULL,
isoforms = NULL,
kstrat = NULL,
parameter = NULL,
counttype = NULL,
design_factor = NULL,
dynamics_design_factors = NULL,
scale_factors = NULL,
cstrat = NULL,
feature_lengths = NULL,
experimental = NULL,
reference = NULL,
param_name = NULL,
param_function = NULL,
formula_mean = NULL,
sd_grouping_factors = NULL,
fit_params = NULL,
repeatID = NULL,
sub_features = NULL,
grouping_features = NULL,
sample_feature = NULL,
modeled_to_measured = NULL,
graph = NULL,
normalize_by_median = NULL,
deconvolved = NULL,
returnNameOnly = FALSE,
exactMatch = FALSE,
alwaysCheck = FALSE
)
Arguments
obj |
EZbakRData object |
type |
The class of EZbakR outputs would you like to search through. Equivalent to the name of the list in the EZbakRData object that contains the tables of interest. |
features |
Features that must be present in the table of interest.
If |
populations |
Only relevant if |
fraction_design |
Only relevant if |
isoforms |
If the relevant table is the result of isoform deconvolution |
kstrat |
Only relevant if |
parameter |
Only relevant if |
counttype |
String denoting what type of read count information you are looking for. Current options are "TMM_normalized", "transcript", and "matrix". TO-DO: Not sure this is being used in any way currently... |
design_factor |
design_factor specified in relevant run of |
dynamics_design_factors |
design_factors included in final |
scale_factors |
Sample group scale factors used in |
cstrat |
Strategy used for comparative analyses. Can be:
|
feature_lengths |
Table of feature lengths used for length normalization. |
experimental |
Experimental condition specified in relevant run of |
reference |
Reference condition specified in relevant run of |
param_name |
Parameter name specified in relevant run of |
param_function |
Function of parameters specified in relevant run of |
formula_mean |
An R formula object specifying how the |
sd_grouping_factors |
What metadf columns should data be grouped by when estimating standard deviations across replicates? Therefore, only relevant if type == "averages". |
fit_params |
Character vector of names of parameters in linear model fit. Therefore, only relevant if type == "averages". |
repeatID |
Numerical ID for duplicate objects with same metadata. |
sub_features |
Only relevant if type == "dynamics". Feature columns
that distinguished between the different measured species when running
|
grouping_features |
Only relevant if type == "dynamics. Features
that were the overarching feature assignments by which |
sample_feature |
Only relevant if type == "dynamics". Name of the metadf
column that distinguished the different classes of samples when running
|
modeled_to_measured |
Only relevant if type == "dynamics". Specifies
the relationship between |
graph |
Only relevant if type == "dynamics". NxN adjacency matrix,
where N represents the number of species modeled when running |
normalize_by_median |
Whether median difference was subtracted from estimated kinetic parameter difference. Thus, only relevant if type == "comparisons". |
deconvolved |
Only relevant if type == "fractions". Boolean that is TRUE if
fractions table is result of performing multi-feature deconvolution with
|
returnNameOnly |
If TRUE, then only the names of tables that passed your
search criteria will be returned. Else, the single table passing your search
criteria will be returned. If there is more than one table that passes your
search criteria and |
exactMatch |
If TRUE, then for |
alwaysCheck |
If TRUE, then even if there is only a single table for the |
Details
The input to EZget() is 1) the type of table you want to get ("fractions",
"kinetics", "averages", "comparisons", or "dynamics") and 2) the metadata necessary
to uniquely specify the table of interest. Above, every available piece of metadata
that can be specified for this purpose is documented. You only need to specify the
minimum information necessary. For example, if you would like to get a "fractions"
table from an analysis of exon bins (feature == "exon_bins", and potentially
other overarching features like "XF", "GF", or "rname"), and none of your other
"fractions" tables includes exon_bins as a feature, then you can get this table
with EZget(ezbdo, type = "fractions", features = "exon_bins"), where ezbdo
is your EZbakRData object.
As another example, imagine you want to get a "kinetics" table from an analysis
of gene-wise kinetic parameters (e.g., features == "XF"). You may have multiple
"kinetics" tables, all with "XF" as at least one of their features. If all of the
other tables have additional features though, then you can tell EZget() that
"XF" is the only feature present in your table of interest by setting exactMatch
to TRUE, which tells EZget() that the metadata you specify should exactly match
the relevant metadata for the table of interest. So the call in this case would
look like EZget(ezbdo, type = "fractions", features = "XF", exactMatch = TRUE).
EZget() is used internally in almost every single EZbakR function to specify
the input table for each analysis. Thus, the usage and metadata described here
also applies to all functions that require you to specify which table you want
to use (e.g., EstimateKinetics(), AverageAndRegularize(), CompareParameters(),
etc.).
Value
Table of interest from the relevant EZbakRdata list (set by the
type parameter).
Examples
# Simulate data to analyze
simdata <- EZSimulate(30)
# Create EZbakR input
ezbdo <- EZbakRData(simdata$cB, simdata$metadf)
# Estimate Fractions
ezbdo <- EstimateFractions(ezbdo)
# Estimate Kinetics
ezbdo <- EstimateKinetics(ezbdo)
# Average log(kdeg) estimates across replicate
ezbdo <- AverageAndRegularize(ezbdo)
#' # Average log(ksyn) estimates across replicate
ezbdo <- AverageAndRegularize(ezbdo, parameter = "log_ksyn")
# Compare log(kdeg) across conditions
ezbdo <- CompareParameters(
ezbdo,
design_factor = "treatment",
reference = "treatment1",
experimental = "treatment2"
)
# Get the one and only fractions object
fxns <- EZget(ezbdo, type = "fractions")
# Get the log(ksyn) averages table
ksyn_avg <- EZget(ezbdo, type = "averages", parameter = "log_ksyn")
Make an MAPlot from EZbakR comparison
Description
Make a plot of effect size (y-axis) vs. log10(read coverage) (x-axis), coloring points by position relative to user-defined decision cutoffs.
Usage
EZpcaPlot(
obj,
data_type = c("fraction_labeled", "reads"),
features = NULL,
exactMatch = TRUE,
variance_decile = 7,
center = TRUE,
scale = TRUE,
point_size = 3,
metadf_cols_to_use = "all"
)
Arguments
obj |
An object of class |
data_type |
Specifies what data to use for the PCA. Options are "fraction_labeled" (default; means using fraction high T-to-C or other mutation type estimate from EZbakR) or "reads" (means using log10(read counts + 1)). |
features |
Character vector of feature names for which comparisons were made. |
exactMatch |
If TRUE, then |
variance_decile |
Integer between (inclusive) 1 and 9. Features with sample-to-sample
variance greater than the nth decile (n = |
center |
From |
scale |
From |
point_size |
Size of points in PCA plot |
metadf_cols_to_use |
Columns in the EZbakR metadf that will be used to color points in the PCA plot. Points will be colored by the interaction between all of these columns (i.e., samples with unique combinations of values of these columns will get unique colors). Default is to use all columns (except "sample"), specified as "all". |
Details
EZMAPlot() accepts as input the output of CompareParameters(), i.e.,
an EZbakRData object with at least one "comparisons" table. It will plot
the "avg_coverage" column in this table vs. the "difference" column.
In the simplest case, "difference" represents a log-fold change in a kinetic
parameter (e.g., kdeg) estimate. More complicated linear model fits and
comparisons can yield different parameter estimates.
NOTE: some outputs of CompareParameters() are not meant for visualization
via an MA plot. For example, when fitting certain interaction models,
some of the parameter estimates may represent average log(kinetic paramter)
in one condition. See discussion of one example of this here.
EZbakR estimates kinetic parameters in EstimateKinetics() and EZDynamics()
on a log-scale. By default, since log2-fold changes are a bit easier to interpret
and more common for these kind of visualizations, EZMAPlot() multiplies
the y-axis value by log2(exp(1)), which is the factor required to convert from
a log to a log2 scale. You can turn this off by setting plotlog2 to FALSE.
Value
A ggplot2 object.
Examples
# Simulate data to analyze
simdata <- EZSimulate(30)
# Create EZbakR input
ezbdo <- EZbakRData(simdata$cB, simdata$metadf)
# Estimate Fractions
ezbdo <- EstimateFractions(ezbdo)
# Make MA plot (ggplot object that you can save and add/modify layers)
EZpcaPlot(ezbdo)
Estimate fractions of each RNA population.
Description
The first step of any NR-seq analysis is to figure out the fraction of reads
from each mutational population in your data. For example, if you are performing
a standard SLAM-seq or TimeLapse-seq experiment, this means estimating the fraction
of reads with high T-to-C mutation content, and the fraction with low T-to-C mutation
content. This is what EstimateFractions is for.
Usage
EstimateFractions(
obj,
features = "all",
mutrate_populations = "all",
fraction_design = NULL,
Poisson = TRUE,
strategy = c("standard", "hierarchical"),
filter_cols = "all",
filter_condition = `&`,
remove_features = c("NA", "__no_feature"),
split_multi_features = FALSE,
multi_feature_cols = NULL,
multi_feature_sep = "+",
pnew_prior_mean = -2.94,
pnew_prior_sd = 0.3,
pold_prior_mean = -6.5,
pold_prior_sd = 0.5,
hier_readcutoff = 300,
init_pnew_prior_sd = 0.8,
pnew_prior_sd_min = 0.01,
pnew_prior_sd_max = 0.15,
pold_est = NULL,
pold_from_nolabel = FALSE,
grouping_factors = NULL,
character_limit = 20,
overwrite = TRUE
)
## S3 method for class 'EZbakRData'
EstimateFractions(
obj,
features = "all",
mutrate_populations = "all",
fraction_design = NULL,
Poisson = TRUE,
strategy = c("standard", "hierarchical"),
filter_cols = "all",
filter_condition = `&`,
remove_features = c("NA", "__no_feature"),
split_multi_features = FALSE,
multi_feature_cols = NULL,
multi_feature_sep = "+",
pnew_prior_mean = -2.94,
pnew_prior_sd = 0.3,
pold_prior_mean = -6.5,
pold_prior_sd = 0.5,
hier_readcutoff = 300,
init_pnew_prior_sd = 0.3,
pnew_prior_sd_min = 0.01,
pnew_prior_sd_max = 0.15,
pold_est = NULL,
pold_from_nolabel = FALSE,
grouping_factors = NULL,
character_limit = 20,
overwrite = TRUE
)
## S3 method for class 'EZbakRArrowData'
EstimateFractions(
obj,
features = "all",
mutrate_populations = "all",
fraction_design = NULL,
Poisson = TRUE,
strategy = c("standard", "hierarchical"),
filter_cols = "all",
filter_condition = `&`,
remove_features = c("NA", "__no_feature"),
split_multi_features = FALSE,
multi_feature_cols = NULL,
multi_feature_sep = "+",
pnew_prior_mean = -2.94,
pnew_prior_sd = 0.3,
pold_prior_mean = -6.5,
pold_prior_sd = 0.5,
hier_readcutoff = 300,
init_pnew_prior_sd = 0.8,
pnew_prior_sd_min = 0.01,
pnew_prior_sd_max = 0.15,
pold_est = NULL,
pold_from_nolabel = FALSE,
grouping_factors = NULL,
character_limit = 20,
overwrite = TRUE
)
Arguments
obj |
|
features |
Character vector of the set of features you want to stratify
reads by and estimate proportions of each RNA population. The default of "all"
will use all feature columns in the |
mutrate_populations |
Character vector of the set of mutational populations that you want to infer the rates of mutations for. By default, all mutation rates are estimated for all populations present in cB. |
fraction_design |
"Design matrix" specifying which RNA populations exist
in your samples. By default, this will be created automatically and will assume
that all combinations of the If you call the function
For example, assume you are doing a typical TimeLapse-seq/SLAM-seq/TUC-seq/etc. experiment
where you have fed cells with s^4U and recoded any incorporated s^4U to a nucleotide
that reverse transcriptase will read as a cytosine. That means that As another example, consider TILAC, a NR-seq extension developed by the Simon lab. TILAC was originally
described in Courvan et al., 2022. In this
method, two populations of RNA, one from s^4U fed cells and one from s^6G fed cells, are pooled
and prepped for sequencing together. This allows for internally controlled comparisons of RNA
abundance without spike-ins. s^4U is recoded to a cytosine analog by TimeLapse chemistry
(or similar chemistry) and s^6G is recoded to an adenine analog. Thus, |
Poisson |
If |
strategy |
String denoting which new read mutation rate estimation strategy to use. Options include:
|
filter_cols |
Which feature columns should be used to filter out feature-less reads. The default value of "all" checks all feature columns for whether or not a read failed to get assigned to said feature. |
filter_condition |
Only two possible values for this make sense: |
remove_features |
All of the feature names that could indicate failed assignment of a read to a given feature. the fastq2EZbakR pipeline uses a value of '__no_feature'. |
split_multi_features |
If a set of reads maps ambiguously to multiple features,
should data for such reads be copied for each feature in the ambiguous set? If this is
|
multi_feature_cols |
Character vector of columns that have the potential to
include assignment to multiple features. Only these columns will have their features split
if |
multi_feature_sep |
String representing how ambiguous feature assignments are distinguished in the feature names. For example, the default value of "+" denotes that if a read maps to multiple features (call them featureA and featureB, for example), then the feature column will have a value of "featureA+featureB" for that read. |
pnew_prior_mean |
Mean for logit(pnew) prior. |
pnew_prior_sd |
Standard deviation for logit(pnew) prior. |
pold_prior_mean |
Mean for logit(pold) prior. |
pold_prior_sd |
Standard deviation for logit(pold) prior. |
hier_readcutoff |
If |
init_pnew_prior_sd |
If |
pnew_prior_sd_min |
The minimum logit(pnew) prior standard deviation when |
pnew_prior_sd_max |
Similar to |
pold_est |
Background mutation rate estimates if you have them. Can either be a single number applied to all samples or a named vector of values, where the names should be sample names. |
pold_from_nolabel |
Fix background mutation rate estimate to mutation rates seen in -label samples.
By default, a single background rate is used for all samples, inferred from the average mutation rate
across all -label samples. The |
grouping_factors |
If |
character_limit |
Maximum number of characters for naming out fractions output. EZbakR
will try to name this as a "_" separated character vector of all of the features analyzed.
If this name is greater than |
overwrite |
If TRUE and a fractions estimate output already exists that would possess the same metadata (features analyzed, populations analyzed, and fraction_design), then it will get overwritten with the new output. Else, it will be saved as a separate output with the same name + "_#" where "#" is a numerical ID to distinguish the similar outputs. |
Details
EstimateFractions uses mixture modeling to estimate the fraction of reads from
each mutational population in your data, and this is done for each feature in your
data (i.e., combination of columns that specify genomic features from which reads
were derived). The set of mutational populations in your
data can be specified by providing a fraction_design object, described in more
depth above (also see ?create_fraction_design). There are several flavors
of mixture modeling that can be performed by EstimateFractions. These are
as follows:
The default: global mutation rate parameters (global = same for all reads from all features) are estimated for each sample by fitting a single two-component mixture model to all of the reads in that sample. These are used to estimate the fraction of reads from each feature that are from each mutational population, also using a two-component mixture model.
With
Poissonset toTRUE, this is a nucleotide-content adjusted Poisson mixture model, which is a more efficient alternative to binomial mixture modeling.With
Poissonset toFALSE, this is a binomial mixture model.
Low mutation rate from -label: if
pold_from_nolabelisTRUE, then the background, no label mutation rates are estimated from -label samples. By default, a single set of background mutation rates are estimated for all samples, but you can change this behavior by settinggrouping_factorsto specify columns in yourmetadfby which samples should be stratified.This is a great strategy to use if you have low label incorporation rates or if you used a fairly short label time.
Hierarchical model: if
strategy == "hierarchical", which is currently only compatible for single-mutation type modeling (e.g., standard T-to-C mutation modeling), then high T-to-C content mutation rates are estimated for each feature. The global sample-wide estimates are used as an informative prior to increase the accuracy of this process by avoiding extreme estimates.
Value
An EZbakRFractions object, which is just an EZbakRData object
with a "fractions" list element. This will include tables of fraction estimates
(e.g., fraction of reads from the high T-to-C mutation rate population in a
standard single-label s4U experiment; termed fraction_highTC in the table) for
all features in all samples.
Methods (by class)
-
EstimateFractions(EZbakRData): Method for class EZbakRData Estimates fractions using an entirely in memory object. -
EstimateFractions(EZbakRArrowData): Mehthod for class EZbakRArrowData This is an alternative to the fully in memory EZbakRDataEstimateFractionsmethod that can help with analyses of larger than RAM datasets. The provided "cB" is expected to be an on-disk Arrow Dataset. Furthermore, it is expected to be partitioned by the sample name, which will allow this method to read only a single-sample worth of data into memory at a time. This can significantly reduce RAM usage. Input object should be created withEZbakRArrowData().
Examples
# Simulate data to analyze
simdata <- SimulateOneRep(30)
# Create EZbakR input
metadf <- data.frame(sample = "sampleA", tl = 2)
ezbdo <- EZbakRData(simdata$cB, metadf)
# Estimate fractions
ezbdo <- EstimateFractions(ezbdo)
Estimate isoform-specific fraction news (or more generally "fractions").
Description
Combines the output of EstimateFractions with transcript isoform
quantification performed by an outside tool (e.g., RSEM, kallisto, salmon, etc.)
to infer transcript isoform-specific fraction news (or more generally fraction
of reads coming from a particular mutation population).
Usage
EstimateIsoformFractions(
obj,
features = NULL,
populations = NULL,
fraction_design = NULL,
repeatID = NULL,
exactMatch = TRUE,
fraction_name = NULL,
quant_name = NULL,
gene_to_transcript = NULL,
overwrite = TRUE,
TPM_min = 1,
count_min = 10
)
Arguments
obj |
An |
features |
Character vector of the set of features you want to stratify
reads by and estimate proportions of each RNA population. The default of "all"
will use all feature columns in the |
populations |
Mutational populations that were analyzed to generate the fractions table to use. For example, this would be "TC" for a standard s4U-based nucleotide recoding experiment. |
fraction_design |
"Design matrix" specifying which RNA populations exist
in your samples. By default, this will be created automatically and will assume
that all combinations of the |
repeatID |
If multiple |
exactMatch |
If TRUE, then |
fraction_name |
Name of fraction estimate table to use. Should be stored in the
|
quant_name |
Name of transcript isoform quantification table to use. Should be stored
in the obj$readcounts list under this name. Use |
gene_to_transcript |
Table with columns |
overwrite |
If TRUE and a fractions estimate output already exists that would possess the same metadata (features analyzed, populations analyzed, and fraction_design), then it will get overwritten with the new output. Else, it will be saved as a separate output with the same name + "_#" where "#" is a numerical ID to distinguish the similar outputs. |
TPM_min |
Minimum TPM for a transcript to be kept in analysis. |
count_min |
Minimum expected_count for a transcript to be kept in analysis. |
Details
EstimateIsoformFractions expects as input a "fractions" table with estimates
for transcript equivalence class (TEC) fraction news. A transcript equivalence class
is the set of transcript isoforms with which a sequencing read is compatible.
fastq2EZbakR is able to assign
reads to these equivalence classes so that EZbakR can estimate the fraction of
reads in each TEC that are from labeled RNA.
EstimateIsoformFractions estimates transcript isoform fraction news
by fitting a linear mixing model to the TEC fraction new estimates + transcript
isoform abundance estimates. In other words, each TEC fraction new (data) is modeled
as a weighted average of transcript isoform fraction news (parameters to estimate),
with the weights determined by the relative abundances of the transcript isoforms
in the TEC (data). The TEC fraction new is modeled as a Beta distribution with mean
equal to the weighted transcript isoform fraction new average and variance related
to the number of reads in the TEC.
Transcript isoforms with estimated TPMs less than with an estimated
TPM greater than TPM_min (1 by default) or an estimated number of read
counts less than count_min (10 by default) are removed from TECs and will
not have their fraction news estimated.
Value
An EZbakRData object with an additional table under the "fractions"
list. Has the same form as the output of EstimateFractions(), and will have the
feature column "transcript_id".
Examples
# Load dependencies
library(dplyr)
# Simulates a single sample worth of data
simdata_iso <- SimulateIsoforms(nfeatures = 100)
# We have to manually create the metadf in this case
metadf <- tibble(sample = 'sampleA',
tl = 4,
condition = 'A')
ezbdo <- EZbakRData(simdata_iso$cB,
metadf)
ezbdo <- EstimateFractions(ezbdo)
### Hack in the true, simulated isoform levels
reads <- simdata_iso$ground_truth %>%
dplyr::select(transcript_id, true_count, true_TPM) %>%
dplyr::mutate(sample = 'sampleA',
effective_length = 10000) %>%
dplyr::rename(expected_count = true_count,
TPM = true_TPM)
# Name of table needs to have "isoform_quant" in it
ezbdo[['readcounts']][['simulated_isoform_quant']] <- reads
### Perform deconvolution
ezbdo <- EstimateIsoformFractions(ezbdo)
Estimate kinetic parameters
Description
Simple kinetic parameter (kdeg and ksyn) estimation using fraction estimates
from EstimateFractions(). Several slightly different kinetic parameter
inference strategies are implemented.
Usage
EstimateKinetics(
obj,
strategy = c("standard", "tilac", "NSS", "shortfeed", "pulse-chase"),
features = NULL,
populations = NULL,
fraction_design = NULL,
repeatID = NULL,
exactMatch = TRUE,
grouping_factor = NULL,
reference_factor = NULL,
character_limit = 20,
feature_lengths = NULL,
exclude_pulse_estimates = TRUE,
scale_factors = NULL,
overwrite = TRUE
)
Arguments
obj |
An |
strategy |
Kinetic parameter estimation strategy. Options include:
|
features |
Character vector of the set of features you want to stratify
reads by and estimate proportions of each RNA population. The default of "all"
will use all feature columns in the |
populations |
Mutational populations that were analyzed to generate the fractions table to use. For example, this would be "TC" for a standard s4U-based nucleotide recoding experiment. |
fraction_design |
"Design matrix" specifying which RNA populations exist
in your samples. By default, this will be created automatically and will assume
that all combinations of the |
repeatID |
If multiple |
exactMatch |
If TRUE, then |
grouping_factor |
Which sample-detail columns in the metadf should be used
to group samples by for calculating the average RPM ( |
reference_factor |
Which sample-detail column in the metafd should be used to
determine which group of samples provide information about the starting levels of RNA
for each sample? This should have values that match those in |
character_limit |
Maximum number of characters for naming out fractions output. EZbakR
will try to name this as a "_" separated character vector of all of the features analyzed.
If this name is greater than |
feature_lengths |
Table of effective lengths for each feature combination in your data. For example, if your analysis includes features named GF and XF, this should be a data frame with columns GF, XF, and length. |
exclude_pulse_estimates |
If |
scale_factors |
Data frame with two columns, "sample" and "scale_factor".
sample should be the sample name, and scale_factor is the factor which read counts
in that sample should be divided to get the normalized read counts. This should match
the convention from DESeq2 and thus you can provide, for example, scale factors derived
from DESeq2 for this. By default, EZbakR uses DESeq2's median of ratios normalization
method, but specifying |
overwrite |
If TRUE and a fractions estimate output already exists that would possess the same metadata (features analyzed, populations analyzed, and fraction_design), then it will get overwritten with the new output. Else, it will be saved as a separate output with the same name + "_#" where "#" is a numerical ID to distinguish the similar outputs. |
Details
EstimateKinetics() estimates the kinetics and RNA synthesis and degradation
from standard single-label NR-seq data. It also technically supports analyses
of the dual-label TILAC experiment, but that functionality is not as actively
tested or maintained as the more standard analyses.
EstimateKinetics() assumes a simple linear ODE model of RNA dynamics:
\frac{\text{dR}}{\text{dt}} = k_{\text{syn}} - k_{\text{deg}}*\text{R}
where:
R = Amount of RNA
-
k_{\text{syn}}= Synthesis rate constant -
k_{\text{deg}}= Degradation rate constant and for which the general solution is:\text{R(t)} = \text{R(0)}e^{-k_{\text{deg}}\text{t}} + (1 - \frac{k_{\text{syn}}}{k_{\text{deg}}})e^{-k_{\text{deg}}\text{t}}where
\text{R(0)}is the initial RNA level.
The default kinetic parameter estimation strategy (strategy == "standard")
assumes that for labeled RNA (or more precisely RNA synthesized in the presence
of label) \text{R(0)} = 0. Thus, it assumes a pulse-label rather than
pulse-chase experimental design (I've written several places, like here
and here
for example, about why pulse-label designs are superior to pulse-chases in
almost all settings).
For unlabeled RNA, it assumes that \text{R(0)} = \frac{k_{\text{syn}}}{k_{\text{deg}}}.
This is known as the steady-state assumption and is the key assumption of this method.
More broadly, assuming steady-state means that this method assumes that RNA levels
from a given feature are constant for the entire metabolic labeling duration.
As EZbakR is designed for analyses of bulk NR-seq data, "constant" means that the
average RNA levels across all profiled cells is constant (thus asynchronous populations
of dividing cells still count as steady-state, even if the RNA expression landscapes
in individual cells are quite dynamic). This assumption may be violated in cases
where labeling coincides with or closely follows a perturbation (e.g., drug treatment).
When the steady-state assumption hold, there is a simple relationship between the
fraction of reads from labeled RNA (\theta) and the turnover kinetics of the RNA:
\theta = 1 - e^{-k_{\text{deg}}t}
The power in this approach is thus that turnover kinetics are estimated from
an "internally normalized" quantity: \theta (termed the "fraction new",
"fraction labeled", "fraction high mutation content", or new-to-total ratio (NTR), depending
on where you look or who you ask). "Internally normalized" means that a normalization
scale factor does not need to be estimated in order to accurately infer turnover
kinetics. \theta represents a ratio of read counts from the same feature
in the same library, thus any scale factor would appear in both the numerator
and denominator of this estimate and cancel out.
When this assumption is valid, the strategy = "NSS" approach may be preferable.
This approach relies on the same model of RNA dynamics, but now assumes that the
initial RNA levels (\text{R(0)} for the unlabeled RNA) are not at the steady-state
levels dictated by the current synthesis and turnover kinetics. The idea for this
strategy was first presented in Narain et al. 2021.
To run this approach, you need to be able to estimate the initial RNA levels of
each label-fed sample, as the fraction of reads from labeled RNA will no longer.
only reflect the turnover kinetics (as the old RNA is assumed to potentially
not have reached the new steady-state levels yet). This means that for each
label-fed sample, you need to have a sample whose assayed RNA population represents
the starting RNA population for the label-fed sample. EZbakR will use these
reference samples to infer \text{R(0)} for unlabeled RNAs and estimate
turnover kinetics from the ratio of this quantity to the remaining unlabeled RNA levels
after labeling:
\theta_{\text{NSS}} = \frac{\text{R(tl)}}{\text{R(0)}}
While I really like the idea of this strategy, it is not without some severe limitations. For one, the major benefit of NR-seq, internally normalized estimation of turnover kinetics, is out the window. Thus, read counts need to be normalized between the relevant reference and label-fed sample pairs. In addition, the variance patterns of this ratio are somewhat unfortunate. Its variance is incredibly high for more stable RNAs, severely limiting the effective dynamic range of this approach relative to steady-state analyses. I continue to work to refine EZbakR's implementation of this approach, but for now I urge caution in its use and interpretation. See my discussion of an alternative approach for navigating non-steady-state data here.
As an aside, you may wonder how this strategy deals with dynamic regulation of synthesis and degradation rate constants during labeling. To answer this, you have to realize that the duration of metabolic labeling represents an integration time over which the best we can do is assess average kinetics. Thus, if rate constants are changing during the labeling, this strategy can still be thought of as providing a an estimate of the time averaged synthesis and turnover kinetics during the label time.
Eventually, I will get around to implementing pulse-chase support in EstimateKinetics().
I haven't yet because I don't like pulse-chase experiments and think they are
way more popular than they should be for purely historical reasons. But lots
of people keep doing pulse-chase NR-seq so c'est la vie.
Value
EZbakRKinetics object, which is just an EZbakRData object with
a "kinetics" slot. This includes tables of kinetic parameter estimates for
each feature in each sample for which kinetic parameters can be estimated.
Examples
# Simulate data to analyze
simdata <- SimulateOneRep(30)
# Create EZbakR input
metadf <- data.frame(sample = "sampleA", tl = 2)
ezbdo <- EZbakRData(simdata$cB, metadf)
# Estimate Fractions
ezbdo <- EstimateFractions(ezbdo)
# Estimate Kinetics
ezbdo <- EstimateKinetics(ezbdo)
Estimate mutation rates
Description
Two component mixture models are fit to all data to estimate global high and low mutation rates for all samples. Estimation of these mutation rates are regularized through the use of weakly informative priors whose parameters can be altered using the arguments defined below.
Usage
EstimateMutRates(
obj,
populations = "all",
pnew_prior_mean = -2.94,
pnew_prior_sd = 0.3,
pold_prior_mean = -6.5,
pold_prior_sd = 0.5,
pold_est = NULL,
pold_from_nolabel = FALSE,
grouping_factors = NULL
)
## S3 method for class 'EZbakRData'
EstimateMutRates(
obj,
populations = "all",
pnew_prior_mean = -2.94,
pnew_prior_sd = 0.3,
pold_prior_mean = -6.5,
pold_prior_sd = 0.5,
pold_est = NULL,
pold_from_nolabel = FALSE,
grouping_factors = NULL
)
## S3 method for class 'EZbakRArrowData'
EstimateMutRates(
obj,
populations = "all",
pnew_prior_mean = -2.94,
pnew_prior_sd = 0.3,
pold_prior_mean = -6.5,
pold_prior_sd = 0.5,
pold_est = NULL,
pold_from_nolabel = FALSE,
grouping_factors = NULL
)
Arguments
obj |
An |
populations |
Character vector of the set of mutational populations that you want to infer the fractions of. For example, say your cB file contains columns tracking T-to-C and G-to-A |
pnew_prior_mean |
logit-Normal mean for logit(pnew) prior. |
pnew_prior_sd |
logit-Normal sd for logit(pnew) prior. |
pold_prior_mean |
logit-Normal mean for logit(pold) prior. |
pold_prior_sd |
logit-Normal sd for logit(pold) prior. |
pold_est |
Background mutation rate estimates if you have them. Can either be a single number applied to all samples or a named vector of values, where the names should be sample names. |
pold_from_nolabel |
Fix background mutation rate estimate to mutation rates seen in -label samples.
By default, a single background rate is used for all samples, inferred from the average mutation rate
across all -label samples. The |
grouping_factors |
If |
Value
EZbakRData object with an added mutrates slot containing estimated
high and low mutation rates for each mutation type modeled.
Methods (by class)
-
EstimateMutRates(EZbakRData): Method for class EZbakRData Estimates mutation rates using a fully in memory object. -
EstimateMutRates(EZbakRArrowData): Method for class EZbakRArrowData Estimate mutation rates using a partially on-disk object.
Examples
# Simulate data to analyze
simdata <- SimulateOneRep(30)
# Create EZbakR input
metadf <- data.frame(sample = "sampleA", tl = 2)
ezbdo <- EZbakRData(simdata$cB, metadf)
# Estimate mutation rates
mutrates <- EstimateMutRates(ezbdo)
Import transcript isoform quantification into EZbakRData object
Description
A convenient wrapper to tximport() for importing isoform quantification
data into an EZbakRData object. You need to run this before running
EstimateIsoformFractions.
Usage
ImportIsoformQuant(
obj,
files,
quant_tool = c("none", "salmon", "sailfish", "alevin", "piscem", "kallisto", "rsem",
"stringtie"),
txIn = TRUE,
...
)
Arguments
obj |
An |
files |
A named vector of paths to all transcript quantification files that you would like
to import. This will be passed as the first argument of |
quant_tool |
String denoting the type of software used to generate the abundances. Will
get passed to the |
txIn |
Whether or now you are providing isoform level quantification files.
Alternative ( |
... |
Additional arguments to be passed to |
Value
An EZbakRData object with an additional element in the readcounts
list named "isform_quant_<quant_tool>". It contains TPM, expected_count,
and effective length information for each transcript_id and each sample.
Examples
# Dependencies for example
library(dplyr)
library(data.table)
# Simulate and analyze data
simdata <- EZSimulate(30)
ezbdo <- EZbakRData(simdata$cB, simdata$metadf)
ezbdo <- EstimateFractions(ezbdo)
# Hack to generate example quantification files
savedir <- tempdir()
rsem_data <- tibble(
transcript_id = paste0("tscript_feature", 1:30),
gene_id = paste0("feature", 1:30),
length = 1000,
effective_length = 1000,
expected_count = 1000,
TPM = 10,
FPKM = 10,
IsoPct = 1
)
fwrite(rsem_data, file.path(savedir, "Sample_1.isoforms.results"), sep = '\t')
files <- file.path(savedir,"Sample_1.isoforms.results")
names(files) <- "Sample_1"
# Read in file
ezbdo <- ImportIsoformQuant(ezbdo, files, quant_tool = "rsem")
Normalize for experimental/bioinformatic dropout of labeled RNA.
Description
Uses the strategy described here, and similar to that originally presented in Berg et al. 2024, to normalize for dropout. Normalizing for dropout means identifying a reference sample with low dropout, and estimating dropout in each sample relative to that sample.
Usage
NormalizeForDropout(
obj,
normalize_across_tls = FALSE,
grouping_factors = NULL,
features = NULL,
populations = NULL,
fraction_design = NULL,
repeatID = NULL,
exactMatch = TRUE,
read_cutoff = 25
)
Arguments
obj |
An EZbakRFractions object, which is an EZbakRData object on which
you have run |
normalize_across_tls |
If TRUE, samples from different label times will be normalized by finding a max inferred degradation rate constant (kdeg) sample and using that as a reference. Degradation kinetics with this max will be assumed to infer reference fraction news at different label times |
grouping_factors |
Which sample-detail columns in the metadf should be used
to group -s4U samples by for calculating the average -s4U RPM? The default value of
|
features |
Character vector of the set of features you want to stratify
reads by and estimate proportions of each RNA population. The default of |
populations |
Mutational populations that were analyzed to generate the fractions table to use. For example, this would be "TC" for a standard s4U-based nucleotide recoding experiment. |
fraction_design |
"Design matrix" specifying which RNA populations exist
in your samples. By default, this will be created automatically and will assume
that all combinations of the |
repeatID |
If multiple |
exactMatch |
If TRUE, then |
read_cutoff |
Minimum number of reads for a feature to be used to fit dropout model. |
Details
NormalizeForDropout() has a number of unique advantages relative to
CorrectDropout():
-
NormalizeForDropout()doesn't require -label control data. -
NormalizeForDropout()compares an internally normalized quantity (fraction new) across samples, which has some advantages over the absolute dropout estimates derived from comparisons of normalized read counts inCorrectDropout(). -
NormalizeForDropout()may be used to normalize half-life estimates across very different biological contexts (e.g., different cell types).
There are also some caveats to be aware of when using NormalizeForDropout():
Be careful using
NormalizeForDropout()when you have multiple different label times. Dropout normalization requires each sample be compared to a reference sample with the same label time. Thus, normalization will be performed separately for groups of samples with different label times. If the extent of dropout in the references with different label times is different, there will still be unaccounted for dropout biases between some of the samples.-
NormalizeForDropout()effectively assumes that there are no true global differences in turnover kinetics of RNA. If such differences actually exist (e.g., half-lives in one context are on average truly lower than those in another),NormalizeForDropout()risks normalizing away these real differences. This is similar to how statistical normalization strategies implemented in differential expression analysis software like DESeq2 assumes that there are no global differences in RNA levels.
By default, all samples with same label time are normalized with respect
to a reference sample chosen from among them. If you want to further separate
the groups of samples that are normalized together, specify the columns of
your metadf by which you want to additionally group factors in the grouping_factors
parameter. This behavior can be changed by setting normalize_across_tls to
TRUE, which will
Value
An EZbakRData object with the specified "fractions" table replaced
with a dropout corrected table.
Examples
# Simulate data to analyze
simdata <- EZSimulate(30)
# Create EZbakR input
ezbdo <- EZbakRData(simdata$cB, simdata$metadf)
# Estimate Fractions
ezbdo <- EstimateFractions(ezbdo)
# Normalize for dropout
ezbdo <- NormalizeForDropout(ezbdo)
Simple simulation function
Description
Simple simulation function
Usage
SimpleSim(
nreads = 1000,
fn = 0.5,
pnew = 0.05,
pold = 0.001,
rlen = 100,
Ucont = 0.25
)
Arguments
nreads |
Number of reads to simulate |
fn |
Fraction of reads that are new in simulation. Whether a read will be new will be determined by a draw from a Bernoulli(fn) distribution. |
pnew |
T-to-C mutation rate in new reads |
pold |
T-to-C mutation rate in old reads |
rlen |
Length of simulated reads |
Ucont |
Fraction of nucleotides in simulated reads that are Ts (U in RNA) |
Value
Tibble with 3 columns:
nT: Simulated number of Ts
TC: Simulated number of T-to-C mutations
n: Number of simulated reads with nT Ts and TC mutations.
Examples
# Simulate 1 gene worth of data data
simdata <- SimpleSim()
Simulation of generalized dynamical system model.
Description
SimulateDynamics() simulates any specified dynamical system of interconverting
RNA species. Its required input is similar to that of EstimateDynamics(), i.e.,
an adjacency matrix describing the set of species and how they are related to
one another and a list of formula relating actually assayed species to the
modeled species. Currently, SimulateDynamics() implements a naive heteroskedastic
replicate variability simulation and is not designed to simulate multiple experimental
conditions.
Usage
SimulateDynamics(
nfeatures,
graph,
metadf,
log_means,
log_sds,
ntreatments = 1,
treatment_effects = NULL,
formula_list = NULL,
unassigned_name = "__no_feature",
seqdepth = nfeatures * 2500,
dispersion = 100,
lfn_sd = 0.2,
effect_avg_default = 0,
effect_sd_default = 0.5,
fraction_affected_default = 0.5,
...
)
Arguments
nfeatures |
Number of "features" to simulate data for. A "feature" in this case may contain a number of "sub-features". For example, you may want to simulate pre-RNA and mature RNA for a set of "genes", in which case the number of features is the number of genes. |
graph |
An adjacency matrix describing the reaction diagram graph relating the various RNA species to one another. |
metadf |
Data frame with two required columns ( |
log_means |
Vector of log-Normal logmeans from which the distribution of
feature-specific parameters will be drawn from. Length of vector should be the same
as max(entries in |
log_sds |
Vector of log-Normal logsds from which the distribution of feature-specific parameters will be drawn from. |
ntreatments |
Number of distinct experimental treatments to simulate. By default, only a single "treatment" (you might refer to this as wild-type, or control) is simulated. Increase this if you would like to explore performing comparative dynamical systems modeling. |
treatment_effects |
Data frame describing effects of treatment on each
parameter. Should have five columns: "parameter_index", "treatment_index", "mean", "sd",
and "fraction_affected".
Each row corresponds to the effect the ith (i = treatment_index) treatment has on the
jth (j = parameter_index) kinetic parameter. Effect sizes, on a log-scale, are drawn from
a Normal distribution with mean and standard deviation set by the mean and sd columns,
respectively. The number of non-zero effects is set by "fraction_affected", and is
equal to |
formula_list |
A list of named lists. The names of each sub-list should be
the same as the sample names as they are found in |
unassigned_name |
String to give to reads not assigned to a given feature. |
seqdepth |
Total number of reads in each sample. |
dispersion |
Negative binomial |
lfn_sd |
Logit(fn) replicate variability. |
effect_avg_default |
If |
effect_sd_default |
If |
fraction_affected_default |
If |
... |
Parameters passed to |
Simulation of transcript isoform kinetic parameters.
Description
SimulateIsoforms() performs a simple simulation of isoform-specific kinetic
parameters to showcase and test EstimateIsoformFractions(). It assumes that
there are a set of reads (fraction of total set by funique parameter) which
map uniquely to a given isoform, while the rest are ambiguous to all isoforms
from that gene. Mutational content of these reads are simulated as in
SimulateOneRep().
Usage
SimulateIsoforms(
nfeatures,
nt = NULL,
seqdepth = nfeatures * 2500,
label_time = 4,
sample_name = "sampleA",
feature_prefix = "Gene",
pnew = 0.1,
pold = 0.002,
funique = 0.2,
readlength = 200,
Ucont = 0.25,
avg_numiso = 2,
psynthdiff = 0.5,
logkdeg_mean = -1.9,
logkdeg_sd = 0.7,
logksyn_mean = 2.3,
logksyn_sd = 0.7
)
Arguments
nfeatures |
Number of "features" to simulate data for. Each feature will have a simulated number of transcript isoforms |
nt |
(Optional), can provide a vector of the number of isoforms you would
like to simulate for each of the |
seqdepth |
Total number of sequencing reads to simulate |
label_time |
Length of s^4^U feed to simulate. |
sample_name |
Character vector to assign to |
feature_prefix |
Name given to the i-th feature is |
pnew |
Probability that a T is mutated to a C if a read is new. |
pold |
Probability that a T is mutated to a C if a read is old. |
funique |
Fraction of reads that uniquely "map" to a single isoform. |
readlength |
Length of simulated reads. In this simple simulation, all reads are simulated as being exactly this length. |
Ucont |
Percentage of nucleotides simulated to be U's. |
avg_numiso |
Average number of isoforms for each feature. Feature-specific
isoform counts are drawn from a Poisson distribution with this average. NOTE:
to insure that all features have multiple isoforms, the simulated number of
isoforms drawn from a Poisson distribution is incremented by 2. Thus, the
actual average number of isoforms from each feature is |
psynthdiff |
Percentage of genes for which all isoform abundance differences are synthesis driven. If not synthesis driven, then isoform abundance differences will be driven by differences in isoform kdegs. |
logkdeg_mean |
meanlog of a log-normal distribution from which kdegs are simulated |
logkdeg_sd |
sdlog of a log-normal distribution from which kdegs are simulated |
logksyn_mean |
meanlog of a log-normal distribution from which ksyns are simulated |
logksyn_sd |
sdlog of a log-normal distribution from which ksyns are simulated |
Value
List with two elements:
cB: Tibble that can be passed as the
cBarg toEZbakRData().ground_truth: Tibble containing simulated ground truth.
Examples
simdata <- SimulateIsoforms(30)
Simulate NR-seq data for multiple replicates of multiple biological conditions
Description
SimulateMultiCondition is a highly flexibly simulator that combines linear modeling
of log(kdeg)'s and log(ksyn)'s with SimulateOneRep to simulate an NR-seq dataset. The linear model
allows you to simulate multiple distinct treatments, batch effects, interaction effects,
etc. The current downside for its flexibility is its relative complexity to implement.
Easier to use simulators are on the way to EZbakR.
Usage
SimulateMultiCondition(
nfeatures,
metadf,
mean_formula,
param_details = NULL,
seqdepth = nfeatures * 2500,
label_time = 2,
pnew = 0.05,
pold = 0.001,
readlength = 200,
Ucont_alpha = 25,
Ucont_beta = 75,
feature_prefix = "Gene",
dispslope = 5,
dispint = 0.01,
logkdegsdtrend_slope = -0.3,
logkdegsdtrend_intercept = -2.25,
logksynsdtrend_slope = -0.3,
logksynsdtrend_intercept = -2.25,
logkdeg_mean = -1.9,
logkdeg_sd = 0.7,
logksyn_mean = 2.3,
logksyn_sd = 0.7,
logkdeg_diff_avg = 0,
logksyn_diff_avg = 0,
logkdeg_diff_sd = 0.5,
logksyn_diff_sd = 0.5,
pdiff_kd = 0.1,
pdiff_ks = 0,
pdiff_both = 0,
pdo = 0
)
Arguments
nfeatures |
Number of "features" (e.g., genes) to simulate data for |
metadf |
A data frame with the following columns:
These parameters (described more below) can also be included in metadf to specify sample-specific simulation parameter:
|
mean_formula |
A formula object that specifies the linear model used to
relate the factors in the |
param_details |
A data frame with one row for each column of the design matrix
obtained from
If param_details is not specified by the user, the first column of the design matrix
is assumed to represent the reference parameter, all parameters are assumed to be
non-global, logkdeg_mean and logksyn_mean are set to the equivalently named parameter values
described below for the reference and |
seqdepth |
Only relevant if |
label_time |
Length of s^4^U feed to simulate. |
pnew |
Probability that a T is mutated to a C if a read is new. |
pold |
Probability that a T is mutated to a C if a read is old. |
readlength |
Length of simulated reads. In this simple simulation, all reads are simulated as being exactly this length. |
Ucont_alpha |
Probability that a nucleotide in a simulated read from a given feature
is a U is drawn from a beta distribution with shape1 = |
Ucont_beta |
Probability that a nucleotide in a simulated read from a given feature
is a U is drawn from a beta distribution with shape2 = |
feature_prefix |
Name given to the i-th feature is |
dispslope |
Negative binomial dispersion parameter "slope" with respect to read counts. See DESeq2 paper for dispersion model used. |
dispint |
Negative binomial dispersion parameter "intercept" with respect to read counts. See DESeq2 paper for dispersion model used. |
logkdegsdtrend_slope |
Slope for log10(read count) vs. log(kdeg) replicate variability trend |
logkdegsdtrend_intercept |
Intercept for log10(read count) vs. log(kdeg) replicate variability trend |
logksynsdtrend_slope |
Slope for log10(read count) vs. log(ksyn) replicate variability trend |
logksynsdtrend_intercept |
Intercept for log10(read count) vs. log(ksyn) replicate variability trend |
logkdeg_mean |
Mean of normal distribution from which reference log(kdeg)
linear model parameter is drawn from for each feature if |
logkdeg_sd |
Standard deviation of normal distribution from which reference log(kdeg)
linear model parameter is drawn from for each feature if |
logksyn_mean |
Mean of normal distribution from which reference log(ksyn)
linear model parameter is drawn from for each feature if |
logksyn_sd |
Standard deviation of normal distribution from which reference log(ksyn)
linear model parameter is drawn from for each feature if |
logkdeg_diff_avg |
Mean of normal distribution from which non-reference log(kdeg)
linear model parameters are drawn from for each feature if |
logksyn_diff_avg |
Mean of normal distribution from which reference log(ksyn)
linear model parameter are drawn from for each feature if |
logkdeg_diff_sd |
Standard deviation of normal distribution from which reference log(kdeg)
linear model parameter are drawn from for each feature if |
logksyn_diff_sd |
Standard deviation of normal distribution from which reference log(ksyn)
linear model parameter are drawn from for each feature if |
pdiff_kd |
Proportion of features for which non-reference log(kdeg) linear model parameters differ from the reference. |
pdiff_ks |
Proportion of features for which non-reference log(ksyn) linear model parameters differ from the reference. |
pdiff_both |
Proportion of features for which BOTH non-reference log(kdeg) and log(ksyn) linear model parameters differ from the reference. ksyns are simulated |
pdo |
Dropout rate; think of this as the probability that a s4U containing
molecule is lost during library preparation and sequencing. If |
Value
A list containing 6 elements:
cB: Tibble that can be provided as the
cBarg toEZbakRData().metadf: Tibble that can be provided as the
metadfarg toEZbakRData().PerRepTruth: Tibble containing replicate-by-replicate simulated ground truth
AvgTruth: Tibble containing average simulated ground truth
param_details: Tibble containing information about simulated linear model parameters
UnbiasedFractions: Tibble containing no dropout ground truth
Examples
simdata <- SimulateMultiCondition(30,
data.frame(sample = c('sampleA', 'sampleB'),
treatment = c('treatment1', 'treatment2')),
mean_formula = ~treatment-1)
Simulate one replicate of multi-label NR-seq data
Description
Generalizes SimulateOneRep() to simulate any combination of mutation types. Currently, no kinetic model is used to relate certain parameters to the fractions of reads belonging to each simulated mutational population. Instead these fractions are drawn from a Dirichlet distribution with gene-specific parameters.
Usage
SimulateMultiLabel(
nfeatures,
populations = c("TC"),
fraction_design = create_fraction_design(populations),
fractions_matrix = NULL,
read_vect = NULL,
sample_name = "sampleA",
feature_prefix = "Gene",
kdeg_vect = NULL,
ksyn_vect = NULL,
logkdeg_mean = -1.9,
logkdeg_sd = 0.7,
logksyn_mean = 2.3,
logksyn_sd = 0.7,
phighs = stats::setNames(rep(0.05, times = length(populations)), populations),
plows = stats::setNames(rep(0.002, times = length(populations)), populations),
seqdepth = nfeatures * 2500,
readlength = 200,
alpha_min = 3,
alpha_max = 6,
Ucont = 0.25,
Acont = 0.25,
Gcont = 0.25,
Ccont = 0.25
)
Arguments
nfeatures |
Number of "features" (e.g., genes) to simulate data for |
populations |
Vector of mutation populations you want to simulate. |
fraction_design |
Fraction design matrix, specifying which potential mutational populations should actually exist. See ?EstimateFractions for more details. |
fractions_matrix |
Matrix of fractions of each mutational population to simulate. If not provided, this will be simulated. One row for each feature, one column for each mutational population, rows should sum to 1. |
read_vect |
Vector of length = |
sample_name |
Character vector to assign to |
feature_prefix |
Name given to the i-th feature is |
kdeg_vect |
Vector of length = |
ksyn_vect |
Vector of length = |
logkdeg_mean |
If necessary, meanlog of a log-normal distribution from which kdegs are simulated |
logkdeg_sd |
If necessary, sdlog of a log-normal distribution from which kdegs are simulated |
logksyn_mean |
If necessary, meanlog of a log-normal distribution from which ksyns are simulated |
logksyn_sd |
If necessary, sdlog of a log-normal distribution from which ksyns are simulated |
phighs |
Vector of probabilities of mutation rates in labeled reads of each type denoted in
|
plows |
Vector of probabilities of mutation rates in unlabeled reads of each type denoted in
|
seqdepth |
Only relevant if |
readlength |
Length of simulated reads. In this simple simulation, all reads are simulated as being exactly this length. |
alpha_min |
Minimum possible value of alpha element of Dirichlet random variable |
alpha_max |
Maximum possible value of alpha element of Dirichlet random variable |
Ucont |
Probability that a nucleotide in a simulated read is a U. |
Acont |
Probability that a nucleotide in a simulated read is an A. |
Gcont |
Probability that a nucleotide in a simulated read is a G. |
Ccont |
Probability that a nucleotide in a simulated read is a C. |
Value
List with two elements:
cB: Tibble that can be passed as the
cBarg toEZbakRData().ground_truth: Tibble containing simulated ground truth.
Examples
simdata <- SimulateMultiLabel(3)
Simulate a single replicate of NR-seq data
Description
In SimulateOneRep, users have the option to either provide vectors of feature-specific
read counts, fraction news, kdegs, and ksyns for the simulation, or to have those drawn
from relevant distributions whose properties can be tuned by the various optional
parameters of SimulateOneRep. The number of mutable nucleotides (nT) in
a read is drawn from a binomial distribution with readlength trials and a probability
of "success" equal to Ucont. A read's status as new or old is drawn from a Bernoulli
distribution with probability of "success" equal to the feature's fraction new. If a read
is new, the number of mutations in the read is drawn from a binomial distribution with
probability of mutation equal to pnew. If a read is old, the number of mutations is instead
drawn from a binomial distribution with probability of mutation equal to pold.
Usage
SimulateOneRep(
nfeatures,
read_vect = NULL,
label_time = 2,
sample_name = "sampleA",
feature_prefix = "Gene",
fn_vect = NULL,
kdeg_vect = NULL,
ksyn_vect = NULL,
pnew = 0.05,
pold = 0.002,
logkdeg_mean = -1.9,
logkdeg_sd = 0.7,
logksyn_mean = 2.3,
logksyn_sd = 0.7,
seqdepth = nfeatures * 2500,
readlength = 200,
Ucont_alpha = 25,
Ucont_beta = 75,
feature_pnew = FALSE,
pnew_kdeg_corr = FALSE,
logit_pnew_mean = -2.5,
logit_pnew_sd = 0.1
)
Arguments
nfeatures |
Number of "features" (e.g., genes) to simulate data for |
read_vect |
Vector of length = |
label_time |
Length of s^4^U feed to simulate. |
sample_name |
Character vector to assign to |
feature_prefix |
Name given to the i-th feature is |
fn_vect |
Vector of length = |
kdeg_vect |
Vector of length = |
ksyn_vect |
Vector of length = |
pnew |
Probability that a T is mutated to a C if a read is new. |
pold |
Probability that a T is mutated to a C if a read is old. |
logkdeg_mean |
If necessary, meanlog of a log-normal distribution from which kdegs are simulated |
logkdeg_sd |
If necessary, sdlog of a log-normal distribution from which kdegs are simulated |
logksyn_mean |
If necessary, meanlog of a log-normal distribution from which ksyns are simulated |
logksyn_sd |
If necessary, sdlog of a log-normal distribution from which ksyns are simulated |
seqdepth |
Only relevant if |
readlength |
Length of simulated reads. In this simple simulation, all reads are simulated as being exactly this length. |
Ucont_alpha |
Probability that a nucleotide in a simulated read from a given feature
is a U is drawn from a beta distribution with shape1 = |
Ucont_beta |
Probability that a nucleotide in a simulated read from a given feature
is a U is drawn from a beta distribution with shape2 = |
feature_pnew |
Boolean; if TRUE, simulate a different pnew for each feature |
pnew_kdeg_corr |
Boolean; only relevant if |
logit_pnew_mean |
If |
logit_pnew_sd |
If |
Value
List with two elements:
cB: Tibble that can be passed as the
cBarg toEZbakRData().ground_truth: Tibble containing simulated ground truth.
Examples
simdata <- SimulateOneRep(30)
Vectorized simulation of one replicate of multi-label NR-seq data
Description
Generalizes SimulateOneRep() to simulate any combination of mutation types. Currently, no kinetic model is used to relate certain parameters to the fractions of reads belonging to each simulated mutational population. Instead these fractions are drawn from a Dirichlet distribution with gene-specific parameters.
Usage
VectSimulateMultiLabel(
nfeatures,
populations = c("TC"),
fraction_design = create_fraction_design(populations),
fractions_matrix = NULL,
read_vect = NULL,
sample_name = "sampleA",
feature_prefix = "Gene",
kdeg_vect = NULL,
ksyn_vect = NULL,
logkdeg_mean = -1.9,
logkdeg_sd = 0.7,
logksyn_mean = 2.3,
logksyn_sd = 0.7,
phighs = stats::setNames(rep(0.05, times = length(populations)), populations),
plows = stats::setNames(rep(0.002, times = length(populations)), populations),
seqdepth = nfeatures * 2500,
readlength = 200,
alpha_min = 3,
alpha_max = 6,
Ucont = 0.25,
Acont = 0.25,
Gcont = 0.25,
Ccont = 0.25
)
Arguments
nfeatures |
Number of "features" (e.g., genes) to simulate data for |
populations |
Vector of mutation populations you want to simulate. |
fraction_design |
Fraction design matrix, specifying which potential mutational populations should actually exist. See ?EstimateFractions for more details. |
fractions_matrix |
Matrix of fractions of each mutational population to simulate. If not provided, this will be simulated. One row for each feature, one column for each mutational population, rows should sum to 1. |
read_vect |
Vector of length = |
sample_name |
Character vector to assign to |
feature_prefix |
Name given to the i-th feature is |
kdeg_vect |
Vector of length = |
ksyn_vect |
Vector of length = |
logkdeg_mean |
If necessary, meanlog of a log-normal distribution from which kdegs are simulated |
logkdeg_sd |
If necessary, sdlog of a log-normal distribution from which kdegs are simulated |
logksyn_mean |
If necessary, meanlog of a log-normal distribution from which ksyns are simulated |
logksyn_sd |
If necessary, sdlog of a log-normal distribution from which ksyns are simulated |
phighs |
Vector of probabilities of mutation rates in labeled reads of each type denoted in
|
plows |
Vector of probabilities of mutation rates in unlabeled reads of each type denoted in
|
seqdepth |
Only relevant if |
readlength |
Length of simulated reads. In this simple simulation, all reads are simulated as being exactly this length. |
alpha_min |
Minimum possible value of alpha element of Dirichlet random variable |
alpha_max |
Maximum possible value of alpha element of Dirichlet random variable |
Ucont |
Probability that a nucleotide in a simulated read is a U. |
Acont |
Probability that a nucleotide in a simulated read is an A. |
Gcont |
Probability that a nucleotide in a simulated read is a G. |
Ccont |
Probability that a nucleotide in a simulated read is a C. |
Value
List with two elements:
cB: Tibble that can be passed as the
cBarg toEZbakRData().ground_truth: Tibble containing simulated ground truth.
Examples
simdata <- VectSimulateMultiLabel(30)
Make plots to visually assess dropout trends
Description
Plots a measure of dropout (the ratio of -label to +label RPM) as a function of feature fraction new, with the model fit depicted. Use this function to qualitatively assess model fit and whether the modeling assumptions are met.
Usage
VisualizeDropout(
obj,
plot_type = c("grandR", "bakR"),
grouping_factors = NULL,
features = NULL,
populations = NULL,
fraction_design = NULL,
repeatID = NULL,
exactMatch = TRUE,
n_min = 50,
dropout_cutoff = 5,
...
)
Arguments
obj |
An EZbakRFractions object, which is an EZbakRData object on which
you have run |
plot_type |
Which type of plot to make. Options are:
|
grouping_factors |
Which sample-detail columns in the metadf should be used
to group -s4U samples by for calculating the average -s4U RPM? The default value of
|
features |
Character vector of the set of features you want to stratify
reads by and estimate proportions of each RNA population. The default of |
populations |
Mutational populations that were analyzed to generate the fractions table to use. For example, this would be "TC" for a standard s4U-based nucleotide recoding experiment. |
fraction_design |
"Design matrix" specifying which RNA populations exist
in your samples. By default, this will be created automatically and will assume
that all combinations of the |
repeatID |
If multiple |
exactMatch |
If TRUE, then |
n_min |
Minimum raw number of reads to make it to plot |
dropout_cutoff |
Maximum dropout value included in plot. |
... |
Parameters passed to internal |
Value
A list of ggplot2 objects, one for each +label sample.
Examples
# Simulate data to analyze
simdata <- EZSimulate(30)
# Create EZbakR input
ezbdo <- EZbakRData(simdata$cB, simdata$metadf)
# Estimate Fractions
ezbdo <- EstimateFractions(ezbdo)
# Visualize Dropout
ezbdo <- VisualizeDropout(ezbdo)
Generate a fraction_design table for EstimateFractions.
Description
A fraction_design table denotes what populations of labeled/unlabeled RNA are present in your data.
A fraction_design table as one column for each mutation type (e.g., TC) present
in your cB file, and one column named "present". Each entry is either TRUE or
FALSE. The rows include all possible combinations of TRUE and FALSE for all
mutation types columns. A value of TRUE in a mutation type column represents
a population of reads that have high amounts (on average) of that mutation type.
For example, if your fraction_design table has mutation type columns "TC" and
"GA", the row with TC == TRUE and GA == FALSE represents a population of reads
with high T-to-C mutation content and low G-to-A mutation content. In other words,
these are reads from RNA synthesized in the presence of s4U but not s6G. If such
a population exists in your data, the "present" column for that row should have a
value of TRUE.
Usage
create_fraction_design(mutrate_populations)
Arguments
mutrate_populations |
Character vector of the set of mutational populations
present in your data. For example, s4U fed data with standard nucleotide recoding
chemistry (e.g., TimeLapse, SLAM, TUC, AMUC, etc.) would have a |
Value
A fraction_design table that assumes that every possible combination of
mutational populations listed in mutrate_populations are present in your data.
The present column can be modified if this assumption is incorrect. This default
is chosen as it will in theory work for all analyses, it may just be unnecessarily
inefficient and estimate the abundance of populations that don't exist.
Examples
# Standard, single-label NR-seq
fd <- create_fraction_design(c("TC"))
# Dual-label NR-seq
fd2 <- create_fraction_design(c("TC", "GA"))
# Adjust dual-label output for TILAC
fd2$present <- ifelse(fd2$TC & fd2$GA, FALSE, fd2$present)
Example cB table
Description
An example cB table used to create an EZbakRData object. This cB table is a
subset of a cB file from the DCP2 KO dataset published in Luo et al., 2020.
The original file is large (69 MB), so the example cB file has been
downsampled and contains only a subset of reads from chromosome 21.
Usage
example_cB
Format
example_cB
A tibble with 10,000 rows and 7 columns:
- sample
Sample name
- rname
Chromosome name
- GF
Gene name for reads aligning to any region of a gene
- XF
Gene name for reads aligning to exclusively exonic regions of a gene
- TC
Number of T-to-C mutations
- nT
Number of Ts
- n
Number of reads with same value for the first 6 columns
References
Luo et al. (2020) Biochemistry. 59(42), 4121-4142
Example metadf
Description
An example metadf table used to create an EZbakRData object. This metadf
describes the DCP2 KO dataset published in Luo et al., 2020.
Usage
example_metadf
Format
example_metadf
A tibble with 6 rows and 3 columns
- sample
Sample name
- tl
Metabolic label feed time
- genotype
Whether sample was collected from WT or DCP2 KO cells
References
Luo et al. (2020) Biochemistry. 59(42), 4121-4142
Get normalized read counts from either a cB table or EZbakRFractions object.
Description
Uses TMM normalization strategy, similar to that used by DESeq2 and edgeR.
Usage
get_normalized_read_counts(
obj,
features_to_analyze,
fractions_name = NULL,
feature_lengths = NULL,
scale_factors = NULL
)
## S3 method for class 'EZbakRFractions'
get_normalized_read_counts(
obj,
features_to_analyze,
fractions_name = NULL,
feature_lengths = NULL,
scale_factors = NULL
)
## S3 method for class 'EZbakRData'
get_normalized_read_counts(
obj,
features_to_analyze,
fractions_name = NULL,
feature_lengths = NULL,
scale_factors = NULL
)
Arguments
obj |
An |
features_to_analyze |
Features in relevant table |
fractions_name |
Name of fractions table to use |
feature_lengths |
Table of effective lengths for each feature combination in your data. For example, if your analysis includes features named GF and XF, this should be a data frame with columns GF, XF, and length. |
scale_factors |
Dataframe with two columns, one being "sample" (sample names) and the other being "scale_factor" (value to divide read counts by to normalize them) |
Value
Data table of normalized read counts.
Methods (by class)
-
get_normalized_read_counts(EZbakRFractions): Method for class EZbakRFractions Get normalized read counts from fractions table. -
get_normalized_read_counts(EZbakRData): Method for class EZbakRData Get normalized read counts from a cB table.
Examples
# Simulate data
simdata <- EZSimulate(30)
# Create EZbakRData object
ezbdo <- EZbakRData(simdata$cB, simdata$metadf)
# Get normalized read counts
reads <- get_normalized_read_counts(ezbdo, features_to_analyze = "feature")
EZbakRArrowData object constructor for internal use
Description
new_EZbakRArrowData efficiently creates an object of class EZbakRArrowData.
It does not perform any rigorous checks of the legitimacy of this object.
Usage
new_EZbakRArrowData(cBds, metadf)
Arguments
cBds |
Arrow dataset tracking the sample ID, mutational and nucleotide content, and feature assignment of sequencing reads. |
metadf |
Data frame tracking features of each of the samples included
in |
EZbakRDataobject constructor for internal use
Description
new_EZbakRData efficiently creates an object of class EZbakRData.
It does not perform any rigorous checks of the legitimacy of this object.
Usage
new_EZbakRData(cB, metadf)
Arguments
cB |
Data frame tracking the sample ID, mutational and nucleotide content, and feature assignment of sequencing reads. |
metadf |
Data frame tracking features of each of the samples included
in |
EZbakRFractions object constructor
Description
new_EZbakRFractions efficiently creates an object of class EZbakRFractions.
It does not perform any rigorous checks of the legitimacy of this object.
Usage
new_EZbakRFractions(fractions, metadf, name = NULL, character_limit = 20)
Arguments
fractions |
Data frame containing information about the fraction of reads from each mutational population of interest. |
metadf |
Data frame reporting aspects of each of the samples included |
name |
Optional; name to give to fractions table. |
character_limit |
Maximum number of characters for naming out fractions output. EZbakR
will try to name this as a "_" separated character vector of all of the features analyzed.
If this name is greater than |
EZbakRKinetics object constructor
Description
new_EZbakRKinetics efficiently creates an object of class EZbakRKinetics.
It does not perform any rigorous checks of the legitimacy of this object.
Usage
new_EZbakRKinetics(
kinetics,
features,
metadf,
name = NULL,
character_limit = 20
)
Arguments
kinetics |
Data frame containing information about the kinetic parameters of interest for each set of features tracked. |
features |
Features tracked in |
metadf |
Data frame describing each of the samples included |
name |
Optional; name to give to fractions table. |
character_limit |
Maximum number of characters for naming out fractions output. EZbakR
will try to name this as a "_" separated character vector of all of the features analyzed.
If this name is greater than |
Example ODE model graphs and formulas
Description
A list of example "graphs" and specie formulas that can be passed to
EZDynamics or SimulateDynamics, and that are used under the hood
by EZSimulate to facilitate simulations of ODE models.
Usage
ode_models
Format
ode_models
A list with 6 elements
- nuc2cyto
Simplest model of nuclear and cytoplasmic RNA dynamics: 0 -> N -> C -> 0
- preRNA
Simplest model of pre-RNA and mature RNA dynamics: 0 -> P -> M -> 0
- preRNAwithPdeg
Same as preRNA, but now pre-RNA can also degrade.
- nuc2cytowithNdeg
Same as nuc2cyto, but now nuclear RNA can also degrade.
- subtlseq
Subcellular TimeLapse-seq model, similar to that described in Ietswaart et al., 2024. Simplest model discussed there, lacking nuclear degradation: 0 -> CH -> NP -> CY -> PL -> 0, and CY can also degrade.
- nuc2cytowithpreRNA
Combination of nuc2cyto and preRNA where preRNA is first synthesized, then either processed or exported to the cytoplasm. Processing can also occur in the cytoplasm, and mature nuclear RNA can be exported to the cytoplasm. Only mature RNA degrades.
Each element of list has two items
- graph
Matrix representation of ODE system graph.
- formulas
Formula objects relating measured species to modeled species.
References
Ietswaart et al. (2024) Molecular Cell. 84(14), 2765-2784.
Print method for EZbakRArrowData objects
Description
Print method for EZbakRArrowData objects
Usage
## S3 method for class 'EZbakRArrowData'
print(x, max_name_chars = 60, ...)
Arguments
x |
An |
max_name_chars |
Maximum number of characters to print on each line |
... |
Ignored |
Value
The input EZbakRData object, invisibly
Examples
# Simulate data to analyze
simdata <- SimulateOneRep(30)
# Create EZbakR input
metadf <- data.frame(sample = "sampleA", tl = 2)
ezbdo <- EZbakRData(simdata$cB, metadf)
# Print
print(ezbdo)
Print method for EZbakRData objects
Description
Print method for EZbakRData objects
Usage
## S3 method for class 'EZbakRData'
print(x, max_name_chars = 60, ...)
Arguments
x |
An |
max_name_chars |
Maximum number of characters to print on each line |
... |
Ignored |
Value
The input EZbakRData object, invisibly
Examples
# Simulate data to analyze
simdata <- SimulateOneRep(30)
# Create EZbakR input
metadf <- data.frame(sample = "sampleA", tl = 2)
ezbdo <- EZbakRData(simdata$cB, metadf)
# Print
print(ezbdo)
Standard fraction_design table for EstimateFractions
Description
An example fraction_design table for a standard NR-seq experiment with
s^4U labeling. This table tells EstimateFractions that there are two
populations of reads, one with high T-to-C mutation content and one
with low T-to-C mutation content
Usage
standard_fraction_design
Format
standard_fraction_design
A tibble with 2 rows and 2 columns:
- TC
Boolean denoting if population represented by that row has high T-to-C mutational content
- present
Boolean denoting if population represented by that row is expected to be present in this dataset
TILAC fraction_design table for EstimateFractions
Description
An example fraction_design table for a TILAC experiment. TILAC was originally
described in Courvan et al., 2022. In this
method, two populations of RNA, one from s^4U fed cells and one from s^6G fed cells, are pooled
and prepped for sequencing together. This allows for internally controlled comparisons of RNA
abundance without spike-ins. s^4U is recoded to a cytosine analog by TimeLapse chemistry
(or similar chemistry) and s^6G is recoded to an adenine analog. Thus, fraction_design includes
columns called TC and GA. A unique aspect of the TILAC fraction_design table is that
one of the possible populations, TC and GA both TRUE, is denoted as not present (present = FALSE).
This is because there is no RNA was exposed to both s^4U and s^6G, thus a population of reads
with both high T-to-C and G-to-A mutational content should not exist.
Usage
tilac_fraction_design
Format
tilac_fraction_design
A tibble with 4 rows and 3 columns:
- TC
Boolean denoting if population represented by that row has high T-to-C mutational content
- GA
Boolean denoting if population represented by that row has high G-to-A mutational content
- present
Boolean denoting if population represented by that row is expected to be present in this dataset
EZbakRArrowData EZbakRArrowData validator
Description
validate_EZbakRArrowData ensures that input for EZbakRArrowData object construction
is valid.
Usage
validate_EZbakRArrowData(obj)
Arguments
obj |
An object of class |
EZbakRDataobject validator
Description
validate_EZbakRData ensures that input for EZbakRData construction
is valid.
Usage
validate_EZbakRData(obj)
Arguments
obj |
An object of class |
EZbakRFractions object validator
Description
validate_EZbakRFractions ensures that input for EZbakRFractions construction
is valid.
Usage
validate_EZbakRFractions(obj)
Arguments
obj |
An object of class |
EZbakRKinetics object validator
Description
validate_EZbakRKinetics ensures that input for EZbakRKinetics construction
is valid.
Usage
validate_EZbakRKinetics(obj, features)
Arguments
obj |
An object of class |
features |
Features tracked in |