--- title: "Differential RNA structurome analysis using `dStruct`" author: - name: Krishna Choudhary affiliation: University of California, San Francisco, USA email: kchoudhary@ucdavis.edu - name: Sharon Aviran affiliation: University of California, Davis, USA package: dStruct output: BiocStyle::html_document: toc_float: true BiocStyle::pdf_document: default vignette: > %\VignetteIndexEntry{Differential RNA structurome analysis using `dStruct`} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: "`r 'refs.bib'`" abstract: >

A large variety of RNA molecules exist in cells. They adopt diverse structures to serve important cellular functions. In the last decade, a number of deep sequencing-based technologies have been developed to study RNA structures at a genome-wide scale, at high resolution, and in vivo or in vitro. These technologies generate large-scale and highly noisy data. A common goal in many studies utilizing such data is to compare RNA structuromes under two or more conditions, and identify RNAs or their regions that may have altered structures between the conditions.

`dStruct`: dStruct is a package meant for **d**ifferential analysis of RNA **Struct**urome profiling data [@choudhary2019dstruct]. It is broadly applicable to all existing technologies for RNA structurome profiling. It accounts for inherent biological variation in structuromes between replicate samples and returns differential regions. Here, we describe the input data format required by dStruct, provide use cases, and discuss options for analysis and visualization. --- ```{r echo = FALSE, message = FALSE, warning = FALSE} library(BiocStyle) ``` *** For details on the methods presented here, consider having a look at our manuscript: > Choudhary, K., Lai, Y. H., Tran, E. J., & Aviran, S. (2019). dStruct: identifying differentially reactive regions from RNA structurome profiling data. Genome Biology, 20(1), 1-26. doi: [10.1186/s13059-019-1641-3](https://doi.org/10.1186/s13059-019-1641-3) # Load packages {-} The following code chunk loads the `dStruct` and `tidyverse` packages. We use `tidyverse` for cleaner code presentation in this vignette. ```{r load-libs, message = FALSE, warning = FALSE} library(dStruct) library(tidyverse) ``` # Introduction dStruct is broadly applicable to technologies for RNA structurome profiling, e.g., SHAPE-Seq, Structure-Seq, PARS, SHAPE-MaP, etc. The raw data from these technologies are reads from DNA sequencing platforms. The start sites of read alignment coordinates are adjacent to sites of reactions of ribonucleotides with a structure-probing reagent. For some technologies, the sites of mismatches between reads and the reference genome sequences are sites of reactions, e.g., SHAPE-MaP and DMS-MaPseq. The counts of reads mapping to the nucleotides of an RNA should be tallied and converted to reactivities, which represent the degrees of reactions. The specific steps to process data from sequencing reads to reactivities depend on the structurome profiling technology. For a review on structurome profiling technologies and methods for assessing reactivities, see @choudhary2017comparative. dStruct takes nucleotide-level normalized reactivities of one or multiple RNAs as input. Currently, the preffered normalization method depends on the RNA structurome profiling technology. We have provided an implementation of the 2-8 % method for normalization (@low2010shape) via a function named `twoEightNormalize`, which is commonly used to normalize data from technologies such as SHAPE-Seq, Structure-Seq, etc. In the following, we assume that the user has processed the raw reads to obtain normalized reactivities for their RNAs of interest. # Input data The input data to dStruct is a `list` object. Each element of the `list` should be a `data.frame` containing normalized reactivities. ## Load inbuilt sample data We have provided two test datasets with this package. These are derived from experiments described in @lai2019genome and @wan2014landscape. As described in the section on differential analysis below, we use these to illustrate de novo and guided discovery use cases, respectively. To start with, let us examine the data structure required for input. The Lai et al. data can be loaded as follows. ```{r} data(lai2019) ``` This adds an object named `lai2019` to the global environment, which can be checked using the following function. ```{r} ls() ``` `lai2019` is an object of class `list`. Each of the list elements is a `data.frame` containing normalized reactivities. ```{r} class(lai2019) ``` ```{r} names(lai2019) %>% head() ``` All elements of `lai2019` have the same structure. Let us check one of them. ```{r} class(lai2019[[1]]) ``` ```{r} head(lai2019[["YAL042W"]], n= 20) ``` Each row of the above `data.frame` has the reactivity for one nucleotide of the RNA with id `"YAL042W"` and each column is one sample. Consecutive rows must represent consecutive nucleotides. `NA` values represent unavailable reactivities. If the data were generated using base-selective reagents (e.g., DMS used by Lai et al.), `dStruct` expects that the reactivities of unprobed bases have been masked as `NA` before running the differential analysis. Currently, `dStruct` supports comparisons of samples from two conditions at a time. The prefixes `A` and `B` in the columns of each `data.frame` of `lai2019` stand for the two conditions. It is required that the conditions be labeled as `A` and `B` because `dStruct` will be looking for these internally. The numeric suffixes in the column names are replicate sample numbers. These must start with `1` and increase in steps of 1. If the samples were prepared in batches with each batch having one sample of each group, the corresponding samples should be given the same numeric suffix. # Differential analysis Differential analysis can be performed for multiple RNAs in one step or a single RNA at a time. Additionally, we allow for two modes of analysis, one to identify differentially reactive regions de novo, and another called as guided discovery to assess the significance of differences in reactivities in the regions provided by the user. ## De novo discovery The function `dStruct` analyzes reactivity profiles for a single transcript to identify differential regions de novo. Additionally, we provide a wrapper function called `dStructome` which can analyze profiles for multiple transcripts simultaneously. For example, a single transcript, say `YAL042W`, can be analyzed as follows. ```{r, warning=FALSE} dStruct(rdf = lai2019[["YAL042W"]], reps_A = 3, reps_B = 2, batches = TRUE, min_length = 21, between_combs = data.frame(c("A3", "B1", "B2")), within_combs = data.frame(c("A1", "A2", "A3")), ind_regions = TRUE) ``` **Input arguments.** The arguments `rdf`, `reps_A` and `reps_B` of the `dStruct` function are required while the others are optional. These take in a `data.frame` with reactivity values, the number of samples of group A (group of wild-type *S. cerevisiae* samples), and the number of samples of group B (group of *dbp2$\Delta$ S. cerevisiae* samples), respectively. Set `batches` to `TRUE` if the samples were prepared in batches with a paired experiment design, i.e., two samples in each batch -- one of each group. `min_length` is the minimum length of a differential region that `dStruct` would search for. For the Lai et al. dataset, we set it to 21 nt because that is the putative length of regions bound by Dbp2, which was the RNA helicase under investigation in their study. As described in our manuscript, `dStruct` performs differential analysis by regrouping the samples in homogeneous and heterogeneous groups and assessing the dissimilarity of reactivity profiles in these groups. The homogeneous groups comprise of samples from the same original group (e.g., A or B) while the heterogeneous groups comprise of samples from both A and B, respectively (see @choudhary2019dstruct for details). Users can explicitly specify these groupings to the `dStruct` function as shown in the example above. Otherwise, `dStruct` would automatically generate the homogeneous and heterogeneous groups. Setting `ind_regions` to `TRUE` requires test for significance of difference in reactivity patterns in individual regions. If it is set to `FALSE` all regions identified within a transcript are tested collectively to obtain significance scores at the level of a transcript. **Output.** The output is an `IRanges` object with columns giving the start (column titled *start*) and end (column titled *end*) coordinates of the tested regions from the transcript whose reactivity profile is input, *p*-values, medians of nucleotide-wise differences of the between-group and within-group *d* scores (column titled *del_d*), and the FDR-adjusted *p*-values. Alternatively, all transcripts in a reactivity list can be analyzed in one step using the wrapper function `dStructome`. By default, `dStructome` spawns multiple processes, which speeds up the analysis by processing transcripts in parallel. This behavior can be changed using the argument `processes` as shown in the following code chunk. ```{r, warning=FALSE} res <- dStructome(lai2019, 3, 2, batches= TRUE, min_length = 21, between_combs = data.frame(c("A3", "B1", "B2")), within_combs = data.frame(c("A1", "A2", "A3")), ind_regions = TRUE, processes = 1) ``` This returns the coordinates of regions in transcripts with the corresponding *p*- and FDR-values. ```{r} head(res) ``` ## Guided discovery The function `dStructGuided` tests for differential reactivity profiles of pre-defined regions. Additionally, the wrapper function `dStructome` can analyze profiles for multiple regions simultaneously when its argument `method = "guided"`. We illustrate this mode of running `dStruct` using PARS data from @wan2014landscape. We downloaded the data reported in their manuscript and processed it to get PARS scores as described in @choudhary2019dstruct. The PARS scores are available with the `dStruct` package and can be loaded as follows. ```{r} data(wan2014) ``` This adds an object named `wan2014` to the global environment, which can be checked using the following function. ```{r} ls() ``` `wan2014` is an object of class `list` and has the same structure as described for Lai et al.'s data above. Each of its elements is a `data.frame` containing PARS scores for an 11 nt region. At its center, each region has a single-nucleotide variant between the two groups of samples (the regions could be defined by users based on other considerations, e.g., regions bound by proteins based on assays such as eCLIP-seq). There are two samples of group A and one of group B. ```{r} wan2014[[1]] ``` The names of the list elements give a transcript's ID and the coordinates of its region to be tested. ```{r} names(wan2014) %>% head() ``` To test a single region, use `dStructGuided` as follows. ```{r} dStructGuided(wan2014[[1]], reps_A = 2, reps_B = 1) ``` `dStructGuided` returns a *p*-value for significance of differential PARS scores in the input region and the median of nucleotide-wise differences of the between-group and within-group *d* scores. Alternatively, all regions in `wan2014` can be analyzed in one step using the wrapper function `dStructome`. ```{r, warning=FALSE} res_predefined_regs <- dStructome(wan2014, reps_A = 2, reps_B = 1, method = "guided", processes = 1) ``` This returns a table with the coordinates of regions in transcripts with the corresponding *p*- and FDR-values. ```{r} head(res_predefined_regs) ``` # Visualizing results The function `plotDStructurome` takes in a list of `data.frame`s with reactivities and the results from `dStructome` to save a PDF file showing the reactivities in a region. We illustrate this with the result for the `lai2019` data. Let us plot the reactivities for the gene with the lowest FDR value. We can identify it as follows. ```{r} toPlot <- res@elementMetadata %>% data.frame() %$% magrittr::extract(t, order(FDR)) %>% head(., n = 1) toPlot ``` Users may pass the entire reactivity list to `plotDStructurome` and the complete result table. By default, the results are plotted for FDR < 0.05 and $y$-axis limits range from 0 to 3. However, this may take some time in case there are a large number of differential regions. Here, we plot the reactivities for the gene in the object `toPlot`. ```{r, message=FALSE, warning=FALSE} plotDStructurome(rl = lai2019[toPlot], diff_regions = subset(res, t == toPlot), outfile = paste0("DRRs_in_", toPlot), fdr = 0.05, ylim = c(-0.05, 3)) ``` This saves a PDF file named "DRRs_in_YJR045C.pdf" in the current working directory. # Session Information {-} ```{r} sessionInfo() ``` # References {-}