--- title: "Introduction to derfinderHelper" author: - name: Leonardo Collado-Torres affiliation: - &libd Lieber Institute for Brain Development, Johns Hopkins Medical Campus - &ccb Center for Computational Biology, Johns Hopkins University email: lcolladotor@gmail.com output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 code_folding: show date: "`r doc_date()`" package: "`r pkg_ver('derfinderHelper')`" vignette: > %\VignetteIndexEntry{Introduction to derfinderHelper} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Basics ## Install `r Biocpkg('derfinderHelper')` `R` is an open-source statistical environment which can be easily modified to enhance its functionality via packages. `r Biocpkg('derfinderHelper')` is a `R` package available via the [Bioconductor](http://bioconductor/packages/derfinderHelper) repository for packages. `R` can be installed on any operating system from [CRAN](https://cran.r-project.org/) after which you can install `r Biocpkg('derfinderHelper')` by using the following commands in your `R` session: ```{r 'installDer', eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("derfinderHelper") ## Check that you have a valid Bioconductor installation BiocManager::valid() ``` ## Required knowledge `r Biocpkg('derfinderHelper')` is based on many other packages and in particular in those that have implemented the infrastructure needed for dealing with RNA-seq data. A `r Biocpkg('derfinderHelper')` user is not expected to deal with those packages directly but will need to be familiar with `r Biocpkg('derfinder')`. If you are asking yourself the question "Where do I start using Bioconductor?" you might be interested in [this blog post](http://lcolladotor.github.io/2014/10/16/startbioc/#.VkOKbq6rRuU). ## Asking for help As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But `R` and `Bioconductor` have a steep learning curve so it is critical to learn where to ask for help. The blog post quoted above mentions some but we would like to highlight the [Bioconductor support site](https://support.bioconductor.org/) as the main resource for getting help: remember to use the `derfinder` or `derfinderHelper` tags and check [the older posts](https://support.bioconductor.org/t/derfinder/). Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the [posting guidelines](http://www.bioconductor.org/help/support/posting-guide/). It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error. ## Citing `r Biocpkg('derfinderHelper')` We hope that `r Biocpkg('derfinderHelper')` will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you! ```{r 'citation'} ## Citation info citation("derfinderHelper") ``` # Introduction to `r Biocpkg('derfinderHelper')` ```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE} ## Track time spent on making the vignette startTime <- Sys.time() ## Bib setup library("RefManageR") ## Write bibliography information bib <- c( derfinderHelper = citation("derfinderHelper")[1], BiocStyle = citation("BiocStyle"), knitr = citation("knitr")[3], RefManageR = citation("RefManageR")[1], rmarkdown = citation("rmarkdown"), R = citation(), IRanges = citation("IRanges"), Matrix = citation("Matrix"), S4Vectors = RefManageR::BibEntry( bibtype = "manual", key = "S4Vectors", author = "Hervé Pagès and Michael Lawrence and Patrick Aboyoun", title = "S4Vectors: S4 implementation of vector-like and list-like objects", year = 2017, doi = "10.18129/B9.bioc.S4Vectors" ), sessioninfo = citation("sessioninfo"), testthat = citation("testthat") ) ``` `r Biocpkg('derfinderHelper')` `r Citep(bib[['derfinderHelper']])` is a small package that was created to speed up the F-statistics approach implemented in the parent package `r Biocpkg('derfinder')`. It contains a single function, `fstats.apply()`, which is used to calculate the F-statistics for a given data matrix, null and an alternative models. The data is generally arranged in an matrix where the rows ($n$) are the genomic features of interest (gene-level summaries, exon-level summaries, or base-level data) and the columns ($m$) represent the samples. The other two main arguments for `fstats.apply()` are the null and alternative model matrices which are $m \times p_0$ and $m \times p$ where $p_0$ is the number of covariates in the null model and $p$ is the number of covariates in the alternative model. The models have to be nested and thus by definition $p > p_0$. The end result is a vector of F-statistics with length $n$, which is run length encoded for memory saving purposes. Other arguments of `fstats.apply()` are related to flow in `r Biocpkg('derfinder')` such as the scaling factor (`scalefac`) used, whether to subset the data (`index`), and if the data was separated into chunks and saved to disk to lower the memory load (`lowMemDir`). Implementation-wise, `adjustF` is useful when the denominator of the F-statistic calculation is too small. Finally, `method` controls how will the F-statistics be calculated. * `Matrix` is the recommended option because it uses around half the memory load of `regular` and can be faster. Specially if the data was saved in this format previously by `r Biocpkg('derfinder')`. * `Rle` uses the least amount of memory but gets very slow as the number of samples increases. Thus making it less than ideal in several cases. * `regular` uses base `R` to calculate the F-statistics and can require a large amount of memory. This is noticeable when using several cores to run `fstats.apply()` on different portions of the data. The F-statistics for each feature $i$ are calculated using the following formula: $$ F_i = \frac{ (\text{RSS0}_i - \text{RSS1}_i)/(\text{df}_1 - \text{df}_0) }{ \text{adjustF} + (\text{RSS1}_i / (p - p_0 - \text{df_1}))} $$ # Example The following section walks through an example. However, in practice, you will probably not use this package directly and it will be used via `r Biocpkg('derfinder')`. ## Data First lets create an example data set where we have information for 1000 features and 16 samples where samples 1 to 4 are from group A, 5 to 8 from group B, 9 to 12 from group C, and 13 to 16 from group D. ```{r 'createData'} ## Create some toy data suppressPackageStartupMessages(library("IRanges")) set.seed(20140923) toyData <- DataFrame( "sample1" = Rle(sample(0:10, 1000, TRUE)), "sample2" = Rle(sample(0:10, 1000, TRUE)), "sample3" = Rle(sample(0:10, 1000, TRUE)), "sample4" = Rle(sample(0:10, 1000, TRUE)), "sample5" = Rle(sample(0:15, 1000, TRUE)), "sample6" = Rle(sample(0:15, 1000, TRUE)), "sample7" = Rle(sample(0:15, 1000, TRUE)), "sample8" = Rle(sample(0:15, 1000, TRUE)), "sample9" = Rle(sample(0:20, 1000, TRUE)), "sample10" = Rle(sample(0:20, 1000, TRUE)), "sample11" = Rle(sample(0:20, 1000, TRUE)), "sample12" = Rle(sample(0:20, 1000, TRUE)), "sample13" = Rle(sample(0:100, 1000, TRUE)), "sample14" = Rle(sample(0:100, 1000, TRUE)), "sample15" = Rle(sample(0:100, 1000, TRUE)), "sample16" = Rle(sample(0:100, 1000, TRUE)) ) ## Lets say that we have 4 groups group <- factor(rep(toupper(letters[1:4]), each = 4)) ## Note that some groups have higher coverage, we can adjust for this in the model sampleDepth <- sapply(toyData, sum) sampleDepth ``` ## Models Next we create the model matrices for our example data set. Lets say that we want to calculate F-statistics comparing the alternative hypothesis that the group coefficients are not 0 versus the null hypothesis that they are equal to 0, when adjusting for the sample depth. To do so, we create the nested models. ```{r 'createModels'} ## Build the model matrices mod <- model.matrix(~ sampleDepth + group) mod0 <- model.matrix(~sampleDepth) ## Explore them mod mod0 ``` ## Get F-statistics Finally, we can calculate the F-statistics using `fstats.apply()`. ```{r 'calculateFstats'} library("derfinderHelper") fstats <- fstats.apply(data = toyData, mod = mod, mod0 = mod0, scalefac = 1) fstats ``` We can then proceed to use this information in `r Biocpkg('derfinder')` or in any way you like. # Details We created `r Biocpkg('derfinderHelper')` for calculating F-statistics using `SnowParam()` from `r Biocpkg('BiocParallel')`. Using this form of parallelization requires loading the necessary packages in the child processes. Because `r Biocpkg('derfinder')` takes a long time to load, we shipped off `fstats.apply()` to its own package to improve the speed of the calculations while retaining the memory advantages of `SnowParam()` over `MulticoreParam()`. Note that transforming the data from a `DataFrame` to a `dgCMatrix` takes some time, so the most efficient performance is achieved when the data is converted at the beginning instead of at every permutation calculation. This is done in `derfinder::preprocessCoverage()` when `lowMemDir` is specified. # Reproducibility This package was made possible thanks to: * R `r Citep(bib[['R']])` * `r Biocpkg('IRanges')` `r Citep(bib[['IRanges']])` * `r CRANpkg('Matrix')` `r Citep(bib[['Matrix']])` * `r Biocpkg('S4Vectors')` `r Citep(bib[['S4Vectors']])` * `r CRANpkg('sessioninfo')` `r Citep(bib[['sessioninfo']])` * `r CRANpkg('knitr')` `r Citep(bib[['knitr']])` * `r Biocpkg('BiocStyle')` `r Citep(bib[['BiocStyle']])` * `r CRANpkg("RefManageR")` `r Citep(bib[["RefManageR"]])` * `r CRANpkg('rmarkdown')` `r Citep(bib[['rmarkdown']])` * `r CRANpkg('testthat')` `r Citep(bib[['testthat']])` Code for creating the vignette ```{r createVignette, eval=FALSE} ## Create the vignette library("rmarkdown") system.time(render("derfinderHelper.Rmd", "BiocStyle::html_document")) ## Extract the R code library("knitr") knit("derfinderHelper.Rmd", tangle = TRUE) ``` Date the vignette was generated. ```{r reproducibility1, echo=FALSE} ## Date the vignette was generated Sys.time() ``` Wallclock time spent generating the vignette. ```{r reproducibility2, echo=FALSE} ## Processing time in seconds totalTime <- diff(c(startTime, Sys.time())) round(totalTime, digits = 3) ``` `R` session information. ```{r reproducibility3, echo=FALSE} ## Session info library("sessioninfo") options(width = 120) session_info() ``` # Bibliography This vignette was generated using `r Biocpkg('BiocStyle')` `r Citep(bib[['BiocStyle']])` with `r CRANpkg('knitr')` `r Citep(bib[['knitr']])` and `r CRANpkg('rmarkdown')` `r Citep(bib[['rmarkdown']])` running behind the scenes. Citations made with `r CRANpkg('RefManageR')` `r Citep(bib[['RefManageR']])`. ```{r vignetteBiblio, results = 'asis', echo = FALSE, warning = FALSE, message = FALSE} ## Print bibliography PrintBibliography(bib, .opts = list(hyperlink = "to.doc", style = "html")) ```