--- title: "Local Use of dnaEPICO" author: - name: Paul Ruiz affiliation: - Queensland University of Technology email: ruizpint@qut.edu.au - name: Divya Mehta affiliation: - Queensland University of Technology output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 date: "`r Sys.Date()`" package: dnaEPICO bibliography: dnaEPICO.bib vignette: > %\VignetteIndexEntry{Interactive Local Use of dnaEPICO} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} --- ```{r, include = FALSE} startTime <- Sys.time() knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL ) library(dnaEPICO) can_run_preprocessing <- requireNamespace("minfiData", quietly = TRUE) && requireNamespace("IlluminaHumanMethylation450kmanifest", quietly = TRUE) && requireNamespace("IlluminaHumanMethylation450kanno.ilmn12.hg19", quietly = TRUE) can_run_sva <- requireNamespace("minfiData", quietly = TRUE) ``` # Introduction This vignette demonstrates the local, interactive use of `dnaEPICO`. In this mode, the main functions return structured in-memory result objects that can be inspected directly in the R session. This is the recommended route when you are checking inputs, reviewing quality-control outputs, or learning the structure of the package results before moving to larger analyses. The examples below focus on the first three workflow stages: 1. `preprocessingMinfiEwasWater()` 2. `svaEnmix()` 3. `preprocessingPheno()` All examples use `saveOutputs = FALSE`, which keeps the workflow in memory and avoids writing the legacy file outputs. # Citation The `r Biocpkg("dnaEPICO")` package uses methods from several external packages. Because no single manuscript describes all components, the guidance below explains how to cite `dnaEPICO` depending on the functions you use. This citation guidance is adapted from the vignettes and user guides of [minfi](https://github.com/hansenlab/minfi/blob/devel/vignettes/minfi.Rmd), [ENmix](https://github.com/xuz1/ENmix/blob/master/vignettes/ENmix.Rmd), and [wateRmelon](https://github.com/schalkwyk/wateRmelon/blob/master/vignettes/references.bib). - If you use `preprocessingMinfiEwasWater()`, please cite @minfi; @minfiEPIC; @ENmix; @wateRmelon; @ewastools; @SWAN; @funnorm; @noob; @quantile. - If you use `svaEnmix()`, please cite @minfi; @ENmix. # Required knowledge `r Biocpkg("dnaEPICO")` is built on core Bioconductor infrastructure for high-dimensional genomic data, with a focus on Illumina DNA methylation arrays. To read this vignette, we assume that you are already familiar with the general DNA methylation workflow. If not, we recommend first reading [this tutorial](https://paulyrp.github.io/2025-cpgpneurogenomics-workshop/tutorial.html), which provides a practical introduction to the main concepts and analysis steps. Preprocessing and quality control are performed using established Bioconductor tools, including `r Biocpkg("minfi")`, `r Biocpkg("ENmix")`, and `r Biocpkg("wateRmelon")`. Users are expected to have basic familiarity with R, Bioconductor workflows, and Illumina IDAT file structures. If you are asking yourself the question "Where do I start using Bioconductor?", you might be interested in [this post](https://www.bioconductor.org/install/). # Installation The `r Biocpkg("dnaEPICO")` package depends on `r Biocpkg("minfi")`, `r Biocpkg("ENmix")`, and `r Biocpkg("wateRmelon")`, among others. Install dependencies with BiocManager: ```{r, eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("dnaEPICO") BiocManager::valid() ``` ```{r, eval = FALSE, message = FALSE} library("dnaEPICO") ``` # Step 1: preprocessingMinfiEwasWater The preprocessing wrapper reads a phenotype table and IDAT files, performs quality control, normalisation, filtering, and cell-composition estimation, and returns a list-based result object. The next chunk runs the function on a small example input and prints the class, element names, and a preview of the filtered phenotype table. ### Create example input files The next chunk shows how to recreate the temporary files used in this example. It writes a small phenotype table to a temporary directory, copies a small set of IDAT files, and records the cross-reactive probe reference used during probe filtering. ```{r preprocessing-local-inputs, eval = can_run_preprocessing, message = FALSE, warning = FALSE} preprocessing_inputs <- dnaEPICO:::exampleMinfiIdatInputsDnaEpico(n = 6L) names(preprocessing_inputs) preprocessing_inputs$tempDir basename(preprocessing_inputs$phenoFile) head(basename(list.files(preprocessing_inputs$idatFolder, full.names = TRUE)), 4) basename(preprocessing_inputs$crossReactivePath) ``` `preprocessing_inputs` is a small list returned by the internal example helper. `names(preprocessing_inputs)` shows its main elements: `tempDir` is the temporary working directory, `idatFolder` contains the copied example IDAT files, `phenoFile` is the phenotype table used by the wrapper, `targets` is the same phenotype information already loaded into memory, `arrayType` and `annotationVersion` describe the array platform, and `crossReactivePath` points to the cross-reactive probe reference used later during filtering. ```{r preprocessing-local, eval = can_run_preprocessing, message = FALSE, warning = FALSE} preprocessResult <- preprocessingMinfiEwasWater( phenoFile = preprocessing_inputs$phenoFile, idatFolder = preprocessing_inputs$idatFolder, outputLogs = file.path(preprocessing_inputs$tempDir, "logs"), nSamples = 6, SampleID = "Sample_Name", arrayType = preprocessing_inputs$arrayType, annotationVersion = preprocessing_inputs$annotationVersion, scriptLabel = "preprocessingMinfiEwasWater", baseDataFolder = file.path(preprocessing_inputs$tempDir, "rData"), figureBaseDir = file.path(preprocessing_inputs$tempDir, "figures"), detPThreshold = 1, normMethods = "quantile", sexColumn = "Sex", pvalThreshold = 1, chrToRemove = "", snpsToRemove = "SBE", mafThreshold = 1, crossReactivePath = preprocessing_inputs$crossReactivePath, plotGroupVar = "Sex", lcRef = "saliva", phenoOrder = "Sample_Name;Sex;Basename;Sentrix_ID;Sentrix_Position", lcPhenoDir = preprocessing_inputs$tempDir, display = FALSE, verbose = FALSE, logs = FALSE, saveOutputs = FALSE ) class(preprocessResult) names(preprocessResult) head(preprocessResult$targets[, c( "Sample_Name", "Sex", "Sentrix_ID", "Sentrix_Position" )]) ``` The returned object has class `dnaEPICO_preprocessingMinfiEwasWater`. `names(preprocessResult)` lists the main elements returned by the wrapper. The `targets` element is the filtered phenotype table aligned to the retained samples. The remaining elements are nested result objects for the major preprocessing stages, including the `RGSet`, the methylation metrics, and the cell-composition results. Arguments: - `phenoFile` and `idatFolder` define the phenotype table and the directory of raw IDAT pairs. - `nSamples = 6` keeps the example small enough for interactive use. - `SampleID = "Sample_Name"` identifies the column used to align phenotype data and methylation objects. - `arrayType` and `annotationVersion` must match the array platform being processed. - `detPThreshold`, `normMethods`, `pvalThreshold`, `chrToRemove`, `snpsToRemove`, and `mafThreshold` control quality filtering and probe filtering. - `crossReactivePath` supplies the cross-reactive probe reference used during filtering. - `sexColumn`, `plotGroupVar`, `lcRef`, and `phenoOrder` control sex prediction, quality-control grouping, cell-type estimation, and the structure of the returned phenotype table. - `display = FALSE`, `verbose = FALSE`, `logs = FALSE`, and `saveOutputs = FALSE` keep the example quiet, in memory, and free of file outputs. The next chunk inspects three important elements: the filtered `RGSet`, the metric matrices, and the phenotype table augmented with cell-composition estimates. ```{r preprocessing-local-inspect, eval = can_run_preprocessing, message = FALSE, warning = FALSE} class(preprocessResult$RGSet) names(preprocessResult$metricsData) head(preprocessResult$lcData$phenoLC[, 1:6]) ``` In local use, these are the most useful objects to inspect first. `preprocessResult$RGSet` is the core methylation object. `metricsData` contains the beta-value, M-value, and copy-number matrices used in downstream steps. `lcData$phenoLC` is the phenotype table augmented with estimated cell composition and is the object passed forward to later workflow stages. # Step 2: svaEnmix The SVA wrapper loads a phenotype table and a saved `RGSet`, estimates surrogate variables from ENmix control probes, and returns both the surrogate variables and the merged phenotype table. ### Create example input files The next chunk shows how to recreate the temporary files consumed by `svaEnmix()`: a phenotype CSV file and a saved `RGSet` object in `.RData` format. ```{r sva-local-inputs, eval = can_run_sva, message = FALSE, warning = FALSE} sva_inputs <- dnaEPICO:::exampleMinfiBaseDataDnaEpico() sva_temp_dir <- tempdir() sva_targets_file <- file.path(sva_temp_dir, "sva_targets.csv") sva_rgset_file <- file.path(sva_temp_dir, "sva_RGSet.RData") names(sva_inputs) class(sva_inputs$RGSet) utils::write.csv(sva_inputs$targets, sva_targets_file, row.names = FALSE) RGSet <- sva_inputs$RGSet save(RGSet, file = sva_rgset_file) basename(c(sva_targets_file, sva_rgset_file)) file.exists(c(sva_targets_file, sva_rgset_file)) ``` `sva_inputs` is another helper list used to make the vignette self-contained. `names(sva_inputs)` shows that it provides the in-memory `RGSet`, the matching `targets` table, and a `crossReactivePath` reference. In this step, the wrapper needs file inputs rather than the in-memory objects, so the chunk writes `targets` to `sva_targets.csv` and saves the `RGSet` object to `sva_RGSet.RData`. ```{r sva-local, eval = can_run_sva, message = FALSE, warning = FALSE} svaResult <- svaEnmix( phenoFile = sva_targets_file, rgsetData = sva_rgset_file, outputLogs = file.path(tempdir(), "logs"), SampleID = "Sample_Name", arrayType = "IlluminaHumanMethylation450k", annotationVersion = "ilmn12.hg19", SentrixIDColumn = "Sentrix_ID", SentrixPositionColumn = "Sentrix_Position", ctrlSvaPercVar = 0.90, ctrlSvaFlag = 1, scriptLabel = "svaEnmix", display = FALSE, verbose = FALSE, logs = FALSE, saveOutputs = FALSE ) class(svaResult) names(svaResult) dim(svaResult$svaData$sva) head(svaResult$mergedPheno[, 1:min(6, ncol(svaResult$mergedPheno))]) ``` The returned object has class `dnaEPICO_svaEnmix`, and `names(svaResult)` shows the main SVA components returned by the wrapper. The `svaData$sva` matrix contains one row per sample and one column per surrogate variable. The `mergedPheno` table is often the most convenient object to inspect because it combines the original phenotype data and the new SVA columns in one place. Arguments: - `phenoFile` points to the phenotype table that already includes the cell composition estimates from preprocessing. - `rgsetData` points to the saved `RGSet` used to derive ENmix control-probe surrogate variables. - `SampleID`, `SentrixIDColumn`, and `SentrixPositionColumn` identify the sample and array-position columns used in the SVA analysis. - `ctrlSvaPercVar = 0.90` asks the procedure to keep enough control-derived surrogate variables to explain 90% of the control-probe variance. - `ctrlSvaFlag = 1` enables the control-based SVA workflow. - `display = FALSE`, `verbose = FALSE`, `logs = FALSE`, and `saveOutputs = FALSE` keep the example interactive and in-memory. The association analysis between the surrogate variables and array-position metadata is stored separately in `analysisData`. ```{r sva-local-inspect, eval = can_run_sva, message = FALSE, warning = FALSE} class(svaResult$analysisData) names(svaResult$analysisData) ``` This object contains the fitted models and ANOVA summaries used to assess how the surrogate variables relate to Sentrix chip and position effects. # Step 3: preprocessingPheno The phenotype-preparation wrapper aligns the phenotype table with the beta, M-value, and copy-number matrices; splits the data by timepoint; constructs a combined longitudinal object; and prepares Clock Foundation export tables. ### Create example input files The next chunk shows how to recreate the temporary phenotype and matrix files used in this example. The helper writes a phenotype table plus aligned beta, M-value, and copy-number objects to a temporary directory. ```{r preprocessingPheno-local-inputs, message = FALSE, warning = FALSE} pheno_inputs <- dnaEPICO:::examplePreprocessingPhenoStateDnaEpico() names(pheno_inputs) pheno_inputs$tempDir basename(c( pheno_inputs$phenoPath, pheno_inputs$betaPath, pheno_inputs$mPath, pheno_inputs$cnPath )) ``` `pheno_inputs` is a list that bundles both the temporary files and the corresponding in-memory objects used in this stage. `names(pheno_inputs)` shows that it contains the temporary directory, the phenotype table, the three saved matrix files (`phenoPath`, `betaPath`, `mPath`, and `cnPath`), and precomputed helper objects such as `metricsData`, `timepointData`, `combinedData`, and `clockFoundation`. ```{r preprocessingPheno-local, message = FALSE, warning = FALSE} phenoResult <- preprocessingPheno( phenoFile = pheno_inputs$phenoPath, betaPath = pheno_inputs$betaPath, mPath = pheno_inputs$mPath, cnPath = pheno_inputs$cnPath, SampleID = "Sample_Name", timeVar = "Timepoint", timepoints = "1,2", combineTimepoints = "1,2", outputPheno = file.path(pheno_inputs$tempDir, "data", "preprocessingPheno"), outputRData = file.path( pheno_inputs$tempDir, "rData", "preprocessingPheno", "metrics" ), outputRDataMerge = file.path( pheno_inputs$tempDir, "rData", "preprocessingPheno", "mergeData" ), sexColumn = "Sex", outputLogs = file.path(pheno_inputs$tempDir, "logs"), outputDir = file.path(pheno_inputs$tempDir, "clockFoundation"), verbose = FALSE, logs = FALSE, saveOutputs = FALSE ) class(phenoResult) names(phenoResult) names(phenoResult$combinedData) head(phenoResult$combinedData$phenoBeta[, 1:6]) ``` The returned object has class `dnaEPICO_preprocessingPheno`, and `names(phenoResult)` shows the main returned components. The `combinedData` element groups the combined longitudinal objects, and `phenoResult$combinedData$phenoBeta` is the key downstream table for modeling. It combines phenotype columns and CpG beta values in a single data frame that can be passed to the GLM or GLMM wrappers. Arguments: - `phenoFile`, `betaPath`, `mPath`, and `cnPath` define the phenotype table and the three methylation matrices that are aligned in this step. - `SampleID = "Sample_Name"` is the key used to match samples across the loaded objects. - `timeVar = "Timepoint"`, `timepoints = "1,2"`, and `combineTimepoints = "1,2"` define the separate exports and the combined longitudinal object. - `outputPheno`, `outputRData`, `outputRDataMerge`, and `outputDir` still need to be specified even when `saveOutputs = FALSE`, because they define the canonical output structure for the wrapper. - `sexColumn = "Sex"` identifies the sex variable used by the helper functions. - `verbose = FALSE`, `logs = FALSE`, and `saveOutputs = FALSE` keep the example focused on the returned object rather than file writing. The Clock Foundation export objects are returned in the same result, which makes it possible to inspect the export-ready tables before deciding whether they should be used in a reporting or export workflow. ```{r preprocessingPheno-local-inspect, message = FALSE, warning = FALSE} names(phenoResult$clockFoundation) head(phenoResult$clockFoundation$phenoCF) ``` The `clockFoundation` element contains the export-ready objects, and `phenoResult$clockFoundation$phenoCF` is the phenotype table prepared for [Clock Foundation](https://dnamage.clockfoundation.org/user) use. # Summary In interactive local use, the most important habit is to inspect the returned objects after each step. In practice, the following checks are usually enough to confirm that the workflow is behaving as expected: - `class(result)` to confirm which wrapper produced the object - `names(result)` to see the main return elements - `head()` on the phenotype-like tables - `class()` or `dim()` on the methylation objects and matrices The next vignette shows how to use `saveOutputs = TRUE` and the Makefile workflow. The file-based vignette also describes the regression and mixed-effects model structures used downstream, including how phenotype-specific PRS terms are added through `prsMap`. # Basics Date the vignette was generated. ```{r, echo = FALSE} Sys.time() ``` Wallclock time spent generating the vignette. ```{r, echo = FALSE} totalTime <- diff(c(startTime, Sys.time())) round(totalTime, digits = 3) ``` `R` session information. ```{r, echo = FALSE} sessionInfo() ``` ## Asking for help As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But `R` and `Bioconductor` have a steep learning curve, so it is critical to learn where to ask for help. We would like to highlight the [Bioconductor support site](https://support.bioconductor.org/) as the main resource for getting help. Please remember to use the `dnaEPICO` tag and check the [older posts](https://support.bioconductor.org/tag/dnaEPICO/). If you want to receive help, please provide a small reproducible example and your session information so the source of the problem can be tracked efficiently. # References