--- title: "NBDCtools" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{NBDCtools} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} Sys.setenv("_R_CHECK_CRAN_INCOMING_" = "true") # always build vignette like on CRAN knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Background The `NBDCtools` R package makes use of the regular structure of NBDC datasets, especially standardized metadata (data dictionary and levels table; see, e.g., [here](https://docs.abcdstudy.org/latest/documentation/curation/metadata.html)) and the organization of tabulated data as one file per table in the BIDS `rawdata/phenotype/` directory (see [here](https://docs.abcdstudy.org/latest/documentation/curation/structure.html#rawdata) for information about the structure of the ABCD file-based data, and [here](https://docs.hbcdstudy.org/latest/datacuration/phenotypes/) for the HBCD study). The package assumes that users downloaded the complete tabulated dataset as file-based data and saved the files in a local directory. Using functions from the package, users can then create custom datasets by specifying the study name and any set of variable names and/or table names in its data dictionary. By making use of the study’s metadata, the functions automatically retrieve the needed columns from different files on disk, and join them to a data frame in memory. This provides a fast, storage- and memory-efficient, and highly reproducible way to work with data from the NBDC Data Hub that can be used as an alternative to creating and downloading different datasets (and creating on-disk representations for each of them) through the [Data Exploration & Analysis Portal (DEAP)](https://nbdc.deapscience.com) or the [NBDC Data Access Platform](https://nbdc-datashare.lassoinformatics.com). ### Download data using DEAP To download data from the NBDC Data Hub in the format that is required by the `NBDCtools` package, follow the following steps: 1. Log in to the [DEAP](https://nbdc.deapscience.com) application and select the `My datasets` tab. 1. On the bottom of the page, click on `Pre-assembled datasets`. 1. In the pop-up window, select the `All tables` option. 1. Click on the `Download tables` button to download the data files. 1. Unzip the downloaded file to a local directory and remember the path to this directory, as you will need it to load the data using the `NBDCtools` package. ![](img/get_data.png){width=100%} ## Getting started To begin using the `NBDCtools` package effectively, the most essential and frequently utilized function is `create_dataset()`. This omnibus function loads selected variables from files and creates an analysis-ready data frame in one step, incorporating various transformation and cleaning options. In this vignette, we will demonstrate the use of the `create_dataset()` function with simulated ABCD data files. We will illustrate how to join variables, perform various transformations, and explore some advanced options. ## Setup > **IMPORTANT:** Please ensure that the both the `NBDCtools` and `NBDCtoolsData` packages are installed. When `NBDCtools` is loaded, it will automatically load the required objects from `NBDCtoolsData` package, so you don't need to load it separately. To load `NBDCtools`, use the following command: ```{r setup} library(NBDCtools) ``` Alternatively, you can call functions directly without loading the package by using `::`, e.g., `NBDCtools::name_of_function(...)`. You can also access `NBDCtoolsData` objects directly using the colon-colon syntax. ## Load and join data We can use the following command to inspect the simulated data files: ```{r} dir_abcd <- system.file("extdata", "phenotype", package = "NBDCtools") list.files(dir_abcd) ``` Next, we will use the `create_dataset()` function to load data from the files in `dir_abcd` with selected variables of interest. ```{r} vars <- c( "ab_g_dyn__visit_type", "ab_g_dyn__cohort_grade", "ab_g_dyn__visit__day1_dt", "ab_g_stc__gen_pc__01", "ab_g_dyn__visit_age", "ab_g_dyn__visit_days", "ab_g_dyn__visit_dtt", "mr_y_qc__raw__dmri__r01__series_t" ) create_dataset( dir_data = dir_abcd, study = "abcd", vars = vars ) ``` > **NOTE:** The simulated data contains only a few variables and rows. In a real-world scenario, each file will typically have many more rows and tables. Users can select which variables to join using the following four arguments: - `vars`: Individual variables of interest - `tables`: Full tables of interest - `vars_add`: Additional individual variables - `tables_add`: Additional full tables Columns of interest specified by the `vars` and `tables` arguments are full-joined, meaning the resulting data frame retains all rows with at least one non-missing value in the selected variables/tables. Additional columns specified by the `vars_add` and `tables_add` arguments are left-joined to the data frame containing the columns of interest, retaining all rows and adding columns from the additional variables/tables. The `create_dataset()` function utilizes the low-level function `join_tabulated()` for data joining. For more information about the `join_tabulated()` function, refer to the [Join data](https://software.nbdc-datahub.org/NBDCtools/articles/join.html) vignette. For a diagram detailing the joining strategy for main and additional variables/tables, see [this page](https://docs.deapscience.com/create_edit/create.html#joining) (the `NBCDtools` package uses the same approach as the [DEAP](https://nbdc.deapscience.com) application). For example, if we only specify the `mr_y_qc__raw__dmri` variable in `vars` and move others to `vars_add`, we will have different number of rows in the data: ```{r} create_dataset( dir_data = dir_abcd, study = "abcd", vars = c( "mr_y_qc__raw__dmri__r01__series_t" ), vars_add = c( "ab_g_dyn__visit_type", "ab_g_dyn__cohort_grade", "ab_g_dyn__visit__day1_dt", "ab_g_stc__gen_pc__01", "ab_g_dyn__visit_age", "ab_g_dyn__visit_days", "ab_g_dyn__visit_dtt" ) ) ``` ## Process data After loading and joining the data, the `create_dataset()` function performs several transformation steps. Each step is reported with an `i` message in the console, allowing users to see which actions are being taken. For example, the output indicates that the function has executed the following steps: ``` #> ℹ Converting categorical variables to factors. #> ℹ Adding variable and value labels. ``` These steps utilize lower-level functions that can be used independently. The [Transform data](https://software.nbdc-datahub.org/NBDCtools/articles/transformation.html) vignette describes how to do so. ### Default transformations By default, `create_dataset()` performs the following two transformation steps (users can choose to not execute them by setting the respective arguments to `FALSE`): - `categ_to_factor`: Converts categorical columns to factors using the lower-level function `transf_factor()`. - `add_labels`: Adds variable and value labels using the lower-level function `transf_label()`. ### Additional transformations Users can also apply additional transformations to the data by setting the respective arguments to `TRUE`. The following transformations are available: - `value_to_label`: Converts categorical columns' numeric values to labels using the lower-level function `transf_value_to_label()`. - `value_to_na`: Converts categorical missingness/non-response codes to `NA` using the lower-level function `transf_value_to_na()`. - `time_to_hms`: Converts time variables to `hms` class using the lower-level function `transf_time_to_hms()`. Here is an example of adding these additional transformations to the `create_dataset()` function: ```{r} create_dataset( dir_data = dir_abcd, study = "abcd", vars = vars, value_to_label = TRUE, value_to_na = TRUE, time_to_hms = TRUE ) ``` ### Shadow matrices The `create_dataset()` function also includes the option to process shadow matrices. Shadow matrices are tables with the same dimensions as the original data and provide information about why a given cell is missing in the original data. Using the `bind_shadow = TRUE` argument, users can append the shadow matrix as additional columns to the end of the data frame. ```{r eval=FALSE} create_dataset( dir_data = dir_abcd, study = "abcd", vars = vars, bind_shadow = TRUE ) ``` Please note that shadow matrices are processed differently for ABCD and HBCD study datasets: - **ABCD:** Currently, no raw shadow matrix data is being released. As such, `create_dataset()` will create a shadow matrix from the data using `naniar::as_shadow()` if `bind_shadow` is set to `TRUE`. - **HBCD:** The shadow matrix is provided as a separate file in the `rawdata/phenotype/` directory. The `create_dataset()` function will read it from the file and append it to the data frame by default if `bind_shadow` is set to `TRUE`. Users can use the additional argument `naniar_shadow = TRUE` if they prefer for the shadow matrix to be created from the data using `naniar::as_shadow()` instead: ```{r eval=FALSE} create_dataset( dir_data = dir_abcd, study = "abcd", vars = vars, bind_shadow = TRUE, naniar_shadow = TRUE ) ``` > **IMPORTANT:** The `naniar::as_shadow()` requires the `naniar` package to be installed, which is not a dependency of `NBDCtools`. If you want to use this option, please install the `naniar` package first using `install.packages("naniar")`. For more information about shadow matrices, please refer to the [Work with shadow matrices](https://software.nbdc-datahub.org/NBDCtools/articles/shadow.html) vignette. ## Advanced options The `create_dataset()` function calls several other low-level functions to process the data. Some of these low-level functions have additional arguments that can be used to customize the processing. To use these arguments, users can pass them to the `create_dataset()` function using the `...` argument. For example, if we select `value_to_na = TRUE`, the function will call the lower-level `transf_value_to_na()` function, which will convert factor levels that represent missingness/non-response codes to `NA`. This is useful when the data contains specific codes that indicate missingness like in the ABCD study where `"222"`, `"333"`, `"444"`, etc. are used consistently (see, [here](https://docs.abcdstudy.org/latest/documentation/curation/standards.html#non-responsemissingness-codes) for more details). One can change the non-response/missingness codes that should be converted to `NA` by passing the `missing_codes` argument to the `create_dataset()` function. For example, if we want to convert the levels `1` and `2` to `NA` (this is typically not advisable in a real-world scenario), we can do so by passing the `missing_codes` argument to the `create_dataset()` function as follows: ```{r} create_dataset( dir_data = dir_abcd, study = "abcd", vars = vars, value_to_na = TRUE, missing_codes = c("1", "2") ) ``` First `create_dataset()` prints out the message that indicating which additional arguments are passed to the low-level functions: ```r #> ℹ Argument `missing_codes` is passed to `transf_value_to_na()`. ``` In the results, we can see that in column `ab_g_dyn__visit_type`, the levels `1` and `2` are converted to `NA`, while the other values are kept as is. If the user defines wrong or not existing arguments, they will be ignored. For example, if we pass an additional argument `my_arg` to `create_dataset()` function, it will be ignored and the returned data will be the same as if we did not pass this argument at all: ```{r} create_dataset( dir_data = dir_abcd, study = "abcd", vars = vars, value_to_na = TRUE, my_arg = "some_value" # this argument will be ignored ) ``` Please refer to the lower-level functions documentation for more information about the available arguments and their usage on the [Reference](https://software.nbdc-datahub.org/NBDCtools/reference/index.html) page.