---
title: "NBDCtools"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{NBDCtools}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
Sys.setenv("_R_CHECK_CRAN_INCOMING_" = "true") # always build vignette like on CRAN
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Background

The `NBDCtools` R package makes use of the regular structure of NBDC datasets, especially
standardized metadata (data dictionary and levels table; see, e.g.,
[here](https://docs.abcdstudy.org/latest/documentation/curation/metadata.html))
and the organization of tabulated data as one file per table in the BIDS
`rawdata/phenotype/` directory
(see [here](https://docs.abcdstudy.org/latest/documentation/curation/structure.html#rawdata)
for information about the structure of the ABCD file-based data, and
[here](https://docs.hbcdstudy.org/latest/datacuration/phenotypes/)
for the HBCD study).

The package assumes that users downloaded the complete tabulated
dataset as file-based data and saved the files in a local directory. Using
functions from the package, users can then create custom datasets by specifying
the study name and any set of variable names and/or table names in its data
dictionary.

By making use of the study’s metadata, the functions automatically
retrieve the needed columns from different files on disk, and join them to a
data frame in memory. This provides a fast, storage- and memory-efficient, and
highly reproducible way to work with data from the NBDC Data Hub that can be
used as an alternative to creating and downloading different datasets (and
creating on-disk representations for each of them) through the
[Data Exploration & Analysis Portal (DEAP)](https://nbdc.deapscience.com) or
the [NBDC Data Access Platform](https://nbdc-datashare.lassoinformatics.com).

### Download data using DEAP

To download data from the NBDC Data Hub in the format that is required by the
`NBDCtools` package, follow the following steps:

1. Log in to the [DEAP](https://nbdc.deapscience.com) application and select
   the `My datasets` tab.
1. On the bottom of the page, click on `Pre-assembled datasets`.
1. In the pop-up window, select the `All tables` option.
1. Click on the `Download tables` button to download the data files.
1. Unzip the downloaded file to a local directory and remember the path to this
   directory, as you will need it to load the data using the `NBDCtools`
   package.

![](img/get_data.png){width=100%}

## Getting started

To begin using the `NBDCtools` package effectively, the most essential and
frequently utilized function is `create_dataset()`. This omnibus function loads
selected variables from files and creates an analysis-ready data frame in one
step, incorporating various transformation and cleaning options.

In this vignette, we will demonstrate the use of the `create_dataset()` function
with simulated ABCD data files. We will illustrate how to join variables,
perform various transformations, and explore some advanced options.

## Setup

> **IMPORTANT:** Please ensure that the both the `NBDCtools` and `NBDCtoolsData`
packages are installed. When `NBDCtools` is loaded, it will automatically load
the required objects from `NBDCtoolsData` package, so you don't need to load it
separately.

To load `NBDCtools`, use the following command:

```{r setup}
library(NBDCtools)
```

Alternatively, you can call functions directly without loading the package by
using `::`, e.g., `NBDCtools::name_of_function(...)`. You can also access
`NBDCtoolsData` objects directly using the colon-colon syntax.

## Load and join data

We can use the following command to inspect the simulated data files:

```{r}
dir_abcd <- system.file("extdata", "phenotype", package = "NBDCtools")
list.files(dir_abcd)
```

Next, we will use the `create_dataset()` function to load data from the files in
`dir_abcd` with selected variables of interest.

```{r}
vars <- c(
  "ab_g_dyn__visit_type", 
  "ab_g_dyn__cohort_grade", 
  "ab_g_dyn__visit__day1_dt", 
  "ab_g_stc__gen_pc__01", 
  "ab_g_dyn__visit_age", 
  "ab_g_dyn__visit_days", 
  "ab_g_dyn__visit_dtt", 
  "mr_y_qc__raw__dmri__r01__series_t"
)
create_dataset(
  dir_data = dir_abcd,
  study = "abcd",
  vars = vars
)
```

> **NOTE:** The simulated data contains only a few variables and rows. In a
real-world scenario, each file will typically have many more rows and tables.

Users can select which variables to join using the following four arguments:

-  `vars`: Individual variables of interest
-  `tables`: Full tables of interest
-  `vars_add`: Additional individual variables
-  `tables_add`: Additional full tables

Columns of interest specified by the `vars` and `tables` arguments are
full-joined, meaning the resulting data frame retains all rows with at least one
non-missing value in the selected variables/tables. Additional columns specified
by the `vars_add` and `tables_add` arguments are left-joined to the data frame
containing the columns of interest, retaining all rows and adding columns from
the additional variables/tables. The `create_dataset()` function utilizes the
low-level function `join_tabulated()` for data joining. For more information
about the `join_tabulated()` function, refer to the [Join data](https://software.nbdc-datahub.org/NBDCtools/articles/join.html)
vignette. For a diagram detailing the joining strategy for main and additional
variables/tables, see [this
page](https://docs.deapscience.com/create_edit/create.html#joining) (the
`NBCDtools` package uses the same approach as the
[DEAP](https://nbdc.deapscience.com) application).

For example, if we only specify the `mr_y_qc__raw__dmri` variable in `vars` and
move others to `vars_add`,  we will have different number of rows in the data:

```{r}
create_dataset(
  dir_data = dir_abcd,
  study = "abcd",
  vars = c(
    "mr_y_qc__raw__dmri__r01__series_t"
  ),
  vars_add = c(
    "ab_g_dyn__visit_type",
    "ab_g_dyn__cohort_grade", 
    "ab_g_dyn__visit__day1_dt",
    "ab_g_stc__gen_pc__01", 
    "ab_g_dyn__visit_age",
    "ab_g_dyn__visit_days", 
    "ab_g_dyn__visit_dtt"
  )
)
```


## Process data

After loading and joining the data, the `create_dataset()` function performs
several transformation steps. Each step is reported with an `i` message in the
console, allowing users to see which actions are being taken. For example, the
output indicates that the function has executed the following steps:

```
#> ℹ Converting categorical variables to factors.
#> ℹ Adding variable and value labels.
```

These steps utilize lower-level functions that can be used independently. The
[Transform data](https://software.nbdc-datahub.org/NBDCtools/articles/transformation.html) 
vignette describes how to do so.

### Default transformations

By default, `create_dataset()` performs the following two transformation steps
(users can choose to not execute them by setting the respective arguments to
`FALSE`):

- `categ_to_factor`: Converts categorical columns to factors using
  the lower-level function `transf_factor()`.
- `add_labels`: Adds variable and value labels using the lower-level
  function `transf_label()`.

### Additional transformations

Users can also apply additional transformations to the data by setting the
respective arguments to `TRUE`. The following transformations are available:

- `value_to_label`: Converts categorical columns' numeric values to labels
  using the lower-level function `transf_value_to_label()`.
- `value_to_na`: Converts categorical missingness/non-response codes to `NA`
  using the lower-level function `transf_value_to_na()`.
- `time_to_hms`: Converts time variables to `hms` class using the lower-level
  function `transf_time_to_hms()`.
  
Here is an example of adding these additional transformations to the
`create_dataset()` function:

```{r}
create_dataset(
  dir_data = dir_abcd,
  study = "abcd",
  vars = vars,
  value_to_label = TRUE,
  value_to_na = TRUE,
  time_to_hms = TRUE
)
```

### Shadow matrices

The `create_dataset()` function also includes the option to process shadow
matrices. Shadow matrices are tables with the same dimensions as the original
data and provide information about why a given cell is missing in the original
data. Using the `bind_shadow = TRUE` argument, users can append the shadow
matrix as additional columns to the end of the data frame.

```{r eval=FALSE}
create_dataset(
  dir_data = dir_abcd,
  study = "abcd",
  vars = vars,
  bind_shadow = TRUE
)
```

Please note that shadow matrices are processed differently for ABCD and HBCD
study datasets:

- **ABCD:** Currently, no raw shadow matrix data is being released. As such,
  `create_dataset()` will create a shadow matrix from the data using
  `naniar::as_shadow()` if `bind_shadow` is set to `TRUE`.
- **HBCD:** The shadow matrix is provided as a separate file in the
  `rawdata/phenotype/` directory. The `create_dataset()` function will read it
  from the file and append it to the data frame by default if `bind_shadow` is
  set to `TRUE`. Users can use the additional argument `naniar_shadow = TRUE` if
  they prefer for the shadow matrix to be created from the data using
  `naniar::as_shadow()` instead:
    ```{r eval=FALSE}
    create_dataset(
      dir_data = dir_abcd,
      study = "abcd",
      vars = vars,
      bind_shadow = TRUE,
      naniar_shadow = TRUE
    )
    ```
    
> **IMPORTANT:** The `naniar::as_shadow()` requires the `naniar` package to be
installed, which is not a dependency of `NBDCtools`. If you want to use this
option, please install the `naniar` package first using
`install.packages("naniar")`.

For more information about shadow matrices, please refer to the
[Work with shadow matrices](https://software.nbdc-datahub.org/NBDCtools/articles/shadow.html) 
vignette.

## Advanced options

The `create_dataset()` function calls several other low-level functions to
process the data. Some of these low-level functions have additional arguments
that can be used to customize the processing. To use these arguments, users
can pass them to the `create_dataset()` function using the `...` argument.

For example, if we select `value_to_na = TRUE`, the function will
call the lower-level `transf_value_to_na()` function, which will convert
factor levels that represent missingness/non-response codes to `NA`.
This is useful when the data contains specific codes that indicate missingness
like in the ABCD study where `"222"`, `"333"`, `"444"`, etc. are used
consistently (see,
[here](https://docs.abcdstudy.org/latest/documentation/curation/standards.html#non-responsemissingness-codes)
for more details).

One can change the non-response/missingness codes that should be converted to
`NA` by passing the `missing_codes` argument to the `create_dataset()` function.
For example, if we want to convert the levels `1` and `2` to `NA` (this is
typically not advisable in a real-world scenario), we can do so by passing the
`missing_codes` argument to the `create_dataset()` function as follows:

```{r}
create_dataset(
  dir_data = dir_abcd,
  study = "abcd",
  vars = vars,
  value_to_na = TRUE,
  missing_codes = c("1", "2")
)
```

First `create_dataset()` prints out the message that indicating which additional
arguments are passed to the low-level functions:

```r
#> ℹ Argument `missing_codes` is passed to `transf_value_to_na()`.
```

In the results, we can see that in column `ab_g_dyn__visit_type`, the levels `1`
and `2` are converted to `NA`, while the other values are kept as is.

If the user defines wrong or not existing arguments, they will be ignored. For
example, if we pass an additional argument `my_arg` to `create_dataset()`
function, it will be ignored and the returned data will be the same as if we did
not pass this argument at all:

```{r}
create_dataset(
  dir_data = dir_abcd,
  study = "abcd",
  vars = vars,
  value_to_na = TRUE,
  my_arg = "some_value" # this argument will be ignored
)
```

Please refer to the lower-level functions documentation for more information
about the available arguments and their usage on the
[Reference](https://software.nbdc-datahub.org/NBDCtools/reference/index.html)
page.