--- title: "Package design vignette for {readepi}" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Package Design vignette for {readepi}} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: 72 --- ```{css, echo = FALSE} .section { opacity: 1; } ``` ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Concept and motivation ::: {style="text-align: justify;"} This document outlines the design decisions guiding the development strategies of the `{readepi}` R package, the reasoning behind them, as well as the possible pros and cons of each decision. Importing data from various sources into the R environment is the first step in the workflow of outbreak analysis. Health data are often stored in individual files of different formats, in relational database management systems (RDBMS), and more importantly, many health organizations store their data in health information systems (HIS) that are wrapped under hood of a specific Application Programming Interfaces (APIs). Many R packages have been developed over the years to read data stored in a file or in a directory containing multiple files. We recommend the [{rio}](http://gesistsa.github.io/rio/) package for importing data that are relatively small in size and the [{data.table}](https://CRAN.R-project.org/package=data.table) package for large files. For retrieving data from RDBMS, we recommend the [{DBI}](https://dbi.r-dbi.org/) package. There are several R packages for reading data from HIS such as {fingertipsR}, {REDCapR}, {godataR}, and {globaldothealth}, which are used to fetch data from [Fingertips](https://fingertips.phe.org.uk/), [REDCap](https://projectredcap.org/software/), [Go.Data](https://www.who.int/tools/godata), and [Global.Health](https://global.health/) respectively. However, these packages are usually designed to read from specific HIS and can't be used to query others. This increases the dependency on many other packages and introduces the challenge of having a unified framework for importing data from multiple HIS. As such, we propose `{readepi}`, a centralized tool that will provide users with the capability of importing data from various HIS and RDBMS. `{readepi}` aims at importing data from several potential sources in the same way. The data sources include distributed health information systems and public databases as shown in the figure below. ::: ![readepi roadmap](../man/figures/roadmap_readepi.drawio.svg) ## Scope ::: {style="text-align: justify;"} The `{readepi}` package is designed to import data from two common sources of institutional health-related data: HIS wrapped with specific APIs and RDBMS that run on specific servers. To import data from these sources, users must have read access and provide the relevant query parameters to fetch the target data. The current version of `{readepi}` supports importing data from: - **HISs**: [DHIS2](https://dhis2.org/) and [SORMAS](https://sormas.org/), - **RDBMS**: MS SQL, SQLite, MySQL, and PostgreSQL. In next releases, we plan to include features for reading data from additional HISs like GoData, Globaldothealth, and ODK, as well as RDBMS such as MS Access. ::: ![Diagram of current functions available in {readepi}](../man/figures/readepi_design_diagram_v-0.1.0.drawio.svg) ## Output ::: {style="text-align: justify;"} The main functions of the {readepi} package return a `data frame` object that contains the data fetched from the target source with the specified request parameters. The `login()` function returns a connection object that is used in the subsequent queries. ::: ## Design decisions ::: {style="text-align: justify;"} The aim of {readepi} is to simplify and standardize the process of fetching data from APIs and servers. We strive to make this easy for users by limiting the number of required arguments to access and retrieve the data of interest from the target source. As such, the package is structured around few main functions: `read_dhis2()`, `read_sormas()`, and `read_rdbms()`; and one auxiliary functions (`login()`). ### Authentication The `login()` function is used to establish connection with the data source. It verifies the user's identity and determines if they are authorized to access the requested database or API. Establishing this connection is crucial for ensuring successful data import. However, the basic authentication does not work for SORMAS. To maintain the design of the package across all HIS, the login function returns a object when importing data from SORMAS. Once authentication credentials are provided, they are securely stored within the connection object. This prevents the need to re-supply them for subsequent requests in other functions. The Figure below lists the arguments needed to call the `login()` function. ::: ```{r login, fig.align='center', echo=FALSE} DiagrammeR::grViz("digraph{ graph[rankdir = LR] node[shape = rectangle, style = filled, fontname = Courier, align = center] subgraph cluster_0 { graph[shape = rectangle] style = rounded bgcolor = LightBlue label = 'Arguments for login() function' node[shape = rectangle, fillcolor = LemonChiffon, margin = 0.25] A[label = 'type: the source name'] B[label = 'from: the URL, or IP address, or hostname'] C[label = 'user_name: the user name'] D[label = 'password: the password or API token or key'] E[label = 'db_name: the database name (RDBMS only)'] F[label = 'port: the port id (RDBMS only)'] G[label = 'driver_name: the driver name (RDBMS only)'] } }") ``` ::: {style="text-align: justify;"} The `type` argument refers to the name of the data source of interest. The current version of the package covers the following types: ``` i) RDBMS: “ms sql”, “mysql”, “postgresql”, “sqlite” ii) APIs: “dhis2”, “sormas” ``` ::: ### Data import You can use one of the functions below depending on the data source. - `read_rdbms()`: for importing data from RDBMS. It takes the following arguments: - **login**: A `Pool` object obtained from the `login()` function - **query**: A string with an SQL query or a list of parameters. When the query parameters are provided as a list, they will be used to form the appropriate SQL query internally. The resulting SQL query will then be executed by the `read_rdbms()`. - `read_dhis2()`: for importing data from DHIS2. This function expect the following arguments: - **login**: A `httr2_response` object returned by the `login()` function - **org_unit**: A character with the organisation unit ID or name - **program**: A character with the program ID or name - `read_sormas()`: for importing data from SORMAS. It takes the following arguments: - **login**: A `list` object returned by the `login()` - **disease**: A character vector with the names of the diseases of interest. Users can get the list of all diseases which data is available on their SORMAS system using the `sormas_get_diseases()` function. - **since**: A Date value in ISO8601 format (YYYY-mm-dd). ::: {style="text-align: justify;"} Note that, when reading from RDBMS, the `query` argument could be an [SQL query]{.underline} or a [list with a vector of table names, fields and rows to subset on]{.underline}. For HIS, we strongly recommend reading the vignette on the [query_parameters](./query_parameters.Rmd) for more details about the request parameters that are supported in the current version of the package. ::: ## Dependencies - The main and internal functions of the package rely primarily on three packages: - [{httr2}](https://CRAN.R-project.org/package=httr2) or [{data.table}](https://CRAN.R-project.org/package=data.table): These are used to construct and execute API requests. - [{dplyr}](https://CRAN.R-project.org/package=dplyr): Utilized for its data manipulation capabilities. - The `read_rdbms()` function depends on the following packages: - [{DBI}](https://CRAN.R-project.org/package=DBI): Used for database connectivity and querying. - [{pool}](https://CRAN.R-project.org/package=pool): Provides functionality for managing multiple database connections. - [{odbc}](https://CRAN.R-project.org/package=odbc): Supplies drivers required for accessing various DBMS. - [{RMySQL}](https://CRAN.R-project.org/package=RMySQL): Specifically used for MySQL database connectivity. These functions also require system dependencies for OS-X and Linux systems, detailed in the [install drivers vignette](./install_drivers.Rmd) vignette. Additionally, the development of the package necessitates the inclusion of other required packages: - [{checkmate}](https://CRAN.R-project.org/package=checkmate) - [{httptest2}](https://CRAN.R-project.org/package=httptest2) - [{bookdown}](https://CRAN.R-project.org/package=bookdown) - [{rmarkdown}](https://CRAN.R-project.org/package=rmarkdown) - [{testthat}](https://CRAN.R-project.org/package=testthat) (\>= 3.0.0) - [{knitr}](https://CRAN.R-project.org/package=knitr) - [{cli}](https://CRAN.R-project.org/package=cli) - [{DiagrammeR}](https://CRAN.R-project.org/package=DiagrammeR) - [{cyclocomp}](https://CRAN.R-project.org/package=cyclocomp) ## Contribute There are no special requirements to contributing to {readepi}, please follow the [package contributing guide](https://github.com/epiverse-trace/.github/blob/main/CONTRIBUTING.md).