--- title: "Comparison with healthbR and microdatasus" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Comparison with healthbR and microdatasus} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", eval = FALSE) ``` `datasusr` is one of several R packages that pull Brazilian public health data into the tidyverse. This article compares it with the two most relevant alternatives so that you can pick the right tool — and, in many cases, that tool will *not* be `datasusr`. - [`healthbR`](https://github.com/SidneyBissoli/healthbR) (Sidney Bissoli): modern, broad-scope toolkit covering DATASUS *and* IBGE surveys, primary-care indicators, ANS, and ANVISA, with module-specific helpers, variable dictionaries, parallel downloads, and Parquet caching. - [`microdatasus`](https://github.com/rfsaldanha/microdatasus) (Saldanha, Bastos & Barcellos, 2019): the original tidyverse-friendly DATASUS reader, with rich `process_*()` functions for variable labelling. Currently archived on CRAN. ## TL;DR — when to use what **Reach for `healthbR` first** when you want analysis-ready data and your question spans more than the DATASUS FTP. It covers VIGITEL, PNS, PNAD Contínua, POF, Censo, SI-PNI (microdata 2020+), SISAB, ANS, and ANVISA in addition to SIM/SINASC/SIH/SIA/SINAN/CNES, exposes a uniform `_data()` / `_variables()` / `_dictionary()` API, filters by ICD-10 prefix at the call site, parallelises downloads via `future`/`furrr`, and caches as Parquet via `arrow`. For most applied public-health analyses this is the most productive starting point. **Use `datasusr`** when you specifically need a small, fast, dependency-light reader for raw DBC files from the DATASUS FTP, a persistent on-disk cache with age/size pruning, or low-level access to the FTP catalog (custom paths, custom file types, custom column types). It does *not* try to replace `healthbR`'s breadth or `microdatasus`'s variable-labelling features — it stays focused on fast I/O and the FTP. **Use `microdatasus`** when you want its `process_sim()` / `process_sinasc()` / etc. labelling pipelines for SIM/SINASC/SIH/SIA/CNES and a select set of SINAN diseases, in a workflow you already know. ## Feature comparison | Feature | datasusr | healthbR | microdatasus | |---|---|---|---| | **Data sources** | DATASUS FTP only (18 systems catalogued) | DATASUS FTP + IBGE surveys + SISAB + ANS + ANVISA | DATASUS FTP only (~15 systems) | | **Survey microdata** (VIGITEL, PNS, PNAD-C, POF, Censo) | No | Yes | No | | **Regulatory data** (ANS, ANVISA) | No | Yes | No | | **Primary-care indicators** (SISAB) | No | Yes | No | | **DBC reading** | In-memory C parser (bundled `blast`) | Vendored C decompressor | External `read.dbc` | | **Module-specific API** | No (one generic `datasus_fetch()`) | Yes (`sim_data()`, `sih_data()`, …) | Yes (`fetch_datasus(information_system = "SIH-RD")`) | | **Variable dictionaries / labels** | No (raw fields) | Yes (`*_variables()`, `*_dictionary()`) | Yes (`process_*()` family) | | **Filter by ICD-10 prefix** | No (post-hoc with `dplyr`) | Yes (`cause = "I"`, `diagnosis = "J"`) | No | | **Parallel downloads** | `curl::multi_download` (sequential per call) | `future` / `furrr` plans | Sequential | | **Caching** | Persistent on disk, configurable, with `prune` by age/size | Per-module Parquet cache (via `arrow`) with `*_cache_status()` | Session-level temp files | | **Cache opt-in** | tempdir() default, persistent only on opt-in | Per-module helpers | None | | **Column selection at read** | Yes (`select`) | Module-dependent | Yes (`vars`) | | **Type control at read** | Yes (`col_types`, `guess_types`) | No | No | | **Date parsing at read** | Yes (`parse_dates`) | Module-dependent | Via `process_*()` | | **Return type** | Tibble, snake_case by default | Tibble, snake_case | Data frame → tibble after `process_*()` | | **CRAN status** | This submission | On CRAN | Archived | | **Dependencies** | cli, curl, dplyr, purrr, rlang, stringr, tibble, tidyr | + readr, jsonlite, foreign (Imports); arrow, furrr, future, survey, srvyr (Suggests) | foreign, read.dbc (archived) | | **License** | MIT | MIT | MIT | ## Reading DBC files All three packages ultimately rely on Mark Adler's `blast` decompressor for the PKWare DCL format used by DATASUS. The differences are in *how* the decompressed data is materialised in R: - **`microdatasus`** delegates to the `read.dbc` package, which writes a temporary `.dbf` file and parses it with `foreign::read.dbf()`. All numeric fields become `double`. - **`healthbR`** vendors its own C decompressor and parses fields into tidy tibbles per module. Each `_data()` call applies the module's specific cleaning/labelling rules. - **`datasusr`** decompresses and parses entirely in memory in a single C call. The compressed payload goes straight to a memory buffer and is parsed field-by-field into pre-allocated R vectors, with optional column selection, integer-vs-double inference, explicit `col_types`, and date parsing — all at the C level. There are no temporary files on disk. For a single large DBC file with column selection, `datasusr` is the fastest option. For a multi-source analysis touching surveys and DATASUS together, `healthbR`'s integrated workflow saves more wall time than the per-file speedup is worth. ## API ergonomics `healthbR` provides a uniform per-module API: ```{r} library(healthbR) obitos <- sim_data(year = 2022, uf = "AC") obitos_cv <- sim_data(year = 2022, uf = "AC", cause = "I") # ICD-10 prefix nascidos <- sinasc_data(year = 2022, uf = "AC") intern <- sih_data(year = 2022, month = 1:6, uf = c("SP", "RJ")) ambul <- sia_data(year = 2022, month = 1, uf = "AC", type = "AM") # Variable metadata and value labels sim_variables() sim_dictionary("SEXO") ``` `datasusr` is intentionally one level lower. There is a single `datasus_fetch()` entry point, plus a layered API that exposes the catalog and the FTP: ```{r} library(datasusr) # Browse the catalog datasus_sources() datasus_file_types(source = "SINAN") # Validate availability against the FTP files <- datasus_list_files( source = "SINAN", file_type = "DENG", year = 2020:2024 ) # Download with caching downloads <- datasus_download(files, cache_dir = tempdir()) # Read with column selection and explicit types df <- read_datasus_dbc( downloads$local_file[[1]], select = c("dt_notific", "id_municip", "classi_fin"), col_types = c(dt_notific = "date"), parse_dates = TRUE ) ``` If you do not need that level of control, `datasus_fetch()` collapses the four steps above into one call, similar to `healthbR`'s `_data()`. ## Caching and parallelism `healthbR` writes per-module Parquet files (when `arrow` is installed) and exposes per-module cache helpers (`sim_cache_status()`, `sih_clear_cache()`, …). Parallel downloads are opt-in via a `future` plan, so a single call can pull many UFs/months/years at once. `datasusr` keeps the cache as raw `.dbc` files in a single configurable directory (default: a session-scoped subdirectory of `tempdir()`). Pruning is global rather than per-module: ```{r} datasus_cache_info() datasus_cache_prune(older_than_days = 90) datasus_cache_prune(max_size_bytes = 5 * 1024^3) datasus_cache_clear() ``` If you want a persistent cross-session cache, point `cache_dir` (or the `DATASUSR_CACHE_DIR` env var, or the `datasusr.cache_dir` option) at a directory of your choice, e.g. `tools::R_user_dir("datasusr", "cache")`. `datasusr` does not bundle a parallel-download abstraction, but `datasus_list_files()` returns a tibble of URLs that you can hand to `curl::multi_download()` or your own `future_map()` if you need parallel I/O. ## Variable labelling Neither `datasusr` nor base R itself attaches semantic labels to DATASUS variables. `healthbR` and `microdatasus` both do, with different styles: - `healthbR`: per-module `*_variables()` and `*_dictionary()` return tibbles describing each field and its categorical levels. - `microdatasus`: `process_sim()`, `process_sinasc()`, … rewrite the data frame in place with decoded values. If you start in `datasusr`, you can join against the territorial tables to recover human-readable municipality / region names, but ICD codes, sex codes, race codes, etc. remain raw: ```{r} library(dplyr) mun <- datasus_get_territory("tb_municip", cache_dir = tempdir()) df <- datasus_fetch( source = "SIM", file_type = "DO", year = 2022, uf = "PE", cache_dir = tempdir() ) |> left_join( select(mun, co_municip, ds_nome), by = c("codmunocor" = "co_municip") ) ``` For analyses where labelled variables matter, prefer `healthbR` (or `microdatasus`'s `process_*()` family). ## Composing the packages The packages are not mutually exclusive. A common pattern: - Use `healthbR` as the default entry point for surveys and labelled DATASUS data. - Drop down to `datasusr` when you need fast custom reads of arbitrary DBC files, or when you want to navigate the FTP catalog directly (`datasus_ftp_ls()`, `datasus_list_files()`, `datasus_docs_url()`). - Reach for `microdatasus::process_*()` if you already rely on its labelling pipelines. ## Summary `healthbR` is the most complete and currently the most actively maintained option for Brazilian public health data in R, and is the recommended starting point for most users. `datasusr` is a focused DATASUS-FTP reader and catalog tool: it shines when you need fast, flexible, dependency-light I/O against raw DBC files, but it does not try to match `healthbR`'s scope.