--- title: "Working with larger-than-memory data" date: "`r Sys.Date()`" output: rmarkdown::html_vignette urlcolor: blue vignette: > %\VignetteIndexEntry{Working with larger-than-memory data} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = identical(tolower(Sys.getenv("NOT_CRAN")), "true"), out.width = "100%" ) ``` # Larger-than-memory Data Many data sets of Brazilian censuses are too big to load in users' RAM memory. To avoid this problem, **{censobr}** works with files saved in `.parquet` format and, by default, the functions in **{censobr}** returns an [Arrow table](https://arrow.apache.org/docs/r/articles/arrow.html#tabular-data-in-arrow) rather than a `data.frame`. There are a few really simple alternative ways to work with Arrow tables in R without loading the full data to memory. We cover four alternative approaches in this vignette. First, let's read the 2010 mortality data, which we'll use throughout vignette for illustration purposes. ```{r warning = FALSE, message=FALSE} library(censobr) # read 2010 mortality data df <- censobr::read_mortality( year = 2010, add_labels = 'pt', showProgress = FALSE ) ``` ## 1. `{dplyr}` Because of the seamless integration between `{arrow}` and `{dplyr}`, Arrow tables can be analyzed pretty much like a regular `data.frame` using the `{dplyr}`syntax. There is a small but important difference, though. When using {dplyr} with an Arrow table, the operations are not executed immediately. Instead, {dplyr} builds a lazy query plan that is only evaluated when you explicitly ask for the results. To retrieve the actual results, you need to call either: - `collect()`: This brings the results into memory as a regular `data.frame`. - `compute()`: This materializes the result in the Arrow format (e.g., as a new Arrow table). Without calling one of these, the query is just prepared but not executed, which is useful for delaying heavy computations until needed. In the example below, we create a new Arrow table that only includes the deaths records of men in the state of Rio de Janeiro without loading the data to memory. Note that we only piece of data we `collect()` (i.e. load to memory) here are the first observations of the data. ```{r warning = FALSE, message=FALSE} library(dplyr) # Filter deaths of men in the state of Rio de Janeiro rio <- df |> filter(V0704 == 'Masculino' & abbrev_state == 'RJ') head(rio) |> collect() ``` ## 2. `{duckdb}` `{duckdb}` is another powerful library to work with larger-than-memory data in R through database interface. There are different ways to use `{duckdb}`, but here cover three alternatives ### 2.1 Combining `{duckdb}` & `{dplyr}` One easy option is to combine `{duckdb}` & `{dplyr}`. Note here that first you need to convert the Arrow table into a DuckDB table with `arrow::to_duckdb()`. Also note that the you need to use a bit of `SQL` syntax inside the dplyr call. Using the same example as above: ```{r warning = FALSE, message=FALSE} library(duckdb) library(dplyr) library(arrow) # Filter deaths of men in the state of Rio de Janeiro rio1 <- df |> arrow::to_duckdb() |> filter(sql("V0704 LIKE '%Masculino%' AND abbrev_state = 'RJ'")) head(rio1) |> collect() ``` ### 2.2 Combining `{duckdb}` & `SQL` Another alternative is to combine `{duckdb}` with `{DBI}` using database interface and `SQL` syntax. ```{r warning = FALSE} library(duckdb) library(DBI) # create databse connection con <- duckdb::dbConnect(duckdb::duckdb()) # register the data in the data base duckdb::duckdb_register_arrow(con, 'mortality_2010_tbl', df) # Filter deaths of men in the state of Rio de Janeiro rio2 <- DBI::dbGetQuery(con, "SELECT * FROM 'mortality_2010_tbl' WHERE V0704 LIKE '%Masculino%' AND abbrev_state = 'RJ'") head(rio2) ``` ### 2.3 `{duckplyr}` A third alternative is the new `{duckplyr}` package. This this library is under development so it still does not cover many of the `{dplyr}`verbs. (See [here](https://duckdb.org/2024/04/02/duckplyr.html)). Nonetheless, it looks like a really promising and seamless approach that allow users to leverage the powerful capabilities of {duckdb}. ```{r warning = FALSE, message = FALSE} library(duckplyr) rio3 <- df |> duckplyr::filter(V0704 == 'Masculino' & abbrev_state == 'RJ') head(rio3) |> collect() ```