--- title: "Memory and Parallelization" author: "Victor Granda (Sapfluxnet Team)" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Memory and Parallelization} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Memory In order to be able to work with the whole database at the *sapwood* or *plant* level it is recommended at least $16GB$ of RAM memory. This is because loading all data objects already consumes $4GB$ and any operation like aggregation or metric calculation results in extra memory needed: ```{r memory_all, eval=FALSE} library(sapfluxnetr) # This will need at least 5GB of memory during the process folder <- 'RData/plant' sfn_metadata <- read_sfn_metadata(folder) daily_results <- sfn_sites_in_folder(folder) %>% filter_sites_by_md( si_biome %in% c("Temperate forest", 'Woodland/Shrubland'), sites = sites, metadata = sfn_metadata ) %>% read_sfn_data(folder) %>% daily_metrics(tidy = TRUE, metadata = sfn_metadata) # Important to save, this way you will have access to the object in the future save(daily_results, file = 'daily_results.RData') ``` To circumvent this in less powerful systems, we recommend to work in small subsets of sites (25-30) and join the tidy results afterwards: ```{r memory_steps, eval = FALSE} library(sapfluxnetr) folder <- 'RData/plant' metadata <- read_sfn_metadata(folder) sites <- sfn_sites_in_folder(folder) %>% filter_sites_by_md( si_biome %in% c("Temperate forest", 'Woodland/Shrubland'), sites = sites, metadata = sfn_metadata ) daily_results_1 <- read_sfn_data(sites[1:30], folder) %>% daily_metrics(tidy = TRUE, metadata = sfn_metadata) daily_results_2 <- read_sfn_data(sites[31:60], folder) %>% daily_metrics(tidy = TRUE, metadata = sfn_metadata) daily_results_3 <- read_sfn_data(sites[61:90], folder) %>% daily_metrics(tidy = TRUE, metadata = sfn_metadata) daily_results_4 <- read_sfn_data(sites[91:110], folder) %>% daily_metrics(tidy = TRUE, metadata = sfn_metadata) daily_results_steps <- bind_rows( daily_results_1, daily_results_2, daily_results_3, daily_results_4 ) rm(daily_results_1, daily_results_2, daily_results_3, daily_results_4) save(daily_results_steps, file = 'daily_results_steps.RData') ``` ## Parallelization `sapfluxnetr` includes the capability to parallelize the metrics calculation when performed on a `sfn_data_multi` object. This is made thenks to the [furrr](https://github.com/futureverse/furrr) package, which uses the [future](https://github.com/futureverse/future) package behind the scenes. By default, the code will run in a sequential process, which is the usual way the R code runs. But setting the `future::plan` to `multicore` (in Linux), `multisession` (in Windows) or `multiprocess` (automatically choose between the previous plans depending on the system) will run the code in parallel, dividing the sites between the available cores. > Be advised, parallelization usually means more RAM used, so in systems with less then 16GB maybe is not a good idea. Also, the time benefits start to show when analysing 10 sites or more. ```{r parallelizations, eval = FALSE} # loading future package library(future) # setting the plan plan('multiprocess') # metrics!! daily_results_parallel <- sfn_sites_in_folder(folder) %>% filter_sites_by_md( si_biome %in% c("Temperate forest", 'Woodland/Shrubland'), sites = sites, metadata = sfn_metadata ) %>% read_sfn_data(folder) %>% daily_metrics(tidy = TRUE, metadata = sfn_metadata) # Important to save, this way you will have access to the object in the future save(daily_results_parallel, file = 'daily_results_parallel.RData') ``` ### Memory limit When using `furrr`, even in the `sequential` plan, the `future` package sets a limit of $500MB$ for each core. With sapfluxnet data this limit is easily exceeded, causing an error. To avoid this we may want to set the `future.globals.maxSize` limit to a higher value ($1GB$ for example, but the limit wanted really depend on the plan and the number of sites): ```{r max_limit, eval = FALSE} # future library library(future) # plan sequential, not really needed, as it is the default, but for the sake of # clarity plant('sequential') # up the limit to 1GB, this in bytes is 1014*1024^2 options('future.globals.maxSize' = 1014*1024^2) # do the metrics daily_results_limit <- sfn_sites_in_folder(folder) %>% filter_sites_by_md( si_biome %in% c("Temperate forest", 'Woodland/Shrubland'), sites = sites, metadata = sfn_metadata ) %>% read_sfn_data(folder) %>% daily_metrics(tidy = TRUE, metadata = sfn_metadata) # Important to save, this way you will have access to the object in the future save(daily_results_limit, file = 'daily_results_limit.RData') ```