---
title: "Parallel processing"
subtitle: "speeding up skytrackr computation"
author: "Koen Hufkens"
output: 
  bookdown::html_document2:
    base_format: rmarkdown::html_vignette
    fig_caption: yes
    toc: true
    toc_depth: 2
pkgdown:
  as_is: true
vignette: >
  %\VignetteIndexEntry{Parallel processing}
  %\VignetteEngine{knitr::rmarkdown}
  %\usepackage[utf8]{inputenc}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.align = "center"
)
set.seed(0)
library(dplyr)
library(multidplyr)
library(skytrackr)
```

# Introduction

Light loggers are often deployed in bulk. In order to speed up processing you can easily convert a standard serial {dplyr} based processing to a parallel one using the {multidplyr} package. To demonstrate this I'll use a fake dataset with two loggers (using the included demo data).

# Data setup

For this exercise you will need {skytrackr} and {multidplyr} libraries loaded. I use the included demo dataset `cc876` and load it into two different data frames, while renaming the logger in one of them. I then merge both data frames. This will create a dataset with two loggers (although containing the same information), suitable to demonstrate parallel processing.

```{r eval = TRUE}
library(skytrackr)
library(dplyr)
library(multidplyr)

# creating a fake dataset
# by duplicating data and
# renaming the logger
df1 <- skytrackr::cc876
df1$logger <- "CC888"
df2 <- skytrackr::cc876
df <- bind_rows(df1,df2)
```

Next I'll define the cluster setup using the `new_cluster()` function. In general you can safely use all but one of your cores on your local machine (n - 1). To detect the number of cores on your machine you can use `parallel::detectCores()`. In most modern computers there are at least four (4) local cores. In our case, with two datasets to process, one core will go unused. If more data is presented it will be distributed over the available cores.

```{r eval = TRUE}
# detect number of cores automatically
# n <- parallel::detectCores() - 1

# in this case I force them to two (2)
n <- 2

# create a new cluster
cluster <- new_cluster(n)
```

Since every CPU processes the data in isolation you need to explicitly specify the libraries you wish to use. In this case, we provide the cluster with the {skytrackr} library (and its dependencies).

```{r eval = FALSE}
# Make sure the "skytrackr" library
# is made available
cluster_library(cluster, "skytrackr")
```

With the cluster details specified I can now partition the data for distribution to the different CPUs. The partitioning of data is done by the standard {dplyr} `group_by()` function, followed by the {multidplyr} `partition()` function using the cluster specifications as an argument. This will transform the `tibble` data frame into a partitioned data frame, or `party_df`.

```{r eval = TRUE}
# split tasks by logger
# across cluster partitions
df_logger <- df |>
  group_by(logger) |>
  partition(cluster)

print(df_logger)
```

# Parallel processing

Finally, after the setup we can now call the main parallel routine to be executed. The setup follows the routine specified in the main README with a few exceptions. Mainly, the generation of the mask, step-selection function and setting the random seed must be included in the `do()` statement. The parallel sessions do not have access to shared memory. As such, we have to define the mask, step-selection function and random seed for each run (logger) separately.

```{r eval = FALSE}
# run the analysis in parallel
# on the cluster (local or remote)
locations <- df_logger |>
  group_by(logger) |>
  do({

    # set seed per parallel unit
    set.seed(1)

    # define land mask
    mask <- stk_mask(
      bbox  =  c(-20, -40, 60, 60),
      buffer = 150, # in km
      resolution = 0.5 # in degrees
    )

    # define land mask with a bounding box
    # and an off-shore buffer (in km), in addition
    # you can specifiy the resolution of the resulting raster
    mask <- stk_mask(
      bbox  =  c(-20, -40, 60, 60), #xmin, ymin, xmax, ymax
      buffer = 150, # in km
      resolution = 0.5 # map grid in degrees
    )

    # define a step selection distribution
    ssf <- function(x, shape = 0.9, scale = 100, tolerance = 1500){
      # normalize over expected range with km increments
      norm <- sum(stats::dgamma(1:tolerance, shape = shape, scale = scale))
      prob <- stats::dgamma(x, shape = shape, scale = scale) / norm
      return(prob)
    }

    skytrackr(
      .data,
      mask = mask,
      step_selection = ssf,
      plot = FALSE,
      verbose = FALSE,
      start_location = c(51.08, 3.73),
      tolerance = 1500, # in km
      scale = log(c(0.00001, 50)),
      range = c(0.09, 148),
      control = list(
        sampler = 'DEzs',
        settings = list(
          burnin = 250,
          iterations = 3000,
          message = FALSE
        )
      )
    )
  })
```

The output of the parallel run is a `party_df` data frame, which is incompatible with {skytrackr} functions. A simple coversion back to a `tibble` data frame is possible by calling the `as.data.frame()` function (dropping {multidplyr} ancillary data), making the data compatible with for example `stk_map()`.

```{r eval = FALSE}
# drop the parallel processing info
locations <- locations |>
  as.data.frame()
```