---
title: "Custom aggregation"
author: "Victor Granda (Sapfluxnet Team)"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Custom aggregation}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

`sapfluxnetr` package offers a very flexible but powerful API based on the
`tidyverse` packages to aggregate and summarise the site/s data in the form
of the `sfn_metrics` function. All the metrics family of functions (`?metrics`)
make use of the `sfn_metrics` function under the hood. If you want full control
to the statistics returned and aggregation periods, we recommend you to use this
API. This vignette will show you how.

## Pre-fixed summarising functions

  1. `daily_metrics`
  1. `monthly_metrics`
  1. `predawn_metrics`
  1. `midday_metrics`
  1. `nightly_metrics`
  1. `daylight_metrics`

See each function help for a detailed description and examples of use.

## Custom summarising functions

`daily_metrics` and related functions return a complete set of metrics ready for
use, but if you want different metrics you can supply your own summarising
functions using the `.funs` argument.  
The correct way of specifying the functions to use is described in
the `summarise_all` help (`?dplyr::summarise_all`). The recommended way is a
list of formulas with the function call:

```{r custom_summ}
# libraries
library(sapfluxnetr)
library(dplyr)

### only mean and sd at a daily scale
# data
data('ARG_TRE', package = 'sapfluxnetr')

# summarising funs (as a list of formulas)
custom_funs <- list(mean = ~ mean(., na.rm = TRUE), std_dev = ~ sd(., na.rm = TRUE))

# metrics
foo_simpler_metrics <- sfn_metrics(
  ARG_TRE,
  period = '1 day',
  .funs = custom_funs,
  solar = TRUE,
  interval = 'general'
)

foo_simpler_metrics[['sapf']]
```

> When supplying only one function to .funs, names of variables are not changed to
  contain the metric name at the end, as the summary function returns the
  same columns as the original data

## Special interest intervals

You can also choose if the "special interest" intervals (predawn, midday,
nighttime or daylight) are calculated or not. For example, if you are only
interested in the midday interval you can use:

```{r special_intervals}
foo_simpler_metrics_midday <- sfn_metrics(
  ARG_TRE,
  period = '1 day',
  .funs = custom_funs,
  solar = TRUE,
  interval = 'midday', int_start = 11, int_end = 13
)

foo_simpler_metrics_midday[['sapf']]
```

## Custom aggregation periods

`period` argument in `sfn_metrics` is passed to `.collapse_timestamp` function,
and so, it can use the same input:

  + "frequency period" format, where frequency is a number and period is an
    interval as character (i.e. "1 year", "7 days")

```{r custom_aggregation}
# weekly
foo_weekly <- sfn_metrics(
  ARG_TRE,
  period = '7 days',
  .funs = custom_funs,
  solar = TRUE,
  interval = 'general'
)

foo_weekly[['env']]
```

 + A custom function name (without quotes). This way you can build irregular
   or custom periods.
   This function first argument must always be the timestamp to be collapsed, and
   can have other arguments thet will be supplied in the dots (`...`) argument
   of `sfn_metrics`. Also, this function always must return a vector of timestamps
   of the same length as the original timestamp.  
   For example, if you want to summarise by quarters we can use the `quarter`
   function from the lubridate package:

```{r custom_aggregation_2}
foo_custom <- sfn_metrics(
  AUS_CAN_ST2_MIX,
  period = lubridate::quarter,
  .funs = custom_funs,
  solar = TRUE,
  interval = 'general',
  with_year = TRUE # argument for lubridate::quarter
)
foo_custom['env']
```

## Extra parameters

`sfn_metrics` has a `...` parameter intended to supply additional parameters to
the internal functions used:

  1. `.collapse_timestamp` accepts the following extra arguments:
     
     - `side`  
        "start" by default in the sfn_metrics implementation
  
  1. `dplyr::summarise_all` accepts extra arguments intended to be applied to
     the summarising functions provided (to **all**, so they all must have the
     argument provided or an error will be raised). That's the reason because we
     recommend to use the list way, as the arguments are specified for the
     individual functions.

For example, if we want the TIMESTAMPs after aggregation to show the end of the
period instead the beginning (default) we can do the following:

```{r extra_params}
foo_simpler_metrics_end <- sfn_metrics(
  ARG_TRE,
  period = '1 day',
  .funs = custom_funs,
  solar = TRUE,
  interval = 'general',
  side = "end"
)

foo_simpler_metrics_end[['sapf']]
```

If it is compared with the `foo_simpler_metrics` calculated before, now the
period is identified in the TIMESTAMP by the ending of the period (daily in this
case).

> When supplying custom functions as "period" argument, the default coverage
  statistic is not reliable as there is no way of knowing beforehand the
  period/s in minutes.

## Temporary columns helpers

The internal aggregation process in `sfn_metrics` generates some transitory
columns which can be used in the summarising functions:

### `TIMESTAMP_coll`

When aggregating by the declared period (i.e. `"daily"`), the TIMESTAMP column
collapses to the period start/end value (meaning thet all the TIMESTAMP values
for the same day becomes identical).  
This makes impossible to use any summarise functions thet obtain the
time of the day at which one event happens (i.e. time of the day at which the
maximum sap flow occurs) because all TIMESTAMP values are identical.
For thet kind of summarising functions, a transitory column called
`TIMESTAMP_coll` is created. So in this case we can create a function thet
takes de variable values for the day, the TIMESTAMP_coll values for the day
and return the TIMESTAMP at which the max sap flow occurs and use it with
`sfn_metrics`:

```{r timestamp_coll}
max_time <- function(x, time) {
  
  # x: vector of values for a day
  # time: TIMESTAMP for the day 

  # if all the values in x are NAs (a daily summmarise of no measures day for
  # example) this will return a length 0 POSIXct vector, which will crash
  # dplyr summarise step. So, check if all NA and if true return NA as POSIXct
  if(all(is.na(x))) {
    return(as.POSIXct(NA, tz = attr(time, 'tz'), origin = lubridate::origin))
  } else {
    time[which.max(x)]
  }
}

custom_funs <- list(max = ~ max(., na.rm = TRUE), ~ max_time(., TIMESTAMP_coll))

max_time_metrics <- sfn_metrics(
  ARG_TRE,
  period = '1 day',
  .funs = custom_funs,
  solar = TRUE,
  interval = 'general'
)

max_time_metrics[['sapf']]
```

## Sub-daily aggregations

`sfn_metrics` allows to perform sub-daily aggregations, by means of the `period`
parameter. Sapfluxnet datasets have sub-daily data usually in the range of 30
minutes to 2 hours. This means thet data can be aggregated in periods above 2
hours. We can aggregate to a 3 hours period easily:

```{r subdaily_periods}
custom_funs <- list(max = ~ max(., na.rm = TRUE))

three_hours_agg <- sfn_metrics(
  ARG_TRE,
  period = '3 hours',
  .funs = custom_funs,
  solar = TRUE,
  interval = 'general'
)

three_hours_agg[['sapf']]
```