Type: Package
Title: Sparklyr Extension for 'Flint'
Version: 0.2.2
Maintainer: Edgar Ruiz <edgar@rstudio.com>
Description: This sparklyr extension makes 'Flint' time series library functionalities (https://github.com/twosigma/flint) easily accessible through R.
License: Apache License 2.0
URL: <https://github.com/r-spark/sparklyr.flint>
BugReports: https://github.com/r-spark/sparklyr.flint/issues
Depends: R (≥ 3.2)
Imports: dbplyr, dplyr, rlang, sparklyr (≥ 1.3)
Suggests: knitr, rmarkdown, tibble
VignetteBuilder: knitr
Encoding: UTF-8
RoxygenNote: 7.1.1
SystemRequirements: Spark: 2.x or above
Collate: 'imports.R' 'globals.R' 'sdf_utils.R' 'asof_join.R' 'init.R' 'window_exprs.R' 'summarizers.R' 'ols_regression.R' 'reexports.R' 'utils.R'
NeedsCompilation: no
Packaged: 2022-01-10 15:59:43 UTC; yitaoli
Author: Yitao Li ORCID iD [aut], Edgar Ruiz [aut, cre]
Repository: CRAN
Date/Publication: 2022-01-11 08:50:13 UTC

Temporal future left join

Description

Perform left-outer join on 2 'TimeSeriesRDD's based on inexact timestamp matches, where each record from 'left' with timestamp 't' matches the record from 'right' having the most recent timestamp at or after 't' if 'strict_lookahead' is FALSE (default) or having the most recent timestamp strictly after 't' if 'strict_lookahead' is TRUE. Notice this is equivalent to 'asof_join()' with 'direction' = ">=" if 'strict_lookahead' is FALSE (default) or direction '>' if 'strict_lookahead' is TRUE. See asof_join.

Usage

asof_future_left_join(
  left,
  right,
  tol = "0ms",
  key_columns = list(),
  left_prefix = NULL,
  right_prefix = NULL,
  strict_lookahead = FALSE
)

Arguments

left

The left 'TimeSeriesRDD'

right

The right 'TimeSeriesRDD'

tol

A character vector specifying a time duration (e.g., "0ns", "5ms", "5s", "1d", etc) as the tolerance for absolute difference in timestamp values between each record from 'left' and its matching record from 'right'. By default, 'tol' is "0ns", which means a record from 'left' will only be matched with a record from 'right' if both contain the exact same timestamps.

key_columns

Columns to be used as the matching key among records from 'left' and 'right': if non-empty, then in addition to matching criteria imposed by timestamps, a record from 'left' will only match one from the 'right' only if they also have equal values in all key columns.

left_prefix

A string to prepend to all columns from 'left' after the join (usually for disambiguation purposes if 'left' and 'right' contain overlapping column names).

right_prefix

A string to prepend to all columns from 'right' after the join (usually for disambiguation purposes if 'left' and 'right' contain overlapping column names).

strict_lookahead

Whether each record from 'left' with timestamp 't' should match record from 'right' with the smallest timestamp strictly greater than 't' (default: FALSE)

See Also

Other Temporal join functions: asof_join(), asof_left_join()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")
if (!is.null(sc)) {
  ts_1 <- copy_to(sc, tibble::tibble(t = seq(10), u = seq(10))) %>%
    from_sdf(is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  ts_2 <- copy_to(sc, tibble::tibble(t = seq(10) + 1, v = seq(10) + 1L)) %>%
    from_sdf(is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  future_left_join_ts <- asof_future_left_join(ts_1, ts_2, tol = "1s")
} else {
  message("Unable to establish a Spark connection!")
}


Temporal join

Description

Perform left-outer join on 2 'TimeSeriesRDD's based on inexact timestamp matches

Usage

asof_join(
  left,
  right,
  tol = "0ms",
  direction = c(">=", "<=", "<"),
  key_columns = list(),
  left_prefix = NULL,
  right_prefix = NULL
)

Arguments

left

The left 'TimeSeriesRDD'

right

The right 'TimeSeriesRDD'

tol

A character vector specifying a time duration (e.g., "0ns", "5ms", "5s", "1d", etc) as the tolerance for absolute difference in timestamp values between each record from 'left' and its matching record from 'right'. By default, 'tol' is "0ns", which means a record from 'left' will only be matched with a record from 'right' if both contain the exact same timestamps.

direction

Specifies the temporal direction of the join, must be one of ">=", "<=", or "<". If direction is ">=", then each record from 'left' with timestamp 'tl' gets joined with a record from 'right' having the largest/most recent timestamp 'tr' such that 'tl' >= 'tr' and 'tl' - 'tr' <= 'tol' (or equivalently, 0 <= 'tl' - 'tr' <= 'tol'). If direction is "<=", then each record from 'left' with timestamp 'tl' gets joined with a record from 'right' having the smallest/least recent timestamp 'tr' such that 'tl' <= 'tr' and 'tr' - 'tl' <= 'tol' (or equivalently, '0 <= 'tr' - 'tl' <= 'tol'). If direction is "<", then each record from 'left' with timestamp 'tl' gets joined with a record from 'right' having the smallest/least recent timestamp 'tr' such that 'tr' > 'tl' and 'tr' - 'tl' <= 'tol' (or equivalently, 0 < 'tr' - 'tl' <= 'tol').

key_columns

Columns to be used as the matching key among records from 'left' and 'right': if non-empty, then in addition to matching criteria imposed by timestamps, a record from 'left' will only match one from the 'right' only if they also have equal values in all key columns.

left_prefix

A string to prepend to all columns from 'left' after the join (usually for disambiguation purposes if 'left' and 'right' contain overlapping column names).

right_prefix

A string to prepend to all columns from 'right' after the join (usually for disambiguation purposes if 'left' and 'right' contain overlapping column names).

See Also

Other Temporal join functions: asof_future_left_join(), asof_left_join()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")
if (!is.null(sc)) {
  ts_1 <- copy_to(sc, tibble::tibble(t = seq(10), u = seq(10))) %>%
    from_sdf(is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  ts_2 <- copy_to(sc, tibble::tibble(t = seq(10) + 1, v = seq(10) + 1L)) %>%
    from_sdf(is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  future_left_join_ts <- asof_join(ts_1, ts_2, tol = "1s", direction = "<=")
} else {
  message("Unable to establish a Spark connection!")
}


Temporal left join

Description

Perform left-outer join on 2 'TimeSeriesRDD's based on inexact timestamp matches, where each record from 'left' with timestamp 't' matches the record from 'right' having the most recent timestamp at or before 't'. Notice this is equivalent to 'asof_join()' with 'direction' = "<=". See asof_join.

Usage

asof_left_join(
  left,
  right,
  tol = "0ms",
  key_columns = list(),
  left_prefix = NULL,
  right_prefix = NULL
)

Arguments

left

The left 'TimeSeriesRDD'

right

The right 'TimeSeriesRDD'

tol

A character vector specifying a time duration (e.g., "0ns", "5ms", "5s", "1d", etc) as the tolerance for absolute difference in timestamp values between each record from 'left' and its matching record from 'right'. By default, 'tol' is "0ns", which means a record from 'left' will only be matched with a record from 'right' if both contain the exact same timestamps.

key_columns

Columns to be used as the matching key among records from 'left' and 'right': if non-empty, then in addition to matching criteria imposed by timestamps, a record from 'left' will only match one from the 'right' only if they also have equal values in all key columns.

left_prefix

A string to prepend to all columns from 'left' after the join (usually for disambiguation purposes if 'left' and 'right' contain overlapping column names).

right_prefix

A string to prepend to all columns from 'right' after the join (usually for disambiguation purposes if 'left' and 'right' contain overlapping column names).

See Also

Other Temporal join functions: asof_future_left_join(), asof_join()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")
if (!is.null(sc)) {
  ts_1 <- copy_to(sc, tibble::tibble(t = seq(10), u = seq(10))) %>%
    from_sdf(is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  ts_2 <- copy_to(sc, tibble::tibble(t = seq(10) + 1, v = seq(10) + 1L)) %>%
    from_sdf(is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  left_join_ts <- asof_left_join(ts_1, ts_2, tol = "1s")
} else {
  message("Unable to establish a Spark connection!")
}


Collect data from a TimeSeriesRDD

Description

Collect data from a TimeSeriesRDD into a R data frame

Usage

## S3 method for class 'ts_rdd'
collect(x, ...)

Arguments

x

A com.twosigma.flint.timeseries.TimeSeriesRDD object

...

Additional arguments to 'sdf_collect()'

Value

A R data frame containing the same time series data the input TimeSeriesRDD contains

See Also

Other Spark dataframe utility functions: from_rdd(), from_sdf(), spark_connection.ts_rdd(), spark_dataframe.ts_rdd(), spark_jobj.ts_rdd(), to_sdf(), ts_rdd_builder()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(sc, tibble::tibble(t = seq(10), v = seq(10)))
  ts <- from_sdf(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  df <- ts %>% collect()
} else {
  message("Unable to establish a Spark connection!")
}


Construct a TimeSeriesRDD from a Spark RDD of rows

Description

Construct a TimeSeriesRDD containing time series data from a Spark RDD of rows

Usage

from_rdd(
  rdd,
  schema,
  is_sorted = FALSE,
  time_unit = .sparklyr.flint.globals$kValidTimeUnits,
  time_column = .sparklyr.flint.globals$kDefaultTimeColumn
)

fromRDD(
  rdd,
  schema,
  is_sorted = FALSE,
  time_unit = .sparklyr.flint.globals$kValidTimeUnits,
  time_column = .sparklyr.flint.globals$kDefaultTimeColumn
)

Arguments

rdd

A Spark RDD[Row] object containing time series data

schema

A Spark StructType object containing schema of the time series data

is_sorted

Whether the rows being imported are already sorted by time

time_unit

Time unit of the time column (must be one of the following values: "NANOSECONDS", "MICROSECONDS", "MILLISECONDS", "SECONDS", "MINUTES", "HOURS", "DAYS"

time_column

Name of the time column

Value

A TimeSeriesRDD useable by the Flint time series library

See Also

Other Spark dataframe utility functions: collect.ts_rdd(), from_sdf(), spark_connection.ts_rdd(), spark_dataframe.ts_rdd(), spark_jobj.ts_rdd(), to_sdf(), ts_rdd_builder()

Other Spark dataframe utility functions: collect.ts_rdd(), from_sdf(), spark_connection.ts_rdd(), spark_dataframe.ts_rdd(), spark_jobj.ts_rdd(), to_sdf(), ts_rdd_builder()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(sc, tibble::tibble(t = seq(10), v = seq(10)))
  rdd <- spark_dataframe(sdf) %>% invoke("rdd")
  schema <- spark_dataframe(sdf) %>% invoke("schema")
  ts <- from_rdd(
    rdd, schema,
    is_sorted = TRUE, time_unit = "SECONDS", time_column = "t"
  )
} else {
  message("Unable to establish a Spark connection!")
}


Construct a TimeSeriesRDD from a Spark DataFrame

Description

Construct a TimeSeriesRDD containing time series data from a Spark DataFrame

Usage

from_sdf(
  sdf,
  is_sorted = FALSE,
  time_unit = .sparklyr.flint.globals$kValidTimeUnits,
  time_column = .sparklyr.flint.globals$kDefaultTimeColumn
)

fromSDF(
  sdf,
  is_sorted = FALSE,
  time_unit = .sparklyr.flint.globals$kValidTimeUnits,
  time_column = .sparklyr.flint.globals$kDefaultTimeColumn
)

Arguments

sdf

A Spark DataFrame object

is_sorted

Whether the rows being imported are already sorted by time

time_unit

Time unit of the time column (must be one of the following values: "NANOSECONDS", "MICROSECONDS", "MILLISECONDS", "SECONDS", "MINUTES", "HOURS", "DAYS"

time_column

Name of the time column

Value

A TimeSeriesRDD useable by the Flint time series library

See Also

Other Spark dataframe utility functions: collect.ts_rdd(), from_rdd(), spark_connection.ts_rdd(), spark_dataframe.ts_rdd(), spark_jobj.ts_rdd(), to_sdf(), ts_rdd_builder()

Other Spark dataframe utility functions: collect.ts_rdd(), from_rdd(), spark_connection.ts_rdd(), spark_dataframe.ts_rdd(), spark_jobj.ts_rdd(), to_sdf(), ts_rdd_builder()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(sc, tibble::tibble(t = seq(10), v = seq(10)))
  ts <- from_sdf(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
} else {
  message("Unable to establish a Spark connection!")
}


Dependencies and initialization procedures

Description

Functions in this file specify all runtime dependencies of sparklyr.flint and package-wide constants in '.sparklyr.flint.globals'.


OLS regression

Description

Ordinary least squares regression

Usage

ols_regression(
  ts_rdd,
  formula,
  weight = NULL,
  has_intercept = TRUE,
  ignore_const_vars = FALSE,
  const_var_threshold = 1e-12
)

Arguments

ts_rdd

Timeseries RDD containing dependent and independent variables

formula

An object of class "formula" (or one that can be coerced to that class) which symbolically describes the model to be fitted, with the left-hand-side being the column name of the dependent variable, and the right-hand-side being column name(s) of independent variable(s) delimited by '+', e.g., 'mpg ~ hp + weight + am' for predicting 'mpg' based on 'hp', 'weight' and 'am'

weight

Name of the weight column if performing a weighted OLS regression, or NULL if otherwise. Default: NULL.

has_intercept

Whether to include an intercept term (default: TRUE). If FALSE, then the resulting regression plane will always pass through the origin.

ignore_const_vars

Whether to ignore independent variables that are constant or nearly constant based on const_threshold (default: FALSE). If TRUE, the scalar fields of regression result are the same as if the constant variables are not included as independent variables. The output beta, tStat, stdErr columns will still have the same dimension number of elements as the number of independent variables. However, entries corresponding to independent variables that are considered constant will have 0.0 for beta and stdErr; and Double.NaN for tStat. If FALSE and at least one independent variable is considered constant, the regression will output Double.NaN for all values. Note that if there are multiple independent variables that can be considered constant and if the resulting model should have an intercept term, then it is recommended to set both ignore_const_vars and has_intercept to TRUE.

const_var_threshold

Consider an independent variable 'x' as constant if ((number of observations) * variance(x)) is less than this value. Default: 1e-12.

Value

A TimeSeries RDD with the following schema: * - "samples": [[LongType]], the number of samples * - "beta": [[ArrayType]] of [[DoubleType]], beta without the intercept component * - "intercept": [[DoubleType]], the intercept * - "hasIntercept": [[BooleanType]], whether the model has an intercept term * - "stdErr_intercept": [[DoubleType]], the standard error of the intercept * - "stdErr_beta": [[ArrayType]] of [[DoubleType]], the standard error of beta * - "rSquared": [[DoubleType]], the r-squared statistics * - "r": [[DoubleType]], the squared root of r-squared statistics * - "tStat_intercept": [[DoubleType]], the t-stats of the intercept * - "tStat_beta": [[ArrayType]] of [[DoubleType]], the t-stats of beta * - "logLikelihood": [[DoubleType]], the log-likelihood of the data given the fitted betas * - "akaikeIC": [[DoubleType]], the Akaike information criterion * - "bayesIC": [[DoubleType]], the Bayes information criterion * - "cond": [[DoubleType]], the condition number of the Gram matrix X^TX where X is the matrix formed by row vectors of independent variables (including a constant entry corresponding to the intercept if 'has_intercept' is TRUE) * - "const_columns": [[ArrayType]] of [[StringType]], the list of independent variables that are considered constants

See Also

Other summarizers: summarize_avg(), summarize_corr2(), summarize_corr(), summarize_count(), summarize_covar(), summarize_dot_product(), summarize_ema_half_life(), summarize_ewma(), summarize_geometric_mean(), summarize_kurtosis(), summarize_max(), summarize_min(), summarize_nth_central_moment(), summarize_nth_moment(), summarize_product(), summarize_quantile(), summarize_skewness(), summarize_stddev(), summarize_sum(), summarize_var(), summarize_weighted_avg(), summarize_weighted_corr(), summarize_weighted_covar(), summarize_z_score()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  mtcars_sdf <- copy_to(sc, mtcars, overwrite = TRUE) %>%
    dplyr::mutate(time = 0L)
  mtcars_ts <- from_sdf(mtcars_sdf, is_sorted = TRUE, time_unit = "SECONDS")
  model <- ols_regression(
    mtcars_ts, mpg ~ cyl + disp + hp + drat + wt + vs + am + gear + carb
  ) %>%
      collect()
} else {
  message("Unable to establish a Spark connection!")
}


Utility functions for importing a Spark data frame into a TimeSeriesRDD

Description

These functions provide an interface for specifying how a Spark data frame should be imported into a TimeSeriesRDD (e.g., which column represents time, whether rows are already ordered by time, and time unit being used, etc)

Arguments

sc

Spark connection

is_sorted

Whether the rows being imported are already sorted by time

time_unit

Time unit of the time column (must be one of the following values: "NANOSECONDS", "MICROSECONDS", "MILLISECONDS", "SECONDS", "MINUTES", "HOURS", "DAYS"

time_column

Name of the time column


Retrieve Spark connection associated with an R object

Description

See spark_connection for more details.


Retrieve Spark connection associated with an R object

Description

See spark_connection for more details.

Usage

## S3 method for class 'ts_rdd'
spark_connection(x, ...)

Arguments

x

An R object from which a 'spark_connection' can be obtained.

...

Optional arguments; currently unused.

See Also

Other Spark dataframe utility functions: collect.ts_rdd(), from_rdd(), from_sdf(), spark_dataframe.ts_rdd(), spark_jobj.ts_rdd(), to_sdf(), ts_rdd_builder()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(sc, tibble::tibble(t = seq(10), v = seq(10)))
  ts <- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  print(spark_connection(ts))
} else {
  message("Unable to establish a Spark connection!")
}


Retrieve a Spark DataFrame

Description

See spark_dataframe for more details.


Retrieve a Spark DataFrame

Description

Retrieve a Spark DataFrame from a TimeSeriesRDD object

Usage

## S3 method for class 'ts_rdd'
spark_dataframe(x, ...)

Arguments

x

An R object wrapping, or containing, a Spark DataFrame.

...

Optional arguments; currently unused.

See Also

Other Spark dataframe utility functions: collect.ts_rdd(), from_rdd(), from_sdf(), spark_connection.ts_rdd(), spark_jobj.ts_rdd(), to_sdf(), ts_rdd_builder()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(sc, tibble::tibble(t = seq(10), v = seq(10)))
  ts <- from_sdf(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  print(ts %>% spark_dataframe())
  print(sdf %>% spark_dataframe()) # the former should contain the same set of
                                   # rows as the latter does, modulo possible
                                   # difference in types of timestamp columns
} else {
  message("Unable to establish a Spark connection!")
}


Retrieve a Spark JVM Object Reference

Description

See spark_jobj for more details.


Retrieve a Spark JVM Object Reference

Description

See spark_jobj for more details.

Usage

## S3 method for class 'ts_rdd'
spark_jobj(x, ...)

Arguments

x

An R object containing, or wrapping, a 'spark_jobj'.

...

Optional arguments; currently unused.

See Also

Other Spark dataframe utility functions: collect.ts_rdd(), from_rdd(), from_sdf(), spark_connection.ts_rdd(), spark_dataframe.ts_rdd(), to_sdf(), ts_rdd_builder()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(sc, tibble::tibble(t = seq(10), v = seq(10)))
  ts <- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  print(spark_jobj(ts))
} else {
  message("Unable to establish a Spark connection!")
}


Average summarizer

Description

Compute moving average of 'column' and store results in a new column named '<column>_mean'

Usage

summarize_avg(ts_rdd, column, window = NULL, key_columns = list())

Arguments

ts_rdd

Timeseries RDD being summarized

column

Column to be summarized

window

Either an R expression specifying time windows to be summarized (e.g., 'in_past("1h")' to summarize data from looking behind 1 hour at each time point, 'in_future("5s")' to summarize data from looking forward 5 seconds at each time point), or 'NULL' to compute aggregate statistics on records grouped by timestamps

key_columns

Optional list of columns that will form an equivalence relation associating each record with the time series it belongs to (i.e., any 2 records having equal values in those columns will be associated with the same time series, and any 2 records having differing values in those columns are considered to be from 2 separate time series and will therefore be summarized separately) By default, 'key_colums' is empty and all records are considered to be part of a single time series.

Value

A TimeSeriesRDD containing the summarized result

See Also

Other summarizers: ols_regression(), summarize_corr2(), summarize_corr(), summarize_count(), summarize_covar(), summarize_dot_product(), summarize_ema_half_life(), summarize_ewma(), summarize_geometric_mean(), summarize_kurtosis(), summarize_max(), summarize_min(), summarize_nth_central_moment(), summarize_nth_moment(), summarize_product(), summarize_quantile(), summarize_skewness(), summarize_stddev(), summarize_sum(), summarize_var(), summarize_weighted_avg(), summarize_weighted_corr(), summarize_weighted_covar(), summarize_z_score()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(sc, tibble::tibble(t = seq(10), v = seq(10)))
  ts <- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  ts_avg <- summarize_avg(ts, column = "v", window = in_past("3s"))
} else {
  message("Unable to establish a Spark connection!")
}


Correlation summarizer

Description

Compute pairwise correations among the list of columns specified and store results in new columns named with the following pattern: '<column1>_<column2>_correlation' and '<column1>_<column2>_correlationTStat', where column1 and column2 are names of any 2 distinct columns

Usage

summarize_corr(ts_rdd, columns, key_columns = list(), incremental = FALSE)

Arguments

ts_rdd

Timeseries RDD being summarized

columns

A list of column names

key_columns

Optional list of columns that will form an equivalence relation associating each record with the time series it belongs to (i.e., any 2 records having equal values in those columns will be associated with the same time series, and any 2 records having differing values in those columns are considered to be from 2 separate time series and will therefore be summarized separately) By default, 'key_colums' is empty and all records are considered to be part of a single time series.

incremental

If FALSE and 'key_columns' is empty, then apply the summarizer to all records of 'ts_rdd'. If FALSE and 'key_columns' is non-empty, then apply the summarizer to all records within each group determined by 'key_columns'. If TRUE and 'key_columns' is empty, then for each record in 'ts_rdd', the summarizer is applied to that record and all records preceding it, and the summarized result is associated with the timestamp of that record. If TRUE and 'key_columns' is non-empty, then for each record within a group of records determined by 1 or more key columns, the summarizer is applied to that record and all records preceding it within its group, and the summarized result is associated with the timestamp of that record.

Value

A TimeSeriesRDD containing the summarized result

See Also

Other summarizers: ols_regression(), summarize_avg(), summarize_corr2(), summarize_count(), summarize_covar(), summarize_dot_product(), summarize_ema_half_life(), summarize_ewma(), summarize_geometric_mean(), summarize_kurtosis(), summarize_max(), summarize_min(), summarize_nth_central_moment(), summarize_nth_moment(), summarize_product(), summarize_quantile(), summarize_skewness(), summarize_stddev(), summarize_sum(), summarize_var(), summarize_weighted_avg(), summarize_weighted_corr(), summarize_weighted_covar(), summarize_z_score()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(sc, tibble::tibble(t = seq(10), u = rnorm(10), v = rnorm(10)))
  ts <- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  ts_corr <- summarize_corr(ts, columns = c("u", "v"))
} else {
  message("Unable to establish a Spark connection!")
}


Pairwise correlation summarizer

Description

Compute pairwise correations for all possible pairs of columns such that the first column of each pair is one of 'xcolumns' and the second column of each pair is one of 'ycolumns', storing results in new columns named with the following pattern: '<column1>_<column2>_correlation' and '<column1>_<column2>_correlationTStat' for each pair of columns (column1, column2)

Usage

summarize_corr2(
  ts_rdd,
  xcolumns,
  ycolumns,
  key_columns = list(),
  incremental = FALSE
)

Arguments

ts_rdd

Timeseries RDD being summarized

xcolumns

A list of column names

ycolumns

A list of column names disjoint from xcolumns

key_columns

Optional list of columns that will form an equivalence relation associating each record with the time series it belongs to (i.e., any 2 records having equal values in those columns will be associated with the same time series, and any 2 records having differing values in those columns are considered to be from 2 separate time series and will therefore be summarized separately) By default, 'key_colums' is empty and all records are considered to be part of a single time series.

incremental

If FALSE and 'key_columns' is empty, then apply the summarizer to all records of 'ts_rdd'. If FALSE and 'key_columns' is non-empty, then apply the summarizer to all records within each group determined by 'key_columns'. If TRUE and 'key_columns' is empty, then for each record in 'ts_rdd', the summarizer is applied to that record and all records preceding it, and the summarized result is associated with the timestamp of that record. If TRUE and 'key_columns' is non-empty, then for each record within a group of records determined by 1 or more key columns, the summarizer is applied to that record and all records preceding it within its group, and the summarized result is associated with the timestamp of that record.

Value

A TimeSeriesRDD containing the summarized result

See Also

Other summarizers: ols_regression(), summarize_avg(), summarize_corr(), summarize_count(), summarize_covar(), summarize_dot_product(), summarize_ema_half_life(), summarize_ewma(), summarize_geometric_mean(), summarize_kurtosis(), summarize_max(), summarize_min(), summarize_nth_central_moment(), summarize_nth_moment(), summarize_product(), summarize_quantile(), summarize_skewness(), summarize_stddev(), summarize_sum(), summarize_var(), summarize_weighted_avg(), summarize_weighted_corr(), summarize_weighted_covar(), summarize_z_score()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(
    sc,
    tibble::tibble(t = seq(10), x1 = rnorm(10), x2 = rnorm(10), y1 = rnorm(10), y2 = rnorm(10))
  )
  ts <- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  ts_corr2 <- summarize_corr2(ts, xcolumns = c("x1", "x2"), ycolumns = c("y1", "y2"))
} else {
  message("Unable to establish a Spark connection!")
}


Count summarizer

Description

Count the total number of records if no column is specified, or the number of non-null values within the specified column within each time window or within each group of records with identical timestamps

Usage

summarize_count(ts_rdd, column = NULL, window = NULL, key_columns = list())

Arguments

ts_rdd

Timeseries RDD being summarized

column

If not NULL, then report the number of values in the column specified that are not NULL or NaN within each time window or group of records with identical timestamps, and store the counts in a new column named '<column>_count'. Otherwise the number of records within each time window or group of records with identical timestamps is reported, and stored in a column named 'count'.

window

Either an R expression specifying time windows to be summarized (e.g., 'in_past("1h")' to summarize data from looking behind 1 hour at each time point, 'in_future("5s")' to summarize data from looking forward 5 seconds at each time point), or 'NULL' to compute aggregate statistics on records grouped by timestamps

key_columns

Optional list of columns that will form an equivalence relation associating each record with the time series it belongs to (i.e., any 2 records having equal values in those columns will be associated with the same time series, and any 2 records having differing values in those columns are considered to be from 2 separate time series and will therefore be summarized separately) By default, 'key_colums' is empty and all records are considered to be part of a single time series.

Value

A TimeSeriesRDD containing the summarized result

See Also

Other summarizers: ols_regression(), summarize_avg(), summarize_corr2(), summarize_corr(), summarize_covar(), summarize_dot_product(), summarize_ema_half_life(), summarize_ewma(), summarize_geometric_mean(), summarize_kurtosis(), summarize_max(), summarize_min(), summarize_nth_central_moment(), summarize_nth_moment(), summarize_product(), summarize_quantile(), summarize_skewness(), summarize_stddev(), summarize_sum(), summarize_var(), summarize_weighted_avg(), summarize_weighted_corr(), summarize_weighted_covar(), summarize_z_score()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(sc, tibble::tibble(t = seq(10), v = seq(10)))
  ts <- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  ts_count <- summarize_count(ts, column = "v", window = in_past("3s"))
} else {
  message("Unable to establish a Spark connection!")
}


Covariance summarizer

Description

Compute covariance between values from 'xcolumn' and 'ycolumn' within each time window or within each group of records with identical timestamps, and store results in a new column named '<xcolumn>_<ycolumn>_covariance'

Usage

summarize_covar(ts_rdd, xcolumn, ycolumn, window = NULL, key_columns = list())

Arguments

ts_rdd

Timeseries RDD being summarized

xcolumn

Column representing the first random variable

ycolumn

Column representing the second random variable

window

Either an R expression specifying time windows to be summarized (e.g., 'in_past("1h")' to summarize data from looking behind 1 hour at each time point, 'in_future("5s")' to summarize data from looking forward 5 seconds at each time point), or 'NULL' to compute aggregate statistics on records grouped by timestamps

key_columns

Optional list of columns that will form an equivalence relation associating each record with the time series it belongs to (i.e., any 2 records having equal values in those columns will be associated with the same time series, and any 2 records having differing values in those columns are considered to be from 2 separate time series and will therefore be summarized separately) By default, 'key_colums' is empty and all records are considered to be part of a single time series.

Value

A TimeSeriesRDD containing the summarized result

See Also

Other summarizers: ols_regression(), summarize_avg(), summarize_corr2(), summarize_corr(), summarize_count(), summarize_dot_product(), summarize_ema_half_life(), summarize_ewma(), summarize_geometric_mean(), summarize_kurtosis(), summarize_max(), summarize_min(), summarize_nth_central_moment(), summarize_nth_moment(), summarize_product(), summarize_quantile(), summarize_skewness(), summarize_stddev(), summarize_sum(), summarize_var(), summarize_weighted_avg(), summarize_weighted_corr(), summarize_weighted_covar(), summarize_z_score()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(sc, tibble::tibble(t = seq(10), u = rnorm(10), v = rnorm(10)))
  ts <- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  ts_covar <- summarize_covar(ts, xcolumn = "u", ycolumn = "v", window = in_past("3s"))
} else {
  message("Unable to establish a Spark connection!")
}


Dot product summarizer

Description

Compute dot product of values from 'xcolumn' and 'ycolumn' within a moving time window or within each group of records with identical timestamps and store results in a new column named '<xcolumn>_<ycolumn>_dotProduct'

Usage

summarize_dot_product(
  ts_rdd,
  xcolumn,
  ycolumn,
  window = NULL,
  key_columns = list()
)

Arguments

ts_rdd

Timeseries RDD being summarized

xcolumn

Name of the first column

ycolumn

Name of the second column

window

Either an R expression specifying time windows to be summarized (e.g., 'in_past("1h")' to summarize data from looking behind 1 hour at each time point, 'in_future("5s")' to summarize data from looking forward 5 seconds at each time point), or 'NULL' to compute aggregate statistics on records grouped by timestamps

key_columns

Optional list of columns that will form an equivalence relation associating each record with the time series it belongs to (i.e., any 2 records having equal values in those columns will be associated with the same time series, and any 2 records having differing values in those columns are considered to be from 2 separate time series and will therefore be summarized separately) By default, 'key_colums' is empty and all records are considered to be part of a single time series.

Value

A TimeSeriesRDD containing the summarized result

See Also

Other summarizers: ols_regression(), summarize_avg(), summarize_corr2(), summarize_corr(), summarize_count(), summarize_covar(), summarize_ema_half_life(), summarize_ewma(), summarize_geometric_mean(), summarize_kurtosis(), summarize_max(), summarize_min(), summarize_nth_central_moment(), summarize_nth_moment(), summarize_product(), summarize_quantile(), summarize_skewness(), summarize_stddev(), summarize_sum(), summarize_var(), summarize_weighted_avg(), summarize_weighted_corr(), summarize_weighted_covar(), summarize_z_score()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(sc, tibble::tibble(t = seq(10), u = seq(10, 1, -1), v = seq(10)))
  ts <- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  ts_dot_product <- summarize_dot_product(ts, xcolumn = "u", ycolumn = "v", window = in_past("3s"))
} else {
  message("Unable to establish a Spark connection!")
}


EMA half-life summarizer

Description

Calculate the exponential moving average of a time series using the half- life specified and store the result in a new column named '<column>_ema' See https://github.com/twosigma/flint/blob/master/doc/ema.md for details on different EMA implementations.

Usage

summarize_ema_half_life(
  ts_rdd,
  column,
  half_life_duration,
  window = NULL,
  time_column = "time",
  interpolation = c("previous", "linear", "current"),
  convention = c("legacy", "convolution", "core"),
  key_columns = list()
)

Arguments

ts_rdd

Timeseries RDD being summarized

column

Column to be summarized

half_life_duration

A time duration specified in string form (e.g., "1d", "1h", "15m", etc) representing the half-life duration

window

Either an R expression specifying time windows to be summarized (e.g., 'in_past("1h")' to summarize the EMA of 'column' within the time interval of [t - 1h, t] for each timestamp 't', 'in_future("5s")' to summarize EMA of 'column' within the time interval of [t, t + 5s] for each timestamp 't'), or 'NULL' to summarize EMA of 'column' within the time interval of (-inf, t] for each timestamp 't'

time_column

Name of the column containing timestamps (default: "time")

interpolation

Method used for interpolating values between two consecutive data points, must be one of "previous", "linear", and "current" (default: "previous"). See https://github.com/twosigma/flint/blob/master/doc/ema.md for details on different interpolation methods.

convention

Convolution convention, must be one of "convolution", "core", and "legacy" (default: "legacy"). See https://github.com/twosigma/flint/blob/master/doc/ema.md for details.

key_columns

Optional list of columns that will form an equivalence relation associating each record with the time series it belongs to (i.e., any 2 records having equal values in those columns will be associated with the same time series, and any 2 records having differing values in those columns are considered to be from 2 separate time series and will therefore be summarized separately) By default, 'key_colums' is empty and all records are considered to be part of a single time series.

See Also

Other summarizers: ols_regression(), summarize_avg(), summarize_corr2(), summarize_corr(), summarize_count(), summarize_covar(), summarize_dot_product(), summarize_ewma(), summarize_geometric_mean(), summarize_kurtosis(), summarize_max(), summarize_min(), summarize_nth_central_moment(), summarize_nth_moment(), summarize_product(), summarize_quantile(), summarize_skewness(), summarize_stddev(), summarize_sum(), summarize_var(), summarize_weighted_avg(), summarize_weighted_corr(), summarize_weighted_covar(), summarize_z_score()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  price_sdf <- copy_to(
    sc,
    data.frame(time = seq(1000), price = rnorm(1000))
  )
  ts <- fromSDF(price_sdf, is_sorted = TRUE, time_unit = "SECONDS")
  ts_ema <- summarize_ema_half_life(
    ts,
    column = "price",
    half_life_duration = "100s"
  )
} else {
  message("Unable to establish a Spark connection!")
}


Exponential weighted moving average summarizer

Description

Compute exponential weighted moving average (EWMA) of 'column' and store results in a new column named '<column>_ewma' At time t[n], the i-th value x[i] with timestamp t[i] will have a weighted value of [weight(i, n) * x[i]], where weight(i, n) is determined by both 'alpha' and 'smoothing_duration'.

Usage

summarize_ewma(
  ts_rdd,
  column,
  alpha = 0.05,
  smoothing_duration = "1d",
  time_column = "time",
  convention = c("core", "legacy"),
  key_columns = list()
)

Arguments

ts_rdd

Timeseries RDD being summarized

column

Column to be summarized

alpha

A smoothing factor between 0 and 1 (default: 0.05) – a higher alpha discounts older observations faster

smoothing_duration

A time duration specified in string form (e.g., "1d", "1h", "15m", etc) or "constant". The weight applied to a past observation from time t[p] at time t[n] is jointly determined by 'alpha' and 'smoothing_duration'.

If 'smoothing_duration' is a fixed time duration such as "1d", then weight(p, n) = (1 - alpha) ^ [(t[n] - t[p]) / smoothing_duration]

If 'smoothing_duration' is "constant", then weight(p, n) = (1 - alpha) ^ (n - p) (i.e., this option assumes the difference between consecutive timestamps is equal to some constant 'diff', and 'smoothing_duration' is effectively also equal to 'diff', so that t[n] - t[p] = (n - p) * diff and weight(p, n) = (1 - alpha) ^ [(t[n] - t[p]) / smoothing_duration] = (1 - alpha) ^ [(n - p) * diff / diff] = (1 - alpha) ^ (n - p))

time_column

Name of the column containing timestamps (default: "time")

convention

One of "core" or "legacy" (default: "core")

If 'convention' is "core", then the output will be weighted sum of all observations divided by the sum of all weight coefficients (see https://github.com/twosigma/flint/blob/master/doc/ema.md#core).

If 'convention' is "legacy", then the output will simply be the weighted sum of all observations, without being normalized by the sum of all weight coefficients (see https://github.com/twosigma/flint/blob/master/doc/ema.md#legacy).

key_columns

Optional list of columns that will form an equivalence relation associating each record with the time series it belongs to (i.e., any 2 records having equal values in those columns will be associated with the same time series, and any 2 records having differing values in those columns are considered to be from 2 separate time series and will therefore be summarized separately) By default, 'key_colums' is empty and all records are considered to be part of a single time series.

See Also

Other summarizers: ols_regression(), summarize_avg(), summarize_corr2(), summarize_corr(), summarize_count(), summarize_covar(), summarize_dot_product(), summarize_ema_half_life(), summarize_geometric_mean(), summarize_kurtosis(), summarize_max(), summarize_min(), summarize_nth_central_moment(), summarize_nth_moment(), summarize_product(), summarize_quantile(), summarize_skewness(), summarize_stddev(), summarize_sum(), summarize_var(), summarize_weighted_avg(), summarize_weighted_corr(), summarize_weighted_covar(), summarize_z_score()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  price_sdf <- copy_to(
    sc,
    data.frame(
      time = ceiling(seq(12) / 2),
      price = seq(12) / 2,
      id = rep(c(3L, 7L), 6)
    )
  )
  ts <- fromSDF(price_sdf, is_sorted = TRUE, time_unit = "DAYS")
  ts_ewma <- summarize_ewma(
    ts,
    column = "price",
    smoothing_duration = "1d",
    key_columns = "id"
  )
} else {
  message("Unable to establish a Spark connection!")
}


Geometric mean summarizer

Description

Compute geometric mean of values from 'column' within a moving time window or within each group of records with identical timestamps and store results in a new column named '<column>_geometricMean'

Usage

summarize_geometric_mean(
  ts_rdd,
  column,
  key_columns = list(),
  incremental = FALSE
)

Arguments

ts_rdd

Timeseries RDD being summarized

column

Column to be summarized

key_columns

Optional list of columns that will form an equivalence relation associating each record with the time series it belongs to (i.e., any 2 records having equal values in those columns will be associated with the same time series, and any 2 records having differing values in those columns are considered to be from 2 separate time series and will therefore be summarized separately) By default, 'key_colums' is empty and all records are considered to be part of a single time series.

incremental

If FALSE and 'key_columns' is empty, then apply the summarizer to all records of 'ts_rdd'. If FALSE and 'key_columns' is non-empty, then apply the summarizer to all records within each group determined by 'key_columns'. If TRUE and 'key_columns' is empty, then for each record in 'ts_rdd', the summarizer is applied to that record and all records preceding it, and the summarized result is associated with the timestamp of that record. If TRUE and 'key_columns' is non-empty, then for each record within a group of records determined by 1 or more key columns, the summarizer is applied to that record and all records preceding it within its group, and the summarized result is associated with the timestamp of that record.

Value

A TimeSeriesRDD containing the summarized result

See Also

Other summarizers: ols_regression(), summarize_avg(), summarize_corr2(), summarize_corr(), summarize_count(), summarize_covar(), summarize_dot_product(), summarize_ema_half_life(), summarize_ewma(), summarize_kurtosis(), summarize_max(), summarize_min(), summarize_nth_central_moment(), summarize_nth_moment(), summarize_product(), summarize_quantile(), summarize_skewness(), summarize_stddev(), summarize_sum(), summarize_var(), summarize_weighted_avg(), summarize_weighted_corr(), summarize_weighted_covar(), summarize_z_score()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(sc, tibble::tibble(t = seq(10), u = seq(10, 1, -1)))
  ts <- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  ts_geometric_mean <- summarize_geometric_mean(ts, column = "u")
} else {
  message("Unable to establish a Spark connection!")
}


Kurtosis summarizer

Description

Compute the excess kurtosis (fourth standardized moment minus 3) of 'column' and store the result in a new column named '<column>_kurtosis'

Usage

summarize_kurtosis(ts_rdd, column, key_columns = list(), incremental = FALSE)

Arguments

ts_rdd

Timeseries RDD being summarized

column

Column to be summarized

key_columns

Optional list of columns that will form an equivalence relation associating each record with the time series it belongs to (i.e., any 2 records having equal values in those columns will be associated with the same time series, and any 2 records having differing values in those columns are considered to be from 2 separate time series and will therefore be summarized separately) By default, 'key_colums' is empty and all records are considered to be part of a single time series.

incremental

If FALSE and 'key_columns' is empty, then apply the summarizer to all records of 'ts_rdd'. If FALSE and 'key_columns' is non-empty, then apply the summarizer to all records within each group determined by 'key_columns'. If TRUE and 'key_columns' is empty, then for each record in 'ts_rdd', the summarizer is applied to that record and all records preceding it, and the summarized result is associated with the timestamp of that record. If TRUE and 'key_columns' is non-empty, then for each record within a group of records determined by 1 or more key columns, the summarizer is applied to that record and all records preceding it within its group, and the summarized result is associated with the timestamp of that record.

See Also

Other summarizers: ols_regression(), summarize_avg(), summarize_corr2(), summarize_corr(), summarize_count(), summarize_covar(), summarize_dot_product(), summarize_ema_half_life(), summarize_ewma(), summarize_geometric_mean(), summarize_max(), summarize_min(), summarize_nth_central_moment(), summarize_nth_moment(), summarize_product(), summarize_quantile(), summarize_skewness(), summarize_stddev(), summarize_sum(), summarize_var(), summarize_weighted_avg(), summarize_weighted_corr(), summarize_weighted_covar(), summarize_z_score()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  price_sdf <- copy_to(
    sc,
    data.frame(
      time = ceiling(seq(12) / 2),
      price = seq(12) / 2,
      id = rep(c(3L, 7L), 6)
    )
  )
  ts <- fromSDF(price_sdf, is_sorted = TRUE, time_unit = "DAYS")
  ts_kurtosis <- summarize_kurtosis(ts, column = "price")
} else {
  message("Unable to establish a Spark connection!")
}


Maximum value summarizer

Description

Find maximum value among values from 'column' within each time window or within each group of records with identical timestamps, and store results in a new column named '<column>_max'

Usage

summarize_max(ts_rdd, column, window = NULL, key_columns = list())

Arguments

ts_rdd

Timeseries RDD being summarized

column

Column to be summarized

window

Either an R expression specifying time windows to be summarized (e.g., 'in_past("1h")' to summarize data from looking behind 1 hour at each time point, 'in_future("5s")' to summarize data from looking forward 5 seconds at each time point), or 'NULL' to compute aggregate statistics on records grouped by timestamps

key_columns

Optional list of columns that will form an equivalence relation associating each record with the time series it belongs to (i.e., any 2 records having equal values in those columns will be associated with the same time series, and any 2 records having differing values in those columns are considered to be from 2 separate time series and will therefore be summarized separately) By default, 'key_colums' is empty and all records are considered to be part of a single time series.

Value

A TimeSeriesRDD containing the summarized result

See Also

Other summarizers: ols_regression(), summarize_avg(), summarize_corr2(), summarize_corr(), summarize_count(), summarize_covar(), summarize_dot_product(), summarize_ema_half_life(), summarize_ewma(), summarize_geometric_mean(), summarize_kurtosis(), summarize_min(), summarize_nth_central_moment(), summarize_nth_moment(), summarize_product(), summarize_quantile(), summarize_skewness(), summarize_stddev(), summarize_sum(), summarize_var(), summarize_weighted_avg(), summarize_weighted_corr(), summarize_weighted_covar(), summarize_z_score()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(sc, tibble::tibble(t = seq(10), v = seq(10)))
  ts <- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  ts_max <- summarize_max(ts, column = "v", window = in_past("3s"))
} else {
  message("Unable to establish a Spark connection!")
}


Minimum value summarizer

Description

Find minimum value among values from 'column' within each time window or within each group of records with identical timestamps, and store results in a new column named '<column>_min'

Usage

summarize_min(ts_rdd, column, window = NULL, key_columns = list())

Arguments

ts_rdd

Timeseries RDD being summarized

column

Column to be summarized

window

Either an R expression specifying time windows to be summarized (e.g., 'in_past("1h")' to summarize data from looking behind 1 hour at each time point, 'in_future("5s")' to summarize data from looking forward 5 seconds at each time point), or 'NULL' to compute aggregate statistics on records grouped by timestamps

key_columns

Optional list of columns that will form an equivalence relation associating each record with the time series it belongs to (i.e., any 2 records having equal values in those columns will be associated with the same time series, and any 2 records having differing values in those columns are considered to be from 2 separate time series and will therefore be summarized separately) By default, 'key_colums' is empty and all records are considered to be part of a single time series.

Value

A TimeSeriesRDD containing the summarized result

See Also

Other summarizers: ols_regression(), summarize_avg(), summarize_corr2(), summarize_corr(), summarize_count(), summarize_covar(), summarize_dot_product(), summarize_ema_half_life(), summarize_ewma(), summarize_geometric_mean(), summarize_kurtosis(), summarize_max(), summarize_nth_central_moment(), summarize_nth_moment(), summarize_product(), summarize_quantile(), summarize_skewness(), summarize_stddev(), summarize_sum(), summarize_var(), summarize_weighted_avg(), summarize_weighted_corr(), summarize_weighted_covar(), summarize_z_score()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(sc, tibble::tibble(t = seq(10), v = seq(10)))
  ts <- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  ts_min <- summarize_min(ts, column = "v", window = in_past("3s"))
} else {
  message("Unable to establish a Spark connection!")
}


N-th central moment summarizer

Description

Compute n-th central moment of the column specified and store result in a new column named '<column>_<n>thCentralMoment'

Usage

summarize_nth_central_moment(
  ts_rdd,
  column,
  n,
  key_columns = list(),
  incremental = FALSE
)

Arguments

ts_rdd

Timeseries RDD being summarized

column

Column to be summarized

n

The order of moment to calculate

key_columns

Optional list of columns that will form an equivalence relation associating each record with the time series it belongs to (i.e., any 2 records having equal values in those columns will be associated with the same time series, and any 2 records having differing values in those columns are considered to be from 2 separate time series and will therefore be summarized separately) By default, 'key_colums' is empty and all records are considered to be part of a single time series.

incremental

If FALSE and 'key_columns' is empty, then apply the summarizer to all records of 'ts_rdd'. If FALSE and 'key_columns' is non-empty, then apply the summarizer to all records within each group determined by 'key_columns'. If TRUE and 'key_columns' is empty, then for each record in 'ts_rdd', the summarizer is applied to that record and all records preceding it, and the summarized result is associated with the timestamp of that record. If TRUE and 'key_columns' is non-empty, then for each record within a group of records determined by 1 or more key columns, the summarizer is applied to that record and all records preceding it within its group, and the summarized result is associated with the timestamp of that record.

Value

A TimeSeriesRDD containing the summarized result

See Also

Other summarizers: ols_regression(), summarize_avg(), summarize_corr2(), summarize_corr(), summarize_count(), summarize_covar(), summarize_dot_product(), summarize_ema_half_life(), summarize_ewma(), summarize_geometric_mean(), summarize_kurtosis(), summarize_max(), summarize_min(), summarize_nth_moment(), summarize_product(), summarize_quantile(), summarize_skewness(), summarize_stddev(), summarize_sum(), summarize_var(), summarize_weighted_avg(), summarize_weighted_corr(), summarize_weighted_covar(), summarize_z_score()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(sc, tibble::tibble(t = seq(10), v = rnorm(10)))
  ts <- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  ts_4th_central_moment <- summarize_nth_central_moment(ts, column = "v", n = 4L)
} else {
  message("Unable to establish a Spark connection!")
}


N-th moment summarizer

Description

Compute n-th moment of the column specified and store result in a new column named '<column>_<n>thMoment'

Usage

summarize_nth_moment(
  ts_rdd,
  column,
  n,
  key_columns = list(),
  incremental = FALSE
)

Arguments

ts_rdd

Timeseries RDD being summarized

column

Column to be summarized

n

The order of moment to calculate

key_columns

Optional list of columns that will form an equivalence relation associating each record with the time series it belongs to (i.e., any 2 records having equal values in those columns will be associated with the same time series, and any 2 records having differing values in those columns are considered to be from 2 separate time series and will therefore be summarized separately) By default, 'key_colums' is empty and all records are considered to be part of a single time series.

incremental

If FALSE and 'key_columns' is empty, then apply the summarizer to all records of 'ts_rdd'. If FALSE and 'key_columns' is non-empty, then apply the summarizer to all records within each group determined by 'key_columns'. If TRUE and 'key_columns' is empty, then for each record in 'ts_rdd', the summarizer is applied to that record and all records preceding it, and the summarized result is associated with the timestamp of that record. If TRUE and 'key_columns' is non-empty, then for each record within a group of records determined by 1 or more key columns, the summarizer is applied to that record and all records preceding it within its group, and the summarized result is associated with the timestamp of that record.

Value

A TimeSeriesRDD containing the summarized result

See Also

Other summarizers: ols_regression(), summarize_avg(), summarize_corr2(), summarize_corr(), summarize_count(), summarize_covar(), summarize_dot_product(), summarize_ema_half_life(), summarize_ewma(), summarize_geometric_mean(), summarize_kurtosis(), summarize_max(), summarize_min(), summarize_nth_central_moment(), summarize_product(), summarize_quantile(), summarize_skewness(), summarize_stddev(), summarize_sum(), summarize_var(), summarize_weighted_avg(), summarize_weighted_corr(), summarize_weighted_covar(), summarize_z_score()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(sc, tibble::tibble(t = seq(10), v = rnorm(10)))
  ts <- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  ts_4th_moment <- summarize_nth_moment(ts, column = "v", n = 4L)
} else {
  message("Unable to establish a Spark connection!")
}


Product summarizer

Description

Compute product of values from the given column within a moving time window new column named '<column>_product'

Usage

summarize_product(ts_rdd, column, window = NULL, key_columns = list())

Arguments

ts_rdd

Timeseries RDD being summarized

column

Column to be summarized

window

Either an R expression specifying time windows to be summarized (e.g., 'in_past("1h")' to summarize data from looking behind 1 hour at each time point, 'in_future("5s")' to summarize data from looking forward 5 seconds at each time point), or 'NULL' to compute aggregate statistics on records grouped by timestamps

key_columns

Optional list of columns that will form an equivalence relation associating each record with the time series it belongs to (i.e., any 2 records having equal values in those columns will be associated with the same time series, and any 2 records having differing values in those columns are considered to be from 2 separate time series and will therefore be summarized separately) By default, 'key_colums' is empty and all records are considered to be part of a single time series.

Value

A TimeSeriesRDD containing the summarized result

See Also

Other summarizers: ols_regression(), summarize_avg(), summarize_corr2(), summarize_corr(), summarize_count(), summarize_covar(), summarize_dot_product(), summarize_ema_half_life(), summarize_ewma(), summarize_geometric_mean(), summarize_kurtosis(), summarize_max(), summarize_min(), summarize_nth_central_moment(), summarize_nth_moment(), summarize_quantile(), summarize_skewness(), summarize_stddev(), summarize_sum(), summarize_var(), summarize_weighted_avg(), summarize_weighted_corr(), summarize_weighted_covar(), summarize_z_score()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(sc, tibble::tibble(t = seq(10), v = seq(10)))
  ts <- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  ts_product <- summarize_product(ts, column = "v", window = in_past("3s"))
} else {
  message("Unable to establish a Spark connection!")
}


Quantile summarizer

Description

Compute quantiles of 'column' within each time window or within each group of records with identical time-stamps, and store results in new columns named '<column>_<quantile value>quantile'

Usage

summarize_quantile(ts_rdd, column, p, window = NULL, key_columns = list())

Arguments

ts_rdd

Timeseries RDD being summarized

column

Column to be summarized

p

List of quantile probabilities

window

Either an R expression specifying time windows to be summarized (e.g., 'in_past("1h")' to summarize data from looking behind 1 hour at each time point, 'in_future("5s")' to summarize data from looking forward 5 seconds at each time point), or 'NULL' to compute aggregate statistics on records grouped by timestamps

key_columns

Optional list of columns that will form an equivalence relation associating each record with the time series it belongs to (i.e., any 2 records having equal values in those columns will be associated with the same time series, and any 2 records having differing values in those columns are considered to be from 2 separate time series and will therefore be summarized separately) By default, 'key_colums' is empty and all records are considered to be part of a single time series.

Value

A TimeSeriesRDD containing the summarized result

See Also

Other summarizers: ols_regression(), summarize_avg(), summarize_corr2(), summarize_corr(), summarize_count(), summarize_covar(), summarize_dot_product(), summarize_ema_half_life(), summarize_ewma(), summarize_geometric_mean(), summarize_kurtosis(), summarize_max(), summarize_min(), summarize_nth_central_moment(), summarize_nth_moment(), summarize_product(), summarize_skewness(), summarize_stddev(), summarize_sum(), summarize_var(), summarize_weighted_avg(), summarize_weighted_corr(), summarize_weighted_covar(), summarize_z_score()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(sc, tibble::tibble(t = seq(10), v = seq(10)))
  ts <- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  ts_quantile <- summarize_quantile(
    ts, column = "v", p = c(0.5, 0.75, 0.99), window = in_past("3s")
  )
} else {
  message("Unable to establish a Spark connection!")
}


Skewness summarizer

Description

Compute skewness (third standardized moment) of 'column' and store the result in a new column named '<column>_skewness'

Usage

summarize_skewness(ts_rdd, column, key_columns = list(), incremental = FALSE)

Arguments

ts_rdd

Timeseries RDD being summarized

column

Column to be summarized

key_columns

Optional list of columns that will form an equivalence relation associating each record with the time series it belongs to (i.e., any 2 records having equal values in those columns will be associated with the same time series, and any 2 records having differing values in those columns are considered to be from 2 separate time series and will therefore be summarized separately) By default, 'key_colums' is empty and all records are considered to be part of a single time series.

incremental

If FALSE and 'key_columns' is empty, then apply the summarizer to all records of 'ts_rdd'. If FALSE and 'key_columns' is non-empty, then apply the summarizer to all records within each group determined by 'key_columns'. If TRUE and 'key_columns' is empty, then for each record in 'ts_rdd', the summarizer is applied to that record and all records preceding it, and the summarized result is associated with the timestamp of that record. If TRUE and 'key_columns' is non-empty, then for each record within a group of records determined by 1 or more key columns, the summarizer is applied to that record and all records preceding it within its group, and the summarized result is associated with the timestamp of that record.

See Also

Other summarizers: ols_regression(), summarize_avg(), summarize_corr2(), summarize_corr(), summarize_count(), summarize_covar(), summarize_dot_product(), summarize_ema_half_life(), summarize_ewma(), summarize_geometric_mean(), summarize_kurtosis(), summarize_max(), summarize_min(), summarize_nth_central_moment(), summarize_nth_moment(), summarize_product(), summarize_quantile(), summarize_stddev(), summarize_sum(), summarize_var(), summarize_weighted_avg(), summarize_weighted_corr(), summarize_weighted_covar(), summarize_z_score()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  price_sdf <- copy_to(
    sc,
    data.frame(
      time = ceiling(seq(12) / 2),
      price = seq(12) / 2,
      id = rep(c(3L, 7L), 6)
    )
  )
  ts <- fromSDF(price_sdf, is_sorted = TRUE, time_unit = "DAYS")
  ts_skewness <- summarize_skewness(ts, column = "price")
} else {
  message("Unable to establish a Spark connection!")
}


Standard deviation summarizer

Description

Compute unbiased (i.e., Bessel's correction is applied) sample standard deviation of values from 'column' within each time window or within each group of records with identical timestamps, and store results in a new column named '<column>_stddev'

Usage

summarize_stddev(ts_rdd, column, window = NULL, key_columns = list())

Arguments

ts_rdd

Timeseries RDD being summarized

column

Column to be summarized

window

Either an R expression specifying time windows to be summarized (e.g., 'in_past("1h")' to summarize data from looking behind 1 hour at each time point, 'in_future("5s")' to summarize data from looking forward 5 seconds at each time point), or 'NULL' to compute aggregate statistics on records grouped by timestamps

key_columns

Optional list of columns that will form an equivalence relation associating each record with the time series it belongs to (i.e., any 2 records having equal values in those columns will be associated with the same time series, and any 2 records having differing values in those columns are considered to be from 2 separate time series and will therefore be summarized separately) By default, 'key_colums' is empty and all records are considered to be part of a single time series.

Value

A TimeSeriesRDD containing the summarized result

See Also

Other summarizers: ols_regression(), summarize_avg(), summarize_corr2(), summarize_corr(), summarize_count(), summarize_covar(), summarize_dot_product(), summarize_ema_half_life(), summarize_ewma(), summarize_geometric_mean(), summarize_kurtosis(), summarize_max(), summarize_min(), summarize_nth_central_moment(), summarize_nth_moment(), summarize_product(), summarize_quantile(), summarize_skewness(), summarize_sum(), summarize_var(), summarize_weighted_avg(), summarize_weighted_corr(), summarize_weighted_covar(), summarize_z_score()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(sc, tibble::tibble(t = seq(10), v = seq(10)))
  ts <- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  ts_stddev <- summarize_stddev(ts, column = "v", window = in_past("3s"))
} else {
  message("Unable to establish a Spark connection!")
}


Sum summarizer

Description

Compute moving sums on the column specified and store results in a new column named '<column>_sum'

Usage

summarize_sum(ts_rdd, column, window = NULL, key_columns = list())

Arguments

ts_rdd

Timeseries RDD being summarized

column

Column to be summarized

window

Either an R expression specifying time windows to be summarized (e.g., 'in_past("1h")' to summarize data from looking behind 1 hour at each time point, 'in_future("5s")' to summarize data from looking forward 5 seconds at each time point), or 'NULL' to compute aggregate statistics on records grouped by timestamps

key_columns

Optional list of columns that will form an equivalence relation associating each record with the time series it belongs to (i.e., any 2 records having equal values in those columns will be associated with the same time series, and any 2 records having differing values in those columns are considered to be from 2 separate time series and will therefore be summarized separately) By default, 'key_colums' is empty and all records are considered to be part of a single time series.

Value

A TimeSeriesRDD containing the summarized result

See Also

Other summarizers: ols_regression(), summarize_avg(), summarize_corr2(), summarize_corr(), summarize_count(), summarize_covar(), summarize_dot_product(), summarize_ema_half_life(), summarize_ewma(), summarize_geometric_mean(), summarize_kurtosis(), summarize_max(), summarize_min(), summarize_nth_central_moment(), summarize_nth_moment(), summarize_product(), summarize_quantile(), summarize_skewness(), summarize_stddev(), summarize_var(), summarize_weighted_avg(), summarize_weighted_corr(), summarize_weighted_covar(), summarize_z_score()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(sc, tibble::tibble(t = seq(10), v = seq(10)))
  ts <- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  ts_sum <- summarize_sum(ts, column = "v", window = in_past("3s"))
} else {
  message("Unable to establish a Spark connection!")
}


Variance summarizer

Description

Compute variance of values from 'column' within each time window or within each group of records with identical timestamps, and store results in a new column named ‘<column>_variance', with Bessel’s correction applied to the results

Usage

summarize_var(ts_rdd, column, window = NULL, key_columns = list())

Arguments

ts_rdd

Timeseries RDD being summarized

column

Column to be summarized

window

Either an R expression specifying time windows to be summarized (e.g., 'in_past("1h")' to summarize data from looking behind 1 hour at each time point, 'in_future("5s")' to summarize data from looking forward 5 seconds at each time point), or 'NULL' to compute aggregate statistics on records grouped by timestamps

key_columns

Optional list of columns that will form an equivalence relation associating each record with the time series it belongs to (i.e., any 2 records having equal values in those columns will be associated with the same time series, and any 2 records having differing values in those columns are considered to be from 2 separate time series and will therefore be summarized separately) By default, 'key_colums' is empty and all records are considered to be part of a single time series.

Value

A TimeSeriesRDD containing the summarized result

See Also

Other summarizers: ols_regression(), summarize_avg(), summarize_corr2(), summarize_corr(), summarize_count(), summarize_covar(), summarize_dot_product(), summarize_ema_half_life(), summarize_ewma(), summarize_geometric_mean(), summarize_kurtosis(), summarize_max(), summarize_min(), summarize_nth_central_moment(), summarize_nth_moment(), summarize_product(), summarize_quantile(), summarize_skewness(), summarize_stddev(), summarize_sum(), summarize_weighted_avg(), summarize_weighted_corr(), summarize_weighted_covar(), summarize_z_score()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(sc, tibble::tibble(t = seq(10), v = seq(10)))
  ts <- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  ts_var <- summarize_var(ts, column = "v", window = in_past("3s"))
} else {
  message("Unable to establish a Spark connection!")
}


Weighted average summarizer

Description

Compute moving weighted average, weighted standard deviation, weighted t- stat, and observation count with the column and weight column specified and store results in new columns named '<column>_<weighted_column>_mean', '<column>_<weighted_column>_weightedStandardDeviation', '<column>_<weighted_column>_weightedTStat', and '<column>_<weighted_column>_observationCount',

Usage

summarize_weighted_avg(
  ts_rdd,
  column,
  weight_column,
  window = NULL,
  key_columns = list()
)

Arguments

ts_rdd

Timeseries RDD being summarized

column

Column to be summarized

weight_column

Column specifying relative weight of each data point

window

Either an R expression specifying time windows to be summarized (e.g., 'in_past("1h")' to summarize data from looking behind 1 hour at each time point, 'in_future("5s")' to summarize data from looking forward 5 seconds at each time point), or 'NULL' to compute aggregate statistics on records grouped by timestamps

key_columns

Optional list of columns that will form an equivalence relation associating each record with the time series it belongs to (i.e., any 2 records having equal values in those columns will be associated with the same time series, and any 2 records having differing values in those columns are considered to be from 2 separate time series and will therefore be summarized separately) By default, 'key_colums' is empty and all records are considered to be part of a single time series.

Value

A TimeSeriesRDD containing the summarized result

See Also

Other summarizers: ols_regression(), summarize_avg(), summarize_corr2(), summarize_corr(), summarize_count(), summarize_covar(), summarize_dot_product(), summarize_ema_half_life(), summarize_ewma(), summarize_geometric_mean(), summarize_kurtosis(), summarize_max(), summarize_min(), summarize_nth_central_moment(), summarize_nth_moment(), summarize_product(), summarize_quantile(), summarize_skewness(), summarize_stddev(), summarize_sum(), summarize_var(), summarize_weighted_corr(), summarize_weighted_covar(), summarize_z_score()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(sc, tibble::tibble(t = seq(10), v = seq(10), w = seq(1, 0.1, -0.1)))
  ts <- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  ts_weighted_avg <- summarize_weighted_avg(
    ts,
    column = "v", weight_column = "w", window = in_past("3s")
  )
} else {
  message("Unable to establish a Spark connection!")
}


Pearson weighted correlation summarizer

Description

Compute Pearson weighted correlation between 'xcolumn' and 'ycolumn' weighted by 'weight_column' and store result in a new columns named '<xcolumn>_<ycolumn>_<weight_column>_weightedCorrelation'

Usage

summarize_weighted_corr(
  ts_rdd,
  xcolumn,
  ycolumn,
  weight_column,
  key_columns = list(),
  incremental = FALSE
)

Arguments

ts_rdd

Timeseries RDD being summarized

xcolumn

Column representing the first random variable

ycolumn

Column representing the second random variable

weight_column

Column specifying relative weight of each data point

key_columns

Optional list of columns that will form an equivalence relation associating each record with the time series it belongs to (i.e., any 2 records having equal values in those columns will be associated with the same time series, and any 2 records having differing values in those columns are considered to be from 2 separate time series and will therefore be summarized separately) By default, 'key_colums' is empty and all records are considered to be part of a single time series.

incremental

If FALSE and 'key_columns' is empty, then apply the summarizer to all records of 'ts_rdd'. If FALSE and 'key_columns' is non-empty, then apply the summarizer to all records within each group determined by 'key_columns'. If TRUE and 'key_columns' is empty, then for each record in 'ts_rdd', the summarizer is applied to that record and all records preceding it, and the summarized result is associated with the timestamp of that record. If TRUE and 'key_columns' is non-empty, then for each record within a group of records determined by 1 or more key columns, the summarizer is applied to that record and all records preceding it within its group, and the summarized result is associated with the timestamp of that record.

Value

A TimeSeriesRDD containing the summarized result

See Also

Other summarizers: ols_regression(), summarize_avg(), summarize_corr2(), summarize_corr(), summarize_count(), summarize_covar(), summarize_dot_product(), summarize_ema_half_life(), summarize_ewma(), summarize_geometric_mean(), summarize_kurtosis(), summarize_max(), summarize_min(), summarize_nth_central_moment(), summarize_nth_moment(), summarize_product(), summarize_quantile(), summarize_skewness(), summarize_stddev(), summarize_sum(), summarize_var(), summarize_weighted_avg(), summarize_weighted_covar(), summarize_z_score()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(sc, tibble::tibble(t = seq(10), x = rnorm(10), y = rnorm(10), w = 1.1^seq(10)))
  ts <- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  ts_weighted_corr <- summarize_weighted_corr(ts, xcolumn = "x", ycolumn = "y", weight_column = "w")
} else {
  message("Unable to establish a Spark connection!")
}


Weighted covariance summarizer

Description

Compute unbiased weighted covariance between values from 'xcolumn' and 'ycolumn' within each time window or within each group of records with identical timestamps, using values from 'weight_column' as relative weights, and store results in a new column named '<xcolumn>_<ycolumn>_<weight_column>_weightedCovariance'

Usage

summarize_weighted_covar(
  ts_rdd,
  xcolumn,
  ycolumn,
  weight_column,
  window = NULL,
  key_columns = list()
)

Arguments

ts_rdd

Timeseries RDD being summarized

xcolumn

Column representing the first random variable

ycolumn

Column representing the second random variable

weight_column

Column specifying relative weight of each data point

window

Either an R expression specifying time windows to be summarized (e.g., 'in_past("1h")' to summarize data from looking behind 1 hour at each time point, 'in_future("5s")' to summarize data from looking forward 5 seconds at each time point), or 'NULL' to compute aggregate statistics on records grouped by timestamps

key_columns

Optional list of columns that will form an equivalence relation associating each record with the time series it belongs to (i.e., any 2 records having equal values in those columns will be associated with the same time series, and any 2 records having differing values in those columns are considered to be from 2 separate time series and will therefore be summarized separately) By default, 'key_colums' is empty and all records are considered to be part of a single time series.

Value

A TimeSeriesRDD containing the summarized result

See Also

Other summarizers: ols_regression(), summarize_avg(), summarize_corr2(), summarize_corr(), summarize_count(), summarize_covar(), summarize_dot_product(), summarize_ema_half_life(), summarize_ewma(), summarize_geometric_mean(), summarize_kurtosis(), summarize_max(), summarize_min(), summarize_nth_central_moment(), summarize_nth_moment(), summarize_product(), summarize_quantile(), summarize_skewness(), summarize_stddev(), summarize_sum(), summarize_var(), summarize_weighted_avg(), summarize_weighted_corr(), summarize_z_score()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(sc, tibble::tibble(t = seq(10), u = rnorm(10), v = rnorm(10), w = 1.1^seq(10)))
  ts <- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  ts_weighted_covar <- summarize_weighted_covar(
    ts,
    xcolumn = "u", ycolumn = "v", weight_column = "w", window = in_past("3s")
  )
} else {
  message("Unable to establish a Spark connection!")
}


Z-score summarizer

Description

Compute z-score of value(s) in the column specified, with respect to the sample mean and standard deviation observed so far, with the option for out- of-sample calculation, and store result in a new column named '<column>_zScore'.

Usage

summarize_z_score(
  ts_rdd,
  column,
  include_current_observation = FALSE,
  key_columns = list(),
  incremental = FALSE
)

Arguments

ts_rdd

Timeseries RDD being summarized

column

Column to be summarized

include_current_observation

If true, then use unbiased sample standard deviation with current observation in z-score calculation, otherwise use unbiased sample standard deviation excluding current observation

key_columns

Optional list of columns that will form an equivalence relation associating each record with the time series it belongs to (i.e., any 2 records having equal values in those columns will be associated with the same time series, and any 2 records having differing values in those columns are considered to be from 2 separate time series and will therefore be summarized separately) By default, 'key_colums' is empty and all records are considered to be part of a single time series.

incremental

If FALSE and 'key_columns' is empty, then apply the summarizer to all records of 'ts_rdd'. If FALSE and 'key_columns' is non-empty, then apply the summarizer to all records within each group determined by 'key_columns'. If TRUE and 'key_columns' is empty, then for each record in 'ts_rdd', the summarizer is applied to that record and all records preceding it, and the summarized result is associated with the timestamp of that record. If TRUE and 'key_columns' is non-empty, then for each record within a group of records determined by 1 or more key columns, the summarizer is applied to that record and all records preceding it within its group, and the summarized result is associated with the timestamp of that record.

Value

A TimeSeriesRDD containing the summarized result

See Also

Other summarizers: ols_regression(), summarize_avg(), summarize_corr2(), summarize_corr(), summarize_count(), summarize_covar(), summarize_dot_product(), summarize_ema_half_life(), summarize_ewma(), summarize_geometric_mean(), summarize_kurtosis(), summarize_max(), summarize_min(), summarize_nth_central_moment(), summarize_nth_moment(), summarize_product(), summarize_quantile(), summarize_skewness(), summarize_stddev(), summarize_sum(), summarize_var(), summarize_weighted_avg(), summarize_weighted_corr(), summarize_weighted_covar()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(sc, tibble::tibble(t = seq(10), v = rnorm(10)))
  ts <- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  ts_z_score <- summarize_z_score(ts, column = "v", include_current_observation = TRUE)
} else {
  message("Unable to establish a Spark connection!")
}


Wrapper functions for commonly used summarizer functions

Description

R wrapper functions for commonly used Flint summarizer functionalities such as sum and count.

Arguments

ts_rdd

Timeseries RDD being summarized

window

Either an R expression specifying time windows to be summarized (e.g., 'in_past("1h")' to summarize data from looking behind 1 hour at each time point, 'in_future("5s")' to summarize data from looking forward 5 seconds at each time point), or 'NULL' to compute aggregate statistics on records grouped by timestamps

column

Column to be summarized

key_columns

Optional list of columns that will form an equivalence relation associating each record with the time series it belongs to (i.e., any 2 records having equal values in those columns will be associated with the same time series, and any 2 records having differing values in those columns are considered to be from 2 separate time series and will therefore be summarized separately) By default, 'key_colums' is empty and all records are considered to be part of a single time series.

incremental

If FALSE and 'key_columns' is empty, then apply the summarizer to all records of 'ts_rdd'. If FALSE and 'key_columns' is non-empty, then apply the summarizer to all records within each group determined by 'key_columns'. If TRUE and 'key_columns' is empty, then for each record in 'ts_rdd', the summarizer is applied to that record and all records preceding it, and the summarized result is associated with the timestamp of that record. If TRUE and 'key_columns' is non-empty, then for each record within a group of records determined by 1 or more key columns, the summarizer is applied to that record and all records preceding it within its group, and the summarized result is associated with the timestamp of that record.


Export data from TimeSeriesRDD to a Spark dataframe

Description

Construct a Spark dataframe containing time series data from a TimeSeriesRDD

Usage

to_sdf(ts_rdd)

toSDF(ts_rdd)

Arguments

ts_rdd

A TimeSeriesRDD object

Value

A Spark dataframe containing time series data exported from 'ts_rdd'

See Also

Other Spark dataframe utility functions: collect.ts_rdd(), from_rdd(), from_sdf(), spark_connection.ts_rdd(), spark_dataframe.ts_rdd(), spark_jobj.ts_rdd(), ts_rdd_builder()

Other Spark dataframe utility functions: collect.ts_rdd(), from_rdd(), from_sdf(), spark_connection.ts_rdd(), spark_dataframe.ts_rdd(), spark_jobj.ts_rdd(), ts_rdd_builder()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(sc, tibble::tibble(t = seq(10), v = seq(10)))
  ts <- from_sdf(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  ts_avg <- summarize_avg(ts, column = "v", window = in_past("3s"))
  # now export the average values from `ts_avg` back to a Spark dataframe
  # named `sdf_avg`
  sdf_avg <- ts_avg %>% to_sdf()
} else {
  message("Unable to establish a Spark connection!")
}


Attempt to establish a Spark connection

Description

Attempt to connect to Apache Spark and return a Spark connection object upon success

Usage

try_spark_connect(...)

Arguments

...

Parameters for sparklyr::spark_connect

Value

a Spark connection object if attempt was successful, or NULL otherwise

Examples


try_spark_connect(master = "local")


TimeSeriesRDD builder object

Description

Builder object containing all required info (i.e., isSorted, timeUnit, and timeColumn) for importing a Spark data frame into a TimeSeriesRDD

Usage

ts_rdd_builder(
  sc,
  is_sorted = FALSE,
  time_unit = .sparklyr.flint.globals$kValidTimeUnits,
  time_column = .sparklyr.flint.globals$kDefaultTimeColumn
)

Arguments

sc

Spark connection

is_sorted

Whether the rows being imported are already sorted by time

time_unit

Time unit of the time column (must be one of the following values: "NANOSECONDS", "MICROSECONDS", "MILLISECONDS", "SECONDS", "MINUTES", "HOURS", "DAYS"

time_column

Name of the time column

Value

A reusable TimeSeriesRDD builder object

See Also

Other Spark dataframe utility functions: collect.ts_rdd(), from_rdd(), from_sdf(), spark_connection.ts_rdd(), spark_dataframe.ts_rdd(), spark_jobj.ts_rdd(), to_sdf()


Time window specifications

Description

Functions for specifying commonly used types of time windows, which should only be used within the context of summarize_* functions (e.g., 'summarize_count(ts_rdd, in_past("3s"))'). When passing a time window specification to some summarize_* function, the Spark connection parameter ('sc') for the time window object will be injected and will be the same Spark connection the underlying timeseries RDD object is associated with, so, 'sc' never needs to be specified explicitly.

Create a sliding time window capuring data within the closed interval of [current time - duration, current time]

Create a sliding time window capuring data within the closed interval of [current time, current time + duration]

Usage

in_past(duration, sc)

in_future(duration, sc)

Arguments

duration

String representing length of the time window containing a number followed by a time unit (e.g., "10s" or "10sec"), where time unit must be one of the following: "d", "day", "h", "hour", "min", "minute", "s", "sec", "second", "ms", "milli", "millisecond", "

\mu

s", "micro", "microsecond", "ns", "nano", "nanosecond"

sc

Spark connection (does not need to be specified within the context of 'summarize_*' functions)

Value

A time window object useable by the Flint time series library

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(sc, tibble::tibble(t = seq(10), v = seq(10)))
  ts <- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  ts_count <- summarize_count(ts, column = "v", window = in_past("3s"))
} else {
  message("Unable to establish a Spark connection!")
}


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(sc, tibble::tibble(t = seq(10), v = seq(10)))
  ts <- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  ts_count <- summarize_count(ts, column = "v", window = in_future("3s"))
} else {
  message("Unable to establish a Spark connection!")
}