--- title: "Introduction to theft" author: "Trent Henderson" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 4 vignette: > %\VignetteIndexEntry{Introduction to theft} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.height = 7, fig.width = 7, warning = FALSE, fig.align = "center" ) ``` ```{r setup, message = FALSE, warning = FALSE} library(theft) ``` ## Purpose `theft` enables the standardised calculation of time-series features from six existing feature sets in both R and Python as well as any user-supplied features. ## Core functionality All time-series datasets passed into `theft` must be a `tbl_ts` generated through the [`tsibble`](https://tsibble.tidyverts.org/) R package. This ensures consistency with the broader [`tidyverts`](https://tidyverts.org/) collection of packages. To explore package functionality, we are going to use a dataset that comes with `theft` called `simData`. This dataset contains some simulated time series processes, including Gaussian noise, AR(1), ARMA(1,1), MA(1), noisy sinusoid, and a random walk. The dataset can be accessed via: ```{r, message = FALSE, warning = FALSE, eval = FALSE} theft::simData ``` Note that `simData` is a `tsibble` with two `key` variables: `id` and `process` which identiy each time series by its unique ID and group, and an `index` variable of `timepoint` which denotes time indices. The data follows the following structure: ```{r, message = FALSE, warning = FALSE} head(simData) ``` ### Calculating time-series features The core function in `theft` is `calculate_features`. You can choose which subset of features to calculate with the `feature_set` argument. The choices are currently `"catch22"`, `"feasts"`, `"tsfeatures"`, `"tsfresh"`, `"tsfel"`, and `"kats"`. Note that `kats`, `tsfresh` and `tsfel` are Python packages. The R package `reticulate` is used to call Python code that uses these packages and applies it within the broader *tidy* data philosophy embodied by `theft`. `theft` currently provides access to $>1100$ features from these six sets alone. However, as discussed in the functionality demonstrations below, you can also supply your own list of features too! #### Installing Python feature sets Prior to using `theft` (only if you want to use the `Kats`, `tsfresh` or `TSFEL` feature sets; the R-based sets will run fine) you should have a working Python 3.9 installation and run the function `install_python_pkgs(venv, python)` after first installing `theft`, where the `venv` argument is the name of the virtual environment you want to create and `python` is the path to the Python interpreter you want to use. For example, if you wanted to install the Python libraries to the default virtual environment folder used by `reticulate`, you would run the following after first having installed `theft` (here I am just creating a new virtual environment called `"theft-package"`---you can call it whatever you like!): ```{r, eval = FALSE} install_python_pkgs(venv = "theft-package", python = "/usr/local/bin/python3.10") ``` You can then run the following to activate the virtual environment: ```{r, eval = FALSE} init_theft("theft-package") ``` You are now ready to commit theft! **NOTE 1: You only need to call ** `init_theft` **once per session.** **NOTE 2: There are also separate installation functions for each Python feature set, such as** `install_tsfresh` **if you only need one of the libraries and want to keep your dependencies light.** #### Calculating features The core function in `theft` is `calculate_features` which takes the following arguments: * `data`---a `tbl_ts` containing the time series data * `feature_set`---character or vector of characters denoting the set of time-series features to calculate. Can be one or more of `"catch22"`, `"feasts"`, `"tsfeatures"`, `"tsfresh"`, `"tsfel"`, or `"kats"` * `features`---a named list containing a set of user-supplied functions to calculate on `data`. Each function should take a single argument which is the time series. Defaults to `NULL` for no manually-specified features. Each list entry must have a name as `calculate_features` looks for these to name the features. If you don't want to use the existing feature sets and only compute those passed to `features`, set `feature_set = NULL` * `catch24`---Boolean specifying whether to compute `catch24` in addition to `catch22` if `catch22` is one of the feature sets selected. Defaults to `FALSE` * `tsfresh_cleanup`---Boolean specifying whether to use the in-built `tsfresh` relevant feature filter or not. Defaults to `FALSE` * `use_compengine`---Boolean specifying whether to use the `"compengine"` features in `tsfeatures`. Defaults to `FALSE` to provide immense computational efficiency benefits * `seed`---integer denoting a fixed number for R's random number generator to ensure reproducibility. Defaults to `123` Here is an example with the `catch22` set: ```{r, message = FALSE, warning = FALSE} feature_matrix <- calculate_features(data = simData, feature_set = "catch22") head(feature_matrix) ``` Note that `data` must be a `tsibble::tbl_ts` object, which has specified `key` (i.e., identifying) and `index` (i.e., time) variables. `theft` treats the first variable in the `key` as the ID variable and the second as the grouping variable (if there is one). Any other key variables will be ignored by `theft`. You can also supply your own named list of functions to compute as time-series features. Below is an example with mean and standard deviation. Note that the list *must* be named as `theft` uses the list element names to label the time-series features internally. Note that if you don't want to use any of the existing feature sets in `theft` and only calculate the features you supply to `features`, just set `feature_set = NULL`. ```{r, message = FALSE, warning = FALSE} feature_matrix2 <- calculate_features(data = simData, feature_set = NULL, features = list("mean" = mean, "sd" = sd)) head(feature_matrix2) ``` ### Comparison of feature sets For a detailed comparison of the six feature sets, see [this paper](https://ieeexplore.ieee.org/document/9679937) for a detailed review^[T. Henderson and B. D. Fulcher, "An Empirical Evaluation of Time-Series Feature Sets," 2021 International Conference on Data Mining Workshops (ICDMW), 2021, pp. 1032-1038, doi: 10.1109/ICDMW53433.2021.00134.]. ## Reading and processing hctsa-formatted files As `theft` is based on the foundations laid by [`hctsa`](https://github.com/benfulcher/hctsa), there is also functionality for reading in `hctsa`-formatted Matlab files and automatically processing them into tidy dataframes ready for feature extraction in `theft`. The `process_hctsa_file` function takes a string filepath to the Matlab file and does all the work for you, returning a dataframe with naming conventions consistent with other `theft` functionality. As per `hctsa` specifications for [Input File Format 1](https://time-series-features.gitbook.io/hctsa-manual/installing-and-using-hctsa/calculating/input_files#input-file-format-1-.mat-file), this file should have 3 variables with the following exact names: `timeSeriesData`, `labels`, and `keywords`. The filepath can be a local drive path or a URL. ## Analysing, interpreting, and visualising time-series features Please see the companion package [`theftdlc`](https://github.com/hendersontrent/theftdlc) ('`theft` downloadable content') for a large suite of functions that are designed to work on top of `theft`.