---
title: "Introduction to theft"
author: "Trent Henderson"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 4
vignette: >
  %\VignetteIndexEntry{Introduction to theft}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
  
```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.height = 7,
  fig.width = 7,
  warning = FALSE,
  fig.align = "center"
)
```

```{r setup, message = FALSE, warning = FALSE}
library(theft)
```

## Purpose

`theft` enables the standardised calculation of time-series features from six existing feature sets in both R and Python as well as any user-supplied features.

## Core functionality

All time-series datasets passed into `theft` must be a `tbl_ts` generated through the [`tsibble`](https://tsibble.tidyverts.org/) R package. This ensures consistency with the broader [`tidyverts`](https://tidyverts.org/) collection of packages. To explore package functionality, we are going to use a dataset that comes with `theft` called `simData`. This dataset contains some simulated time series processes, including Gaussian noise, AR(1), ARMA(1,1), MA(1), noisy sinusoid, and a random walk. The dataset can be accessed via:
  
```{r, message = FALSE, warning = FALSE, eval = FALSE}
theft::simData
```

Note that `simData` is a `tsibble` with two `key` variables: `id` and `process` which identiy each time series by its unique ID and group, and an `index` variable of `timepoint` which denotes time indices.

The data follows the following structure:
  
```{r, message = FALSE, warning = FALSE}
head(simData)
```

### Calculating time-series features

The core function in `theft` is `calculate_features`. You can choose which subset of features to calculate with the `feature_set` argument. The choices are currently `"catch22"`, `"feasts"`, `"tsfeatures"`, `"tsfresh"`, `"tsfel"`, and `"kats"`.

Note that `kats`, `tsfresh` and `tsfel` are Python packages. The R package `reticulate` is used to call Python code that uses these packages and applies it within the broader *tidy* data philosophy embodied by `theft`. `theft` currently provides access to $>1100$ features from these six sets alone. However, as discussed in the functionality demonstrations below, you can also supply your own list of features too!

#### Installing Python feature sets

Prior to using `theft` (only if you want to use the `Kats`, `tsfresh` or `TSFEL` feature sets; the R-based sets will run fine) you should have a working Python 3.9 installation and run the function `install_python_pkgs(venv, python)` after first installing `theft`, where the `venv` argument is the name of the virtual environment you want to create and `python` is the path to the Python interpreter you want to use.

For example, if you wanted to install the Python libraries to the default virtual environment folder used by `reticulate`, you would run the following after first having installed `theft` (here I am just creating a new virtual environment called `"theft-package"`---you can call it whatever you like!):

```{r, eval = FALSE}
install_python_pkgs(venv = "theft-package", python = "/usr/local/bin/python3.10")
```

You can then run the following to activate the virtual environment:

```{r, eval = FALSE}
init_theft("theft-package")
```

You are now ready to commit theft!

**NOTE 1: You only need to call ** `init_theft` **once per session.**

**NOTE 2: There are also separate installation functions for each Python feature set, such as** `install_tsfresh` **if you only need one of the libraries and want to keep your dependencies light.**

#### Calculating features

The core function in `theft` is `calculate_features` which takes the following arguments:

* `data`---a `tbl_ts` containing the time series data
* `feature_set`---character or vector of characters denoting the set of time-series features to calculate. Can be one or more of `"catch22"`, `"feasts"`, `"tsfeatures"`, `"tsfresh"`, `"tsfel"`, or `"kats"`
* `features`---a named list containing a set of user-supplied functions to calculate on `data`. Each function should take a single argument which is the time series. Defaults to `NULL` for no manually-specified features. Each list entry must have a name as `calculate_features` looks for these to name the features. If you don't want to use the existing feature sets and only compute those passed to `features`, set `feature_set = NULL`
* `catch24`---Boolean specifying whether to compute `catch24` in addition to `catch22` if `catch22` is one of the feature sets selected. Defaults to `FALSE`
* `tsfresh_cleanup`---Boolean specifying whether to use the in-built `tsfresh` relevant feature filter or not. Defaults to `FALSE`
* `use_compengine`---Boolean specifying whether to use the `"compengine"` features in `tsfeatures`. Defaults to `FALSE` to provide immense computational efficiency benefits
* `seed`---integer denoting a fixed number for R's random number generator to ensure reproducibility. Defaults to `123`

Here is an example with the `catch22` set:
  
```{r, message = FALSE, warning = FALSE}
feature_matrix <- calculate_features(data = simData, feature_set = "catch22")
head(feature_matrix)
```

Note that `data` must be a `tsibble::tbl_ts` object, which has specified `key` (i.e., identifying) and `index` (i.e., time) variables. `theft` treats the first variable in the `key` as the ID variable and the second as the grouping variable (if there is one). Any other key variables will be ignored by `theft`.

You can also supply your own named list of functions to compute as time-series features. Below is an example with mean and standard deviation. Note that the list *must* be named as `theft` uses the list element names to label the time-series features internally. Note that if you don't want to use any of the existing feature sets in `theft` and only calculate the features you supply to `features`, just set `feature_set = NULL`.

```{r, message = FALSE, warning = FALSE}
feature_matrix2 <- calculate_features(data = simData, 
                                      feature_set = NULL,
                                      features = list("mean" = mean, "sd" = sd))

head(feature_matrix2)
```

### Comparison of feature sets

For a detailed comparison of the six feature sets, see [this paper](https://ieeexplore.ieee.org/document/9679937) for a detailed review^[T. Henderson and B. D. Fulcher, "An Empirical Evaluation of Time-Series Feature Sets," 2021 International Conference on Data Mining Workshops (ICDMW), 2021, pp. 1032-1038, doi: 10.1109/ICDMW53433.2021.00134.].

## Reading and processing hctsa-formatted files

As `theft` is based on the foundations laid by [`hctsa`](https://github.com/benfulcher/hctsa), there is also functionality for reading in `hctsa`-formatted Matlab files and automatically processing them into tidy dataframes ready for feature extraction in `theft`. The `process_hctsa_file` function takes a string filepath to the Matlab file and does all the work for you, returning a dataframe with naming conventions consistent with other `theft` functionality. As per `hctsa` specifications for [Input File Format 1](https://time-series-features.gitbook.io/hctsa-manual/installing-and-using-hctsa/calculating/input_files#input-file-format-1-.mat-file), this file should have 3 variables with the following exact names: `timeSeriesData`, `labels`, and `keywords`. The filepath can be a local drive path or a URL.

## Analysing, interpreting, and visualising time-series features

Please see the companion package [`theftdlc`](https://github.com/hendersontrent/theftdlc) ('`theft` downloadable content') for a large suite of functions that are designed to work on top of `theft`.