--- title: "A Short Introduction to the FSinR Package" author: "Francisco Aragón Royón" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{A Short Introduction to the FSinR Package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ### 1. Introduction The package **FSinR** contains functions to perform the feature selection process. More specifically, it contains a large number of filter and wrapper methods widely used in the literature that are combined with search algorithms in order to obtain an optimal subset of features. The FSinR package uses the functions for training classification and regression models available in the R caret package to generate wrapper measures. This gives the package a great background of methods and functionalities. In addition, the package has been implemented in such a way that its use is as easy and intuitive as possible. This is why the package contains a number of high-level functions from which any search algorithm or filter/wrapper method can be called. The way to install the package from the CRAN repository is as follows: ```{r eval = FALSE, 'Install package'} install.packages("FSinR") ``` ### 2. Feature Selection Process As mentioned above, the feature selection process aims to obtain an optimal subset of features. To do this, a search algorithm is combined with a filter/wrapper method. The search algorithm guides the process in the features space according to the results returned by the filter/wrapper methods of the evaluated subsets. The optimal subset of features obtained in this process is very useful to generate simpler models, which require less computational resources and which present a better final performance. The feature selection process is done with the `featureSelection` function. This is the main function of the package and its parameters are: - data: a matrix or data.frame with the dataset - class: the name of the dependent variable - searcher: the search algorithm - evaluator: the evaluation method (filter or wrapper method) The result of this function mainly returns the best subset of features found and the measure value of that subset. In addition, it returns another group of variables such as the execution time, etc. #### 2.1. Search algorithms The search functions present in the package are the following: 1. Sequential forward selection 1. Sequential floating forward selection 1. Sequential backward selection 1. Sequential floating backward selection 1. Breadth first search 1. Deep first search 1. Genetic algorithm 1. Whale optimization algorithm 1. Ant colony optimization 1. Simulated annealing 1. Tabu search 1. Hill-Climbing 1. Las Vegas wrapper The package contains a function, `searchAlgorithm`, which allows you to select the search algorithm to be used in the feature selection process. The function consists of the following parameters: - searcher: the name of the search algorithm - params: a list of specific parameters for each algorithm The result of the call to this function is another function to be used in the main function as a search algorithm. #### 2.2. Filter methods The filter methods implemented in the package are: 1. Chi-squared 1. Cramer V 1. F-score 1. Relief 1. Rough Sets consistency 1. Binary consistency 1. Inconsistent Examples consistency 1. Inconsistent Examples Pairs consistency 1. Determination Coefficient 1. Mutual information 1. Gain ratio 1. Symmetrical uncertain 1. Gini index 1. Jd 1. MDLC 1. ReliefFeatureSetMeasure The `filterEvaluator` function allows you to select a filter method from the above. The function has the parameters: - filter: the name of the filter method - params: a list of specific parameters for each method The result of the function call is a function that is used as a filter evaluation method in the main function, `featureSelection`. #### 2.3. Wrapper methods The FSinR package allows the possibility of using the 238 models available in the caret package as wrapper methods. The complete list of caret models can be found [here](https://topepo.github.io/caret/available-models.html). In addition, the caret package offers the possibility of establishing a group of options to personalize the models (eg. resampling techniques, evaluation measurement, grid parameters, ...) using the `trainControl` and `train` functions. In FSinR, the `wrapperEvaluator` function is used to set all these parameters and use them to generate the wrapper model using as background the functions of caret. The `wrapperEvaluator` function has as parameters: - learner: model name of those available in caret - resamplingParams: list of parameters for `trainControl` function - fittingParams: list of parameters for `train` function (x, y, method and trainControl not neccesary) The result of this function is another function that is used as a wrapper evaluation measure in the main function. #### 2.4. Feature Selection process example To demonstrate in a simple manner how package works, the iris dataset will be used in this example. It is important to note that the FSinR package does not divide data into training and test data. Instead, it applies the feature selection process to the entire dataset passed to it as a parameter. Therefore, in a modeling process the user should divide the dataset prior to performing the feature selection process. Missing data must also be processed prior to the use of the package. But since the main purpose of this vignette is to illustrate the use of the package and not the entire modeling process, in this case the entire unpartitioned dataset will be used. The iris dataset consists in 150 instances of 4 variables (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) that determine the type of iris plant. The target variable, Species, has 3 possible classes (setosa, versicolor, virginica). ```{r message=FALSE, 'Load libraries'} library(caret) library(FSinR) data(iris) ``` ##### 2.4.1. Search + Wrapper Next we will illustrate an example of feature selection composed of a search algorithm and a wrapper method. First, the wrapper method must be generated. To do this, you have to call the `wrapperEvaluator` function and determine which model you want to use as a wrapper method. Since the iris problem is a classification problem, in the example we use a knn model. ```{r, 'Generate Evaluator (S+W)'} evaluator <- wrapperEvaluator("knn") ``` The next step is to generate the search algorithm. This is done by calling the `searchAlgorithm` function and specifying the algorithm you want to use. In this case we are going to use as a search algorithm a sequential search, `sequentialForwardSelection`. ```{r, 'Generate searcher (S+W)'} searcher <- searchAlgorithm('sequentialForwardSelection') ``` Once we have generated the search algorithm and the wrapper method we call the main function that performs the feature selection process. ```{r, 'Feature Selection (S+W)'} results <- featureSelection(iris, 'Species', searcher, evaluator) ``` The results show the best subset of features found and its evaluation. ```{r, 'Results (S+W)'} results$bestFeatures results$bestValue ``` The above example illustrates a simple use of the package, but this can also be more complex if you intend to establish certain parameters. Regarding the wrapper method, it is possible to tune the model parameters. In the first instance, you can establish the resampling parameters, which are the same arguments that are passed to the `trainControl` function of the caret package. Secondly, you can establish the fitting parameters, which are the same as for the caret `train` function. ```{r eval=FALSE, 'Generate wrapper (S+W) 2'} resamplingParams <- list(method = "cv", number = 10) fittingParams <- list(preProc = c("center", "scale"), metric="Accuracy", tuneGrid = expand.grid(k = c(1:20))) evaluator <- wrapperEvaluator("knn", resamplingParams, fittingParams) ``` For more details, the way in which caret train and tune a model can be seen [here](https://topepo.github.io/caret/model-training-and-tuning.html). Note that the FSinR package is able to detect automatically depending on the metric whether the objective of the problem is to maximize or minimize. As for the search algorithm, it is also possible to establish a certain number of parameters. These parameters are specific to each search algorithm. If in this case we now use a tabu search, the number of algorithm iterations, the size of the taboo list, as well as an intensification phase and a diversification phase, among others can be established. ```{r eval=FALSE, 'Search generator (W+S) 2'} searcher <- searchAlgorithm('tabu', list(tamTabuList = 4, iter = 5, intensification=2, iterIntensification=5, diversification=1, iterDiversification=5, verbose=FALSE) ) ``` Most FSinR package search algorithms include a parameter called verbose, which if set to TRUE shows the development and information of the iterations of the algorithms per console. It is important for tabu search to note that the number of neighbors that are considered and evaluated in each iteration of the tabu search algorithm, `numNeigh`, is set by default to as many as there are. Then it should be noted that a high value of this parameter considerably increases the calculation time. Finally, the feature selection process is run again. ```{r eval=FALSE, 'Feature Selection (S+W) 2'} results <- featureSelection(iris, 'Species', searcher, evaluator) ``` ##### 2.4.2. Search + Filter This example is very similar to the previous one. The main difference is to generate a filter method as an evaluator instead of a wrapper method. This is done with the function `filterEvaluator`. ```{r, 'Generate Evaluator (S+F)'} evaluator <- filterEvaluator('MDLC') ``` The search algorithm is then generated: ```{r, 'Generate searcher (S+F)'} searcher <- searchAlgorithm('sequentialForwardSelection') ``` And finally the feature selection process is performed, passing to the `featureSelection` function the filter method and the search algorithm generated. ```{r, 'Feature Selection (S+F)'} results <- featureSelection(iris, 'Species', searcher, evaluator) ``` Again, the best subset obtained and the value obtained by the evaluation measure are shown as results. ```{r, 'Results (S+F)'} results$bestFeatures results$bestValue ``` As in the previous example, in this case you can establish both the parameters of the search algorithms and the filter methods if the functions have them. ##### 2.4.3. Individual evaluation of features Advanced use of the package allows individual evaluation of a set of features using the filter/wrapper methods. That is, the filter/wrapper methods can be used directly without the need to include them in a search algorithm. This is not a defined high-level functionality within the package, since an individual evaluation is not a feature selection process itself. But it is a functionality that can be realized and has to be taken into consideration. To do this, an evaluator must be generated using a filter method or a wrapper method. ```{r, 'Generate Evaluator (F/W)'} filter_evaluator <- filterEvaluator("IEConsistency") wrapper_evaluator <- wrapperEvaluator("lvq") ``` And then obtain a measure from the evaluator by passing as a parameter the dataset, the name of the dependent variable and the set of features to be evaluated. ```{r, 'Results (F/W)'} resultFilter <- filter_evaluator(iris, 'Species', c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")) resultFilter resultWrapper <- wrapper_evaluator(iris, 'Species', c("Petal.Length", "Petal.Width")) resultWrapper ``` ### 3. Direct Feature Selection Process In this case, the feature selection process consists of a direct search algorithm (cut-off method) combined with an evaluator based on filter or wrapper methods. The direct feature selection process is performed using the `directFeatureSelection` function. This function contains the same parameters as the `featureSelection` function, with the difference that the searcher parameter is replaced by the directSearcher parameter. This parameter contains a direct search algorithm. The function mainly returns the selected subset of features, and the value of the individual evaluation of each of these features. #### 3.1. Direct search algorithms The direct search functions present in the package are the following: 1. Select k best 1. Select percentile 1. Select threshold 1. Select threshold range 1. Select difference 1. Select slope The package contains the `directSearchAlgorithm` function that allows you to select the direct search algorithm to be used. The parameters are the same as the `searchAlgorithm` function, except that the searcher parameter is replaced by the directSearcher parameter. This parameter contains the name of the chosen cut-off method. The result of the call to this function is another function to be used in the main function as a direct search algorithm. #### 3.2. Direct Feature Selection process example In the previous example we showed the functionality of the package on a classification example. In this example we are going to show a regression example. For this we use the mtcars dataset, composed of 32 instances of 10 variables. The regression problem consists in forecasting the value of the mpg variable. Compared to the process in the previous example, to perform a direct feature selection process the main change is that in this case a direct search algorithm has to be generated. In addition, the function for performing the feature selection process also changes to `directFeatureSelection`. While the generation of the filter or wrapper methods remains the same. ```{r 'DFS'} library(caret) library(FSinR) data(mtcars) evaluator <- filterEvaluator('determinationCoefficient') directSearcher <- directSearchAlgorithm('selectKBest', list(k=3)) results <- directFeatureSelection(mtcars, 'mpg', directSearcher, evaluator) results$bestFeatures results$featuresSelected results$valuePerFeature ``` ### 4. Hybrid Feature Selection Process The hybrid feature selection process is composed by a hybrid search algorithm that combines the measures obtained from 2 evaluators. This process is performed with the `hybridFeatureSelection` function. The parameters of this function are as follows: - data: a matrix or data.frame with the dataset - class: the name of the dependent variable - hybridSearcher: the hybrid search algorithm - evaluator_1: the first evaluation method (filter or wrapper method) - evaluator_2: the second evaluation method (filter or wrapper method) The function returns again the best subset of features found and their measure among other values. #### 4.1. Hybrid search algorithms The hybrid search function present in the package is the following: 1. Linear Consistency-Constrained (LCC) The package contains the `hybridSearchAlgorithm` function that allows you to select this hybrid search function. The parameters are the same as the `searchAlgorithm` and `directSearchAlgorithm` function, except that the searcher (directSearcher) parameter is replaced by the hybridSearcher parameter. This parameter contains the name of the chosen hybrid method. The result of the call to this function is another function to be used in the main function as a hybrid search algorithm. #### 4.2. Hybrid Feature Selection process example In this example we continue with the previous regression example. The hybrid feature selection process is slightly different from the above. This is because in addition to generating a hybrid search algorithm, two evaluators must be generated. The package only implements a hybrid search algorithm, LCC. This algorithm requires a first evaluator to evaluate the features individually, and a second evaluator to evaluate the features conjunctively. The package implements different filter/wrapper methods focused on individual feature evaluation or set evaluation. Set evaluation methods can also evaluate features individually. But individual evaluation methods (Cramer, Chi squared, F-Score and Relief) cannot evaluate sets of features. Therefore, to perform a hybrid feature selection process, a hybrid search algorithm has to be generated together with an individual feature evaluator (individual or set evaluation method) and a set evaluator (set evaluation method) ```{r 'HFS'} library(caret) library(FSinR) data(mtcars) evaluator_1 <- filterEvaluator('determinationCoefficient') evaluator_2 <- filterEvaluator('ReliefFeatureSetMeasure') hybridSearcher <- hybridSearchAlgorithm('LCC') results <- hybridFeatureSelection(mtcars, 'mpg', hybridSearcher, evaluator_1, evaluator_2) results$bestFeatures results$bestValue ```