% \VignetteIndexEntry{ A short fscaret package introduction with examples } % \VignetteDepends{fscaret} % \VignettePackage{fscaret} \documentclass[a4,12pt]{article} \usepackage{amsmath} \usepackage[pdftex]{graphicx} \usepackage{color} \usepackage{xspace} \usepackage{fancyvrb} \usepackage{fancyhdr} \usepackage{lastpage} \usepackage[ colorlinks=true, linkcolor=blue, citecolor=blue, urlcolor=blue] {hyperref} \usepackage{Sweave} \SweaveOpts{keep.source=TRUE} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % define new colors for use \definecolor{darkgreen}{rgb}{0,0.6,0} \definecolor{darkred}{rgb}{0.6,0.0,0} \definecolor{lightbrown}{rgb}{1,0.9,0.8} \definecolor{brown}{rgb}{0.6,0.3,0.3} \definecolor{darkblue}{rgb}{0,0,0.8} \definecolor{darkmagenta}{rgb}{0.5,0,0.5} \newcommand{\code}[1]{\mbox{\footnotesize\color{darkblue}\texttt\textsl{#1}}} \newcommand{\pkg}[1]{{\fontseries{b}\selectfont #1}} \renewcommand{\pkg}[1]{{\textsf{#1}}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \DefineVerbatimEnvironment{Sinput}{Verbatim}{fontshape=sl,formatcom=\color{darkblue}} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} \fvset{fontsize=\footnotesize} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \newcommand{\bld}[1]{\mbox{\boldmath{$#1$}}} \newcommand{\shell}[1]{\mbox{$#1$}} \renewcommand{\vec}[1]{\mbox{\bf {#1}}} \newcommand{\ReallySmallSpacing}{\renewcommand{\baselinestretch}{.6}\Large\normalsize} \newcommand{\SmallSpacing}{\renewcommand{\baselinestretch}{1.1}\Large\normalsize} \newcommand{\halfs}{\frac{1}{2}} \setlength{\oddsidemargin}{-.25 truein} \setlength{\evensidemargin}{0truein} \setlength{\topmargin}{-0.2truein} \setlength{\textwidth}{7 truein} \setlength{\textheight}{8.5 truein} \setlength{\parindent}{0.20truein} \setlength{\parskip}{0.10truein} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \pagestyle{fancy} \lhead{} \chead{A short \pkg{fscaret} package introduction with examples} \rhead{} \lfoot{} \cfoot{} \rfoot{\thepage\ of \pageref{LastPage}} \renewcommand{\headrulewidth}{1pt} \renewcommand{\footrulewidth}{1pt} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \title{A short \pkg{fscaret} package introduction with examples} \author{Jakub Szlek (j.szlek@uj.edu.pl)} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \thispagestyle{empty} \section{ Installation } As it is in case of \pkg{caret}, the \pkg{fscaret} uses large number of R packages but it loads them when needed. To fully take advantage of the package it is recommended to install it both with dependent and suggested packages. Install \pkg{fscaret} with the command <>= install.packages("fscaret", dependencies = c("Depends", "Suggests")) @ from R console.\\ {\color{darkred}\\Be adivsed! Running above code would install all possible packages (in some cases more than 240!), but it is necessary to fully benefit from fscaret. If you wish to use only specific algorithms, check which parameter from \code{funcRegPred} corresponds to which package.\\ \\In the second case install \pkg{fscaret} with the command <>= install.packages("fscaret", dependencies = c("Depends")) @ } \section{ Overview } In general \pkg{fscaret} is a wrapper module. It uses the engine of \pkg{caret} to build models and to get the variable ranking from them. When models are build package tries to draw variable importance from them directly or indirectly. The raw feature ranking would be worthless, since the results in this form cannot be compared.\\ That is why within \pkg{fscaret} the scaling process was introduced. Developed models are used to get prediction errors (RMSE and MSE). Finally the output is produced. It contains the data frame of variable importance, errors for all build models and preprocessed data set if the \code{preprocessData = TRUE} when calling the main \code{fscaret()} function. Also is possible to retrive the original models build with \code{train()} function of \pkg{caret}, to do this you should set \code{saveModel=TRUE} in the call of \code{fscaret} function.\\ \\In summary the whole feature ranking process can be divided into: \begin{enumerate} \item User provides input data sets and a few settings \item Models are build \item Variable rankings are draw out of the models \item Generalization error is calculated for each model \item Variable rankings are scaled according to generalization error \item The results are gathered in tables \end{enumerate} \section{ Input } \subsection{ Data set format } Be advised that \pkg{fscaret} assumes that data sets are in MISO format (multiple input single output). The example of such (with header) is: \begin{tabular}{|l|l|l|l|l|} \hline Input\_no1 & Input\_no2 & Input\_no3 & \ldots & Output \\ \hline 2 & 5.1 & 32.06 & \ldots & 1.02 \\ \hline 5 & 1.21 & 2.06 & \ldots & 7.2 \\ \hline \end{tabular} For more information on reading files in R, please write \code{?read.csv} in R console. If \code{fscaret()} function is switched to \code{classPred=TRUE}, the output must be in binary format (0/1). \subsection{An example} \label{firstEx} There are plenty of methods to introduce data sets into R. The best way is to read file (preasumably csv with \texttt{tab} as column separator) as follows: \begin{enumerate} \item Select file name <>= basename_file <- "My_database" file_name <- paste(basename_file,".csv",sep="") @ \item Read data file into matrix <>= matrixTrain <- read.csv(file_name,header=TRUE,sep="\t", strip.white = TRUE, na.strings = c("NA","")) @ \item Put loaded matrix into \code{data.frame} <>= matrixTrain <- as.data.frame(matrixTrain) @ \end{enumerate} Be advised to use \code{header=TRUE} when you have data set with column names as first row and \code{header=FALSE} when there are no column names.\\ Starting from \pkg{fscaret} version 0.9 setting \code{header} is fixed to \code{TRUE}.\\ The last step is obligatory to introduce data into \pkg{fscaret} functions as it checks if the data presented is in \code{data.frame} format. \section{ Function fscaret() } \subsection{Settings} All the settings are documented in \texttt{Reference manual} of \pkg{fscaret} \url{http://cran.r-project.org/web/packages/fscaret/fscaret.pdf}. Here we will concentrate only on a few valuable ones. \begin{itemize} \item \code{installReqPckg} The default setting is \code{FALSE}, but if set to \code{TRUE} prior to calculations it installs all packages from the sections `Depends` and `Suggests` of \textsc{DESCRIPTION}. Please be advised to be logged as root (admin) if you want to install packages for all users. \item \code{preprocessData} The default setting is \code{FALSE}, but if set to \code{TRUE} prior to calculations it performs the data preprocessing, which in short is realized in two steps: \begin{enumerate} \item Check for near zero variance predictors and flag as near zero if: \begin{itemize} \item the percentage of unique values is less than 20% and \item the ratio of the most frequent to the second most frequent value is greater than 20, \end{itemize} \item Check for susceptibility to multicollinearity \begin{itemize} \item Calculate correlation matrix \item Find variables with correlation 0.9 or more and delete them \end{itemize} \end{enumerate} \item \code{regPred} Default option is \code{TRUE} and so the regression models are applied \item \code{classPred} Default option is \code{FALSE} and if set \code{classPred=TRUE} remember to set \code{regPred=FALSE} \item \code{myTimeLimit} Time limit in seconds for single model development, be advised that some models need as time to be build, if the option is omitted, the standard 24-hours time limit is applied. This function is off on non-Unix like systems. \item \code{Used.funcRegPred} Vector of regression models to be used, for all available models please enter \code{Used.funcRegPred="all"}, the listing of functions is: <>= library(fscaret) data(funcRegPred) funcRegPred @ \item \code{Used.funcClassPred} Vector of classification models to be used, for all available models please enter \code{Used.funcClassPred="all"}, the listing of functions is: <>= library(fscaret) data(funcClassPred) funcClassPred @ \item \code{no.cores} The default setting is \code{NULL} as to maximize the CPU utilization and to use all available cores. \item \code{missData} This option handles the missing data. Possible values are: \begin{itemize} \item \code{missData="delRow"} - for deletion of observations (rows) with missing values, \item \code{missData="delCol"} - for deletion of attributes (columns) with missing values, \item \code{missData="meanCol"} - for imputing mean to missing values, \item \code{missData=NULL} - no action is taken. \end{itemize} \item \code{supress.output} Default option is \code{FALSE}, but it is sometimes justifed to supress the output of intermediate functions and focus on ranking predictions. \item \code{saveModel} Default option is \code{FALSE} as some models have large size, and therefore saving all obtained would lead to 100-500 MB RData files. Keep in mind that loading such large objects into R would require a lot of RAM, e.g. 140MB RData file consumes about 1.5GB of RAM. On the other hand one may want to utilize developed models.\\ To export a model from a result of \code{fscaret()} function, e.g. \code{myFS} object: <>= my_res_foba <- myFS$VarImp$model$foba my_res_foba <- structure(my_res_foba,class="train") @ \end{itemize} \subsection{Regression problems - an example} A simple example of regression problem utilizing the data provided in the \pkg{fscaret}: <>= library(fscaret) data(dataset.train) data(dataset.test) trainDF <- dataset.train testDF <- dataset.test myFS<-fscaret(trainDF, testDF, myTimeLimit = 5, preprocessData=TRUE, Used.funcRegPred=c("pcr","pls"), with.labels=TRUE, supress.output=TRUE, no.cores=1) myRES_tab <- myFS$VarImp$matrixVarImp.MSE[1:10,] myRES_tab <- subset(myRES_tab, select=c("pcr","pls","SUM%","ImpGrad","Input_no")) myRES_rawMSE <- myFS$VarImp$rawMSE myRES_PPlabels <- myFS$PPlabels @ <>= library(fscaret) # if((Sys.info()['sysname'])=="SunOS"){ myRES_tab <- data.frame(pcr = c(5.862841e+01, 1.567799e+01, 1.916511e+01, 2.519981e-01, 1.872058e-02, 2.880832e-04, 5.880416e-04, 7.190168e-05, 1.570926e-06, 1.081909e-06), pls = c(5.227714e+01, 2.741963e+01, 1.995465e+01, 3.161112e-01, 3.079973e-02, 1.324904e-03, 2.781880e-04, 6.894892e-05, 2.697715e-06, 1.078743e-06), "SUM" = c(1.000000e+02, 3.885975e+01, 3.527303e+01, 5.122461e-01, 4.465089e-02, 1.454379e-03, 7.810516e-04, 1.270005e-04, 3.848898e-06, 1.948191e-06), ImpGrad=c(0.000000, 61.140253, 9.229896, 98.547769, 91.283313, 96.742777, 46.296556, 83.739807, 96.969384, 49.383143), Input_no=c("4","5","22","23","2","13","9","1","17","21")) names(myRES_tab)[length(myRES_tab)-2]<-"SUM%" myRES_rawMSE <- data.frame(pcr = c(716.6597), pls = c(671.8195)) myRES_PPlabels <- data.frame("Orig Input No"=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 22, 23, 24, 25, 26, 27, 29), Labels = c("Balaban.index", "Dreiding.energy", "Fused.aromatic.ring.count", "Hyper.wiener.index", "Szeged.index", "Ring.count.of.atom", "pI", "Quaternary_structure", "PLGA_Mw", "La_to_Gly", "PVA_conc_inner_phase", "PVA_conc_outer_phase", "PVA_Mw", "Inner_phase_volume", "Encaps_rate", "PLGA_conc", "PLGA_to_Placticizer", "diss_pH", "diss_add", "Prod_method", "Asymmetric.atom.count.1", "Hyper.wiener.index.1", "Szeged.index.1", "count", "pH_14_logd", "bpKa2", "Cyclomatic.number.2")) # } else { # data(dataset.train) # data(dataset.test) # trainDF <- dataset.train # testDF <- dataset.test # myFS<-fscaret(trainDF, testDF, myTimeLimit = 5, preprocessData=TRUE,regPred=TRUE, # Used.funcRegPred=c("pcr","pls"), with.labels=TRUE, # supress.output=TRUE, no.cores=1, saveModel=FALSE) # myRES_tab <- myFS$VarImp$matrixVarImp.MSE[1:10,] # myRES_rawMSE <- myFS$VarImp$rawMSE # myRES_PPlabels <- myFS$PPlabels # } myRES_tab <- subset(myRES_tab, select=c("pcr","pls","SUM%","ImpGrad","Input_no")) @ <>= # library(MASS) # # # make testing set # data(Pima.te) # # Pima.te[,8] <- as.numeric(Pima.te[,8])-1 # # myDF <- Pima.te # # myFS.class<-fscaret(myDF, myDF, myTimeLimit = 20, preprocessData=FALSE, with.labels=TRUE, classPred=TRUE, regPred=FALSE, Used.funcClassPred=c("knn","rpart"), # supress.output=FALSE, no.cores=1) # # print(myFS.class) # myRES.class_tab <- myFS.class$VarImp$matrixVarImp.MeasureError[,] # myRES.class_tab <- subset(myRES.class_tab, select=c("knn","rpart","SUM%","ImpGrad","Input_no")) # myRES.class_rawError <- myFS.class$VarImp$rawMeasureError @ \subsection{Classification problems - an example} An example of classification problem utilizing the data \code{data(Pima.te)} in the \pkg{MASS}: <>= library(MASS) # make testing set data(Pima.te) Pima.te[,8] <- as.numeric(Pima.te[,8])-1 myDF <- Pima.te myFS.class<-fscaret(myDF, myDF, myTimeLimit = 5, preprocessData=FALSE, with.labels=TRUE, classPred=TRUE,regPred=FALSE, Used.funcClassPred=c("knn","rpart"), supress.output=TRUE, no.cores=1) myRES.class_tab <- myFS.class$VarImp$matrixVarImp.MeasureError myRES.class_tab <- subset(myRES.class_tab, select=c("knn","rpart","SUM%","ImpGrad","Input_no")) myRES.class_rawError <- myFS.class$VarImp$rawMeasureError @ \section{Output} For regression problems, as it was stated previously there are three lists of outputs. \begin{enumerate} \item Feature ranking and generalization errors for models: <>= # Print out the Variable importance results for MSE scaling print(myRES_tab) @ \item Raw RMSE/MSE errors for each model <>= # Print out the generalization error for models print(myRES_rawMSE) @ \item Reduced data frame of inputs after preprocessing <>= # Print out the reduced number of inputs after preprocessing print(myRES_PPlabels) @ \end{enumerate} As one can see in the example there were only two models used \code{"pcr","pls"}, to use all available models please set option \code{Used.funcRegPred="all"}. The results can be presented on a bar plot (see Figure \ref{F:barPlot}). Then the arbitrary feature reduction can be applied. \begin{figure} \begin{center} <>= # Present variable importance on barplot a=0.9 b=0.7 c=2 # if((Sys.info()['sysname'])=="SunOS"){ myFS <- NULL myFS$VarImp$matrixVarImp.MSE <- myRES_tab # } lk_row.mse=nrow(myFS$VarImp$matrixVarImp.MSE) setEPS() barplot1 <- barplot(myFS$VarImp$matrixVarImp.MSE$"SUM%"[1:(a*lk_row.mse)], cex.names=b, las = c, xlab="Variables", ylab="Importance Sum%", names.arg=c(myFS$VarImp$matrixVarImp.MSE$Input_no[1:(a*lk_row.mse)])) lines(x = barplot1, y = myFS$VarImp$matrixVarImp.MSE$"SUM%"[1:(a*lk_row.mse)]) points(x = barplot1, y = myFS$VarImp$matrixVarImp.MSE$"SUM%"[1:(a*lk_row.mse)]) @ \caption{A sum of feature ranking of models trained and tested on \code{dataset.train}, two models were used \code{"pcr","pls"}. } \label{F:barPlot} \end{center} \end{figure} For classification problems, two lists of outputs. \begin{enumerate} \item Feature ranking and errors (F-measure) for models: <>= # Print out the Variable importance results for F-measure scaling print(myRES.class_tab) @ \item Raw F-measures for each model <>= # Print out the generalization error for models print(myRES.class_rawError) @ \end{enumerate} As one can see in the example there were only two models used \code{"knn","rpart"}, to use all available models please set option \code{Used.funcClassPred="all"}. The results can be presented on a bar plot as the previous ones. Then the arbitrary feature reduction can be applied. \section{Known issues} \begin{enumerate} \item In some cases during model development stage users can encounter "caught segfault" errors. It is highly depenent on the input data and the model. The nature of the error prevents function \code{fscaret()} returning proper results, therefore no scaling of variable importance is done, and no summary of feature ranking is presented.\\The way around is to exclude the troublesome method from calculations. If you encounter an odd behaviour of your working script, e.g. results of an object \code{myFS} in \code{VarImp} is an empty \code{list()}, search for "segfault" in a Rout file. In the example given below \code{"partDSA"} is the trouble maker. Then run once again computations. <>= library(fscaret) myFuncRegPred <- funcRegPred[which(funcRegPred!="partDSA")] print(funcRegPred) myFS<-fscaret(trainDF, testDF, myTimeLimit = 12*60*60, preprocessData=TRUE,regPred=TRUE, Used.funcRegPred=myFuncRegPred, with.labels=TRUE, supress.output=TRUE, no.cores=NULL, saveModel=FALSE) @ \end{enumerate} \section{Acknowledgments} This work was funded by Poland-Singapore bilateral cooperation project no 2/3/POL-SIN/2012. \section{References} \begin{enumerate} \item Szlek J, Paclawski A, Lau R, Jachowicz R, Mendyk A. Heuristic modeling of macromolecule release from PLGA microspheres. International Journal of Nanomedicine. 2013:8(1); 4601 - 4611.\href{http://www.dovepress.com/heuristic-modeling-of-macromolecule-release-from-plga-microspheres-peer-reviewed-article-IJN}{link to webpage} \item Szlek, J., Paclawski, A., Lau, R., Jachowicz, R., Mendyk, A. Heuristic modeling of macromolecules release from PLGA microspheres. Conference proceedings. Gdansk, May 24-25, 2013.\href{http://www.polgerpharm.gumed.edu.pl/attachment/attachment/20321/AbstractBook_Cover_Corrected.pdf}{Abstract book} \item Paclawski A, Szlek J, Lau R, Jachowicz R, Mendyk A. Empirical modeling of the fine particle fraction for carrier-based pulmonary delivery formulations. International Journal of Nanomedicine. 2015:10(1); 801 - 810.\href{http://www.dovepress.com/empirical-modeling-of-the-fine-particle-fraction-fornbspcarrier-based--peer-reviewed-article-IJN}{link to webpage} \item Szlek, J., Paclawski, A., Lau, R., Jachowicz, R., Kazemi, P., Mendyk, A. Empirical search for factors affecting mean particle size of PLGA microspheres containing macromolecular drugs. Computer Methods and Programs in Biomedicine. 2016:134; 137-147.\href{https://doi.org/10.1016/j.cmpb.2016.07.006}{link to webpage} \item Eskandarian, S., Bahrami, P., Kazemi, P. A comprehensive data mining approach to estimate the rate of penetration: Application of neural network, rule based models and feature ranking. Journal of Petroleum Science and Engineering. 2017:156; 605 - 615.\href{https://doi.org/10.1016/j.petrol.2017.06.039}{link to webpage} \item Manuel Amunategui. Ensemble Feature Selection On Steroids: fscaret Package. At: Data Exploration and Machine Learning, Hands-on. On-line Practical walkthroughs on machine learning, data exploration and finding insight.\href{http://amunategui.github.io/fscaret-Walkthrough/}{On-line resource} \end{enumerate} \end{document}