\documentclass[11pt, oneside, a4paper]{article} \usepackage[utf8]{inputenc} \usepackage[sectionbib, round]{natbib} \usepackage[textwidth = 17cm, top = 2cm, bottom = 2cm]{geometry} \usepackage{nameref} \usepackage{bm} \usepackage{amsmath, amssymb, amsfonts} \usepackage{graphicx} \usepackage{hyperref} \newcommand{\pkg}[1]{\texorpdfstring% {{\normalfont\fontseries{b}\selectfont #1}}% {#1}} \begin{document} \SweaveOpts{engine = R} %\VignetteIndexEntry{rebmix: The Rebmix Package} %\VignetteKeyword{continuous variable} %\VignetteKeyword{discrete variable} %\VignetteKeyword{finite mixture} %\VignetteKeyword{parameter estimation} %\VignetteKeyword{REBMIX algorithm} %\VignettePackage{rebmix} \bibliographystyle{abbrvnat} \title{\pkg{rebmix}: Finite Mixture Modeling, Clustering \& Classification} \author{Marko Nagode, Branislav Pani\'{c}, Jernej Klemenc \& Simon Oman} \date{\today} \maketitle \begin{abstract} The \pkg{rebmix} package provides R functions for random univariate and multivariate finite mixture model generation, estimation, clustering, latent class analysis and classification. Variables can be continuous, discrete, independent or dependent and may follow normal, lognormal, Weibull, gamma, binomial, Poisson, Dirac or von Mises parametric families. \end{abstract} \section{Introduction} To cite the REBMIX algorithm please refer to \citep{Nagode_Fajdiga_2011a, Nagode_Fajdiga_2011b, Nagode_2015, Nagode_2018, Panic_2020a, Panic_2020b, Panic_2020c}. For theoretical backgrounds please upload \url{http://doi.org/10.5963/JAO0302001}. \section{What's new in version \texttt{2.16.0}} In version \texttt{2.16.0} the handling of warnings and errors was introduced. From now on, the user can monitor in R code the problems that may occur in C++ code. Some minor changes and bug fixes have been made in C++ and R. The rough estimation procedure has been thoroughly analysed and improved. An additional argument \texttt{Mode} has been added to the method \texttt{REBMIX}. The detection of the global mode can now be performed in different ways, which can be advantageous when handling different types of datasets. \section{Previous versions} Version \texttt{2.15.0} introduces several auxiliary methods and some improvements to the core methods of the package. Two new auxiliary methods have been added primarily for dealing with digital images and other datasets of spatial interest that are initially processed by flattening or similarly removing spatial information. The newly added \texttt{"labelmoments"} method estimates spatial moments for clustering solutions obtained using either the \texttt{RCLRMIX} or the \texttt{mapclusters} method. Of course, the output of the \texttt{RCLRMIX} method or the \texttt{mapclusters} method must first be converted into a two-dimensional spatial matrix or a three-dimensional spatial matrix. Then the spatial moments are estimated similarly to widely used image moments \cite{liao1996image}. The \texttt{"labelmoments"} method also estimates the adjacency matrix between different labels in clustering using already estimated spatial moments. See the manual for more information. The second added method, \texttt{"mergelabels"}, uses the spectral clustering described in \cite{ng2001spectral} to propose another clustering solution that merges the most similar labels from the previous clustering solution. \texttt{"mergelabels"} accepts two adjacency matrices. The first one can be created using the method \texttt{"labelmoments"} for spatial similarity, and the second can be, for example, the similarity matrix of the different mixture components. Therefore, version \texttt{2.15.0} also introduces an improvement to the class \texttt{RCLRMIX}, which now also has a slot \texttt{A} representing an adjacency matrix (i.e., a similarity matrix) between different mixture components. The adjacency matrix is computed according to the given argument \texttt{Rule} in the method \texttt{RCLRMIX}. Finally, the C++ functions \texttt{CombineComponentsEntropy()} and \texttt{CombineComponentsDemp()} are improved and further tested. Version \texttt{2.14.2} introduces several important improvements to the package. Estimation of mixture parameters can now be performed using the object of class \texttt{"Histogram"}. Objects of class \texttt{"Histogram"} represent the data with bins and bin counts. The object \texttt{"Histogram"} is best suited for discrete datasets, such as digital images. The image resolution can be quite high, but there is a finite number of realizations of the image pixel intensity. Often the number of realizations is much smaller than the number of observations. Therefore, it is more useful to represent the original dataset as an ordered pair $(\bar{y}_j, k_j)$, where $\bar{y}_j$ is the $j$th possible realization and $k_j$ is its count. The continuous datasets can also be shrunk, but some bias may occur. When the number of observations is large, the bias is usually small and vice versa, and when time and memory efficiency is desired, shrinking the dataset is recommended. The original dataset (object of class \texttt{data.frame}) is transformed with the newly added methods \texttt{"chistogram"}, \texttt{"fhistogram"} into a \texttt{"Histogram"}. The method \texttt{"fhistogram"} is memory intensive, but very fast. It is best suited for one-dimensional datasets. Since it precomputes all possible realizations, even if some of them are not in the dataset, the argument \texttt{shrink= TRUE } should be added. The method \texttt{"chistogram"} is slower, but memory efficient. When a new realization $y_j$ is found, it is appended. The two methods \texttt{"chistogram"} and \texttt{"fhistogram"} can be chained in a for loop so that multiple datasets can be shrunk into one \texttt{"Histogram"} object. This can be used when, for example, multiple images are to be loaded. Also, if for example a large dataset is to be processed, only parts of it can be loaded at a time to avoid large memory overhead. The \texttt{mapclusters} method is added for maximum a posteriori estimation of cluster membership. This method was added to complement the \texttt{RCLRMIX} method. The \texttt{RCLRMIX} method performs clustering and cluster membership prediction as well as cluster merging using different estimators. For large datasets, this can be very time consuming. Using objects of class \texttt{"histogram"} speeds up the \texttt{RCLRMIX} method enormously, but the cluster membership is only determined of the realizations $\bar{y}_j$ and the order is often not preserved either. Thus, we can use the \texttt{mapclusters} method to get the cluster membership on the original dataset of the object of the \texttt{"data.frame"} class. Finally, all density estimation methods (\texttt{"demix"}, \texttt{"dfmix"}, \texttt{"pemix"}, \texttt{"pfmix"}) are modified to include estimation for objects of class \texttt{"Histogram"}. Version \texttt{2.14.0} introduces three parametric variant of the Gumbel parametric family \begin{equation} f(y) = \frac{1}{\sigma}\exp\left(\frac{y - \mu}{\xi \sigma} - \exp\left(\frac{y - \mu}{\xi \sigma}\right)\right), \end{equation} where $\xi \in \{-1, \text{NA}, 1\}$. If initially $\xi$ is set to $\text{NA}$, the package estimates $\xi$. Otherwise the estimated $\xi$ equals initial $\xi$. The package was extended to handle three parameter parametric families. Class \texttt{"EMMIX.Theta"} and method \texttt{EMMIX} are added. They enable parameter estimation for all parametric families available in the package by using the EM algorithm only. Version \texttt{2.13.1} introduces Gumbel probability density \begin{equation} f(y) = \frac{1}{\sigma}\exp\left(-\frac{y - \mu}{\sigma} - \exp\left(-\frac{y - \mu}{\sigma}\right)\right). \end{equation} The Gumbel mixture models can now be estimated. In the mixture model the parameters $\mu$ and $\sigma$ are estimated for all components. The corresponding equations for the REBMIX and the EM algorithms have been derived. The EM algorithm has been implemented for all other parametric family types (normal, lognormal, Weibull, gamma, binomial, Poisson and von Mises parametric families), too. Therefore the EM algorithm can now be used with all parametric family types. Additionally, the histogram based EM algorithm has been added to speed up the calculations when the datasets contain large numbers of observations. Finally, the package has been debugged further and some core functions have been improved. Version \texttt{2.12.0} introduces the Knuth algorithm \citep{Knuth_2019} as an effective way of optimal number of bins search. This affects the time efficiency of the \texttt{REBMIX} method considerably. The user can now enter different numbers of bins for different random variables. Two accompanied methods are added \texttt{optbins} and \texttt{bins}. The C++ code is optimized regarding memory allocation and efficiency. The R code is debugged and further improved regarding wrong user input messaging. Further debugging has been done. Outlier detection has been simplified. The \texttt{REBMIX} method now delivers some more components, but identifies also components with very low probability of occurrence, which is important regarding our future plans. The \texttt{plot} method is improved. The \texttt{RCLRMIX} method has been debugged and improved. The same holds for the \texttt{REBMIX} method. Version \texttt{2.11.0} introduces the Expectation-Maximization (EM) algorithm for the improved estimation of Gaussian mixture model parameters (with diagonal and unrestriced covariance matrices). Here the REBMIX algorithm is used to assess the initial parameters of the EM algorithm. Two different variants of the EM algorithm are implemented, namely the original EM algorithm from \citep{Dempster_1977} and a $k$-means like variant of the EM algorithm (Classification EM) as described in \citep{Celeux_1992}. As the REBMIX algorithm estimates a wide range of parameters for the Gaussian mixture model for different numbers of components, three different strategies, named \textbf{exhaustive}, \textbf{best} and \textbf{single}, have been implemented. The \textbf{exhaustive} strategy is used to run the EM algorithm (or variant) on each solution of Gaussian mixture model parameters provided by the REBMIX algorithm. The \textbf{best} strategy utilizes a voting scheme for the estimated parameters from the REBMIX algorithm and runs the EM algorithm only on selected optimal parameters. The best candidates are chosen based on the value of the likelihood function (the highest one) for each number of components $c$ from a minimum specified \texttt{cmin} to a maximum specified \texttt{cmax}. The \textbf{single} strategy is useful when the single value of, for example, the number of bins in histogram preprocessing is supplied as input for the REBMIX algorithm. Otherwise, when multiple numbers of bins $k$ are supplied, this strategy is the same as the \textbf{exhaustive} strategy. To tackle the slow linear convergence of the EM algorithm, simple acceleration methods are implemented, which can be controlled with parameter \texttt{acceleration} and \texttt{acceleration.multiplier}. The increment of the EM algorithm in each iteration can be written as \begin{equation} \varDelta\boldsymbol{\Theta} = \boldsymbol{\Theta}^{(i+1)} - \boldsymbol{\Theta}^{(i)} \end{equation} Instead of using a standard EM increment $\varDelta\boldsymbol{\Theta}$ to reduce the number of iterations needed for the EM algorithm, this increment can be multiplied with some multiplier $a_{\textmd{EM}}$, which is referred to as \texttt{acceleration.multiplier}. Therefore the update in each EM iteration now becomes \begin{equation} \boldsymbol{\Theta}^{i+1} = \boldsymbol{\Theta}^{(i)} + a_{\textmd{EM}}\varDelta\boldsymbol{\Theta} \end{equation} The safe range for the $a_{\textmd{EM}}$ multiplier lies between 1.0 and 2.0, where 1.0 gives a standard EM increment and 2.0 doubles the EM increment. However, this does not necessarily mean that multiplication by a value of 2.0 will double the speed of the EM algorithm (i.e. by reducing the required number of iterations by 2). Here, 1.5 is a safe value which mostly speeds up the EM algorithm whilst retaining good results for the estimated parameters. A value of 1.9 can significantly speed up the estimation process, yet it can also deteriorate the quality of the resulting estimated parameters. Therefore, the value of the multiplicator needs to be set carefully. This value is set with \texttt{acceleration.multiplier} parameter. The other parameter \texttt{acceleration} controls how the $a_{\textmd{EM}}$ multiplier is handled and can be one of \textbf{fixed}, \textbf{line} and \textbf{golden}. Selecting the \textbf{fixed} option means that the $a_{\textmd{EM}}$ multiplier is specified via the \texttt{acceleration.multiplier} parameter and for each iteration of the EM algorithm the increment is increased by a specified value of $a_{\textmd{EM}}$. The \textbf{line} and \textbf{golden} options perform a, line and golden search (respectively) for the optimal value of $a_{\textmd{EM}}$ for which the highest increase in the likelihood function of each EM iteration is achieved. EM handling is carried out using the newly introduced class \texttt{"EM.Control"}. Classes \texttt{"REBMIX"} and \texttt{"REBMVNORM"} and its signature method \texttt{REBMIX} now accept the \texttt{"EM.Control"} object via the argument called \texttt{"EMcontrol"}. The class \texttt{EM.Control} has the same name convection for slots as the input argument \texttt{EMcontrol} (\texttt{strategy}, \texttt{variant}, \texttt{acceleration}, \texttt{tolerance}, \texttt{acceleration.multiplier} and \texttt{maximum.iterations}) as well as all accessor functions with the same name convention as \texttt{a.\textit{slot name}} and setter function \texttt{a.\textit{slot name}<-}. Methods \texttt{Zp} and \texttt{coef} have been replaced by \texttt{a.Zp}, \texttt{a.theta1.all} and \texttt{a.theta2.all} getters. All slots can be accessed via accessors. Their names are generally composed of \texttt{a.} followed by the slot name and are used to read the slots. Class \texttt{"RNGMIX.Theta"} has been added to simplify random finite mixture model generation. Method \texttt{show} has been added for \texttt{"RCLS.chunk"} class. The minimum number of components \texttt{cmin} was added to \texttt{REBMIX} arguments and to the \texttt{"REBMIX"} class. The \texttt{"Parzen window"} preprocessing has been renamed to more commonly known \texttt{"kernel density estimation"}. Rough parameter estimation for binomial and Poisson parametric families has also been improved and the package is now broadened to latent class analysis in version 2.10.3. Method \texttt{split} has been improved and examples for its proper use are added. GCC 8.1 notes and warnings in C++ functions have been eliminated in version 2.10.2. Cholesky decomposition is now used to calculate the logarithm of the determinant and inverse of variance-covariance matrices instead of LU decomposition. Special attention has been paid to resolving numerical problems related to high dimensional datasets. Version 2.10.1 is the further debugged version of 2.10.0. Large \texttt{K} in combination with large dimension $d$ can lead to histograms with numerous nonempty bins $v$. In order to restrain $v$, the well known RootN rule \citep{Velleman_1976} may intuitively be extended to multidimensions \begin{equation} v_{\mathrm{max}} = \frac{1 + d}{d} n^\frac{d}{1 + d}. \end{equation} If $d = \infty$, then $v_{\mathrm{max}} = n$. If $d = 1$, then $v_{\mathrm{max}} = 2 \sqrt{n}$. Minor debugging and function improvements have also been carried out in version 2.10.0. The acceleration rate is now progressively increasing. Each time the inner loop starts, the counter $I_{2}$ \citep[see][for details]{Nagode_2015} is initiated and constant \begin{equation} A = \left. \frac{1 - a_{\mathrm{r}}}{a_{\mathrm{r}} (D_{l} w_{l} - D_{\mathrm{min}})} \right|_{I_{2} = 1} \end{equation} is calculated. The acceleration rate $a_{\mathrm{r}}$ at $I_{2} = 1$ always equals the value stored in the input argument \texttt{ar}. Otherwise \begin{equation} a_{\mathrm{r}} = \left. \frac{1}{A (D_{l} w_{l} - D_{\mathrm{min}}) + 1} \right|_{I_{2} > 1}. \end{equation} The Newton-Raphson root finding in C\texttt{++} functions was improved in version 2.9.3. This affects only Weibull, gamma and von Mises parametric families. A circular von Mises parametric family has been added and further debugging carried out in version 2.9.2. Version 2.9.1 is a further debugged version 2.8.4. The R code has been extended and rewritten in S4 class system. The background C code has also been extended and rewritten as object-oriented C\texttt{++} code. The package can now more easily be extended to other parametric families. Multivariate normal mixtures with unrestricted variance-covariance matrices have been added. Clustering has also been added and classification improved. \section{Examples} To illustrate the use of the REBMIX algorithm, univariate and multivariate datasets are considered. The \pkg{rebmix} is loaded and the prompt before starting new page is set to \texttt{TRUE}. <>= ############################################## ## R sources for reproducing the results in ## ## rebmix package ## ############################################## options(prompt = "R> ", continue = "+ ", width = 80, useFancyQuotes = FALSE, digits = 3) @ <>= ################### ## Preliminaries ## ################### ## load package and set prompt before starting new page to TRUE. library(rebmix) devAskNewPage(ask = TRUE) @ \subsection{Gamma datasets} Three gamma mixtures are considered \citep{Wiper_2001}. The first has four well-separated components with means $2$, $4$, $6$ and $8$, respectively \begin{center} \(\begin{array}{lll} \theta_{1} = 1/100 & \beta_{1} = 200 & n_{1} = 100 \\ \theta_{2} = 1/100 & \beta_{2} = 400 & n_{2} = 100 \\ \theta_{3} = 1/100 & \beta_{3} = 600 & n_{3} = 100 \\ \theta_{4} = 1/100 & \beta_{4} = 800 & n_{4} = 100. \end{array}\) \end{center} The second has equal means but different variances and weights \begin{center} \(\begin{array}{lll} \theta_{1} = 1/27 & \beta_{1} = 9 & n_{1} = 40 \\ \theta_{2} = 1/270 & \beta_{2} = 90 & n_{2} = 360. \end{array}\) \end{center} The third is a mixture of a rather diffuse component with mean $6$ and two lower weighted components with smaller variances and means of $2$ and $10$, respectively \begin{center} \(\begin{array}{lll} \theta_{1} = 1/20 & \beta_{1} = 40 & n_{1} = 80 \\ \theta_{2} = 1 & \beta_{2} = 6 & n_{2} = 240 \\ \theta_{3} = 1/20 & \beta_{3} = 200 & n_{3} = 80. \end{array}\) \end{center} \subsubsection{Finite mixture generation} <>= ###################### ## Gamma datasets ## ###################### ## Generate gamma datasets. n <- c(100, 100, 100, 100) Theta <- new("RNGMIX.Theta", c = 4, pdf = "gamma") a.theta1(Theta) <- rep(1/100, 4) a.theta2(Theta) <- c(200, 400, 600, 800) gamma1 <- RNGMIX(Dataset.name = "gamma1", n = n, Theta = a.Theta(Theta)) n <- c(40, 360) Theta <- new("RNGMIX.Theta", c = 2, pdf = "gamma") a.theta1(Theta) <- c(1/27, 1 / 270) a.theta2(Theta) <- c(9, 90) gamma2 <- RNGMIX(Dataset.name = "gamma2", n = n, Theta = a.Theta(Theta)) n <- c(80, 240, 80) Theta <- new("RNGMIX.Theta", c = 3, pdf = "gamma") a.theta1(Theta) <- c(1/20, 1, 1/20) a.theta2(Theta) <- c(40, 6, 200) gamma3 <- RNGMIX(Dataset.name = "gamma3", rseed = -4, n = n, Theta = a.Theta(Theta)) @ \subsubsection{Finite mixture estimation} <>= ## Estimate number of components, component weights and component parameters. gamma1est <- REBMIX(Dataset = a.Dataset(gamma1), Preprocessing = "kernel density estimation", cmax = 8, Criterion = "BIC", pdf = "gamma") gamma2est <- REBMIX(Dataset = a.Dataset(gamma2), Preprocessing = "histogram", cmax = 8, Criterion = "BIC", pdf = "gamma") gamma3est <- REBMIX(Dataset = a.Dataset(gamma3), Preprocessing = "histogram", cmax = 8, Criterion = "BIC", pdf = "gamma", K = 23:27) @ \subsubsection{Plot method} \begin{figure}[htb]\centering <>= plot(gamma3est, pos = 1, what = c("pdf", "marginal cdf"), ncol = 2, npts = 1000, family = "sans") @ \caption{Gamma 3 dataset. Empirical density (circles) and predictive gamma mixture density in black solid line.} \end{figure} \subsubsection{Summary, a.theta1.all and a.theta2.all methods} <>= summary(gamma2est) a.theta1.all(gamma1est, pos = 1) a.theta2.all(gamma1est, pos = 1) @ \subsubsection{Bootstrap methods} <>= ## Bootstrap finite mixture. gamma3boot <- boot(x = gamma3est, pos = 1, Bootstrap = "p", B = 10) gamma3boot summary(gamma3boot) @ \subsubsection{Estimation of the mixture parameters using the Best REBMIX\&EM strategy} <>= ## EM.control object creation. EM <- new("EM.Control", strategy = "best", variant = "EM", acceleration = "fixed", acceleration.multiplier = 1.0, tolerance = 1e-4, maximum.iterations = 1000, K = 0) gamma1est.em <- REBMIX(Dataset = a.Dataset(gamma1), Preprocessing = "kernel density estimation", cmax = 8, Criterion = "BIC", pdf = "gamma", EMcontrol = EM) gamma2est.em <- REBMIX(Dataset = a.Dataset(gamma2), Preprocessing = "histogram", cmax = 8, Criterion = "BIC", pdf = "gamma", EMcontrol = EM) gamma3est.em <- REBMIX(Dataset = a.Dataset(gamma3), Preprocessing = "histogram", cmax = 8, Criterion = "BIC", pdf = "gamma", K = 23:27, EMcontrol = EM) summary(gamma1est.em) summary(gamma2est.em) summary(gamma3est.em) @ \subsection{Poisson dataset} Dataset consists of $n = 600$ two~dimensional observations obtained by generating data points separately from each of three Poisson distributions. The component~dataset sizes and parameters, which are those studied in \citet{Jinwen_2009}, are displayed below \begin{center} \(\begin{array}{ll} \bm{\theta}_{1} = (3, 2)^{\top} & n_{1} = 200 \\ \bm{\theta}_{2} = (9, 10)^{\top} & n_{2} = 200 \\ \bm{\theta}_{3} = (15, 16)^{\top} & n_{3} = 200 \end{array}\) \end{center} For the dataset \citet{Jinwen_2009} conduct $100$ experiments by selecting different initial values of the mixing proportions. In all the cases, the adaptive gradient BYY learning algorithm leads to the correct model selection, i.e., finally allocating the correct number of Poissons for the dataset. In the meantime, it also results in an estimate for each parameter in the original or true Poisson mixture which generated the dataset. As the dataset of \citet{Jinwen_2009} can not exactly be reproduced, $10$ datasets are generated with random seeds $r_{\mathrm{seed}}$ ranging from $-1$ to $-10$. \subsubsection{Finite mixture generation} <>= ######################### ## Poisson dataset ## ######################### ## Generate the Poisson dataset. n <- c(200, 200, 200) Theta <- new("RNGMIX.Theta", c = 3, pdf = rep("Poisson", 2)) a.theta1(Theta, 1) <- c(3, 2) a.theta1(Theta, 2) <- c(9, 10) a.theta1(Theta, 3) <- c(15, 16) poisson <- RNGMIX(Dataset.name = paste("Poisson_", 1:10, sep = ""), n = n, Theta = a.Theta(Theta)) @ \subsubsection{Finite mixture estimation} <>= ## Estimate number of components, component weights and component parameters. poissonest <- REBMIX(Dataset = a.Dataset(poisson), Preprocessing = "histogram", cmax = 10, Criterion = "MDL5", pdf = rep("Poisson", 2), K = 1) @ \subsubsection{Plot method} \begin{figure}[htb]\centering <>= plot(poissonest, pos = 1, what = c("pdf", "marginal pdf", "IC", "D", "logL"), nrow = 2, ncol = 3, npts = 1000, family = "sans") @ \caption{Poisson dataset. Empirical densities (coloured large circles), predictive multivariate Poisson-Poisson mixture density (coloured small circles), empirical densities (circles), predictive univariate marginal Poisson mixture densities and progress charts (solid line).} \end{figure} \subsubsection{Clustering} \begin{figure}[htb]\centering <>= poissonclu <- RCLRMIX(x = poissonest, pos = 1, Zt = a.Zt(poisson)) plot(poissonclu, family = "sans") @ \caption{Poisson dataset. Predictive cluster membership (coloured circles), error (black circles).} \end{figure} \subsubsection{Summary, a.theta1.all and a.theta2.all methods} <>= ## Visualize results. summary(poissonest) a.theta1.all(poissonest, pos = 1) a.theta2.all(poissonest, pos = 1) @ \subsubsection{Estimation of the mixture parameters using the Exhaustive REBMIX\&EM strategy} <>= ## EM.control object creation. EM <- new("EM.Control", strategy = "exhaustive", variant = "EM", acceleration = "fixed", acceleration.multiplier = 1.0, tolerance = 1e-4, maximum.iterations = 1000, K = 0) poissonest.em <- REBMIX(Dataset = a.Dataset(poisson), Preprocessing = "histogram", cmax = 10, Criterion = "MDL5", pdf = rep("Poisson", 2), K = 1, EMcontrol = EM) summary(poissonest.em) @ \subsection{Multivariate normal dataset} The \texttt{mvnorm} dataset from \citep{Panic_2020c} consists of $250$ observations drawn from the $5$-component normal mixture. \subsubsection{Finite mixture generation} <>= ################################### ## Multivariate normal dataset ## ################################### ## Generate normal dataset. n <- c(50, 50, 50, 50, 50) Theta <- new("RNGMVNORM.Theta", c = 5, d = 2) a.theta1(Theta, 1) <- c(2.7, 3.7) a.theta1(Theta, 2) <- c(5.7, 9.1) a.theta1(Theta, 3) <- c(2.0, 9.0) a.theta1(Theta, 4) <- c(9.5, 6.6) a.theta1(Theta, 5) <- c(6.3, 0.6) a.theta2(Theta, 1) <- c(0.9, -0.1, -0.1, 0.4) a.theta2(Theta, 2) <- c(2.8, -1.3, -1.3, 1.5) a.theta2(Theta, 3) <- c(0.1, 0.0, 0.0, 0.3) a.theta2(Theta, 4) <- c(1.3, -0.4, -0.4, 0.4) a.theta2(Theta, 5) <- c(0.5, 0.3, 0.3, 2.5) mvnorm.simulated <- RNGMIX(model = "RNGMVNORM", Dataset.name = "mvnormdataset", rseed = -1, n = n, Theta = a.Theta(Theta)) @ \subsubsection{Finite mixture estimation} <>= ## Estimate number of components, component weights and component parameters. mvnormest <- REBMIX(model = "REBMVNORM", Dataset = a.Dataset(mvnorm.simulated), Preprocessing = "histogram", cmax = 20, Criterion = "BIC") @ \subsubsection{Plot method} \begin{figure}[htb]\centering <>= plot(mvnormest, family = "sans") @ \caption{Dataset \texttt{mvnorm}. Empirical densities (coloured circles), predictive multivariate normal mixture density (coloured lines).} \end{figure} \subsubsection{Clustering} \begin{figure}[htb]\centering <>= mvnormclu <- RCLRMIX(model = "RCLRMVNORM", x = mvnormest) plot(mvnormclu, family = "sans") @ \caption{Dataset \texttt{mvnorm}. Predictive cluster membership (coloured circles).} \end{figure} \subsubsection{Summary method of estimation} <>= summary(mvnormest) @ \subsubsection{Summary method of clustering} <>= summary(mvnormclu) @ \subsubsection{Estimation of the mixture parameters using the Single REBMIX\&EM strategy} The strategy requires the preprocessing parameter \texttt{K} to be known. It can be estimated using the \texttt{optbins} method. To estimate the optimal parameter \texttt{K}, different rules are implemented. More information can be found in \citep{Panic_2020c}. <>= ## EM.control object creation. EM <- new("EM.Control", strategy = "single", variant = "EM", acceleration = "fixed", acceleration.multiplier = 1.0, tolerance = 1e-4, maximum.iterations = 1000, K = 0) ## Optimal K estimation. K <- optbins(Dataset = a.Dataset(mvnorm.simulated), Rule = "Knuth equal", kmin = 2, kmax = 100) ## Finite mixture estimation mvnormest.em <- REBMIX(model = "REBMVNORM", Dataset = a.Dataset(mvnorm.simulated), Preprocessing = "histogram", cmax = 20, K = K, Criterion = "BIC", EMcontrol = EM) summary(mvnormest.em) @ \subsubsection{Clustering with exhaustive REBMIX\&ECM strategy and ICL criterion} <>= ## EM.control object creation. CEM <- new("EM.Control", strategy = "exhaustive", variant = "ECM", acceleration = "fixed", acceleration.multiplier = 1.0, tolerance = 1e-4, maximum.iterations = 1000, K = 0) ## Estimate number of components, component weights and component parameters. mvnormest.cem <- REBMIX(model = "REBMVNORM", Dataset = a.Dataset(mvnorm.simulated), Preprocessing = "histogram", cmax = 10, Criterion = "ICL", EMcontrol = CEM) mvnorm.clu <- RCLRMIX(model = "RCLRMVNORM", x = mvnormest.cem) @ \begin{figure}[htb]\centering <>= plot(mvnorm.clu, family = "sans") @ \caption{Dataset \texttt{mvnorm}. Predictive cluster membership (coloured circles) for exhaustive REBMIX\&ECM strategy and ICL criterion.} \end{figure} \subsubsection{Acceleration of the EM algorithm} Standard EM algorithm with fixed acceleration.multiplier of $a_{\texttt{EM}} = 1.0$: <>= ## Create the EM.Control object to utilize one of the REBMIX&EM strategies. EM.normal <- new("EM.Control", strategy = "exhaustive", variant = "EM", acceleration = "fixed", acceleration.multiplier = 1.0, tolerance = 1e-4, maximum.iterations = 1000, K = 0) ## Estimate number of components, component weights and component parameters. mvnormestest.em.normal <- REBMIX(model = "REBMVNORM", Dataset = a.Dataset(mvnorm.simulated), Preprocessing = "histogram", cmax = 15, Criterion = "BIC", EMcontrol = EM.normal) cat("Total number of EM algorithm iterations: ", a.summary.EM(mvnormestest.em.normal, pos = 1, col.name = "total.iterations.nbr"), ". Value of BIC: ", a.summary(mvnormestest.em.normal, pos = 1, col.name = "IC")) @ Standard EM algorithm with fixed acceleration.multiplier of $a_{\texttt{EM}} = 1.5$: <>= ## Create the EM.Control object to utilize one of the REBMIX&EM strategies. EM.fixed1.5 <- new("EM.Control", strategy = "exhaustive", variant = "EM", acceleration = "fixed", acceleration.multiplier = 1.5, tolerance = 1e-4, maximum.iterations = 1000, K = 0) ## Estimate number of components, component weights and component parameters. mvnormest.em.fixed1.5 <- REBMIX(model = "REBMVNORM", Dataset = a.Dataset(mvnorm.simulated), Preprocessing = "histogram", cmax = 15, Criterion = "BIC", EMcontrol = EM.fixed1.5) cat("Total number of EM algorithm iterations: ", a.summary.EM(mvnormest.em.fixed1.5, pos = 1, col.name = "total.iterations.nbr"), ". Value of BIC: ", a.summary(mvnormest.em.fixed1.5, pos = 1, col.name = "IC")) @ Standard EM algorithm with line search for optimal increment $a_{\texttt{EM}}$ in each iteration: <>= ## Create the EM.Control object to utilize one of the REBMIX&EM strategies. EM.line <- new("EM.Control", strategy = "exhaustive", variant = "EM", acceleration = "line", acceleration.multiplier = 1.0, tolerance = 1e-4, maximum.iterations = 1000, K = 0) ## Estimate number of components, component weights and component parameters. mvnormest.em.line <- REBMIX(model = "REBMVNORM", Dataset = a.Dataset(mvnorm.simulated), Preprocessing = "histogram", cmax = 15, Criterion = "BIC", EMcontrol = EM.line) cat("Total number of EM algorithm iterations: ", a.summary.EM(mvnormest.em.line, pos = 1, col.name = "total.iterations.nbr"), ". Value of BIC: ", a.summary(mvnormest.em.line, pos = 1, col.name = "IC")) @ Standard EM algorithm with golden search for optimal increment $a_{\texttt{EM}}$ in each iteration: <>= ## Create the EM.Control object to utilize one of the REBMIX&EM strategies. EM.golden <- new("EM.Control", strategy = "exhaustive", variant = "EM", acceleration = "golden", acceleration.multiplier = 1.0, tolerance = 1e-4, maximum.iterations = 1000, K = 0) ## Estimate number of components, component weights and component parameters. mvnormest.em.golden <- REBMIX(model = "REBMVNORM", Dataset = a.Dataset(mvnorm.simulated), Preprocessing = "histogram", cmax = 15, Criterion = "BIC", EMcontrol = EM.golden) cat("Total number of EM algorithm iterations: ", a.summary.EM(mvnormest.em.golden, pos = 1, col.name = "total.iterations.nbr"), ". Value of BIC: ", a.summary(mvnormest.em.golden, pos = 1, col.name = "IC")) @ \subsection{Multivariate \texttt{sensorlessdrive} dataset} These data are a result of the sensorless drive diagnosis procedure. The features are extracted from electric current drive signals \citep{sensorless_drive_citation}. The main objective is the sensorless fault detection and classification. <>= data(sensorlessdrive) ## Split dataset into train (75%) and test (25%) subsets. set.seed(5) Drive <- split(p = 0.75, Dataset = sensorlessdrive, class = 4) @ \subsubsection{Finite mixture estimation} <>= ## Estimate number of components, component weights and component ## parameters for train subsets. driveest <- REBMIX(model = "REBMVNORM", Dataset = a.train(Drive), Preprocessing = "histogram", cmax = 15, Criterion = "BIC") @ \subsubsection{Classification} <>= ## Selected features. drivecla <- RCLSMIX(model = "RCLSMVNORM", x = list(driveest), Dataset = a.test(Drive), Zt = a.Zt(Drive)) @ \subsubsection{Show and summary methods} <>= drivecla summary(drivecla) @ \subsubsection{Plot method} \begin{figure}[htb]\centering <>= # Plot selected features. plot(drivecla, nrow = 3, ncol = 2, family = "sans") @ \caption{Dataset \texttt{iris}. Predictive class membership (coloured circles), error (black circles).} \end{figure} \subsubsection{Estimation of using the histogram based EM algorithm} When dealing with large datasets it is useful to shrink datasets to reduce the computational overload of the EM algorithm. The histogram based EM algorithm can be used for this purpose. The dataset is discretized and the EM is computed with the binned data. Integer parameter \texttt{K} from the \texttt{EM.Control} object is used to set the shrinkage level. A smaller value means less containers while a larger value implicates more containers. <>= ## EM.control object creation. EM <- new("EM.Control", strategy = "exhaustive", variant = "EM", acceleration = "fixed", acceleration.multiplier = 1.0, tolerance = 1e-4, maximum.iterations = 1000, K = 300) ## Estimate number of components, component weights and component ## parameters for train subsets. driveest <- REBMIX(model = "REBMVNORM", Dataset = a.train(Drive), Preprocessing = "histogram", cmax = 15, Criterion = "BIC", EMcontrol = EM) drivecla <- RCLSMIX(model = "RCLSMVNORM", x = list(driveest), Dataset = a.test(Drive), Zt = a.Zt(Drive)) summary(drivecla) @ \subsection{Multivariate \texttt{adult} dataset} The \texttt{adult} dataset containing $48842$ instances with $16$ continuous, binary and discrete variables was extracted from the census bureau database \citet{Asuncion_Newman_2007}. Extraction was done by Barry Becker from the 1994 census bureau database. The \texttt{adult} dataset is loaded, complete cases are extracted and levels are replaced with numbers. <>= data(adult) ## Find complete cases. adult <- adult[complete.cases(adult),] ## Replace levels with numbers. adult <- as.data.frame(data.matrix(adult)) @ Numbers of unique values for variables are determined and displayed. <>= ## Find numbers of levels. cmax <- unlist(lapply(apply(adult[, c(-1, -16)], 2, unique), length)) cmax @ The dataset is split into train and test subsets for the two incomes and the \texttt{Type} and \texttt{Income} columns are removed. <>= ## Split adult dataset into train and test subsets for two Incomes ## and remove Type and Income columns. Adult <- split(p = list(type = 1, train = 2, test = 1), Dataset = adult, class = 16) @ \subsubsection{Finite mixture estimation} Number of components, component weights and component parameters are estimated assuming that the variables are independent for the set of chunks $y_{1j}, y_{2j}, \ldots, y_{14j}$. <>= ## Estimate number of components, component weights and component parameters ## for the set of chunks 1:14. adultest <- list() for (i in 1:14) { adultest[[i]] <- REBMIX(Dataset = a.train(chunk(Adult, i)), Preprocessing = "histogram", cmax = min(120, cmax[i]), Criterion = "BIC", pdf = "Dirac", K = 1) } @ \subsubsection{Classification} The class membership prediction is based upon the best first search algorithm. <>= ## Class membership prediction based upon the best first search algorithm. adultcla <- BFSMIX(x = adultest, Dataset = a.test(Adult), Zt = a.Zt(Adult)) @ \subsubsection{Show and summary methods} <>= adultcla summary(adultcla) @ \subsubsection{Plot method} \begin{figure}[htb]\centering <>= ## Plot selected chunks. plot(adultcla, nrow = 5, ncol = 2, family = "sans") @ \caption{Dataset \texttt{adult}. Predictive class membership (coloured circles), error (black circles).} \end{figure} <>= rm(list = ls()) @ \section{Summary}\label{sec:summary} The users of the \texttt{rebmix} package are kindly encouraged to inform the authors about bugs and wishes. \bibliography{rebmix} \vspace{\baselineskip}\noindent\emph{Marko Nagode\\ University of Ljubljana\\ Faculty of Mechanical Engineering\\ A\v{s}ker\v{c}eva 6\\ 1000 Ljubljana\\ Slovenia}\\ \href{mailto:Marko.Nagode@fs.uni-lj.si}{Marko.Nagode@fs.uni-lj.si}. \end{document}