% =========================================================================== % File: clusterCrit.Rnw % Created: 2010-12-17 08:48:52 % Last modification: 2017-11-09 15:28:18 % Author: Bernard Desgraupes % e-mail: bdesgraupes@users.sourceforge.net % =========================================================================== % % % Vignette meta-information % % ------------------------- %\VignetteIndexEntry{An R Package for Computing Clustering Quality Indices} %\VignetteDepends{clusterCrit} %\VignetteKeywords{clustering, quality index, internal index, external index, concordance, R, S} %\VignettePackage{clusterCrit} \documentclass[a4paper,10pt]{article} \usepackage{graphicx} \usepackage{ifthen} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \usepackage{amsopn} \usepackage{amsthm} \usepackage{amscd} \usepackage{varioref} \usepackage[T1]{fontenc} \usepackage[latin1]{inputenc} \usepackage{times} \newcommand{\transp}[1]{{}^t\!#1} \newcommand{\ttransp}[1]{\transp{#1}\,#1} \newcommand{\clust}[2][k]{#2^{\{#1\}}} \newcommand{\R}{\mathbb{R}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\Cov}{\mathrm{Cov}} \newcommand{\Corr}{\mathrm{Corr}} \newcommand{\Tr}{\mathrm{Tr}} \newcommand{\sgn}{\mathrm{sgn}} \pagestyle{myheadings} \markright{Package clusterCrit for R} \begin{document} \title{Clustering Indices} \author{Bernard Desgraupes\\ University Paris Ouest\\ Lab Modal'X} \date{\small November 2017} \maketitle \hrule \vspace{5mm} \tableofcontents \vspace{5mm} \hrule \newpage \section{Internal clustering criteria} \label{sec-clust-int-idx} \subsection{Algebraic background and notations} \label{subsec-int-crit-notation} Let us denote by $A$ the data matrix: each row is an observation $O_{i}$ corresponding to an individual and each column represents a variable observed for all the individuals. There are $N$ observations and $p$ variables. The size of matrix $A$ is $N\times p$. The data are assumed to be partitioned in $K$ groups (or \emph{clusters}). Let us denote by $P$ the vector representing a partition of the data: it is an integer vector with values between 1 and $K$. The size of $P$ is equal to the number $N$ of observations. For each index $i$ ($1\leq i\leq N$), the coordinate $P_{i}$ is equal to the number $k$ ($1\leq k\leq K$) of the cluster the observation $O_{i}$ belongs to. The cluster $C_{k}$ can be represented by a submatrix $\clust{A}$ of matrix $A$ made of the rows of $A$ whose index $i$ is such that $P_{i}=k$. If $n_{k}$ denotes the cardinal of $C_{k}$, the matrix $\clust{A}$ has size $n_{k}\times p$ and one has the relation $\sum_{k}n_{k}=N$. Let us denote by $I_{k}$ the set of the indices of the observations belonging to the cluster $C_{k}$: $$ I_{k}=\{i\,|\,O_{i}\in C_{k}\}=\{i\,|\,P_{i}=k\}. $$ The matrix $\clust{A}$ can also be denoted formally as $A_{\{I_{k}\}}$. Let us denote by $\clust{\mu}$ the barycenter of the observations in the cluster $C_{k}$ and by $\mu$ the barycenter of all the observations. $\clust{\mu}$ and $\mu$ are row-vectors with length $p$: they are the means of the rows of the matrices $\clust{A}$ and $A$ respectively: \begin{eqnarray} \clust{\mu} & = & \frac{1}{n_{k}}\sum_{i\in I_{k}}x_{i} \\ \mu & = & \frac{1}{N}\sum_{i=1}^N x_{i} \end{eqnarray} where $x_{i}$ designates the row of index $i$ in $A$. \subsubsection{Total dispersion} \label{subsubsec-dispersion-totale} Each column vector $V_{j}$ ($1\leq j\leq p$) of the matrix $A$ can be interpreted as a sample of size $N$ of the $j$-th observed variable. Let us center each of these vectors with respect to its mean by setting $v_{j}=V_{j} - \mu_{j}$. If $X$ is the matrix formed by the centered vectors $v_{j}$, the scatter matrix $T$ is the matrix defined by $$ T=\ttransp{X}. $$ The general term of $T$ is: \begin{equation} t_{ij}=\sum_{l=1}^N (a_{li}-\mu_{i})(a_{lj}-\mu_{j}) \end{equation} The matrix $T$ is equal to $N$ times the variance-covariance matrix of the family of column vectors $(V_{1},\dots,V_{p})$. The general term of $T$ can thus also be written as \begin{equation} t_{ij}=N\times\Cov(V_{i},V_{j}). \label{eq-tij-cov} \end{equation} In particular, the diagonal terms are $N$ times the variances of the vectors $V_{i}$: \begin{equation} t_{ii}=N\times\Var(V_{i}). \label{eq-tii-var} \end{equation} One can also write: \begin{equation} t_{ij}=\transp{(V_{i}-\mu_{i})}(V_{j}-\mu_{j}) \end{equation} where, by a slight abuse of notation, $\mu_{i}$ and $\mu_{j}$ are here identified with the vectors $\mu_{i}\mathbf{1}$ and $\mu_{j}\mathbf{1}$ respectively. The scatter matrix is a square symmetric matrix of size $p\times p$. As it is of the form $\ttransp{X}$, the quadratic form it represents is positive semi-definite. Indeed, if one takes any vector $v$ in $\R^p$: \begin{equation} \transp{v}Tv=\transp{v}\ttransp{X}v=\ttransp{(Xv)}=||Xv||^2\geq 0 \end{equation} In particular, the eigenvalues and the determinant of the scatter matrix are also greater than or equal to 0. If $N>p$ and if the matrix $X$ has maximal rank $p$, the form is in fact positive definite. The total scattering TSS (\emph{total sum of squares}) is the trace of the matrix $T$: \begin{equation} TSS=\Tr(T)=N\sum_{j=1}^p\Var(V_{j}) \end{equation} \emph{Geometric interpretation}: let us denote by $M_{1},\dots,M_{N}$ the points of the space $\R^p$ representing all the observations: the coordinates of $M_{i}$ are the coefficients of the $i$-th row of the data matrix $A$. Similarly, let us denote by $G$ the barycenter of these points: its coordinates are the coefficients of the vector $\mu$. One can easily prove the following relations: \begin{eqnarray} \Tr(T) & = & \sum_{i=1}^N ||M_{i}-G||^2 \\ & = & \frac{1}{N}\sum_{i d_{rs}\}} \end{eqnarray} Their difference is: \begin{equation} s^{+}-s^{-}=\sum_{(r,s)\in I_{B}}\sum_{(u,v)\in I_{W}} \sgn(d_{rs} - d_{uv}) \end{equation} % s^{+}-s^{-}=\sum_{i>= library(clusterCrit) dataPath <- file.path(path.package(package="clusterCrit"),"unitTests","data","testsInternal_400_4.Rdata") load(file=dataPath, envir=.GlobalEnv) traj <- traj_400_4 part <- part_400_4 critName <- "Ball_Hall" vals <- vector(mode="numeric",length=6) for (k in 2:7) { idx <- intCriteria(traj,part[[k]],critName) vals[k-1] <- idx } par(mfrow=c(1,2)) plot(traj[,1],traj[,2],col=part[[4]],xlab="",ylab="",asp=1) plot(2:7,vals,type='o',main=critName,xlab="# of clusters",ylab="Index",asp=1) # bestidx <- bestCriterion(vals,critName) # points(bestidx,vals[bestidx+1],pch=8,col="green") par(mfrow=c(1,1)) @ \end{center} \section{External comparison indices} \label{sec-clust-ext-idx} The external indices of comparison are indices designed to measure the similitude between two partitions. They take into account only the distribution of the points in the different clusters and do not allow to measure the quality of this distribution. \subsection{Notation} \label{subsec-ext-crit-notation} All the suggested indices rely on a confusion matrix representing the count of pairs of points depending on whether they are considered as belonging to the same cluster or not according to partition $P_{1}$ or to partition $P_{2}$. There are thus four possibilities: \begin{itemize} \item the two points belong to the same cluster, according to both $P_{1}$ and $P_{2}$ \item the two points belong to the same cluster according to $P_{1}$ but not to $P_{2}$ \item the two points belong to the same cluster acording to $P_{2}$ but not to $P_{1}$ \item the two points do not belong to the same cluster, according to both $P_{1}$ and $P_{2}$. \end{itemize} Let us denote by $yy$, $yn$, $ny$, $nn$ ($y$ means \emph{yes}, and $n$ means \emph{no}) the number of points belonging to these four categories respectively. $N_{T}$ being the total number of pairs of points, one has: \begin{equation} N_{T}=\frac{N(N-1)}{2}=yy+yn+ny+nn. \end{equation} \subsection{Precision and recall coefficients} \label{subsec-coeff-prec-rapp} If partition $P_{1}$ is used as a reference, one defines the \emph{precision coefficient} as the proportion of points rightly grouped together in $P_{2}$, that is to say which are also grouped together according to the reference partition $P_{1}$. Among the $yy+ny$ points grouped together according to $P_{2}$, $yy$ are rightly grouped. One thus has: \begin{equation} \mathcal{P}=\frac{yy}{yy+ny}. \label{eq-precision-coeff} \end{equation} Similarly, one defines the \emph{recall coefficient} as the proportion of points grouped together in $P_{1}$ which are also grouped together in partition $P_{2}$. This is the proportion of points which are supposed to be grouped together according to the reference partition $P_{1}$ and which are effectively marked as such by partition $P_{2}$. Among the $yy+yn$ points grouped together in $P_{1}$, $yy$ are also grouped together in $P_{2}$. One thus has: \begin{equation} \mathcal{R}=\frac{yy}{yy+yn} \label{eq-recall-coeff} \end{equation} In terms of conditional probabilities, one can write \begin{equation} \mathcal{P}=P(gp_{1}|gp_{2})\quad\mbox{and}\quad\mathcal{R}=P(gp_{2}|gp_{1}) \end{equation} where the events $gp_{1}$ and $gp_{2}$ mean that two points are grouped together in $P_{1}$ and in $P_{2}$ respectively. The $\mathcal{F}$-measure is the harmonic mean of the precision and recall coefficients: \begin{equation} \mathcal{F}=\frac{2}{\dfrac{1}{\mathcal{P}}+\dfrac{1}{\mathcal{R}}} =\frac{2\mathcal{P}\times\mathcal{R}}{\mathcal{P}+\mathcal{R}}=\frac{2yy}{2yy+yn+ny} \end{equation} There is also a weighted version of this measure, called the $\mathcal{F}_{\alpha}$-measure, defined like this: \begin{equation} \mathcal{F}_{\alpha}=\frac{(1+\alpha)\mathcal{P}\times\mathcal{R}}{\alpha\mathcal{P}+\mathcal{R}}\mbox{ with }\alpha>0 \end{equation} \subsection{Indicator variables} \label{subsec-var-indic} Let us associate to each partition $P_{a}$ ($a=1,2$) the binary random variable $X_{a}$ defined on the set of indices $i$ and $j$ such that $i mean(V2) % % [1] 2.918919 % Temps of calcul en millisecondes sur MacMini for 150 observations % % \begin{tabular}{|l|c|} % \hline % ball_hall & 1.9 \\ % banfeld_raftery & 2.1 \\ % c_index & 4.3 \\ % calinski_harabasz & 1.9 \\ % davies_bouldin & 2. \\ % det_ratio & 2.2 \\ % dunn & 2. \\ % gamma & 3.3 \\ % g_plus & 3.4 \\ % gdi11 & 2.1 \\ % gdi12 & 2. \\ % gdi13 & 2. \\ % gdi21 & 2. \\ % gdi22 & 2. \\ % gdi23 & 2. \\ % gdi31 & 2. \\ % gdi32 & 2. \\ % gdi33 & 2. \\ % gdi41 & 1.9 \\ % gdi42 & 1.9 \\ % gdi43 & 1.9 \\ % gdi51 & 1.9 \\ % gdi52 & 1.9 \\ % gdi53 & 1.9 \\ % ksq_detw & 2. \\ % log_det_ratio & 2. \\ % log_ss_ratio & 2. \\ % mcclain_rao & 2. \\ % pbm & 2.3 \\ % point_biserial & 2. \\ % ray_turi & 1.9 \\ % ratkowsky_lance & 2. \\ % scott_symons & 2.1 \\ % sd_scat & 1.9 \\ % sd_dis & 2. \\ % s_dbw & 2.1 \\ % tau & 3.3 \\ % trace_w & 1.9 \\ % trace_wib & 2.2 \\ % \hline % \end{tabular} % % > mean(V2) % % [1] 2.157895 \section{Usage of the \emph{clusterCrit} package} \label{sec-clusterCrit-usage} The \emph{clusterCrit} package for R provides an implementation of all the indices described in the preceding sections. The core of the package is written in Fortran and is optimized in order to avoid duplicate calculations. It can be installed from the R console with the following instruction: \begin{verbatim} install.packages(clusterCrit) \end{verbatim} Once it is installed, it can be loaded in an R session with the following instruction: \begin{verbatim} load(clusterCrit) \end{verbatim} \subsection{Available commands} \label{subsec-pkg-commands} The \emph{clusterCrit} package defines several functions which let you compute internal quality indices or external comparison indices. The partitions are specified as an integer vector giving the index of the cluster each observation belongs to. The possible values are integers between 1 and $K$, where $K$ is the number of clusters. \par\medskip The \emph{intCriteria} function calculates one or several internal quality indices. Its syntax is: \begin{verbatim} intCriteria(traj, part, crit) \end{verbatim} The \emph{traj} argument is the matrix of observations (aka as trajectories). The \emph{part} argument is the partition vector. The \emph{crit} argument is a list containing the names of the indices to compute. One can use the keyword \verb|"all"| in order to compute all the available indices. See the \emph{getCriteriaNames} function to see the names of the currently available indices. All the names are case insensitive and can be abbreviated as long as the abbreviation remains unambiguous. \par\medskip The \emph{extCriteria} function calculates one or several external indices (including the precision and recall coefficients). Its syntax is: \begin{verbatim} extCriteria(part1, part2, crit) \end{verbatim} The \emph{part1} and \emph{part2} arguments are the partition vectors. The meaning of the \emph{crit} argument is the same as for the \emph{intCriteria} function. \par\medskip Given a vector of several clustering quality index values computed with a given criterion, the function \emph{bestCriterion} returns the index of the one which must be considered as the best in the sense of the specified criterion. Its syntax is: \begin{verbatim} bestCriterion(x, crit) \end{verbatim} The \emph{x} argument is a numeric vector of quality index values. The \emph{crit} argument is the name of the criterion: it is case insensitive and can be abbreviated. Typically, a set of data is clusterized several times (using different algorithms or specifying a different number of clusters) and a clustering index is calculated each time: the \emph{bestCriterion} function tells which value is considered the best. For instance, if one uses the Calinski\_Harabasz index, the best value is the largest one. The \emph{concordance} function calculates the concordance matrix between two partitions of the same data. Its syntax is: \begin{verbatim} concordance(part1, part2) \end{verbatim} The arguments are the partition vectors. The function returns a $2\times 2$ matrix of the form: $$ \begin{pmatrix} yy & yn \\ ny & nn \end{pmatrix} $$ These are the number of pairs classified as belonging or not belonging to the same cluster with respect to both partitions. Since there are $N(N-1)/2$ pairs of distinct points, one has: $$ yy + yn + ny + nn=N(N-1)/2 $$ \par\medskip The \emph{getCriteriaNames} function is a convenience function which returns the names of the currently implemented indices. Its syntax is: \begin{verbatim} getCriteriaNames(isInternal) \end{verbatim} where the argument \emph{isInternal} is a logical value: if TRUE it returns the names of the internal indices, otherwise it returns the names of the external ones. \subsection{Examples of use} \label{subsec-pkg-examples} First load the package: <>= library(clusterCrit) @ Let us create some artificial data: <>= x <- rbind(matrix(rnorm(100, mean = 0, sd = 0.5), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.5), ncol = 2), matrix(rnorm(100, mean = 2, sd = 0.5), ncol = 2)) @ Now perform the \emph{kmeans} algorithm in order to get a partition with 3 clusters (the \emph{kmeans} function is provided by R in the \emph{stats} package and is available by default): <>= cl <- kmeans(x, 3) @ Let us get the names of the internal indices: <>= getCriteriaNames(TRUE) @ Let us compute all the internal indices and display one of them: <>= intIdx <- intCriteria(x,cl$cluster,"all") length(intIdx) intIdx[["trace_w"]] @ It is possible to compute only a few indices: <>= intCriteria(x,cl$cluster,c("C_index","Calinski_Harabasz","Dunn")) @ The names are case insensitive and can be abbreviated: <>= intCriteria(x,cl$cluster,c("det","cal","dav")) @ Here is now an example of the external criteria. Let us generate two artificial partitions: <>= part1<-sample(1:3,150,replace=TRUE) part2<-sample(1:5,150,replace=TRUE) @ Let us get the names of the external indices: <>= getCriteriaNames(FALSE) @ Let us compute all the external indices and retrieve one of them: <>= extIdx <- extCriteria(part1,part2,"all") length(extIdx) extIdx[["jaccard"]] @ Let us compute only some of them: <>= extCriteria(part1,part2,c("Rand","Folkes")) @ The names are case insensitive and can be abbreviated: <>= extCriteria(part1,part2,c("ra","fo")) @ \subsection{Benchmark} \label{subsec-benchmark} The \emph{clusterCrit} package is written in Fortran which makes the calculations quite fast. Nevertheless some indices are more demanding and require more computations than the others. The following timings have been evaluated using the \emph{rbenchmark} package: the various indices have been computed separately 100 times on a set of 400 points partitionned in four groups. The results are not interesting \emph{per se} but rather to compare the amount of computations required by the different indices. The following table summarizes the timings for the internal indices (they are expressed in seconds for 100 replications, so they must all be divided by 100): \begin{center} \begin{tabular}{|c|c|} \hline all & 3.095 \\ \hline Ball\_Hall & 0.944 \\ \hline Banfeld\_Raftery & 0.946 \\ \hline C\_index & 2.898 \\ \hline Calinski\_Harabasz & 0.930 \\ \hline Davies\_Bouldin & 0.926 \\ \hline Det\_Ratio & 0.930 \\ \hline Dunn & 0.969 \\ \hline Gamma & 2.188 \\ \hline G\_plus & 2.170 \\ \hline GDI11 & 0.985 \\ \hline GDI12 & 0.971 \\ \hline GDI13 & 0.957 \\ \hline GDI21 & 0.966 \\ \hline GDI22 & 0.961 \\ \hline GDI23 & 0.953 \\ \hline GDI31 & 0.959 \\ \hline GDI32 & 0.957 \\ \hline GDI33 & 0.948 \\ \hline GDI41 & 0.936 \\ \hline GDI42 & 0.933 \\ \hline GDI43 & 0.923 \\ \hline GDI51 & 0.934 \\ \hline GDI52 & 0.934 \\ \hline GDI53 & 0.921 \\ \hline Ksq\_DetW & 0.930 \\ \hline Log\_Det\_Ratio & 0.930 \\ \hline Log\_SS\_Ratio & 0.923 \\ \hline McClain\_Rao & 0.958 \\ \hline PBM & 0.928 \\ \hline Point\_Biserial & 0.959 \\ \hline Ray\_Turi & 0.923 \\ \hline Ratkowsky\_Lance & 0.923 \\ \hline Scott\_Symons & 0.965 \\ \hline SD\_Scat & 0.930 \\ \hline SD\_Dis & 0.923 \\ \hline S\_Dbw & 0.924 \\ \hline Silhouette & 0.992 \\ \hline Tau & 2.174 \\ \hline Trace\_W & 0.945 \\ \hline Trace\_WiB & 0.952 \\ \hline Wemmert\_Gancarski & 0.960 \\ \hline Xie\_Beni & 0.978 \\ \hline \end{tabular} \end{center} We observe that the C index is the most time consuming. The gamma, g\_plus and tau indices also need intensive calculations because the concordance and discordance counts concern a huge quantity of pairs of points. All the other indices yield more or less the same values. Using the keyword \verb|"all"| in the \emph{intCriteria} function is quite efficient because the code is optimized to avoid duplicate calculations and to reuse values already computed for other indices. The timing result for calculating all the indices simultaneously 100 times is 3.095. On the contrary, benchmarking the external indices does not exhibit any noticeable difference. They all take more or less the same time and are very fast. Here are the results for 100 replications of the \emph{extCriteria} function applied to two partitions containing 150 items: \begin{center} \begin{tabular}{|c|c|} \hline all & 0.010 \\ \hline Czekanowski\_Dice & 0.010 \\ \hline Folkes\_Mallows & 0.010 \\ \hline Hubert & 0.011 \\ \hline Jaccard & 0.010 \\ \hline Kulczynski & 0.011 \\ \hline McNemar & 0.010 \\ \hline Phi & 0.010 \\ \hline Precision & 0.010 \\ \hline Rand & 0.010 \\ \hline Recall & 0.011 \\ \hline Rogers\_Tanimoto & 0.010 \\ \hline Russel\_Rao & 0.011 \\ \hline Sokal\_Sneath1 & 0.010 \\ \hline Sokal\_Sneath2 & 0.009 \\ \hline \end{tabular} \end{center} \newpage \addcontentsline{toc}{section}{Bibliography} \nocite{*} \bibliographystyle{plain} \bibliography{clusterCrit} \end{document}