%\VignetteIndexEntry{blockcluster tutorial} %\VignetteKeywords{Rcpp, C++, STK++, Co-Clustering, letent block model} %\VignettePackage{blockcluster} \documentclass[a4paper,11pt]{article} \usepackage[utf8]{inputenc} \usepackage{amsfonts,amstext,amsmath,amssymb} \usepackage{graphicx} \usepackage{rotating} \usepackage{multirow} %\usepackage[toc,page]{appendix} \usepackage{tikz} \usetikzlibrary{arrows} \usetikzlibrary{decorations.markings} %\usepackage{array} \usepackage{color} \definecolor{dkgreen}{rgb}{0,0.6,0} \definecolor{gray}{rgb}{0.5,0.5,0.5} \definecolor{mauve}{rgb}{0.58,0,0.82} % une couleur par auteur lors de la redaction \newcommand{\cb}[1]{{\color{blue}{{\bf CB:} #1}}} \newcommand{\fl}[1]{{\color{red}{{\bf FL:} #1}}} \usepackage{fullpage} \usepackage{hyperref} \usepackage{listings} \lstset{ language=R, basicstyle=\footnotesize\ttfamily, keywordstyle=\color{blue}, commentstyle=\ttfamily\color{dkgreen}, backgroundcolor=\color{white}, stringstyle=\color{orange}, showspaces=false, showstringspaces=false, showtabs=false, tabsize=2, captionpos=b, breaklines=true, breakatwhitespace=false, title=\lstname, escapeinside={}, keywordstyle={}, morekeywords={} } %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Diverses commandes \newcommand{\R}{\mathbb{R}} \newcommand{\Rd}{\mathbb{R}^d} \newcommand{\Xd}{\mathbb{X}^d} % Diverses commandes \newcommand{\calA}{{\cal A}} \newcommand{\calB}{{\cal B}} \newcommand{\calE}{{\cal E}} \newcommand{\calF}{{\cal F}} \newcommand{\calM}{{\cal M}} \newcommand{\calN}{{\cal N}} \newcommand{\calP}{{\cal P}} \newcommand{\calT}{{\cal T}} \newcommand{\calU}{{\cal U}} \newcommand{\calW}{{\cal W}} \newcommand{\calX}{{\cal X}} \newcommand{\calZ}{{\cal Z}} % Lettre ou chiffe en gras \newcommand{\un}{\mathbf{1}} \newcommand{\ba}{\mathbf{a}} \newcommand{\bb}{\mathbf{b}} \newcommand{\bc}{\mathbf{c}} \newcommand{\bd}{\mathbf{d}} \newcommand{\bg}{\mathbf{g}} \newcommand{\bh}{\mathbf{h}} \newcommand{\be}{\mathbf{e}} \newcommand{\bL}{\mathbf{L}} \newcommand{\bn}{\mathbf{n}} \newcommand{\bp}{\mathbf{p}} \newcommand{\bq}{\mathbf{q}} \newcommand{\br}{\mathbf{r}} \newcommand{\bs}{\mathbf{s}} \newcommand{\bt}{\mathbf{t}} \newcommand{\bu}{\mathbf{u}} \newcommand{\bv}{\mathbf{v}} \newcommand{\bW}{\mathbf{W}} \newcommand{\bw}{\mathbf{w}} \newcommand{\bx}{\mathbf{x}} \newcommand{\bsx}{\boldsymbol{x}} \newcommand{\bX}{\mathbf{X}} \newcommand{\bsX}{\boldsymbol{X}} \newcommand{\by}{\mathbf{y}} \newcommand{\bsy}{\boldsymbol{y}} \newcommand{\bY}{\mathbf{Y}} \newcommand{\bsY}{\boldsymbol{Y}} \newcommand{\bZ}{\mathbf{Z}} \newcommand{\bz}{\mathbf{z}} % Lettre grecque en gras % requiert \usepackage{amsbsy} \newcommand{\balpha}{\boldsymbol{\alpha}} \newcommand{\bbeta}{\boldsymbol{\beta}} \newcommand{\bchi}{\boldsymbol{\chi}} \newcommand{\bdelta}{\boldsymbol{\delta}} \newcommand{\bDelta}{\boldsymbol{\Delta}} \newcommand{\bepsilon}{\boldsymbol{\epsilon}} \newcommand{\bGamma}{\boldsymbol{\Gamma}} \newcommand{\bgamma}{\boldsymbol{\gamma}} \newcommand{\blambda}{\boldsymbol{\lambda}} \newcommand{\bkappa}{\boldsymbol{\kappa}} \newcommand{\bmu}{\boldsymbol{\mu}} \newcommand{\bnu}{\boldsymbol{\nu}} \newcommand{\bpi}{\boldsymbol{\pi}} \newcommand{\bphi}{\boldsymbol{\phi}} \newcommand{\brho}{\boldsymbol{\rho}} \newcommand{\bsigma}{\boldsymbol{\sigma}} \newcommand{\bSigma}{\boldsymbol{\Sigma}} \newcommand{\btheta}{\boldsymbol{\theta}} \newcommand{\bTheta}{\boldsymbol{\Theta}} \newcommand{\bvarepsilon}{\boldsymbol{\varepsilon}} \newcommand{\bxi}{\boldsymbol{\xi}} % Lettre verticale en gras % requiert \usepackage{amsbsy} \newcommand{\ve}[1]{{\boldsymbol{#1}}} \newcommand{\x}{\boldsymbol{x}} \newcommand{\X}{\boldsymbol{X}} \newcommand{\z}{\boldsymbol{z}} \newcommand{\Z}{\boldsymbol{Z}} \newcommand{\argmin}{\mathop{\mathrm{argmin}}} \newcommand{\argmax}{\mathop{\mathrm{argmax}}} \newcommand{\trace}{\mathop{\mathrm{Tr}}} \newcommand{\diag}{\mathop{\mathrm{diag}}} \newcommand{\mix}{\textsc{mixmod}} \newcommand{\II}{1 \! \! 1} \newcommand{\IR}{\mathbb{R}} \newcommand{\IZ}{\mathbb{Z}} \newcommand{\IN}{\mathbb{N}} \newcommand{\IE}{\mathbb{E}} \newcommand{\mat}[4]{\begin{array}{cc}#1 & #2 \\#3 & #4 \end{array}} \newcommand{\matb}[4]{\begin{array}{cc}{\bf #1} & {\bf #2} \\{\bf #3} & {\bf #4} \end{array}} \newcommand{\med}{\mathrm{med}} \newcommand{\tr}{\mbox{trace}} \newcommand{\tra}[1]{\mbox{tr}{\bf #1}} \newcommand{\var}{\mbox{var}} \newtheorem{exmp}{Example} <>= library(blockcluster) bc.version <- packageDescription("blockcluster")$Version bc.date <- packageDescription("blockcluster")$Date @ %opening \title{A tutorial for {\bf blockcluster} R package\\Version 4} \author{Parmeet Singh Bhatia\footnote{Siemens, bhatia.parmeet@gmail.com}, Serge Iovleff \footnote{INRIA-Lille, serge.iovleff@inria.fr}} \date{\today} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \tableofcontents \begin{abstract} {\bf blockcluster} is a newly developed {\bf R} package for co-clustering of binary, contingency, continuous and categorical data. The core library is written in {\bf C++} and {\bf blockcluster} API acts as a bridge between {\bf C++} core library and {\bf R} statistical computing environment. The package is based on recently proposed \cite{Govaert2003}, \cite{govaert2008binary}, \cite{govaert2010contingency} latent block models for simultaneous clustering of rows and columns. This tutorial is based on the package version 4. \end{abstract} \section{Introduction} Cluster analysis is an important tool in a variety of scientific areas such as pattern recognition, information retrieval, micro-array, data mining, and so forth. Although many clustering procedures such as hierarchical clustering, $k$-means or self-organizing maps, aim to construct an optimal partition of objects or, sometimes, of variables, there are other methods, called block clustering methods, which consider simultaneously the two sets and organize the data into homogeneous blocks. Let $\bx$ denotes a $n\times d$ data matrix defined by $\bx = \{(x_{ij} ); i \in I \mbox{ and } j \in J\}$, where $I$ is a set of $n$ objects (rows, observations, cases etc) and $J$ is a set of $d$ variables (columns, attributes etc). The basic idea of these methods consists in making permutations of objects and variables in order to draw a correspondence structure on $I \times J$. \begin{figure}[h!] \centering \includegraphics[width = 100mm]{figs/binarysamplecocluster.png} \caption{Binary data set (a), data reorganized by a partition on $I$ (b), by partitions on $I$ and $J$ simultaneously (c) and summary matrix (d).} \label{fig:samplecoclust} \end{figure} For illustration, consider Figure~\ref{fig:samplecoclust} where a binary data set defined on set of $n=10$ individuals $I = {A,B,C,D,E,F,G,H,I,J}$ and set of $d=7$ binary variables $J = {1,2,3,4,5,6,7}$ is re-organized into a set of $3\times3$ clusters by permuting the rows and columns. Owing to ever increasing importance of Co-clustering in variety of scientific areas, we have recently developed a R package for the same called {\bf blockcluster}. The R package {\bf blockcluster} allows to estimate the parameters of the co-clustering models [\cite{Govaert2003}] for binary, contingency, continuous and categorical data. This package is unique from the point of view of generative models it implements (latent block models), the used algorithms (BEM, BCEM) and, apart from that, special attention has been given to design the library for handling very huge data sets in reasonable time. The R package is already available on CRAN at \url{https://CRAN.R-project.org/package=blockcluster}. This aim of this tutorial is to elaborate the usage of R package {\bf blockcluster} and to familiarize its users with its various capabilities. The rest of the article is organized as follows. Section \ref{sec:usingblockcluster} gives various details of the package as well as demonstrate it's usage on simulated binary data-set. Section \ref{sec:applications} provides two examples with real data-sets. \section{Package details} \label{sec:usingblockcluster} This package contains two main functions namely {\bf cocluster} and {\bf coclusterStrategy} to perform co-clustering and to set various input parameters respectively. The convenient functions {\bf coclusterBinary}, {\bf coclusterCategorical}, {\bf coclusterContingency} and {\bf coclusterContinuous} are specialized versions of the {\bf cocluster} function. The package also contains two helper functions namely {\bf summary} and {\bf plot} to get the summary of estimated model parameters and to plot the results respectively. We will first go through the details of two main functions. The helper functions are self-explanatory and I will use them in various examples for better understanding. \subsection{cocluster function} Up to version 3, this is the main function of {\bf blockcluster} package that performs Co-clustering for binary, categorical, contingency and continuous data. The prototype of the function is as follows: \begin{lstlisting} cocluster( data, datatype, semisupervised = FALSE , rowlabels = numeric(0), collabels = numeric(0) , model = NULL, nbcocluster, strategy = coclusterStrategy()) \end{lstlisting} The various inputs of {\bf cocluster} functions are as follows: \begin{itemize} \item {\bf data:} Input data as matrix (or list containing data matrix, numeric vector for row effects and numeric vector column effects in case of contingency data with known row and column effects.) \item {\bf datatype:} This is the type of data which can be "binary", "categorical", "continuous" or "contingency". \item {\bf semisupervised:} Boolean value specifying whether to perform semi-supervised co-clustering or not. Make sure to provide row and/or column labels if specified value is true. The default value is false. \item {\bf rowlabels:} Vector specifying the class of rows. The class number starts from zero. Provide -1 for unknown row class. \item {\bf collabels:} Vector specifying the class of columns. The class number starts from zero. Provide -1 for unknown column class. \item {\bf model:} This is the name of model. The various models that are available in package are given in tables~\ref{tab:binarymodeltypes}, \ref{tab:categoricalmodeltypes}, \ref{tab:continuousmodeltypes} and \ref{tab:contingencymodeltypes}. \item {\bf nbcocluster:} Integer vector specifying the number of row and column clusters respectively. \item {\bf strategy:} This input can be used to control various input parameters. It can be created using the function {\bf coclusterStrategy} as explained in Section~\ref{sec:coclusterStrategy}. \item {\bf nbCore:} This input can be used to control the number of threads to use. Put 0 for all availables cores. Default is 1. \end{itemize} The only mandatory inputs to the function {\bf cocluster} are {\bf data}, {\bf datatype} and {\bf nbcocluster}. The default model for each data-type is the most general model with free row and column proportions and unequal dispersion/variance for each block. Furthermore we have default set of input parameters which works well in most cases which are explained in further details in Section~\ref{sec:coclusterStrategy}. The package also comes with OpenMP support (If supported by your Operating system and R). You need to set the number of threads in you environment ({\bf nbCore}). \subsubsection{The coclusterBinary function} The \verb+coclusterBinary+ function is a specialization of the \verb+cocluster+ function for binary data. The prototype of the function is as follows: \begin{lstlisting} # cocluster for binary data coclusterBinary( data, semisupervised = FALSE , rowlabels = numeric(0), collabels = numeric(0) , model = NULL, nbcocluster, strategy = coclusterStrategy() , a=1, b=1, nbCore=1) \end{lstlisting} This function has two additional parameters $a$ and $b$ corresponding to the bayesian form of the likelihood function. The default value correspond to the case "no prior". The available binary models are given in the table~\ref{tab:binarymodeltypes}. \begin{table}[h!] \centering \begin{tabular}{|c|c|c|c|c|} \hline Model & Datatype & Proportions & Dispersion/Variance\\ \hline {\bf pik\_rhol\_epsilonkl} & binary & unequal & unequal\\ {\bf pik\_rhol\_epsilon} & binary & unequal & equal\\ {\bf pi\_rho\_epsilonkl} & binary & equal & unequal\\ {\bf pi\_rho\_epsilon} & binary & equal & equal\\ \hline \end{tabular} \caption{Binary models available in package {\bf blockcluster}.} \label{tab:binarymodeltypes} \end{table} \subsubsection{The coclusterCategorical function} The \verb+coclusterCategorical+ function is a specialization of the \verb+cocluster+ function for categorical data. The prototype of the function is as follows: \begin{lstlisting} # cocluster for categorical data coclusterCategorical( data, semisupervised = FALSE , rowlabels = numeric(0), collabels = numeric(0) , model = NULL, nbcocluster, strategy = coclusterStrategy() , a=1, b=1, nbCore=1) \end{lstlisting} This function has two additional parameters $a$ and $b$ corresponding to the bayesian form of the likelihood function. The default value correspond to the case "no prior". The availables categorical models are given in the table~\ref{tab:categoricalmodeltypes}. \begin{table}[h!] \centering \begin{tabular}{|c|c|c|c|} \hline Model & Datatype & Proportions & Dispersion/Variance\\ \hline {\bf pik\_rhol\_multi} & categorical & unequal & N.A \\ {\bf pi\_rho\_multi} & categorical & equal & N.A \\ \hline \end{tabular} \caption{Categorical models available in package {\bf blockcluster}.} \label{tab:categoricalmodeltypes} \end{table} \subsubsection{The coclusterContinuous function} The \verb+coclusterContinuous+ function is a specialization of the \verb+cocluster+ function for continuous data. The prototype of the function is as follows: \begin{lstlisting} # cocluster for continuous data (Gaussian models) coclusterContinuous( data, semisupervised = FALSE , rowlabels = numeric(0), collabels = numeric(0) , model = NULL, nbcocluster, strategy = coclusterStrategy(), nbCore=1) \end{lstlisting} The availables continuous models are given in the table~\ref{tab:continuousmodeltypes}. \begin{table}[h!] \centering \begin{tabular}{|c|c|c|c|} \hline Model & Datatype & Proportions & Dispersion/Variance\\ \hline {\bf pik\_rhol\_sigma2kl} & continuous & unequal & unequal\\ {\bf pik\_rhol\_sigma 2} & continuous & unequal & equal \\ {\bf pi\_rho\_sigma2kl} & continuous & equal & unequal \\ {\bf pi\_rho\_sigma2} & continuous & equal & equal \\ \hline \end{tabular} \caption{Continuous models available in package {\bf blockcluster}.} \label{tab:continuousmodeltypes} \end{table} \subsubsection{The coclusterContingency function} The \verb+coclusterContingency+ function is a specialization of the \verb+cocluster+ function for contingency data. The prototype of the function is as follows: \begin{lstlisting} # cocluster for contingency data (Poisson models) coclusterContingency( data, semisupervised = FALSE , rowlabels = numeric(0), collabels = numeric(0) , model = NULL, nbcocluster, strategy = coclusterStrategy()) \end{lstlisting} The availables contingency models are given in the table~\ref{tab:contingencymodeltypes}. \begin{table}[h!] \centering \begin{tabular}{|c|c|c|c|} \hline Model & Datatype & Proportions & Dispersion/Variance\\ \hline {\bf pik\_rhol\_unknown} & contingency & unequal & N.A\\ {\bf pi\_rho\_unknown} & contingency & equal & N.A \\ {\bf pik\_rhol\_known} & contingency & unequal & N.A \\ {\bf pi\_rho\_known} & contingency & equal & N.A \\ \hline \end{tabular} \caption{Contingency models available in package {\bf blockcluster}.} \label{tab:contingencymodeltypes} \end{table} \subsection{coclusterStrategy function} \label{sec:coclusterStrategy} In the package {\bf blockcluster}, we have a function called {\bf coclusterStrategy} which can be used to set the values of various input parameters. The prototype of the function is as follows: \begin{lstlisting} coclusterStrategy( algo = "BEM", initmethod = "emInitStep" , stopcriteria = "Parameter", semisupervised = FALSE , nbinitmax = 100, nbiterationsxem = 50, nbiterationsXEM = 500 , nbinititerations = 10, initepsilon = 0.01 , nbiterations_int = 5, epsilon_int = 0.01 , epsilonxem = 1e-04, epsilonXEM = 1e-10, nbtry = 2 , nbxem = 5) \end{lstlisting} In the following example, we call the function {\bf coclusterStrategy} without any arguments and then we called the overloaded function {\bf summary} to see default values of various input parameters. << >>= defaultstrategy <- coclusterStrategy() summary(defaultstrategy) @ To set these input parameters, we have to pass appropriate arguments to function {\bf coclusterStrategy} as shown in example below where we set {\bf nbtry}, {\bf nbxem} and {\bf algo} parameters. << >>= newstrategy <- coclusterStrategy(nbtry=5, nbxem=10, algo='BCEM') @ The {\bf newstrategy} object can then be passed to function {\bf cocluster} to perform Co-clustering using the newly set input parameters. The various input arguments for the function {\bf coclusterStrategy} are as follows: \begin{itemize} \item {\bf algo:} The valid values for this parameter are "BEM" (Default), "BCEM", "BSEM" and "BGibbs" (only for Binary model) which are respectively Block EM, Block Classification EM, Block Stochastic EM algorithms and Gibbs sampling. \item {\bf stopcriteria:} It specifies the stopping criteria. It can be based on either relative change in parameters value (preferred) or relative change in log-likelihood. Valid criterion values are "Parameter" and "Likelihood". Default criteria is "Parameter". \item{\bf initmethod:} Method to initialize model parameters. The valid values are "cemInitStep", "emInitStep" and "randomInit". \item{\bf nbinititerations:} Number of Global iterations used in initialization step. Default value is 10. \item{\bf initepsilon:} Tolerance value used inside initialization. Default value is 1e-2. \item{\bf nbiterations\_int:} Number of iterations for internal E step. Default value is 5. \item{\bf epsilon\_int:} Tolerance value for relative change in Parameter/likelihood for internal E-step. Default value is 1e-2. \item{\bf nbtry:} Number of tries (XEM steps). Default value is 2. \item{\bf nbxem:} Number of xem steps. Default value is 5. \item{\bf nbiterationsxem:} Number of EM iterations used during xem step. Default value is 50. \item{\bf nbiterationsXEM:} Number of EM iterations used during XEM step. Default value is 500. \item{\bf epsilonxem:} Tolerance value used during xem step. Default value is 1e-4. \item{\bf epsilonXEM:} Tolerance value used during XEM step. Default value is 1e-10. \end{itemize} To understand many of the above input parameters, we need to have some basic idea about the algorithms and the way they run inside the package {\bf blockcluster}, which is why there is a separate dedicated section ~\ref{sec:inputarguments} for the same. \subsubsection{Understanding various input parameters} \label{sec:inputarguments} You might be wondering why there are so many types of iterations and tolerances inside the package. Well, to get some basic understanding about various input parameters, it is important to know a bit about the algorithms. We will not go through full fledged theory of these algorithms here but will provide enough details to make you understand the meaning of all the input parameters. From now on everything will be explained using BEM but it is applicable in same way to BCEM as well as to BSEM/BGibbs algorithm. The BEM algorithm can be defined as follows in laymen language. \begin{enumerate} \item Run EM algorithm on rows. \item Run EM algorithm on columns. \item Iterate between above two steps until convergence. \end{enumerate} The following strategy is employed to run various algorithms. \begin{enumerate} \item Run the BEM Algorithm for {\bf 'nbxem'} number of times (with high tolerance and low number of iterations) and keep the best model parameters (based on likelihood) among these runs. We call this step {\bf 'xem'} step. \item Starting with the best model parameters, run the algorithm again but this time with a low value of epsilon (low tolerance) and a high number of iterations. We call this step {\bf 'XEM'} step. \item Repeat above two steps for {\bf 'nbtry'} number of times and keep the best model estimation. \end{enumerate} With this background, the various input parameters are explained as follows. \begin{itemize} \item {\bf nbxem, nbtry:} As explained above these numbers represents the number of time we run {\bf 'xem'} step and {\bf 'xem'+'XEM'} step respectively. The tuning of the values of {\bf 'nbxem'} and {\bf 'nbtry'} need to be done intuitively, and could have a substantial effect on final results. A good way to set these values is to run co-clustering few number of times and check if final log-likelihood is stable. If not, one may need to increase these values. In practice, it is better to increment {\bf 'nbxem'} as it could lead to better (stable) results without compromising too much the running time. \item {\bf nbiterationsxem, nbiterationsXEM:} These are number of iterations for BEM algorithm i.e the number of times we run EM on rows and EM on columns. As the name suggests, they are respectively for {\bf 'xem'} and {\bf 'XEM'} steps. \item {\bf nbiterations\_int:} This is the number of iterations for EM algorithm on rows/columns. \item {\bf epsilonxem, epsilonXEM:} These are tolerance values for BEM algorithm during {\bf 'xem'} and {\bf 'XEM'} step respectively. \item {\bf epsilon\_int:} This is the tolerance value for EM algorithm on rows/columns. \item {\bf initepsilon, nbinititerations:} These are the tolerance value and number of iterations respectively used during initialization of model parameters. \end{itemize} \subsection{Model Parameters} When {\bf summary} function is called on the output {\bf cocluster} fuction, it gives the estimated values of various model parameters. The parameters that are common among all the models are row and column mixing proportions. The model parameter for various data-types are as follows. \subsubsection{Binary Models} The parameters $\balpha$ of the underlying distribution of a binary data set is given by the matrix $\bp=(p_{k\ell})$ where $p_{k\ell} \in ]0,1[$ $\forall$ $k=1,\ldots,g$ and $\ell=1,\ldots,m$ and the probability distribution $f_{k\ell}(x_{ij};\bp)=f(x_{ij};p_{k\ell})$ is the Bernoulli distribution $$f(x_{ij};p_{k\ell})=(p_{k\ell})^{x_{ij}}(1-p_{k\ell})^{1-x_{ij}}.$$ we re-parameterize the model density as follows: $$f_{k\ell}(x_{ij};\balpha)=(\varepsilon_{kj})^{|x_{ij}-a_{k\ell}|} (1-\varepsilon_{kj})^{1-|x_{ij}-a_{k\ell}|}$$ where $$\left\{ \begin{array}{ll} a_{k\ell}=0, \; \varepsilon_{k\ell}=p_{k\ell} &\mbox{ if } p_{k\ell}<0.5\\ a_{k\ell}=1, \; \varepsilon_{k\ell}=1-p_{k\ell} &\mbox{ if } p_{k\ell}>0.5. \end{array} \right. $$ Hence the parameters $p_{k\ell}$ of the Bernoulli mixture model are replaced by the following parameters: \begin{itemize} \item The binary value $a_{k\ell}$, which acts as the center of the block $k,\ell$ and which gives, for each block, the most frequent binary value, \item The value $\varepsilon_{k\ell}$ belonging to the set $]0,1/2[$ that characterizes the dispersion of the block $k,\ell$ and which is, for each block, represents the probability of having a different value than the center. \end{itemize} \subsubsection{Categorical Models} The idea behind categorical models is simple extension of binary models for more than 2 modalities. Hence instead of Bernoulli distribution, we used Multinomial (categorical) distribution. Hence the model parameters for each block $k,l$ are $\balpha_{k\ell}=(\balpha_{k\ell}^{h})_{h=1,..r}$ and $\sum_h{\balpha_{k\ell}^{h}}=1$ where $r$ is the number of modalities. \subsubsection{Continuous Models} In this case, the continuous data is modeled using unidimensional normal distribution. Hence the density for each block is given by: $$f_{k\ell}(x_{ij};\balpha)= \frac{1}{\sqrt{2\pi\sigma_{k\ell}^2}}\exp-\{\frac{1}{2\sigma_{k\ell}^2}(x_{ij}-\mu_{k\ell})^2\}$$ The parameters of the model are $\balpha=(\balpha_{11},\ldots,\balpha_{gm})$ where $\balpha_{k\ell}=(\mu_{k\ell},\sigma^2_{k\ell})$ i.e the mean and variance of block $k,l$. \subsubsection{Contingency Models} In this case, it is assumed that for each block $k,\ell$, the values $x_{ij}$ are distributed according to Poisson distribution $\mathcal{P}(\mu_i \nu_j \gamma_{k\ell})$ where the Poisson parameter is split into $\mu_i$ and $\nu_j$ the effects of the row $i$ and the column $j$ respectively and $\gamma_{k\ell}$ the effect of the block $k\ell$. Then, we have $$f_{k\ell}(x_{ij};\balpha)= \frac{e ^{-\mu_i \nu_j \gamma_{k\ell}} (\mu_i \nu_j \gamma_{k\ell})^{x_{ij}}}{x_{ij}!}$$ where $\balpha=(\bmu,\bnu,\bgamma)$ with $\bmu=(\mu_1,\ldots,\mu_n)$, $\bnu=(\bnu_1,\ldots,\nu_d)$ and $\bgamma=(\gamma_{11},\ldots,\gamma_{gm})$. The row and column effects are either provided by the user for models {\bf pik\_rhol\_known} and {\bf pi\_rho\_known} or estimated by the package itself for models {\bf pik\_rhol\_unknown} and {\bf pi\_rho\_unknown}. \subsection{Example using simulated Binary dataset} The various parameters used to simulate this binary data-set are given in Table~\ref{tab:binarytableparam}. The class mean and dispersion are respectively represented by $\ba$ and $\bepsilon$ whereas $\bpi$ and $\brho$ represents row and column proportions respectively. The data consist of $1000$ rows (samples) and $100$ columns (variables) with two clusters on rows and three clusters on columns. The following R commands shows how to load the library, process the data and visualize/summarize results using {\bf blockcluster}. \begin{table}[h!] \begin{minipage}{.59\linewidth} \flushright \begin{tabular}{|c|c|c|c|} \hline \multirow{2}{1cm}{$\ba$, $\bepsilon$}&0, 0.1&0, 0.3&1, 0.1\\ \cline{2-4} &1, 0.3&1, 0.2&0, 0.1\\ \hline \end{tabular} \end{minipage} \begin{minipage}{.39\linewidth} \flushleft \begin{tabular}{|c|c|c|} \hline $\bpi$&.6&.4\\ \hline \end{tabular} \\ \begin{tabular}{|c|c|c|c|} \hline $\brho$&.3&.3&.4\\ \hline \end{tabular} \end{minipage} \caption{Parameters for simulation of binary data.} \label{tab:binarytableparam} \end{table} << >>= library(blockcluster) data("binarydata") out<-coclusterBinary(binarydata, nbcocluster=c(2,3)) summary(out) @ Note that you also get the explicit Integrated Complete Likelihood (ICL) value in case of binary and categorical models, and asymptotic value otherwise. This value can be used for model selection. The following {\bf R} command is used to plot the original and co-clustered data (Figure~\ref{fig:binarydata}(a)) with default value of {\bf asp} which is $0$ (FALSE). When {\bf asp} is FALSE, R graphics will optimize the output figure for the display, hence the original aspect ratio may not be conserved. To conserve the original aspect ratio, set the value of {\bf asp} as $1$ or TRUE. <>= plot(out, asp = 0) @ To Plot various block distributions (Figure~\ref{fig:binarydata}(b)), the following {\bf R} command is used with {\bf type} argument of overloaded {\bf plot} function set to 'distribution' ({\bf type} is 'cocluster' by default which plots the original and Co-clustered data as shown in (Figure~\ref{fig:binarydata}(a))). <>= plot(out, type = 'distribution') @ \begin{figure}[h!] \begin{minipage}{.4\linewidth} \includegraphics[width = 55mm,height=80mm]{figs/coclustbinary1.jpg}\\ \centering(a) \end{minipage} \begin{minipage}{.59\linewidth} \includegraphics[width = 75mm,height=80mm]{figs/distributionbinary.jpeg}\\ \centering(b) \end{minipage} \caption{Original and co-clustered binary data (a), and distributions for each block along with various mixture densities (b).} \label{fig:binarydata} \end{figure} \section{Examples with real datasets} \label{sec:applications} This section demonstrates the applicability of package on real data. Two examples are used: one for Image segmentation and other for document (co-)clustering. \subsection{Image segmentation} Automatic image segmentation is an important technique and have numerous application especially in fields of Medical imaging. Here I present an interesting application of co-clustering (as pre-processing step) for segmenting object(s) in image. I assume that the object pixels follows Gaussian distribution. Hence I run the {\bf blockcluster} package with Gaussian Family model {\bf pik\_rhol\_sigma2kl} on image shown in Figure~\ref{fig:coclustersnake}. It can be clearly seen that the image got nicely segmented into snake and insect in two different blocks. \begin{figure}[h!] \centering \includegraphics[scale=.25]{figs/coclustersnake.jpg} \caption{Original and co-clustered (segmented) image.} \label{fig:coclustersnake} \end{figure} % and in particular Gaussian latent block model used in our software. \subsection{Document clustering} Document clustering is yet another data mining technique where co-clustering seems to be very useful. Here we run our package on one of the datasets being used in \cite{Dhillon01} which is publicly available at \url{ftp://ftp.cs.cornell.edu/pub/smart}. We mix Medline (1033 medical abstracts) and Cranfield (1398 aeronautical abstracts) making a total of 2431 documents. Furthermore, we used all the words (excluding stop words) as features making a total of $9275$ unique words. The data matrix consist of words on the rows and documents on the columns with each entry giving the term frequency, that is the number of occurrences of corresponding word in corresponding document. I assume that the term frequency follows Poisson distribution. Hence we can apply the model {\bf pik\_rhol\_unknown} available in our package for contingency (Poisson Family) datasets with unknown row and column effects. Table~\ref{tab:confusionmatrix} shows the confusion matrix and compare our results with classical bipartite spectral graph partitioning algorithm of [\cite{Dhillon01}] where we have obtained $100$ percent correct classification. Figure~\ref{fig:coclusterdocument} depicts the $2\times2$ checkerboard pattern in the data matrix, hence confirming the more frequent occurrence of particular set of words in one document and vice-versa. Please note that the data matrix images are extremely sparse (data points almost invisible) and have been processed using simple image processing tools for visualization purpose only. \begin{table}[h!] \begin{minipage}{.49\linewidth} \centering \begin{tabular}{c|c|c|} \cline{2-3} &Medline&Cranfield\\ \hline \multicolumn{1}{|c|}{Medline}&1026&0\\ \hline \multicolumn{1}{|c|}{Cranfield}&7&1400\\ \hline \end{tabular}\\ \vspace{.1cm} (a) \end{minipage} \begin{minipage}{.49\linewidth} \centering \begin{tabular}{c|c|c|} \cline{2-3} &Medline&Cranfield\\ \hline \multicolumn{1}{|c|}{Medline}&1033&0\\ \hline \multicolumn{1}{|c|}{Cranfield}&0&1398\\ \hline \end{tabular}\\ \vspace{.1cm} (b) \end{minipage} \caption{Confusion Matrix: Results reported in \cite{Dhillon01} (a), and Results using {\bf blockcluster} (b). The difference in number of Cranfield documents is because we made use of the already available data extracted from the documents and there are two less documents data in the same.} \label{tab:confusionmatrix} \end{table} \begin{figure}[h!] \begin{minipage}{.49\linewidth} \centering \begin{sideways}Unique Words \end{sideways} \begin{tikzpicture} \draw[black, decoration={markings,mark=at position 1 with {\arrow[scale=2,black]{>}}}, postaction={decorate}, shorten >=0.4pt ] (1.0,0) -- (1.0,4); \end{tikzpicture} \hspace{.2cm} \includegraphics[width = 25mm,height=60mm]{figs/orgdata.jpg}\\ \begin{tikzpicture} \draw[black, decoration={markings,mark=at position 1 with {\arrow[scale=2,black]{>}}}, postaction={decorate}, shorten >=0.4pt ] (1.0,5) -- (3.0,5); \end{tikzpicture}\\ Documents\\ \vspace{.1cm} (a) \end{minipage} \begin{minipage}{.49\linewidth} \centering \begin{sideways}Unique Words \end{sideways} \begin{tikzpicture} \draw[black, decoration={markings,mark=at position 1 with {\arrow[scale=2,black]{>}}}, postaction={decorate}, shorten >=0.4pt ] (1.0,0) -- (1.0,4); \end{tikzpicture} \hspace{.2cm} \includegraphics[width = 25mm,height=60mm]{figs/coclustdata.jpg}\\ \begin{tikzpicture} \draw[black, decoration={markings,mark=at position 1 with {\arrow[scale=2,black]{>}}}, postaction={decorate}, shorten >=0.4pt ] (0,5) -- (2,5); \end{tikzpicture}\\ Documents\\ \vspace{.1cm} (b) \end{minipage} \caption{Original data matrix with words on rows and documents on columns (a), and checkerboard pattern in words by documents matrix obtained after performing co-clustering (b).} \label{fig:coclusterdocument} \end{figure} \section{Remarks} This tutorial gives a brief introduction about the {\bf blockcluster} R package. It demonstrates the use of package using Binary data-set but the package can be used in similar fashion for other types of data namely {\bf Contingency}, {\bf Continuous} and {\bf Categorical}. Please note that this tutorial is based on version 4.% If you have any %questions, suggestions or remarks, do not hesitate to put it on public forum at %\url{https://gforge.inria.fr/forum/forum.php?forum_id=11190&group_id=3679}. \bibliographystyle{plain} \bibliography{biblio} \appendix \section{Examples} In this appendix, we present the main functions allowing to launch the various implemented model in {\bf blockcluster}. \subsection{Example with simulated binary dataset} <>= data(binarydata) out<-coclusterBinary(binarydata,nbcocluster=c(3,2), model="pik_rhol_epsilon") summary(out) plot(out) @ \subsection{Example with simulated categorical dataset} <>= data(categoricaldata) out<-coclusterCategorical(categoricaldata,nbcocluster=c(3,2)) summary(out) @ \subsection{Examples with simulated Poisson dataset} <>= data(contingencydataunknown) out<-coclusterContingency( contingencydataunknown, nbcocluster=c(2,3)) summary(out) @ Contingency models using known row/column effects <>= data(contingencydataunknown) mui= rep(1,nrow(contingencydataunknown)) nuj= rep(1,ncol(contingencydataunknown)) out<-coclusterContingency( list(contingencydataunknown, mui, nuj) , nbcocluster=c(2,3), model="pik_rhol_known") summary(out) @ \subsection{Examples with simulated Gaussian dataset} <>= data(gaussiandata) out<-coclusterContinuous(gaussiandata,nbcocluster=c(2,3)) summary(out) @ Gaussian model using common variance <>= data(gaussiandata) out<-coclusterContinuous(gaussiandata,nbcocluster=c(2,3), model="pik_rhol_sigma2") summary(out) @ \end{document}