\documentclass{article} \usepackage{hyperref} \usepackage{graphics} \usepackage{graphicx} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \begin{document} \title{Inferring mutual information networks using the \Rpackage{minet} package} \author{Patrick E. Meyer, Fr\'ed\'eric Lafitte, Gianluca Bontempi} %\VignetteIndexEntry{Inferring mutual information networks using the minet package} \maketitle %____________________________________________________________________________________________________________________________INTRODUCTION \section{Introduction} The \Rpackage{minet} package provides a set of functions to infer mutual information networks from a dataset \cite{papier2}. If fed with microarray data, the package returns a network where nodes denote genes and edges model statistical dependencies between genes. The weight of an edge provides evidence about the existence of a specific (e.g transcriptional) gene to gene interaction. The inference proceeds in two steps. First, the Mutual Information Matrix ($MIM$) is computed, a square matrix whose $MIM_{ij}$ term is the mutual information between gene $X_i$ and $X_j$. Secondly, an inference algorithm takes the $MIM$ matrix as input and attributes a score to each edge connecting a pair of nodes. Different entropy estimators are implemented in this package as well as different inference methods, namely \Robject{aracne}, \Robject{clr} and \Robject{mrnet} \cite{aracne,clr,papier}. Also, the package integrates accuracy assessment tools, like PR-curves and ROC-curves, to compare the inferred network with a reference one. This vignette guides the package user in : \begin{enumerate} \item Estimating the mutual information matrix. \item Inferring a network modeling the interactions between the dataset's variables. \item Comparing the infered network to a network of known interactions in order to compute $F_\beta-scores$. \item Plotting precision-recall and receiver operating characteristic curves. \item Plotting the infered network using the \Rpackage{Rgraphviz} package. \end{enumerate} The data used in the following examples was generated using the {\it SynTReN} simulator \cite{syntren}. This data generator uses a known network of interacting genes in order to generate gene expression levels for all the genes included in the network. Once the network is infered from the generated data, it can be compared to the true underlying network in order to validate the inference algorithm. %___________________________________________________________________________________________________________________MUTUAL INFORMATION ESTIMATION \section{Mutual Information Estimation} Mutual information networks are a subcategory of network inference methods. These methods set a link between two nodes if it exhibits a high score based on the mutual information between the nodes. Mutual informaton networks rely on the computation of the mutual information matrix (MIM), a square matrix whose element $$ MIM_{ij} = I(X_i ; X_j ) = \sum_{ x_i \in \mathcal{X}_i} \sum_{ x_j\in\mathcal{X}_j} p(x_i , x_j ) \log p(x_i )p(x_j ) $$ is the mutual information between $X_i$ and $X_j$ , where $X_i \in \mathcal{X} ,i = 1, . . . , n$, is a discrete random variable denoting the expression level of the \emph{ith} gene. \subsection{Obtaining The Mutual Information Matrix} <<>>= library(minet) data(syn.data) #Mutual information estimation estimator="spearman" mim <- build.mim(syn.data,estimator) mim[1:5,1:5] @ In the above code, the mutual information matrix is built using the function \Rfunction{build.mim}. This function takes the dataset and one of the mutual information estimator explained in this section as input. All the estimators require discrete data values. The \Rfunction{discretize} function allows the user to choose between two binning algorithms. %_________________________________________________________________________________________________________________________________NETWORK INFERENCE \section{Network Inference} Three network inference methods are available in the package : \Rfunction{aracne}, \Rfunction{clr} and \Rfunction{mrnet}. These receive as input the mutual information matrix and return the weighted adjacency matrix of the network. The network can be directly infered from the dataset by using the \Rfunction{minet} function. This function takes as input the dataset, the name of the estimator and the name of the discretization method to be used as well as the number of bins to be used. \subsection{Obtaining The Network} In the following code, the \Rfunction{mrnet} algorithm is applied to the mutual information matrix estimated in the previous section: <<>>= #Network Inference net <- mrnet(mim) net[1:5,1:5] @ The returned value is the weighted adjacency matrix of the network. \subsection{MRNET} The MRNET approach \cite{papier} consists in repeating a MRMR feature selection procedure for each variable of the dataset. The MRMR method \cite{peng} starts by selecting the variable ${X_i}$ having the highest mutual information with the target ${Y}$. In the following steps, given a set ${\mathcal{S}}$ of selected variables, the criterion updates ${\mathcal{S}}$ by choosing the variable ${X_k}$ that maximizes ${ I(X_k;Y) - \frac{1}{|\mathcal{S}|}\sum_{X_i \in \mathcal{S}} I(X_k;X_i)}$ The weight of each pair $X_i,X_j$ will be the maximum score between the one computed when $X_i$ is the target and the one computed when $X_j$ is the target. \subsection{CLR} The CLR algorithm \cite{clr} considers the MIM as the weighted adjacency matrix of the network but instead of using the information $I(X_i;X_j)$ as the weight of the link between features $X_i$ and $X_j$, it takes into account the score $\sqrt{z_i^2+z_j^2}$, where $$ z_i = \max \bigg\lbrace 0, \frac{I(X_i;X_j)-\mu_i}{\sigma_i} \bigg\rbrace $$ and $\mu_i$ and $\sigma_i$ are, respectively, the mean and the standard deviation of the empirical distribution of the mutual information values $I(X_i;X_k)$, $k=1,...,n$. \subsection{ARACNE} The ARACNE algorithm \cite{aracne} is based on the Data Processing Inequality . This inequality states that, if gene $X_{1}$ interacts with gene $X_{3}$ through gene $X_{2}$, then $$I(X_{1};X_{3})\leq\min\left(I(X_{1};X_{2}),I(X_{2};X_{3})\right)$$ The ARACNE procedure starts by assigning to each pair of nodes a weight equal to the mutual information. Then the weakest edge of each triplet is interpreted as an indirect interaction and is removed if the difference between the two lowest weights is above a threshold $W_0$. The function \Rfunction{aracne} has an extra argument \Robject{eps} which is the numerical value of $W_0$. \subsection{The \Rfunction{minet} function} The \Rfunction{minet} function infers directly the mutual information network from the input dataset. Besides the dataset, this function's arguments are the mutual information estimator, the inference method, the binning algorithm and the number of bins to be used. All the instructions used until now can then be summarized with the following call to \Rfunction{minet}: <<>>= library(minet) data(syn.data) net <- minet(syn.data, method="mrnet") net[1:5,1:5] @ Note that in this case the returned object is the \emph{normalized} weighted adjacency matrix of the network (i.e. the values range from $0$ to $1$). %____________________________________________________________________________________________________________________________________VALIDATION \section{Validation} \subsection{Obtaining Confusion Matrices} The networks infered using this package are weighted but many low weighted edges can be removed by using a threshold value. By setting to $0$ all edges whose weight are lower than the threshold and to $1$ the other edges weight, the network inference problem can be seen as a binary decision problem. The decision made by the algorithm can be summarized by a confusion matrix (see table \ref{tab:confusion}). \begin{table} \label{tab:confusion} \begin{center}\begin{tabular}{c|cc} \scriptsize{\bf EDGE} & Infered & Not Infered \tabularnewline\hline Exists & TP & FN \tabularnewline Doesn't Exist & FP & TN \tabularnewline \end{tabular}\end{center} \caption{Confusion matrix} \end{table} In our case, the threshold value can be seen as the minimal edge weight required for the edge to be infered : edges whose weight are strictly below the threshold are removed from the network. Then, a different confusion matrix is obtained for each different threshold. The table returned by the \Rfunction{validate} function contains all the confusion matrices obtained with \Robject{steps} thresholds ranging from the lowest to the highest value of the edges weight. <<>>= library(minet) data(syn.data) data(syn.net) net <- minet(syn.data) #Infered network validation table <- validate(net, syn.net, steps=20) table[1:10,] @ In the above code, the {\Rfunction{validate}} function compares the infered network { \Robject{net}} to {\Robject{syn.net}}, the network underlying {\Robject{syn.data}}. Note that the true underlying network has to be a matrix containing values $1$ (presence of the edge) or $0$ (absence of the edge). Each line of the returned table contains the threshold used and the confusion matrix obtained by comparing \Robject{syn.net} to the infered network. Note that the \Rfunction{validate} function distinguishes the following cases: \begin{itemize} \item Both networks are oriented \item Both networks are unoriented \item One of the network is oriented and the other unoriented \end{itemize} In the third case, the oriented network will be considered unoriented. \subsection{Using the Confusion Matrices} The confusion matrix summarizes the decisions made by the algorithm. Thus in order to compare inference algorithms, we compare their confusion matrix, more precisely, we compare several criteras that are derived from that matrix \cite{prroc}: \begin{itemize} \item{Precision:} $ p = \frac{TP}{TP+FP} $ \item{Recall:} $ r = \frac{TP}{TP+FN} $ \item{True Positive Rate:} $ tpr = \frac{TP}{TP+TN} $ \item{False Positive Rate:} $fpr = \frac{FP}{FP+FN} $ \item{$F_\beta$-score:} $F_\beta = (1+\beta)\frac{pr}{\beta p + r}$ \end{itemize} These scores are returned by the functions \Rfunction{rates}, \Rfunction{pr} and \Rfunction{fscores}. The functions \Rfunction{show.pr} and \Rfunction{show.roc} can be used to visualize precision-recall curves and receiver operating characteristic curves respectively. The \Rfunction{show.pr} function uses the precisions and recalls computed by the function \Rfunction{pr} and the \Rfunction{show.roc} relies on the rates returned by the \Rfunction{rates} function in order to plot receiver operating characteristic curves. All these functions take as input the data.frame returned by the \Rfunction{validate} function: <<>>= library(minet) data(syn.data) data(syn.net) net1 <- minet(syn.data,method="mrnet") net2 <- minet(syn.data,method="clr") #Infered network validation table1 <- validate(net1, syn.net, steps=50) table2 <- validate(net2, syn.net, steps=50) @ Once the data.frames \Robject{table1} and \Robject{table2} are computed, we can use the function \begin{itemize} \item \Rfunction{pr(table)} to obtain precisions and recalls. \item \Rfunction{rates(table)} to obtain true positive rates and false positive rates. \item \Rfunction{fscores(table,beta)} to obtain $F_{\beta}-scores$. \end{itemize} Both functions \Rfunction{show.pr} and \Rfunction{show.roc} return the device associated to the plotting window used. This allows the user to plot several curves on the same figure. The following code generates the curves. <<>>= #Precision recall curves dev <- show.pr( table1, pch=2, type="b", col="green" ) show.pr( table2, device=dev, pch=1, type="b", col="blue") #ROC curves dev <- show.roc( table1, type="b", col="green" ) show.roc( table2,device=dev,type="b",col="blue" ) @ \bibliographystyle{plain} \begin{thebibliography}{10} \bibitem{aracne} Katia Basso, Adam Margolin, Gustavo Stolovitzky, Ulf Klein, Riccardo Dalla-Favera, and Andrea Califano. \newblock Reverse engineering of regulatory networks in human b cells. \newblock {\em Nature Genetics}, 37, 2005. \bibitem{relnet} A.~J. Butte and I.S. Kohane. \newblock Mutual information relevance networks: Functional genomic clustering using pairwise entropy measurments. \newblock {\em Pacific Symposium on Biocomputing}, 5:415--426, 2000. \bibitem{prroc} J.~Davis and M.~Goadrich. \newblock The relationship between precision-recall and roc curves. \newblock {\em In Proceedings of the 23rd international conference on Machine learning}, 2006. \bibitem{syntren} T.~Van den Bulcke, K.~Van Leemput, B.~Naudts, P.~van Remortel, H.~Ma, A.~Verschoren, B.~De Moor, and K.~Marchal. \newblock Syntren: a generator of synthetic gene expression data for design and analysis of structure learning algorithms. \newblock {\em BMC Bioinformatics}, 7(1):43, 2006. \bibitem{discretize} J.~Dougherty, R.~Kohavi, and M.~Sahami. \newblock Supervised and unsupervised discretization of continuous features. \newblock {\em In International Conference on Machine Learning}, pages pages 194--202, 1995. \bibitem{clr} J.J. Faith, B.~Hayete, J.T. Thaden, I.~Mogno, J.~Wierzbowski, G.~Cottarel, S.~Kasif, J.J. Collins, and T.S. Gardner. \newblock Large-scale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression profiles. \newblock {\em PLoS Biology}, 5, 2007. \bibitem{gardner} T.~S. Gardner and J.~Faith. \newblock Reverse-engineering transcription control networks. \newblock {\em Physics of Life Reviews 2}, 2005. \bibitem{hausser} J.Hausser. \newblock Improving entropy estimation and inferring genetic regulatory networks. \newblock {\em Master thesis of the National Institute of Applied Sciences Lyon}, 2006. \bibitem{papier} P.~E. Meyer, K.~Kontos, F.~Lafitte, and G.~Bontempi. \newblock Information-theoretic inference of large transcriptional regulatory networks. \newblock {\em EURASIP Journal on Bioinformatics and Systems Biology}, 2007. \bibitem{papier2} P.~E. Meyer, F.~Lafitte, and G.~Bontempi. \newblock minet: A R/Bioconductor Package for Inferring Large Transcriptional Networks Using Mutual Information. \newblock {\em BMC Bioinformatics}, 9:461, 2008. \bibitem{nemenman} I.~Nemenman, W.~Bialek, and R.~de~Ruyter~van Steveninck. \newblock Entropy and information in neural spike trains: Progress on the sampling problem. \newblock {\em Physical Review Letters}, 69, 2004. \bibitem{paninski} L.~Paninski. \newblock Estimation of entropy and mutual information. \newblock {\em Neural Computation}, 15(6):1191--1253, 2003. \bibitem{peng} H.~Peng, F.~Long, and C.~Ding. \newblock Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. \newblock {\em IEEE Transactions on Pattern Analysis and Machine Intel ligence}, 27(8), 2005. \bibitem{schafer} J.~Schafer and K.~Strimmer. \newblock A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. \newblock {\em Statistical Applications in Genetics and Molecular Biology}, 4(32), 2005. \bibitem{sg} T.~Schurmann and P.~Grassberger. \newblock Entropy estimation of symbol sequences. \newblock {\em Chaos}, 1996. \bibitem{vansomeren} E.~P. van Someren, L.~F.~A. Wessels, E.~Backer, and M.~J.~T. Reinders. \newblock Genetic network modeling. \newblock {\em Pharmacogenomics}, 3(4):507Ð525, 2002. \end{thebibliography} \nocite{*} \end{document}