% NOTE -- ONLY EDIT ScISI.Rnw!!!
%
%\VignetteIndexEntry{ScISI Vignette}
%\VignetteDepends{ppiData, xtable}
%\VignetteKeywords{Interactome}
%\VignettePackage{ScISI}
\documentclass{article}

\usepackage{hyperref}

\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textsf{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}


\newcommand{\classdef}[1]{%
  {\em #1}
}

\begin{document}
\title{Creating In Silico Interactomes}
\author{Tony Chiang}
\maketitle

\section{Introduction}
We explore the problem of estimating \textit{in silico} interactomes: a collection of 
protein complexes for some cell or tissue under certain specific conditions. We will 
focus our attentions on Saccharomyces cerevisiae though this need not be the case in 
general.

This document will detail the various functions of \Rpackage{ScISI} by creating a
running example for an \textit{in silico} interactome. It is recommended that the reader
has a working session of $R$ and investigates the examples as they arise. 

<<loadlibs, eval=TRUE, results=hide>>=
library("ScISI")
@

\section{Obtaining Information}

\subsection{Gene Ontology}
Before an \textit{in silico} interactome can be assembled, we must obtain the necessary
information. As remarked earlier, two sources are online databases, Gene Ontology (GO) 
and the Munich Information Center for Protein Sequences (MIPS), and the third is a 
catalogue of \Rpackage{apComplex} estimates from a number of small and large
wet-lab experiments. We start by discussing how information is parsed from the GO and 
MIPS databases.

Parsing through the GO database to collect the required information is done by calling the
\Rfunction{getGOInfo} function:

<<getGOInfo, eval=TRUE>>=
goECodes = c("IEA", "NAS", "ND", "NR")
go = getGOInfo(wantDefault = TRUE, toGrep = NULL, parseType=NULL, 
               eCode = goECodes, wantAllComplexes = TRUE)
@

All the information that is obtained from the \Rfunction{getGOInfo} function comes from the $R$
packages \Rpackage{GO} and \Rpackage{GOstats}. It is important to note that the Gene
Ontology repository has periodic updates, so the protein complexes obtained will 
reflect the version of GO you are using. \Rfunction{getGOInfo} essentially parses 
through the Cellular Component (CC) of GO and (a)greps for specified key terms within certain 
boundary conditions. The parsing is carried out on terms (i.e. character vectors) together with 
perl regular expressions (this accounts for a boundary condition on protein complexes). 
The parameters of \Rfunction{getGOInfo} have the following characteristics:

\begin{itemize}
\item[1] \Rfunarg{wantDefault}: A logical: if passed into the function as TRUE, 
three default terms will be parsed: complex by agrep, ``$\verb1\\Base\\b1$'' by grep, and 
``$\verb1\\Bsome\\b1$'' by grep. The latter terms will search for protein complexes such as 
RNA polymerase or Ribosomes which may not contain the term complex.

\item[2] \Rfunarg{toGrep}: A named list with only one entry named ``pattern'' supplied. 
The pattern entry holds a character vector of terms, with perl regular expressions,
supplied by the users. \Rfunction{getGOInfo} will parse for these terms within the CC 
ontology. If toGrep is not NULL, parseType must not be NULL. This argument lets 
the user append to the search terms if \Rfunarg{wantDefault} is TRUE, or it allows 
the user to set the search terms if FALSE.

\item[3] \Rfunarg{parseType}: A character vector of parse terms, e.g. ``grep'' or ``agrep''. 
This vector must either be of length one or of the same length as the vector contained in 
\Robject{toGrep}. If parseType is of length one, all entries of toGrep will be parsed 
uniformly. Otherwise, the $i^{th}$ entry from parseType defines the command for the 
$i^{th}$ entry from toGrep.


To illustrate one example of appending to the search defaults, we have supplied the 
suffix $-somal$ as a candidate for \Rfunarg{toBeParsed}: 


<<getGOInfoToGrep, eval=TRUE>>=
toBeParsed = list()
toBeParsed$pattern = "\\Bsomal\\b"
goGrep = getGOInfo(wantDefault = TRUE, toGrep = toBeParsed, parseType = "grep",
                   eCode = NULL, wantAllComplexes = TRUE)
getGOTerm(setdiff(names(goGrep), names(go)))

@

Because \Rfunarg{wantDefault} is still kept as $TRUE$, the three default search terms were also 
parsed. The last line of code displays those GO protein complexes found with the addition of the 
term $-somal$ which were not found with only the default search terms. 

\item[4] \Rfunarg{eCode}: A character vector of evidence code terms (cite here). The is 
another search restriction; whereas perl regular expression put restrictions on 
the protein complexes as a whole, the evidence codes operate on the individual proteins themselves. 
If a protein is only referenced by evidence codes found within the eCode vector, it is dis-allowed 
and removed from the protein complex. In the example above, we have chosen the evidence codes $IEA$ 
(Inferred from Electronic Annotation), $NAS$ (Non-traceable Author Statement), $ND$ (No biological 
Data available), and $NR$ (Not Recorded).

\item[5] \Rfunarg{wantAllComplexes}: A logical. If TRUE, the children of parsed nodes will 
also be returned (accounting for sub-complexes). In addition, nodes that are children of 
the node ``GO:0043234'' (the protein complex node) are returned as well. 
\end{itemize}

<<getGOInfoOut, eval=TRUE>>=
class(go)
names(go)[1:5]
go$"GO:0005830"
@ 

The return value of \Rfunction{getGOInfo} is a named list (the code and actual output is
shown above). The names of this list corresond to the GO ID's of the protein complexes selected,
and the entries of the list correspond to the yeast genes (using their systematic names) 
annotated to the repsective GO ID (node). The list is one way to
represent the hyper-graph. Each entry of the return value represents a particular hyper-edge, 
and the union over all the entries of the list (i.e. over all the hyper-edges) yields the
vertex set.

The incidence matrix, $M$, of a bi-partite graph has its rows indexed by proteins and has its columns
indexed by the protein complexes. $M_{i,j} = 1$ if and only if protein $i$ is a member of complex $j$,
and $M_{i,j} = 0$ otherwise. Therefore, $M$ is a $\{0,1\}$ matrix which is generally sparse.

The function \Rfunction{createGOMatrix} takes the output of \Rfunction{getGOInfo} and creates the 
bi-partite graph incidence matrix of this hyper-graph list. 

<<createGOMatrix, eval=TRUE>>=
goM = list2Matrix(go)
@ 

We remark that we the proteins names used to index the rows of the incidence matrix are the 
systematic gene names for yeast. To establish uniformity, we have used the systematic gene names
across all data sets. The columns of the incidence matrix are indexed by the GO id's 
corresponding to certain protein complexes. 

Unfortunately, term mining is rarely comprehensive nor exhaustive. There exists GO nodes that 
neither have the term $complex$ within the Go terms nor do they have any terms ending in the suffixes
$-ase$ nor $-some$. We have manually found additionally GO nodes of potential interest. The 
\Robject{xtraGO} data set is a character vector of these hand selected GO protein complexes. There are 
32 total extra complexes though 6 of these complexes were already obtained 
by the \Rfunction{getGOInfo} function. From the remaining 26 complexes,
only 6 complexes were found to have non-trivial overlap with S. cerevisae in the GO CC ontology. To 
check possible GO nodes and to add to the GO protein complexes, we call the \Rfunction{xtraGONodes}
function (example code shown below). It is important to note that the \Rfunction{xtraGONodes} can only 
be called after an initial incidence matrix for the bi-partite graph of the GO protein complexes has 
been built since this function will compare the extra GO nodes with those existing GO protein complexes.

\subsection{Munich Information Center for Protein Sequences}
Parsing for information from the MIPS database (http://mips.gsf.de/genre/proj/yeast/)
is completely analogous to parsing the GO database and is done by calling the 
\Rfunction{getMipsInfo} function:

<<getMipsInfo, eval=TRUE>>=
mips = getMipsInfo(wantDefault = TRUE, toGrep = NULL, parseType = NULL, 
                   eCode = NULL, wantSubComplexes = TRUE, ht=FALSE)
@

Unlike the \Rfunction{getGOInfo} function, \Rfunction{getMipsInfo} parse through three files 
downloaded from the complex catalogue of the MIPS database. One file contains a hierarchy 
for the protein complexes (i.e. complexes are built from sub-complexes)
while the second file contains the protein composition of each of the levels of the hierarchy from the 
first. Finally, the third file contains evidence codes which are used to annotate proteins to each 
protein complex. These files are updated bi-annually, so it is important to note that \Rfunction{getMipsInfo} 
should be updated periodically to correspond with the newer data files. With the exception of the
 \Rfunarg{wantSubComplexes} argument, the functionality of \Rfunction{getMipsInfo} is identical 
to that of \Rfunction{getGOInfo} though the implementation is quite different. 
Thus, the parameters of \Rfunction{getMipsInfo} have the following characteristics:

\begin{itemize}
\item[1] \Rfunarg{wantDefault}: A logical: if passed into the function as TRUE, 
three default terms will be parsed: complex by grep, ``$\verb1\\Base\\b1$'' by grep, and 
``$\verb1\\Bsome\\b1$'' by grep.

\item[2] \Rfunarg{toGrep}: A named list with only one entry named ``pattern''.
The entry is a character vector containing terms, with perl regular expressions,
supplied by the users. \Rfunction{getMipsInfo} will parse for these terms within the 
hierarchy. If toGrep is not NULL, parseType must not be NULL neither. The 
terms supplied will be in addition to those of the default if \Rfunarg{wantDefault} is 
TRUE or will be the only terms parsed if FALSE.

\item[3] \Rfunarg{parseType}: A character vector of parse terms, e.g. ``grep'' or ``agrep''. 
This vector must either be of length one or of the same length as the vector toGrep. If 
parseType is of length one, all entries of toGrep will be parsed uniformly. Otherwise, the 
$i^{th}$ entry from parseType defines the command for the $i^{th}$ entry from toGrep.

Exactly like our in example with the \Rfunction{getGOInfo} function, we append to the default 
search terms with the suffix $-somal$.

<<getMipsInfoToGrep, eval=TRUE>>=
toBeParsed = list()
toBeParsed$pattern = "\\Bsomal\\b"
mipsGrep = getMipsInfo(wantDefault = TRUE, toGrep = toBeParsed, parseType = "grep",
                       eCode = NULL, wantSubComplexes = TRUE)
#mNm = setdiff(names(mipsGrep$Mips), names(mips$Mips))
#mId = strsplit(mNm, split = "-")
#sapply(mId, function(q) mipsGrep$DESC[q[2]])
@ 

Again the output of the code above shows the extra
protein clusters obtained with the addition of $-somal$ to the three default search terms.  

\item[4] \Rfunarg{eCode}: A character vector. This is a character vector of evidence codes.
If a protein is only indexed by the evidence codes given in this vector, it is removed 
from the protein complex it was annotated using only these evidence codes. The default setting 
of this parameter is set to $NULL$ as the vast majority of the evidence codes pertain to the 
protein complex estimates obtained from Affinity Purification - Mass Spectrometry (AP-MS) 
experiments. 

\item[5] \Rfunarg{wantSubComplexes}: A logical. If FALSE, only the top levels of the 
hierarchy will be returned (the aggregate protein complex); if TRUE, 
all the intermediates of the hierarchy  will also be returned 
(the sub-complexes as well). There are certain computational instances when retaining
only the top level protein complexes is necessary, but the default setting is to retain
all protein complexes, and hence the logical is TRUE. 

\item[6] \Rfunarg{ht}: A logical. If FALSE, then the protein complexes obtained by 
high throughput experimentation will not be extracted. Otherwise, if TRUE, those protein
complexes obtained by high throughput analysis are extracted.

\end{itemize}

Another difference between the \Rfunction{getGOInfo} and \Rfunction{getMipsInfo} functions is 
their respective outputs. We have seen that the output from the \Rfunction{getGOInfo} is a named
list that represents the hyper-graph, but, from looking at the output printed below,
this is not the case for \Rfunction{getMipsInfo}. 

The output of \Rfunction{getMipsInfo} is a list with two 
entries: 

<<mipsInfo>>=
class(mips)
names(mips)
@ 

The first, $Mips$, is a named list of character vectors where the names are the MIPS IDs
and the vectors contain the constituent members to the respective protein complexes (i.e. it is also
a hyper-graph representation exactly like the \Rfunction{getGOInfo} output):

<<getMipsInfoMips, eval=TRUE>>=
class(mips)
names(mips)[1:10]
mips$Mips$"MIPS-510.40"
@ 

And the second is a named character vector where the names, again,  correspond to MIPS ID and the 
entries contain a description of its respective protein complex: 

<<getMipsInfoDESC, eval=TRUE>>=
names(mips$DESC)[1:5]
mips$DESC["510.40"]
@ 

The \Rpackage{GO} package has methods of returning descriptions of any GO ID, and, therefore, it is 
redundant to create such a character vector containing the descriptions. Since there is no such 
methods for the MIPS dataset, it is necessary to record this vector as we parse the MIPS files. 

The MIPS bi-partite graph incidence matrix is created exactly as the GO bipartite graph incidence
matrix. The argument for \Rfunction{createMipsMatrix} is the output from \Rfunction{getMipsInfo}.

<<createMipsMatrix, eval=TRUE>>=
mipsM = list2Matrix(mips)
@ 

Again we use the systematic gene names for yeast to index the rows of the MIPS incidence matrix
and the the MIPS ID's corresponding to protein complexes index the columns. It is important to 
note that \Rfunction{createMipsMatrix} has prefixed each of the MIPS id's with the term ``MIPS-''.
This is useful when we merge different incidence matrices, for it allows us to know where each
protein complex orginates. 

\subsection{Purified Data Set Complexes estimated by apComplex}

Protein complexes obtained from the both the GO and MIPS repository haven generally come
from small-scale experiments, and the resulting protein complex estimates have been 
curated. These estimates, therefore, have a stronger likelihood to correctly represent 
true protein complexes found in vivo. Within the last five years, high throughput 
techniques such as Tandem Affinity Purification followed by Mass Spectroscopy have assayed
protein complex co-membership relationships in large scale. We use data derived 
from such AP-MS experiments and complex estimation algorithms on such data to generate
novel protein complex estimates (which have not been curated). 

<<ppiData, eval=TRUE>>=
library("ppiData")
get("Gavin2006BPGraph")
@ 

Presently, three small-scale and two large-scale AP-MS experimental datasets are publically
available. The R-data package \Rpackage{ppiData} contain these five datasets and store them
as a directed graph object (as seen from the example "Gavin2006BPGraph"). These directed
graphs represent bait to prey co-membership data assayed from the experiment rather than 
explicit protein complex composition. In fact, protein complex estimation is a difficult
computational problem that is in itself an active area of research.  

From the bait/prey data obtained from \Rpackage{ppiData}, we applied a quality assessment
on the protein interactions (cite); those proteins which are likely affected by a systematic 
bias of the experimental assay are removed from further analysis. Once the quality 
assessment is completed, we applied the penalized likelihood method of protein complex
estimation given in \Rpackage{apComplex} (cite) to obtain protein complex estimates 
from the data.

The methods and discussion for conducting the quality assessment on the AP-MS protein
interaction data can be found within the R-package \Rpackage{ppiStats}, and we defer
discussion of this process to that package. Likewise, a discussion concerning the 
estimation of protein complexes from the bait to prey data can be found within 
\Rpackage{apComplex}. For the purposes of generating high quality protein complexes
to be added to the in silico interactome, we generated protein complexes via 
apComplex by removing all proteins likely affected by a systematic bias at the 
p-value threshold of 0.01 and then setting the \Robject{commonFrac} parameter 
of the \Rfunction{findComplexes} function of \Rpackage{apComplex} to $\frac{1}{2}$.
One further step taken to produce higher quality complexes is to use statistically significant
derived data, i.e. only those estimates which fall under the multi-bait 
(more than one protein in the complex was used as a bait protein in the AP-MS techonology) 
and multi-edge (the directed graph showed higher instances of reciprocity) were taken to be 
high quality protein complex estimates. 

Again the bi-partite graph incidence matrix are indexed in the rows by the yeast standard 
gene names, but since there are no complex id's for the estimates reported by \Rpackage{apComplex},
the \Rfunction{getAPMSData} function denominates ad hoc id's for these complexes. The columns
are indexed by such ad hoc ids which reference the wet-lab experiments from where the were 
derived.


\section{Comparing Protein Complexes Within and Across Databases}

Now that we have obtained the various bi-partite graph incidence matrices, we need to evaluate and 
cross reference the protein complexes within a single database as well as
across databases and eliminate redundancy. There are two items 
for which we must find when cross referencing protein complexes: equality or inclusion. To compare
two bi-partite graph incidence matrices, we use the \Rfunction{runCompareComplex} function:

<<runCompareComplex, eval=TRUE>>=
mips2mips = runCompareComplex(mipsM, mipsM)
go2go = runCompareComplex(goM, goM)  
mips2go = runCompareComplex(mipsM, goM)
names(mips2go)
@ 

The \Rfunction{runCompareComplex} function takes two 
bi-partite graph incidence matrices as arguments,
and compares each complex from the first bipartite 
graph to the each complex from the second 
bi-partite graph. The argument $byWhich$ shall be 
elucidated in the following paragraphs. Note that
the bi-partite graphs need not be different as we 
can run this function on a single bi-partite graph.
 The output of this function is a named list whose 
entries carry various statistical
information. An explanation of each entry follows:

\subsection{Jaccard Coefficients}
The $JC$ entry of the output from \Rfunction{runCompareComplex} contains a matrix of the Jaccard 
similarity coefficient between the two incidence matrices that are compaired - in our running example, 
the complexes of mips and the complexes of go. The $(i,j)^{th}$ entry of this matrix corresponds to 
the Jaccard similarity between the $i^{th}$ complex of mips and the $j^{th}$ complex of go. The 
coefficient is calculated by taking the quotient of the cardinality of the intersection of these 
complexes $mips_i \cap go_j$ by the cardinality of the union of the same complexes 
$mips_i \cup go_j$:

\begin{equation}
JC_{i,j} = \frac{|mips_i \cap go_j|}{|mips_i \cup go_j|}.
\end{equation}

<<runCompareComplexJC, eval=TRUE>>=
all(diag(mips2mips[["JC"]]) == 1)
@ 

Equality of complexes returns a Jaccard index of unity while disjoint complexes return zero. So 
when we compared the Incidence matrix of MIPS against itself, we would have gotten $1$'s along
the main diaganol. We remark that if one complex completely contains another, e.g. $go_j 
\subset mips_i$, the Jaccard index will be the cardinality of $go_j$ divided by the cardinality 
of $mips_i$: 

\begin{equation}
\frac{|go_j|}{|mips_i|};
\end{equation}

it is important to realize that containment is not easily ascertained without the cardinality of 
each complex at hand.

\subsection{Alignment of Protein Complexes}
The $maxIntersect$ entry of mips2go aligns either the complexes of $go$ to complexes
of $mips$ or complexes of $mips$ to complexes of $go$ or both. We define alignment in this 
context to mean for a given protein complex of one bipartite graph which complex(es) in the second
bipartite graph will return a maximal Jaccard coefficient. This operation is not symmetric, and 
therefore the $byWhich$ argument in \Rfunction{runCompareComplex} instructs the function as to which 
alignment to produce: ``ROW'' compares all the complexes of the second bi-partite graph with each 
complex to the first; ``COL''compares all the complexes of the first bi-partite graph with each 
complex of the second; and ``BOTH'' will produce both the the outputs. This particular alignment will 
often not be one to one, but rather, one to many if either the ``ROW'' or the ``COL'' parameters are 
set. The mapping will be many to many if the ``BOTH'' parameter is set.

The $maxIntersect$ entry is itself a named list with two entries:

<<runCompareComplexMaxInterSect, eval=TRUE>>=
names(mips2go$maxIntersect)
@ 

The maximize entry records the maximum Jaccard indices for each of the complex aligned either by 
``row'' or by ``column'' or both if they are both called. The maxComp entry contains a list of 
named vectors also either by ``row'' or ``column''; the names of the vectors correspond to the 
complex aligned and the entries are the names of those complexes for which produce the highest 
Jaccard index. 

<<runCompareComplexMImaxComp, eval = TRUE>>=

mips2go$maxIntersect$maximize$row[1:3]
mips2go$maxIntersect$maxComp$row[1:3]

@ 

The information obtained by the $maxIntersect$ entry allows us to interpret how similar (or close)
each complex is with respect to other complexes. From this information, we can decide whether to 
allow or dis-allow any protein complex within the \texttt{in silico} interactome if they are deemed
to be too similar. For the creation of the working example interactome, we only dis-allow one complex
from a pair of complexes if and only if that pair returns Jaccard index 1 which implies equality 
between the two complexes.

\subsection{Equality of Protein Complexes}
The $equal$ entry records protein complexes of the first bi-partite graph matrix equal to those
in the second and this is easily done by selecting those complexes which have a Jaccard index of 
unity. To produce an \textit{in silico} interactome, we will eventually merge the different 
bi-partite graph matrices, and if two different bi-partite graphs have common protein complexes, we need
to determine those complexes and eliminate redundancy. 

<<runCompareComplexEqual, eval=TRUE>>=
length(mips2go$equal)
names(mips2go$equal[[1]])
mips2go$equal[[1]]
@ 

The $equal$ entry is a list of list which in turn has five entries: ``BG1Comp'' records the complex
of the first bi-partite graph; ``BG2Comp'' records the complex of the second bi-partite graph; 
``orderBG1Comp'' records the cardinality of ``BG1Comp''; analogously for ``orderBG2Comp''; and 
finally, ``intersect'' records the cardinality of $BG1Comp \cap BG2Comp$. Rather than just recording
the complex names, this data structure allows the user to inspect relative size of those common 
complexes. 

We remark here that there can be redendancy even within a single database as well, and it is necessary 
that we find and eliminate those protein complexes that may have occurred more than once in a database:

<<runCompareComplexEqualMips, eval=TRUE, results=hide>>=
ind = which(sapply(mips2mips$equal, function(w) w$BG1Comp != w$BG2Comp))
mips2mips$equal[ind]
@ 

The code above shows the redundancy within the Mips data-base. Though we have hidden the outputs, 
we encourage the user to run these commands to see the redundancies found within Mips and within 
any other data repository.

\subsection{Eliminating Redundancy}
The ``toBeRm'' entry takes only those protein complexes of the first bi-partite graph (in our
running example, this would be mips or equivalently mipsM) which are equal to some protein complexes 
of the second bi-partite graph. The result is a character vector of protein complexes that will need
to be deleted in order to avoid redundancy in the \textit{in silico} interactome. 

\subsection{Protein Sub-Complexes}
The ``subcomplex'' looks for any complex from one bi-partite graph which is completely contained
in a complex from the other bi-partite graph. The outputs in this entry are identical to those of the 
$equal$ and defer the reader to the explanation written previously.

While the storage of the data does not explicitly state which complex contains the other complex, it is 
pretty clear that the sub-complex must have the smaller cardinality. It is quite evident that if two 
complexes are equal, then one of the two complexes must be removed to eliminate redundancy; it is not 
clear, however, whether or not to remove sub-complexes. For the purpose of simulating AP-MS technology,
sub-complexes will be masked by the aggregate complex, and thus, removal is needed to avoid 
redundancy. There may be other computational tools that require the availability of the sub-complexes, 
so there is no consistent framework calling for the removal of sub-complexes, and hence, we
leave that decision to the users. For the construction of the working example of the \texttt{in silico}
interactome, we do not remove protein sub-complexes, and so our estimates will carry through with 
their inclusion.

\subsection{Removing Protein Sub-Complexes From the Interactome}

If removal of sub-complexes is necessary, we have built a simple component 
into the \Rfunction{runCompareComplex} function that makes this process fairly simple.
The $toBeRmSubC$ entry
records all those protein clusters which a proper sub-complexes of 
another protein complex, either within a single database or across 
databases. 

<<runCompareComplexSubC, eval=TRUE>>=
mips2mips$toBeRmSubC[1:3]
mips2go$toBeRmSubC[1:3]
@ 

Much like the $toBeRm$ entry of the output from \Rfunction{runCompareComplex}, the $toBeRmSubC$
entry is a a character vector of protein complexes that should be removed if sub-complexes are
irrelevant.

As an aside, we remark that it is necessary to record what we have removed, either redundant 
protein complexes or protein sub-complexes, so we advise users who are building these 
\texttt{in silico} interactomes to save the pair-wise comparison data from the output of
\Rfunction{runCompareComplex}.

<<saveData>>=
save(mips2go, file="mips2go.rda", compress=TRUE)
@ 

\section{Combining Bi-Partite Graphs - Building In Silico Interactomes}

After we have called the \Rfunction{runCompareComplex} function on two different bi-partite graph 
matrices, we can merge these repsective graph incidence matrices to form an aggregate
incidence matrix and thus creating an aggregate bi-partite graph. The aggregate bi-partite 
graph, $A$, has certain important characteristics: the rows of $A$ are indexed by the union of 
the proteins of the two constituent bi-partite graphs and the columns are indexed by the union
of the protein complexes of the two constituent bi-partite graphs.

To merge two incidence matrices, we simply call the \Rfunction{mergeBGMat} function:

<<mergeBGMat, eval=TRUE>>=
  dim(mipsM)
  dim(goM)
  ISI = mergeBGMat(mipsM, goM, toBeRm = c(mips2go$toBeRm, mips2mips$toBeRm, go2go$toBeRm))
 
@ 

The first two arguments of the \Rfunction{mergeBGMat} function is two bi-partite graph incidence
matrices; the last argument is a character vector of those protein complexes which should be 
deleted in order to avoid redundancy or to eliminate certain protein complexes
that should be dis-allowed from the \textit{in silico} interactome. We have chosen to only remove
redundant protein complexes in our working example; the end user has the freedom to choose
whichever protein complexes to remove from the in silico interactome. A character vector 
consisting of the identification code of the unwanted protein complexes should be passed into
the \Robject{toBeRm} parameter so that these protein complexes are disallowed.

Once two bi-partite graphs are merged, we can now call the \Rfunction{runCompareComplex} on the $merged$
incidence matrix and a new bi-partite graph matrix (one which is not any of the constituent 
matrices of $merged$). It is also important to call the \Rfunction{runCompareComplex}
function on each bi-partite graph from within one database in addition to 
comparing across databases. In this way, redundancy can be eliminated in 
all avenues, and we can construct the \textit{in silico} interactome
iteratively by merging a new incidence matrix in each subsequent iteration. 

\section{The estimated Saccharomyces cerevisiae In Silico Interactome}

Up until now, we have demonstrated how to use the various functions of the \Rpackage{ScISI}, but 
we diverge from that endeavor to explore the characteristics of our working example 
\texttt{in silico} interactome for budding yeast. By obtaining the dimension tag of incidence
matrix representing the interactome,

<<dimISI, eval = TRUE>>=
dim(ISI)
@ 

we can determine the number of uniquely expressed genes of the yeast genome (i.e. the number of rows
of the ISI which is the first entry of the output above), and the number of distinct multi-protein
complexes derived from the five different data sources (respectively the number of columns of the
ISI which corresponds to the second entry). 

There are two summary statistics that are of some importance. The first figure gives an indication 
of the distribution of the cardinality of the protein complexes within the \textit{in silico} 
interactome. We can see from this figure that the majority of the protein complexes have less
than ten constituent proteins. This statistic verfies what many have conjectured - that 
most multi-protein complexes are have relatively few proteins. It is also widely conjectured that 
the larger multi-protein complexes are built from the union of sub-complexes (e.g. Mitochondrial
translocase complex) which can be verified in our working interactome by exploring the 
relationships between aggregate complexes and sub-complexes. We have touched upon this analysis
in Table~\ref{ta:subComplexes}, but a more thorough and exhaustive analysis is conducted in 
(cite).


\begin{figure}[hb]
\centering
<<distPlot, fig=TRUE, echo=FALSE>>=
par(mfrow = c(1,2))
cS <- table(colSums(ISI))
size <- as.numeric(names(cS))
repet <- as.vector(cS)
freq = rep(size, repet)
hist(freq, xlab = "ScISI Protein Complex Cardinality", ylab = "Frequency",
     main = "Complex Cardinality")

rS <- table(rowSums(ISI))
size <- as.numeric(names(rS))
repet <- as.vector(rS)
freq = rep(size, repet)
hist(freq, xlab = "Complex Membership per Protein", ylab = "Frequency",
     main = "Complex Membership")
par(mfrow = c(1,1))
@
\caption{\label{Distribution of Complex Cardinality and Distribution of Complex Membership per Protein}}
\end{figure}

The second summary statistic gives the distribution of the number of complexes for which each 
protein may have membership. This statistic is important because we can use this distribution 
(along with the summary statistic above concerning protein complex size) to give us a sense 
of the nodal degree of a generic protein in this rather large co-membership network.

The second histogram shows that an overwhelming majority of proteins participate in fewer that 
five distinct protein complexes. Because the majority of complexes have fewer than ten members and
because the majority of proteins are members of fewer than 5 protein complexes, the number of 
neighbors of a generic protein in the protein-protein co-membership graph should also be relatively
small. 

We also discuss some simple summary statistic concerning the pairwise comparisons between the 
five data-sets in which we have obtained our protein complexes.

Table~\ref{ta:Repetitions} reports the number of protein complexes 
that were redundant by comparing the repositories
pairwise including self comparisons. Table~\ref{ta:Repetitions} shows 
that the only data-set with self-redundancy is derived from the 
MIPS repository; and upon investigation, MIPS has categorized protein
complexes by known functionality, and three protein complexes were 
placed in two different categories each. Table~\ref{ta:Repetitions} 
directly shows that the no two repositories has extremely large 
overlap.

<<testStuff, echo=FALSE, results=hide>>=
data("redundantM")
data("subCompM")
@ 


<<redundant, eval = TRUE, echo=FALSE, results=tex >>=
library("xtable")
xtable(redundantM, display = c("s","d","d","d","d","d"), label = "ta:Repetitions", 
caption="Number of repetitive Protein Complexes")

@ 


Not limited to redundant protein complexes, we also process each data-set
independently to find and to remove all protein
sub-complexes. The
results are reported in Table~\ref{ta:subComplexes} where the row
names indicate the location of the sub-complex and the
column names indicate the location of the aggregate protein complex.
Reading across row two of Table~\ref{ta:subComplexes} we see that
GO contains protein complexes that were protein sub-complexes 
in each of the other data repositories including itself. Table~\ref{ta:subComplexes}
is much less sparse relative to Table~\ref{ta:Repetitions}, and it is this
relative comparison that makes all the estimates more credible since each 
data repository is based on unique experiments coupled with unique 
estimation algorithms. 

<<subComp, eval = TRUE, echo=FALSE, results=tex>>=

xtable(subCompM, display = c("s","d","d","d","d","d"), label = "ta:subComplexes", 
caption = "Number of Protein Sub-Complexes")

@ 

While these four summary statistics give a good description of the \textit{in silico} interactome, 
these statistics cannot give a comprehensive description; that would take a much more considerable 
dissection of the interactome (for which computational experiments should be conducted). It is also 
necessary to state that this interactome represents Saccharomyces cerevisiae for some set of 
environmental conditions, and so is one of the many estimates that can be created by this package.

We remark that the interactome can only give predicted members of the protein complexes of interest,
but it cannot ascertain the multiplicity of each individual constituent protein member. Therefore,
it is important to be mindful that the \textit{in silico} interactome is a good model and 
representation of known biology, and so it is limited in its presentation.

\section{Customizing the Interactome}
We have successfully built an \textit{in silico} interactome, and for some computational methodologies,
this interactome is the best interactome to use. In this section, we present some other functions within
\Rpackage{ScISI} that allows the users to refine the interactome from the coarser one built. 

\subsection{Generating an Instance of Class yeastData}
One of the functionalities of the \Rpackage{ScISI} package is to create an instance of the class
\Rclass{yeastData}. Before we can instantiate this class, some preliminary functions need to be 
called. 

The class \Rclass{yeastData} will contain all the available information of some bi-partite graph, and
so we must collate these sets of information into one usable format. We can do this by calling either 
the \Rfunction{createGODataFrame} or \Rfunction{createMipsDataFrame} function. 

<<createDataFrame, eval=TRUE>>=
goDF = createGODataFrame(go, goM)
mipsDesc <- sapply(mips, function(x) {attr(x, "desc")})
mipsDF = createMipsDataFrame(mipsDesc, mipsM)
@ 

The names of these two
functions describe the output quite explicitly, but the arguments need some clarification. The 
arguments for the both functions are essentially the same. The arguments for \Rfunction{createGODataFrame}
are:

\begin{itemize}
 \item[1] \Rfunarg{go} - A named list whose entries are character vectors. This is essentially the 
hyper-graph of the GO protein complexes. The GO id's are obtained by calling the \Rfunction{names} 
function on the list, and the description for the protein complexes is obtained by calling the
\Rfunction{getGOTerm} function from \Rpackage{GO}.
\item[2] \Rfunarg{goM} - The incidence matrix representing the GO hyper-graph. The names of the 
GO protein complexes are obtained by calling the \Rfunction{colnames} on this matrix.
\end{itemize}

The arguments for the function \Rfunction{createMipsDataFrame} are:

\begin{itemize}
\item[1]{mips\$DESC} - A named character vector. The MIPS' id's are obtained by calling \Rfunction{names}
on the vector, and the description is obtained by accessing the vector values.
\item[2]{mipsM} - The incidence matrix representing the MIPS hyper-graph. The names of the MIPS' 
protein complexes are obtained by calling the \Rfunction{colnames} on this matrix.
\end{itemize}

The generated database holds three distinct pieces of 
information for each protein complex: its name relative to the incidence matrix; its id (GO or MIPS);
and its description.

<<dataFrameOut, eval=TRUE>>=
#head(mipsDF)
@ 

Once we have created either the GO dataframe or the MIPS dataframe, we can proceed to call an 
instance of the \Rclass{yeastData} class. Instantiating an \Rclass{yeastData} object is done
by calling the \Rfunction{createYeastDataObj} function:

<<createYeastDataObj, eval=TRUE>>=
mipsOb = createYeastDataObj(mipsDF)
goOb = createYeastDataObj(goDF)
@ 

With the instance of the \Rclass{yeastData} class, we have access to three generic methods: 
\Rmethod{ID}, \Rmethod{Desc}, and \Rmethod{getURL}. 

<<yeastDataMethods, eval=TRUE>>=
ID(mipsOb, "MIPS-510.40")
Desc(mipsOb, "MIPS-510.40")
getURL(mipsOb, "MIPS-510.40")
@ 

The arguments for each of the methods are identical: the first argument (\Rfunarg{mipsOb} in 
our running example) is the instance of \Rclass{yeastData}; the second argument 
(``MIPS-510.40'' in the running example) is the name of the protein complex as 
defined in the incidence matrix \Robject{mipsM}, i.e. the column names. A good exercise is 
too create an instance of \Rclass{yeastData} with the GO dataframe and work with the 
generic methods.

The return values the \Rmethod{ID} and \Rmethod{Desc} are self-explanatory. The output for
\Rmethod{getURL} returns the url for the web-page containing a detailed description of the 
MIPS or GO protein complex. This allows the user to parse through the derived protein 
complexes manually and refine the \textit{in silico} interactome. Simply call the 
\Rfunction{browseURL} function with the output of \Rmethod{getURL} as the argument to open a 
browser with the specified web-page.


\subsection{Generating an Informational .html File}
Once we have created an instance of \Rclass{yeastData}, we can call the
\Rfunction{ScISI2html} function to generate a .html file in the user's home 
directory. 

<<ScISI2html, eval=TRUE, results = hide>>=
mipsNames = colnames(mipsM)
mipsCompOrder = colSums(mipsM)
url = vector(length = length(mipsNames))
linkNames = vector(length = length(mipsNames))
for(i in 1:length(mipsNames)){
url[i] = getURL(mipsOb, mipsNames[i])
linkNames[i] = Desc(mipsOb, mipsNames[i])
}
ScISI2html(urlVect = url, linkName = linkNames, filename = "mips.html", 
           title = "MIPS Protein Complex", compSize = mipsCompOrder)

goNames = as.character(goDF[,1])
goCompOrder = colSums(goM[,goNames])
url = vector(length = length(goNames))
linkNames = vector(length = length(goNames))
for(i in 1:length(goNames)){
    print(i)
    url[i] = getURL(goOb, goNames[i])
    linkNames[i] = Desc(goOb, goNames[i])
}
ScISI2html(urlVect = url, linkName = linkNames, filename = "go.html", 
           title = "GO Protein Complex", compSize = goCompOrder)
@ 

The arguments for the \Rfunction{ScISI2html} are somewhat self explanatory, but we will provide
a terse description of each argument:

\begin{itemize}

\item[1] \Rfunarg{urlVect} is a character vector consisting of url address for the protein complexes.

\item[2] \Rfunarg{linkName} is a character vector taht will be anchored by \Rfunarg{url}.

\item[3] \Rfunarg{filename} is a character that will name the file, e.g. in our running example, 
the output file will be named ``mips.html'' and ``go.html''.

\item[4] \Rfunarg{title} will be the title of the .html file.

\end{itemize}

The purpose for this .html file is a quick and efficient way for users to manually parse the 
protein complexes derived from \Rpackage{ScISi} either for quality control and for refinement.\

<<unWantedComp, eval = TRUE>>=

#ISI2 = unWantedComp(ISI, unwantedComplex =
#                         c("GO:0000262",
#                           "GO:0000228",
#                           "GO:0000775",
#                           "GO:0010008",
#                           "GO:0005792",
#                           "GO:0005768",
#                           "GO:0005769",
#                           "GO:0005770",
#                           "GO:0005777",
#                           "GO:0005844",
#                           "GO:0001400"))

data(ScISIC)
identical(ISI, ScISIC)

@ 

The function has two arguments, the bi-partite graph incidence matrix and a character vector. 
Each entry of the character vector is an protein complex ID which will be deleted. After the 
deletion of these protein complexes, the function also removes any proteins which are no longer 
apart of any protein complex.

<<checking>>=
@ 

\section{Conclusion}
The process given above has created one example on an \textit{in silico} interactome. For the purposes of 
simulating AP-MS data and generating estimates of via estimation algorithms, this interactome is a sound 
place from where to begin. It is clearly not the most suitable interactome for every computational 
experiment, and we concede that refinements will inevitably be made. The \textit{in silico} interactome will 
also change and refine itself as the data from the GO and MIPS databases are refined. One of the most important
aspect to the R package \Rpackage{ScISI} is its dynamic nature, for the basic principal in which it has been  
built is that the \textit{in silico} interactome must be amenable to change due to a number of outside
circumstances. 

\end{document}