%\VignetteIndexEntry{tiger User Guide}
%\VignetteKeywords{TimeCourse, GeneExpression, Transcription}
%\VignetteDepends{Biobase, annotate, puma}
%\VignettePackage{tiger}
\documentclass[a4paper]{article}
\usepackage{url}
\usepackage[authoryear,round]{natbib}

\title{tiger User Guide}
\author{Antti Honkela, Pei Gao, Jonatan Ropponen, \\
  Magnus Rattray, and Neil D. Lawrence}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\tiger}{\Rpackage{tiger}}

\usepackage{Sweave}
\begin{document}
\maketitle


\section{Abstract}

The \tiger{} package implements our methodology of Gaussian process
differential equation models for analysis of gene expression time
series from single input motif networks.  The package can be used for
inferring unobserved transcription factor (TF) protein concentrations
from expression measurements of known target genes, or for ranking
candidate targets of a TF.

\section{Citing \tiger{}}

The \tiger{} package is based on a body of methodological research.
Citing \tiger{} in publications will usually involve citing one or more
of the methodology papers \citep{Lawrence2007,Gao2008,Honkela2010PNAS} that the
software is based on as well as citing the software package itself.


\section{Introductory example analysis - Drosophila development}
\label{section:Introductory example}

In this section we introduce the main functions of the \Rpackage{puma}
package by repeating some of the analysis from the PNAS
paper~\citep{Honkela2010PNAS}\footnote{Note that the results reported
  in the paper were run using an earlier version of this package for
  MATLAB, so there can be minor differences.}.

\subsection{Installing the \tiger{} package}

The recommended way to install \tiger{} is to use the
\Rfunction{biocLite} function available from the bioconductor
website. Installing in this way should ensure that all appropriate
dependencies are met.

\begin{Schunk}
\begin{Sinput}
> source("http://www.bioconductor.org/biocLite.R")
> biocLite("tiger")
\end{Sinput}
\end{Schunk}

% To install the tiger software, unpack the software and run
% \begin{verbatim}
% R CMD INSTALL tiger
% \end{verbatim}

To load the package start R and run
\begin{Schunk}
\begin{Sinput}
> library(tiger)
\end{Sinput}
\end{Schunk}

\subsection{Loading the data}

To get started, you need some preprocessed time series expression
data.  If the data originates from Affymetrix arrays, we highly
recommend processing it with \Rfunction{mmgmos} from the
\Rpackage{puma} package.  This processing extracts error bars on the
expression measurements directly from the array data to allow judging
the reliability of individual measurements.  This information is
directly utilised by all the models in this package.

To start from scratch on Affymetrix data, the .CEL files from
\url{ftp://ftp.fruitfly.org/pub/embryo_tc_array_data/} may be
processed using:
\begin{Schunk}
\begin{Sinput}
> # Names of CEL files
> expfiles <- c(paste("embryo_tc_4_", 1:12, ".CEL", sep=""),
+               paste("embryo_tc_6_", 1:12, ".CEL", sep=""),
+               paste("embryo_tc_8_", 1:12, ".CEL", sep=""))
> # Load the CEL files
> expdata <- ReadAffy(filenames=expfiles,
+                     celfile.path="embryo_tc_array_data")
> # Setup experimental data (observation times)
> pData(expdata) <- data.frame("time.h" = rep(1:12, 3),
+                              row.names=rownames(pData(expdata)))
> # Run mmgMOS processing (requires several minutes to complete)
> drosophila_mmgmos_exprs <- mmgmos(expdata)
> drosophila_mmgmos_fragment <- drosophila_mmgmos_exprs
\end{Sinput}
\end{Schunk}

This data needs to be further processed to make it suitable for our
models.  This can be done using
\begin{Schunk}
\begin{Sinput}
> drosophila_gpsim_fragment <-
+   processData(drosophila_mmgmos_fragment,
+               experiments=rep(1:3, each=12))
\end{Sinput}
\end{Schunk}

Here the last argument specifies that we have three independent time
series of measurements.

In order to save time with the demos, a part of the result of this is
included in this package and can be loaded using
\begin{Schunk}
\begin{Sinput}
> data(drosophila_gpsim_fragment)
\end{Sinput}
\end{Schunk}

\subsection{Learning individual models}

Let us now recreate some the models shown in the plots of the PNAS
paper~\citep{Honkela2010PNAS}:
\begin{Schunk}
\begin{Sinput}
> # FBgn names of target genes
> targets <- c('FBgn0003486', 'FBgn0033188', 'FBgn0035257')
> # Load gene annotations
> library(annotate)
> aliasMapping <- getAnnMap("ALIAS2PROBE",
+                   annotation(drosophila_gpsim_fragment))
> # Get the probe identifier for TF 'twi'
> twi <- get('twi', env=aliasMapping)
> # Load alternative gene annotations
> fbgnMapping <- getAnnMap("FLYBASE2PROBE",
+                  annotation(drosophila_gpsim_fragment))
> # Get the probe identifiers for target genes
> targetProbes <- mget(targets, env=fbgnMapping)
> st_models <- list()
> # Learn single-target models for each gene individually
> for (i in seq(along=targetProbes)) {
+   st_models[[i]] <- GPLearn(drosophila_gpsim_fragment,
+                             TF=twi, targets=targetProbes[i],
+                             useGpdisim=TRUE, quiet=TRUE)
+ }
> # Learn a joint model for all targets
> mt_model <- GPLearn(drosophila_gpsim_fragment, TF=twi,
+                     targets=targetProbes,
+                     useGpdisim=TRUE, quiet=TRUE)
> # Display the joint model parameters
> show(mt_model)
\end{Sinput}
\begin{Soutput}
Gaussian process driving input single input motif model:
  Number of time points: 
  Driving TF: 143396_at
  Target genes (3):
    148227_at
    152715_at
    147995_at
  Basal transcription rate:
    Gene 1: 31.4901487775542
    Gene 2: 0.0077988721171123
    Gene 3: 1.48241754617025e-06
  Kernel:
    Multiple output block kernel:
    Block 1
    Normalised version of the kernel.
    RBF inverse width: 0.7718466 (length scale 1.138242)
    RBF variance: 1.754535
    Block 2
    Normalised version of the kernel
    DISIM decay: 0.07285956
    DISIM inverse width: 0.7718466 (length scale 1.138242)
    DISIM Variance: 1
    SIM decay: 2005.556
    SIM Variance: 0.001472529
    RBF Variance: 1.754535
    Block 3
    Normalised version of the kernel
    DISIM decay: 0.07285956
    DISIM inverse width: 0.7718466 (length scale 1.138242)
    DISIM Variance: 1
    SIM decay: 0.4985881
    SIM Variance: 0.03226757
    RBF Variance: 1.754535
    Block 4
    Normalised version of the kernel
    DISIM decay: 0.07285956
    DISIM inverse width: 0.7718466 (length scale 1.138242)
    DISIM Variance: 1
    SIM decay: 0.0002446364
    SIM Variance: 0.003265287
    RBF Variance: 1.754535
  Log-likelihood: -31.84288 
\end{Soutput}
\end{Schunk}

\subsection{Visualising the models}

The models can be plotted using commands like
\begin{Schunk}
\begin{Sinput}
> GPPlot(st_models[[1]], nameMapping=getAnnMap("FLYBASE",
+                          annotation(drosophila_gpsim_fragment)))
> GPPlot(mt_model, nameMapping=getAnnMap("FLYBASE",
+                    annotation(drosophila_gpsim_fragment)))
\end{Sinput}
\end{Schunk}

\begin{figure}
  \begin{center}
\includegraphics{tiger-009}
\end{center}
\caption{Single target models for the gene FBgn0003486. The models for
  each repeated time series are shown in different columns.}
\end{figure}

\begin{figure}
  \begin{center}
\includegraphics{tiger-010}
\end{center}
\caption{Multiple-target model for all the example genes.  The
  call creates independent figures for each repeated time series.}
\end{figure}

\subsection{Ranking the targets}

Bulk ranking of candidate targets can be accomplished using
\begin{Schunk}
\begin{Sinput}
> ## Rank the targets, filtering weakly expressed genes with average
> ## expression z-score below 1.8
> scores <- GPRankTargets(drosophila_gpsim_fragment, TF=twi,
+                         testTargets=targetProbes,
+                         options=list(quiet=TRUE),
+                         filterLimit=1.8)
> ## Sort the returned list according to log-likelihood
> scores <- sort(scores, decreasing=TRUE)
> write.scores(scores)
\end{Sinput}
\begin{Soutput}
"log-likelihood" "null_log-likelihood"
"147995_at" 6.75970270587137 -487.893231050121
"152715_at" -1.51920691642152 -539.73619673943
"148227_at" -1.52526566233162 -73.4806804255218
\end{Soutput}
\end{Schunk}

To save space, \Rfunction{GPRankTargets} does not return the models by
default.  If those are needed later e.g. for plotting, they can be
recreated using the inferred parameters saved together with the
ranking using
\begin{Schunk}
\begin{Sinput}
> topmodel <- generateModels(drosophila_gpsim_fragment,
+                            scores[1])
> show(topmodel)
\end{Sinput}
\begin{Soutput}
[[1]]
Gaussian process driving input single input motif model:
  Number of time points: 
  Driving TF: 143396_at
  Target genes (1):
    147995_at
  Basal transcription rate:
    Gene 1: 0.000135154902328578
  Kernel:
    Multiple output block kernel:
    Block 1
    Normalised version of the kernel.
    RBF inverse width: 0.760276 (length scale 1.146870)
    RBF variance: 1.804526
    Block 2
    Normalised version of the kernel
    DISIM decay: 0.01789634
    DISIM inverse width: 0.760276 (length scale 1.146870)
    DISIM Variance: 1
    SIM decay: 0.01952063
    SIM Variance: 0.002722844
    RBF Variance: 1.804526
  Log-likelihood: 6.759703 
\end{Soutput}
\end{Schunk}

\subsection{Ranking using known targets with multiple-target models}

Ranking using known targets with multiple-target models can be
accomplished simply by adding the \texttt{knownTargets} argument
\begin{Schunk}
\begin{Sinput}
> ## Rank the targets, filtering weakly expressed genes with average
> ## expression z-score below 1.8
> scores <- GPRankTargets(drosophila_gpsim_fragment, TF=twi,
+                         knownTargets=targetProbes[1],
+                         testTargets=targetProbes[2:3],
+                         options=list(quiet=TRUE),
+                         filterLimit=1.8)
> ## Sort the returned list according to log-likelihood
> scores <- sort(scores, decreasing=TRUE)
> write.scores(scores)
\end{Sinput}
\begin{Soutput}
"log-likelihood" "null_log-likelihood"
"152715_at" -28.0896855288335 -539.73619673943
"147995_at" -206.238519107964 -487.893231050121
\end{Soutput}
\end{Schunk}

\subsection{Running ranking in a batch environment}

\Rfunction{GPRankTargets} can be easily run in a batch environment
using the argument \texttt{scoreSaveFile}.  This indicates a file to
which scores are saved after processing each gene.  Thus one could,
for example, split the data to, say, 3 separate blocks according to
the reminder after division by 3 and run each of these independently.
The first for loop could then be run in parallel (e.g. as separate
jobs on a cluster), as each step is independent of the others.  After
these have all completed, the latter loop could be used to gather the
results.
\begin{Schunk}
\begin{Sinput}
> for (i in seq(1, 3)) {
+   targetIndices <- seq(i,
+     length(featureNames(drosophila_gpsim_fragment)), by=3)
+   outfile <- paste('ranking_results_', i, '.Rdata', sep='')
+   scores <- GPrankTargets(preprocData, TF=twi,
+                           testTargets=targetIndices,
+                           scoreSaveFile=outfile)
+ }
> for (i in seq(1, 3)) {
+   outfile <- paste('ranking_results_', i, '.Rdata', sep='')
+   load(outfile)
+   if (i==1)
+     scores <- scoreList
+   else
+     scores <- c(scores, scoreList)
+ }
> show(scores)
\end{Sinput}
\end{Schunk}

\section{Experimental feature: Using non-Affymetrix data}

Using non-Affymetrix data, or data without associated uncertainty
information for the expression data in general, requires more because
of two reasons
\begin{itemize}
\item noise variances need to be estimated together with other model
  parameters; and
\item weakly expressed genes cannot be easily filtered \emph{a
    priori}.
\end{itemize}

The first of these is automatically taken care of by all the above
functions, but the latter requires some extra steps after fitting the
models.

In order to get started, you need to create an
\Rfunction{ExpressionTimeSeries} object of your data set.  This can be
accomplished with the function
\begin{Schunk}
\begin{Sinput}
> procData <- processRawData(data, times=c(...),
+                            experiments=c(...))
\end{Sinput}
\end{Schunk}

Filtering of weakly expressed genes requires more care and visualising
the fitted models is highly recommended to avoid mistakes.

Based on initial experiments, it seems possible to perform the
filtering based on the statistic \texttt{loglikelihoods(scores) -
  baseloglikelihoods(scores)}, but selection of suitable threshold is
highly dataset specific.

\bibliographystyle{plainnat}
\bibliography{gpsim}

\end{document}