%\VignetteIndexEntry{SpeCond} %\VignettePackage{SpeCond} \documentclass[a4paper]{article} \usepackage{times} \usepackage{a4wide} \usepackage{verbatim} %\usepackage{asmsymb} \usepackage[pdftex,linktocpage]{hyperref} \usepackage{graphicx} % \usepackage{epsfig} \usepackage{float} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rpackage}[1]{\textit{#1}} \newcommand{\Rclass}[1]{\textit{#1}} \newcommand{\Rfunction}[1]{{\small\texttt{#1}}} \clearpage \SweaveOpts{keep.source=TRUE,eps=FALSE,include=FALSE,width=4,height=4.5} \begin{document} \title{Condition-specific detection with SpeCond} \author{Florence Cavalli} \maketitle \tableofcontents \pagebreak \section*{Introduction} This vignette presents the SpeCond package, an R package which performs gene expression data analysis to detect condition-specific genes. Such genes are significantly up- or down-regulated in a small number of different conditions. Conditions can be environmental conditions, tissues, organs or any other sources that you wish to compare in terms of gene expression.\\ Condition-specific genes are essentially outliers in an expression pattern. In order to detect such outliers, SpeCond fits a mixture of normal distributions to the expression values. In addition to the main function \Rfunction{SpeCond}, this package includes other methods such as \Rfunction{writeSpeCondResult}, \Rfunction{getFullHtmlSpeCondResult} and \Rfunction{getGeneHtmlPage} which produce text files and html reports. \section{The detection process in a few words} % SpeCond uses a normal mixture model to determine for each particular gene the null distribution of the gene expression values. From the null distribution, it detects the outliers as gene's condition-specific expression.\\ The main steps of the procedure are as follows: (i) model the expression data with a mixture of normal distribution(s), (ii) identify the null distribution as well as candidate outlier observations, (iii) compute p-values of the expression values using the null distribution and adjust for multiple comparisons using the Benjami and Yekutiely (or a user-defined) method (iv) identify significant condition-specific expression values of the gene using the adjusted p-values. \section{Quick start} \subsection{SpeCond function} The function \Rfunction{SpeCond} enables the user to perform the full analysis. This function is called with the following arguments: \begin{itemize} \item \emph{expressionMatrix}: an ExpressionSet object or a matrix of expression values (in log2); columns are the conditions, rows are genes (or probe sets) \item \emph{param.detection}: a vector containing the parameters for both steps of the procedure, see section "Parameters" \item \emph{multitest.correction.method}: the multitest correction method; the default is "BY", see \Rfunction{p.adjust} for the possible values \item \emph{prefix.file}: a prefix added to the histogram file (if produced). It will be used to link to the result html pages generated by other functions using the result object of this function (if no other prefix value is implemented). The default is "A". It is useful to change the prefix when you perform a new analysis with different parameters, as you may want to compare the results \item \emph{print.hist.pv}: a logical (TRUE/FALSE) value indicating whether a histogram of p-values is to be printed; the default is FALSE \item \emph{condition.factor}: this argument can be used if expressionMatrix is an ExpressionSet object; a factor object of length equal to the number of columns (samples) of the ExpressionSet object specifying which sample(s) belong to which condition (condition.factor levels); can be extracted from the phenoData \item \emph{condition.method}: this argument can be used if expressionMatrix is an ExpressionSet object; the method (mean, median or max) to summarise the samples by conditions (defined by the condition.factor vector) \end{itemize} %\item \emph{do.logtransform}: if TRUE, the data are log transformed before the analysis, the default is FALSE \textbf{Example:}\\ Note: For this vignette we will work in a temporary directory (using \Rfunction{tempdir()} as below), to limit the size of the source files. However, for your own use you should ignore it, by default the .RData and files are generated and saved in the current directory. <>= d=tempdir() oldir=getwd() setwd(d) @ Loading the library and the example dataset; an ExpressionSet and a matrix expression value: <>= library(SpeCond) data(expSetSpeCondExample) expSetSpeCondExample class(expSetSpeCondExample) data(expressionSpeCondExample) Mexp=expressionSpeCondExample class(Mexp) dim(Mexp) @ \textbf{Perform the analysis with default parameters:}\\ Using an expression matrix as input:\\ <>= generalResult=SpeCond(Mexp, param.detection=NULL, multitest.correction.method="BY", prefix.file="E", print.hist.pv=TRUE, fit1=NULL, fit2=NULL, specificOutlierStep1=NULL) names(generalResult) specificResult=generalResult$specificResult @ Using an ExpressionSet object as input:\\ Specifying the \emph{condition.factor} and the \emph{condition.method} arguments to extract correctly the expression values according to the conditions, see \Rfunction{getMatrixFromExpressionSet} for details. <>= generalResult=SpeCond(expSetSpeCondExample, param.detection=NULL, multitest.correction.method="BY", prefix.file="E", print.hist.pv=TRUE, fit1=NULL, fit2=NULL, specificOutlierStep1=NULL, condition.factor=expSetSpeCondExample$Tissue, condition.method="mean") @ Retreive the expression matrix used by the analysis with \Rfunction{getMatrixFronExpressionSet} and the same arguments (\emph{condition.factor} and \emph{condition.method}) used in the \Rfunction{SpeCond} function. This is necessary for the visualisation functions. <>= MexpS=getMatrixFromExpressionSet(expSetSpeCondExample, condition.factor=expSetSpeCondExample$Tissue,condition.method="mean") @ The \textbf{generalResult} and \textbf{specificResult} objects generated above are described in the Output section. \subsection{Result visualisation} Two functions \Rfunction{getFullHtmlSpeCondResult} and \Rfunction{getGeneHtmlPage} allow the user to visualise the results. \subsubsection{Html page displaying the full detection result} A general result html page generated by the \Rfunction{getFullHtmlSpeCondResult} function gives an overall view of the condition specific behaviour of genes present in the dataset. This page mainly contains a plot of the number of specific genes by condition, a specific profile heatmap and the result table with the specific genes and their specific condition(s).\\ \\ The function \Rfunction{getFullHtmlSpeCondResult} is called with the following arguments: \begin{itemize} \item \emph{SpeCondResult}: the result object of the \Rfunction{SpeCond} functions \item \emph{param.detection}: the parameter matrix used in the analysis (by \Rfunction{SpeCond}) \item \emph{page.name}: the name of the result html page. The default is "SpeCond\_result" \item \emph{page.title}: the title of the result html page. The default is "Condition-specific analysis results" \end{itemize} Further arguments are described in the paragraph "Visualisation, html pages". <>= getFullHtmlSpeCondResult(SpeCondResult=generalResult, param.detection= specificResult$param.detection, page.name="Example_SpeCond_results", page.title="Tissue specific results", sort.condition="all", heatmap.profile=TRUE, heatmap.expression=FALSE, heatmap.unique.profile=FALSE, expressionMatrix=Mexp) @ \subsubsection{Html result pages for each gene} A result html page for each gene can be produced by the function \Rfunction{getGeneHtmlPage}. In addition, this function creates an index to allow you to navigate easily between the results pages for each gene. Each gene result page contains the parameter set used, the expression profile plot and its density curve. An additional plot displays the normal density functions from the normal mixture model as well as the null distribution curve determined by SpeCond. Finally, it presents the condition(s) in which the gene is specific with the associated p-value.\\ \\ The function \Rfunction{getGeneHtmlPage} is called with the following arguments: \begin{itemize} \item \emph{expressionMatrix}: the matrix of expression values initially used \item \emph{specificResult}: the R object result of the \Rfunction{getSpecificProbeset} function \item \emph{name.index.html}: the name of the html index, the default is "index.html" \item \emph{prefix.file}: a prefix added to the generated file(s) and \emph{outdir} directory name to linked to the index file. The default is NULL, the \emph{prefix.file} attribute of \emph{specificResult} object is used \item \emph{outdir}: the name of the directory in which the generated files will be created. The default is "Single\_result\_pages" \item \emph{gene.html}: a vector of gene names for which you want to create html pages, same as the row names of the \emph{expressionMatrix} object, the default is NULL (the values of the \emph{gene.html.ids} argument will be used) \item \emph{gene.html.ids}: a vector of integers corresponding to the row numbers in the \emph{expressionMatrix} object for the genes for which you want to create html pages. The default is the first 10 rows (or the number of rows of the expressionMatrix if lower to 10). \end{itemize} \textbf{Remark}: If both \emph{gene.html} and \emph{gene.html.ids} are set to NULL, the gene html pages for every gene in \emph{expressionMatrix} will be generated. It is possible to use \emph{gene.html} or \emph{gene.html.ids} to select a set of genes.\\ For a given set of genes, you may wish to compare the results contained in two or more \textbf{specificResult} objects, which were derived using different parameter sets. For this reason, it is useful to change the prefix (by setting the variable \textbf{prefix.file}) as well as the name of the index file (by setting the variable \textbf{name.index.html}). <>= genePageInfo=getGeneHtmlPage(Mexp, specificResult, name.index.html= "index_example_SpeCond_Results.html", gene.html.ids=c(1:20)) @ \section{Parameters} \subsection{Description} SpeCond uses two sets of parameters that can be tuned intuitively by the user. Two parameters, \emph{lambda} and \emph{beta}, are involved in the implementation of the normal mixture model. Additionally, four parameters are used for the determination of the null distribution: the percentage threshold (\emph{per}), the median difference (\emph{md}), the minimum log-likelihood (\emph{mlk}) and the minimum standard deviation ratio (\emph{rsd}).\\ To be detected as an outlier, a normal component of the mixed distribution must meet several criteria. The difference between its median and that of the principal normal component (distribution clustering most of the expression values) must be greater than \emph{md}, and the percentage of expression values attributed to this normal must be smaller than \emph{per}. Lastly, the component in question and the principal component must be well-separated, such that the minimum of the absolute log-likelihood ratio of the expression values under the two components is larger than \emph{mlk}, or the component in question must be much more spread-out, such that the ratio of their standard deviations is less than \emph{rsd} (see Figure~\ref{fig-detection}). A p-value threshold is set to define a condition as specific for a given gene. Adjusted p-values are used. % Need to defiend principal normal component???? %\lambda and \beta \begin{itemize} \item \emph{lambda}: influences the choice of models by affecting the selection of one, two or three normal distributions, thus introducing some weight on the effect of number of parameters to be defined by the clustering model. The default is 1, the model uses the Bayesian Information Criterion (BIC) value taking into account the log-likelihood value \item \emph{beta}: influences the prior applied during the determination of the variance of the normal distributions. It is necessary in the first fitting step to allow the model to capture isolated outliers. It is set by default to 0 in step 2 \item \emph{per}, percentage threshold: this is the maximum percentage of conditions in which a gene can be defined as specific. As per increases, more expression values will be detected as outliers, so each gene can have a larger number of specific conditions associated with it. \item \emph{md}, median difference: this is the minimum distance between the median values of two normal components of the mixted distribution, which allows identification of one component as an outlier (i.e. possibly not part of the null distribution). Low median distances are excluded to filter out noise that is common in biological samples \item \emph{mlk}, minimum log-likelihood: allows the identification of clusters of conditions that are well separated from the other(s) in the model. If the minimum log-likelihood of a set of expression values in a normal component exceeds \emph{mlk}, then this component is a possible outlier (i.e. not part of the null distribution) \item \emph{rsd}, minimum of standard deviation ratio: allows the identification of clusters of conditions that are extremely spread out compared to the distribution clustering of most expression values. If the ratio of standard deviation between the the component and the principale normal component is below \emph{rsd}, then this component is a possible outlier (i.e. is not part of the null distribution) \item \emph{pv}, p-value threshold to detect a condition as specific for a given gene \end{itemize} Figure~\ref{fig-detection} presents the different detection cases. As \emph{mlk} increases and \emph{rsd} decreases, it is less likely that a mixture component will be defined as an outlier.\\ \begin{figure}[htbp] \begin{center} \includegraphics{Figure1_null_distribution} % \includegraphics{Figure1_test} \end{center} \caption{\small Determination of the null distribution: The three different conditions evaluated in order to consider a normal component as part of the null distribution. (I) The median of the values from each component must have a difference larger than \emph{md}. If this first condition is fulfilled, the procedure tests the following conditions: The normal component will not be part of the null distribution if: (II) The normal component is small and well separated i.e. the minimum of the absolute log-likelihood ratio of the expression values under the two components is larger than \emph{mlk}, (III) The normal component is small and largely spread out i.e. the standard deviation ratio is smaller than \emph{rsd}.} \label{fig-detection} \end{figure} \subsection{How to change the parameter values} The parameter's values are stored in a matrix (\emph{param.detection}). The procedure includes two steps, which are described in the following section. As a consequence each parameter can be changed to improve the detection. The first row called "Step1" and the second row "Step2" of \emph{param.detection} contain the parameters for the first and the second step of the procedure, respectively.\\ To obtain the default parameters, use the function \Rfunction{getDefaultParameter}. <>= param.detection=getDefaultParameter() param.detection @ The function \Rfunction{createParameterMatrix} enables the user to create a new parameter matrix or to change value(s) from the default or a given parameter matrix. The different arguments allow to change the values in the \emph{param.detection} send as argument. If \emph{param.detection} is not specified (i.e. NULL), the default parameters will be used and then changed according to the other arguments. The different arguments are: \emph{param.detection}, \emph{beta.1}, \emph{beta.2}, \emph{lambda.1}, \emph{lambda.2}, \emph{per.1}, \emph{per.2}, \emph{md.1}, \emph{md.2}, \emph{mlk.1}, \emph{mlk.2}, \emph{rsd.1}, \emph{rsd.2}, \emph{pv.1} and \emph{pv.2}. <>= param.detection2=createParameterMatrix(mlk.1=10) param.detection2 param.detection2B=createParameterMatrix(param.detection=param.detection2, mlk.1=15, rsd.2=0.2) param.detection2B @ Remark: \textbf{We strongly advice the following rules:} \begin{itemize} \item beta.2=0 \item md.1=md.2 \item per.1<=per.2 \item pv.1=pv.2 \end{itemize} \subsection{The effect of the parameters on the detections} To give you a sense of the importance and the effect of the parameter values, we have created a simulated dataset. As example, we analyse this dataset with the default parameters then improve the specific detection modifying two parameters. This detailed analysis is present in the ``Improve the detection changing the parameter values'' section. \section{More details} \subsection{Advanced usage of SpeCond} The procedure performed by \Rfunction{SpeCond} integrates two steps (presented in Figure~\ref{fig-procedure}) that can be processed separately using the functions presented in the following section. Additionally, after running the \Rfunction{SpeCond} function for the first time, it is possible to test other parameters to obtain the null distribution. In this case, it is not necessary to re-process the fitting step so the results of the first run can be used as an input for the second run. For this purpose, three extra \Rfunction{SpeCond} parameters allow to perform only the latest parts of the procedure (depending of which arguments are still set to NULL). \begin{itemize} \item \emph{fit1}: the result of the first fitting from the function \Rfunction{fitPrior} or generalResult\$fit1, the default is NULL \item \emph{fit2}: the result of the second fitting from the function \Rfunction{fitNoPriorWithExclusion} or generalResult\$fit2, the default is NULL \item \emph{specificOutlierStep1}: the result of the first detection step from the function \Rfunction{getSelectiveOutliers} or generalResult\$specificOutlierStep1, the default is NULL \end{itemize} Example:\\ Change the detection parameters for the first step and apply these to the previously computed fitting: <>= param.detection2=createParameterMatrix(param.detection, mlk.1=10) param.detection2 generalResult2=SpeCond(Mexp, param.detection=param.detection2, multitest.correction.method="BY", prefix.file="E2", print.hist.pv=TRUE, fit1=generalResult$fit1, fit2=NULL, specificOutlierStep1=NULL) @ Change the parameters for the second step and apply these to the result of the first step and the second fitting computed previously: <>= param.detection3=createParameterMatrix(param.detection, rsd.2=0.2, per.2=0.2) param.detection3 generalResult3=SpeCond(Mexp, param.detection=param.detection3, multitest.correction.method="BY", prefix.file="E3", print.hist.pv=TRUE, fit1=generalResult$fit1, fit2=generalResult$fit2, specificOutlierStep1=generalResult$specificOutliersStep1) @ \begin{figure}[H] \begin{center} \includegraphics{ProcedureSteps} \end{center} \caption{\small Procedure Steps: Short description of the two step procedure indicating key points in each steps} \label{fig-procedure} \end{figure} \subsection{Stepwise analysis} As the method works with two steps it can be very useful to process step by step, saving the object(s) and enabling to re-run only the last detection function for example with a new set of parameters.\\ Here, we described the procedure step by step:\\ Use the function \Rfunction{getMatrixFromExpressionSet} if your dataset is an ExpressionSet to obtain the \textbf{Mexp} matrix as presented in the ``Quick Start'' paragraph.\\ \textbf{Step1}: get the mixture distributions fitting the expression data with a prior on the variance to catch the single outliers. Then detected the single outliers using the first row ("Step1") of the \emph{param.detection} matrix: <>= param.detection fit1=fitPrior(Mexp, param.detection=param.detection) @ Or specifying only \emph{lambda} and \emph{beta} values: <>= fit1=fitPrior(Mexp, lambda=param.detection[1,"lambda"], beta=param.detection[1,"beta"]) @ Detect the single outliers and store them in \textbf{specificOutlierStep1}. <>= specificOutlierStep1=getSpecificOutliersStep1(Mexp, fit=fit1$fit1,param.detection, multitest.correction.method="BY", prefix.file="run1_Step1", print.hist.pv=FALSE) @ \textbf{Step2}: get the second fitting ignoring the values of outliers detected in step 1, then perform the specific detection using the second row ("Step2") of the param.detection matrix: <>= fit2=fitNoPriorWithExclusion(Mexp, specificOutlierStep1=specificOutlierStep1, param.detection=param.detection) @ Or specifying only \emph{lambda} and \emph{beta} values: <>= fit2=fitNoPriorWithExclusion(Mexp, specificOutlierStep1=specificOutlierStep1, lambda=param.detection[2,"lambda"], beta=param.detection[2,"beta"]) @ Detect the condition specific for each gene and store them in \textbf{specificResult}. <>= specificResult=getSpecificResult(Mexp, fit=fit2, specificOutlierStep1=specificOutlierStep1, param.detection, multitest.correction.method="BY", prefix.file="run1_Step2", print.hist.pv=FALSE) @ \section{Output} The SpeCond package can produce three different types of outputs: R objects, text files and html pages. \subsection{R Objects} The result of \Rfunction{SpeCond} the \textbf{generalResult} object is of class \textbf{sp\_list}. It is a large list containing all the parameters and results of the analysis. The five attributes of \textbf{generalResult} correspond to the output of the functions presented in the Stepwise analysis paragraph. So the object \textbf{specifcResult} obtained by the \Rfunction{getSpecificResult} function is a class \textbf{sp\_list} as well and corresponds to the fifth attributes of the \textbf{generalResult} of the \Rfunction{SpeCond} function. The later is a list of seven attributes containing all the parameters and results of the detection.\\ Remark: For all result objects the order of genes and conditions from the input expression value matrix is preserved.\\ The \textbf{generalResult} attributes are: \begin{itemize} \item \emph{prefix.file}: the prefix used for this analysis. It will be used by default in the function \Rfunction{getFullHtmlSpeCondResult} and \Rfunction{getGeneHtmlPage} \item \emph{fit1}: a list of genes as first attributes and for each gene a list of three attributes: \begin{itemize} \item\emph{G}: number of normal components fitting the data \item\emph{NorMixParam}: the parameters of each normal component of the mixture distribution fitting the expression values of the gene: proportion, mean and standard deviation \item\emph{classification}: the normal component of the mixture distribution, to which the expression value is attributed \item Remark: if \emph{evaluate.lamda.beta} is TRUE in \Rfunction{fitPrior}, the result object will be a list of two arguments: \begin{itemize} \item \emph{fit1}: same list as presented above with \emph{G}, \emph{NorMix\_param} and \emph{classification} attributes for each gene \item \emph{G.lambda.beta.effect}: matrix presenting the number of times the values of G (number of normal components for a particular gene) have changed between \emph{lambda}=0 and the \emph{lambda.1} value and between \emph{beta}=0 and the \emph{beta.1} value \end{itemize} \end{itemize} \item \emph{fit2}: same list as \emph{fit1} described above with attributes: \emph{G\_initial}, \emph{G},\emph{NorMixParam} and \emph{classification} for each gene and an additional attribute \emph{specificOutlierStep1}: \begin{itemize} \item \emph{specificOutlierStep1}: the condition(s) for which the expression value of the gene is detected as outlier in the first step of the procedure. If NULL, no expression value has been detected in the first step. The second fitting ignores these expression values \end{itemize} \item \emph{specificOutliersStep1}: a list of all genes with the condition(s) (i.e. column id(s)), in which the gene has been detected as specific, NULL if not \item \emph{specificResult}: described below \end{itemize} The \textbf{specificResult} object is of class \textbf{sp\_list}. This list containing 8 attributes \begin{itemize} \item \emph{prefix.file}: the prefix used for this analysis. It will be used by default in the function \Rfunction{getGeneHtmlPage} \item \emph{param.detection}: the parameters used for the two steps \item \emph{fit}: the fit object of the second step (same as \emph{fit2}) \item \emph{L.specific.result}: full detection results (it will be used by the \Rfunction{getFullHtmlSpeCondResult} function). This list contains 7 attributes: \begin{itemize} \item \emph{M.specific.all}: matrix of 0: not selective, 1: selective up-regulated, -1: selective down-regulated; same dimensions as the input matrix of expression values \item \emph{M.specific}: same as \emph{M.specific.all} but reduced to the specific genes. NULL if no gene has been detected as specific \item \emph{M.specific.sum.row}: Number of conditions in which the gene is specific (-1 and 1 values) \item \emph{M.specific.sum.column}: Number of specific genes by conditions \item \emph{L.pv}: list of all genes with a matrix of conditions and the corresponding p-values (if the gene is specific) \item \emph{specific}: vector of size the number of genes with only the values "Not specific" or "Specific" according to the specificity of the gene (used by the html display) \item \emph{L.condition.specific.id}: list of the specific genes with a vector of column numbers (condition ids), for which the gene is specific \end{itemize} \item \emph{L.null}: a list containing a vector of 1 and 0 representing the null distribution. The length of the vector for each gene corresponds to the number of normal components fitting the gene expression value. The list is sorted as the gene in the input matrix of expression values \item \emph{L.mlk}: a list of vectors containing the minimum log-likelihood computed between normal distributions. NULL if the mixture model of the gene is composed of only one component or if the proportion of all components is superior to the \emph{per.2} parameter \item \emph{L.rsd}: a list of vectors containing the standard deviation ratios computed between normal distributions. NULL if the mixture model of the gene has only one component \item \emph{identic.row.ids}: row number(s) from the initial input matrix, which contain identical values for all conditions. These rows are not considered in the analysis \end{itemize} Example: <>= names(generalResult) specificResult=generalResult$specificResult names(specificResult) names(specificResult$L.specific.result) dim(specificResult$L.specific.result$M.specific.all) dim(specificResult$L.specific.result$M.specific) specificResult @ %\item \emph{L.selective.result.profile}: list of three attributes: M.selective.profile, M.selective.profile.unique, M.selective.profile.table to create the profile heatmap and the profile text files outputs %names(specificResult$L.selective.result.profile) \subsection{Text files} Three functions enable to write the results to text files. \begin{itemize} \item\Rfunction{writeSpeCondResult}: To write the three following text files: \begin{itemize} \item The table of genes detected as specific and in which condition they are specific (0: not specific, 1: specific up-regulated, -1: specific down-regulated). The file name is set by the parameter \emph{file.name.profile}, the default name is "specific\_profile.txt". \item The list of the specific genes. The file name is set by the parameter \emph{file.specific.gene}, the default name is: "list\_specific\_probeset.txt". \item The table of the unique specific profiles detected. The file name is set by the parameter \emph{file.name.unique.profile}, the default name is: "specific\_unique\_profile.txt". \end{itemize} \item\Rfunction{writeUniqueProfileSpecificResult}: To write a text file with the unique specific profiles among the conditions. If \emph{full.list.gene} is TRUE, the last column corresponds to the gene's names which have the profile described in the row. \item\Rfunction{writeGeneResult}: To write a text file containing the list of all genes from the input expression values matrix, whether they have been detected as condition specific or not (S/N), for how many conditions in total, in how many conditions as up-regulated, in how many conditions as down-regulated, in which conditions as up-regulated and down-regulated. The argument \emph{gene.names} can be used to select a subset of genes (the default is NULL). \end{itemize} All these functions use the function \Rfunction{getProfile} to obtain the profiles of the specific genes, see documentation and example below <>= L.specific.result.profile=getProfile(specificResult$L.specific.result$M.specific) writeSpeCondResult(specificResult$L.specific.result,file.name.profile= "Example_specific_profile.txt", file.specific.gene="Example_list_specific_gene.txt", file.name.unique.profile="Example_specific_unique_profile.txt") writeUniqueProfileSpecificResult(L.specific.result=specificResult$L.specific.result, file.name.unique.profile="Example_specific_unique_profile.txt", full.list.gene=FALSE) writeGeneResult(specificResult$L.specific.result, file.name.result.gene= "Example_gene_gummary_result.txt", gene.names=rownames(Mexp)[1:10]) @ \subsection{Visualisation, html pages} The Html page functions \Rfunction{getFullHtmlSpeCondResult} and \Rfunction{getGeneHtmlPage} have been presented above. Here we present all the possible arguments of these functions enabling to personalise the result html pages. \subsubsection{getFullHtmlSpeCondResult} Some other arguments can be used in the \Rfunction{getFullHtmlSpeCondResult}; notably the arguments \emph{prefix.file} and \emph{outdir} enable to properly save the different result pages. \begin{itemize} \item \emph{SpeCondResult}: the result object of \textbf{sp\_list} class result of the \Rfunction{SpeCond} functions \item \emph{L.specific.result}: list of results present in the \textbf{specificResult} \textbf{sp\_list} class object , see \Rfunction{SpeCond} or \Rfunction{getSpecificResult} functions \item \emph{param.detection}: the parameter matrix used in the analysis (by \Rfunction{SpeCond}) \item \emph{page.name}: the name of the result html page. The default is "SpeCond\_result" \item \emph{page.title}: the title of the result html page. The default is "Condition-specific analysis results" \item \emph{prefix.file}: a prefix added to the generated file(s) and the \emph{outdir} directory name to linked them to the full result html page. The default is NULL. It is useful to change the prefix when you create a new result page. As you may want to get results with different parameter sets and plots so using a different \emph{SpeCondResult} or \emph{L.specific.result} objects \item \emph{outdir}: the name of the directory in which the generated files will be created. The default is "General\_result" \item \emph{sort.condition}: the way to sort the conditions in the barplot that presents the number of specific genes per condition. Values are "positive", "negative" or "all" determines that the conditions are sorted by the number of specific genes detected as up-regulated, down-regulated or both respectively \item \emph{gene.page.info}: the result of the \Rfunction{getGeneHtmlPage} function. Enables the creation of links between this full result page and the single result pages created by the previous function. The default is "NULL"; no links are created \item \emph{heatmap.profile}: a logical (TRUE/FALSE) whether to print or not a heatmap showing the specific profile of the genes, the default is FALSE \item \emph{heatmap.expression}: a logical (TRUE/FALSE) whether to print or not a heatmap showing the expression of the genes, the default is FALSE \item \emph{heatmap.unique.profile}: a logical (TRUE/FALSE) whether to print or not a heatmap showing the unique specific profile, the default is FALSE \item \emph{expressionMatrix}: Must not be NULL if heatmap.expression=TRUE, must be the same as the input expression matrix, the default is NULL \end{itemize} \textbf{Remark}: One can use either \emph{SpeCondResult} or \emph{L.specific.result} as the \emph{L.specific.result} is included in the \emph{SpeCondResult} object. If \emph{prefix.file} is NULL the \emph{prefix.file} attribute of \emph{SpeCondResult} is used. \subsubsection{getGeneHtmlPage} The function \Rfunction{getGeneHtmlPage} creates one html page by genes and the corresponding plots. As a consequence a large amount of files can be generated. To keep track of the result pages linked to different parameter sets, it is important to use the \emph{prefix.file} arguments. The main index html page will be created in the current directory whereas all the other gene html pages will be created in the \emph{outdir} directory. \\ Here are all the \Rfunction{getGeneHtmlPage} arguments, more importantly the arguments \emph{prefix.file} and \emph{outdir} allow to configure the classification of your results. \begin{itemize} \item \emph{expressionMatrix}: the matrix of expression values initially used \item \emph{specificResult}: the R object result of the \Rfunction{getSpecificProbeset} function \item \emph{name.index.html}: the name of the html index, the default is "index.html" \item \emph{prefix.file}: a prefix added to the generated file(s) and \emph{outdir} directory name to linked to the index file. The default is NULL, the \emph{prefix.file} attribute of the \emph{specificdResult} is used \item \emph{outdir}: The name of the directory in which the generated files will be created. The default is "Single\_result\_pages" \item \emph{gene.html}: a vector of gene names, same as the row names of the \emph{expressionMatrix} object, the default is NULL \item \emph{gene.html.ids}: a vector of integers corresponding to the row numbers in the \emph{expressionMatrix} object for the gene for which you want to create html pages. The default is the 10 first rows (or the number of row of the expressionMatrix if inferior to 10) \end{itemize} \section{Changing the parameter values to improve the detection results} In this section, we will use a simulated dataset to appreciate the effect of the parameters. The principal parameters of the detection are \emph{mlk.2} and \emph{rsd.2}.\\ The dataset is simulated from three different normal distributions. The default expression values for each probeset is randomly generated from a normal distribution of mean=7 and sd=0.6. The probesets 1 to 100 have specific expression values for the conditions 10, 20 and 30 coming from a normal distribution of mean=11 and sd=0.5. The probesets 200 to 300 have specific expression values for the conditions 9, 18 and 27 coming from a normale distribution of mean=13 and sd=0.4. We will change the parameter \emph{mlk.2} and \emph{rsd.2} to improve the specific detection (detecting the two types of specific behavior) starting with the default parameters. (The following code is present the \emph{SpeCond.R} file available on the Bioconductor package page.)\\ \\ Load the simulated dataset: <>= data(simulatedSpeCondData) @ Perform the first analysis using the default parameters: <>= generalResult_S1=SpeCond(simulatedSpeCondData, param.detection=NULL, multitest.correction.method="BY", prefix.file="S1", print.hist.pv=TRUE, fit1=NULL, fit2=NULL, specificOutlierStep1=NULL) specificResult_S1=generalResult_S1$specificResult getFullHtmlSpeCondResult(L.specific.result=specificResult_S1$L.specific.result, page.name="simulatedSpeCondData_results", page.title="Condition specific results mlk 25 default", prefix.file="S1", sort.condition="all") genePageInfo_S1=getGeneHtmlPage(simulatedSpeCondData, specificResult_S1, name.index.html="index_simulatedSpeCondData.html", prefix.file="S1", gene.html.ids=c(1:20)) @ We open \emph{S1\_simulatedSpeCondData\_results.html} and \emph{S1\_index\_simulatedSpeCondData.html} files present in the current directory to observe the results.\\ Looking at the first results, only few probeset are detected as specific. If you look at the gene result pages you can observe that the \emph{mlk} value of the Step2 in particular is too high as the Normal 2 is still part of the null distribution whereas it can be consider as separated from the Normal 1. As a consequence we change the \emph{mlk.2} parameter to 10. Additionally we are using here the result of the fitting from the first detection as described in the Stepwise analysis paragraph.\\ \\ Change the parameter: <>= param.detection10=createParameterMatrix(param.detection=param.detection, mlk.2=10) param.detection10 @ Perform the analysis changing the prefix value: <>= generalResult_S2=SpeCond(simulatedSpeCondData, param.detection=param.detection10, multitest.correction.method="BY", prefix.file="S2", print.hist.pv=TRUE, fit1=generalResult_S1$fit1, fit2=generalResult_S1$fit2, specificOutlierStep1=generalResult_S1$specificOutliersStep1) specificResult_S2=generalResult_S2$specificResult genePageInfo_S2=getGeneHtmlPage(simulatedSpeCondData, specificResult_S2, name.index.html="index_simulatedSpeCondData.html", prefix.file="S2", gene.html.ids=c(1:20,200:250)) getFullHtmlSpeCondResult(L.specific.result=specificResult_S2$L.specific.result, param.detection=param.detection10, page.name="simulatedSpeCondData_results", page.title="Condition specific results mlk 10",prefix.file="S2", sort.condition="all", gene.page.info=genePageInfo_S2) @ Looking at \emph{S2\_simulatedSpeCondData\_results.html} and \emph{S2\_index\_simulatedSpeCondData.html} files we observe that the detection improves but we are still missing some specific probesets with ids between 1 and 100 and more importantly we do not yet detect several specific probe with ids between 200 and 300 for which their specific values are more spread than the core of the expression values. To detect them we decrease \emph{mlk.2} and increase \emph{rsd.2} to 8 and to 0.3 respectively.\\ \\ Change the parameter: <>= param.detection8_0.3=createParameterMatrix(param.detection=param.detection, mlk.2=8, rsd.2=0.3) param.detection8_0.3 @ Perform the analysis changing the prefix value: <>= generalResult_S3=SpeCond(simulatedSpeCondData, param.detection=param.detection8_0.3, multitest.correction.method="BY", prefix.file="S3", print.hist.pv=TRUE, fit1=generalResult_S1$fit1, fit2=generalResult_S1$fit2, specificOutlierStep1=generalResult_S1$specificOutliersStep1) specificResult_S3=generalResult_S3$specificResult genePageInfo_S3=getGeneHtmlPage(simulatedSpeCondData, specificResult_S3, name.index.html="index_simulatedSpeCondData.html", prefix.file="S3", gene.html.ids=c(1:20,195:310)) getFullHtmlSpeCondResult(L.specific.result=specificResult_S3$L.specific.result, param.detection=param.detection8_0.3, page.name="simulatedSpeCondData_results", page.title="Condition specific results mlk 8, rsd 0.3", prefix.file="S3", sort.condition="all",gene.page.info=genePageInfo_S3) @ <>= setwd(oldir) @ Looking at \emph{S3\_simulatedSpeCondData\_results.html} and \emph{S3\_index\_simulatedSpeCondData.html} files we observe that the detection has largely improved. Change the \emph{gene.html.ids} or \emph{gene.html} values to generate different html pages and keep improving the detection results. \section{Advice} % These are some little tricks mainly for people not very familiar with R that can help you during the analysis. % \subsection*{Save your data!} The fitting of the normal component using mclust can be done only once, in each of the steps of the procedure, as a consequence it is beneficial to save the fitting results as shown below. <>= save(fit1,file="fit1.RData") ## load("fit1.RData") @ The parameter effect can be evaluated separately for the two steps. Saving the outliers detected by the first steps allows to test only the effect of the second set of parameters. <>= save(specificOutlierStep1,file="specificOutlierStep1.RData") ## load("specificStep1.RData") save(fit2,file="fit2.RData") ## load("fit2.Rdata") save(specificResult,file="specificResult.RData") @ \subsection*{Session Info} <>= toLatex(sessionInfo()) @ \section{References} Florence MG Cavalli, Juan-Manuel Vaquerizas, Richard Bourgon, Nicholas M Luscombe (in preparation)\\ \\ C. Fraley and A. E. Raftery, Model-based clustering, discriminant analysis, and density estimation, \emph{Journal of the American Statistical Association, Vol. 97, pages 611-631 (2002)}.\\ \\ C. Fraley and A. E. Raftery, MCLUST Version 3 for R: Normal Mixture Modeling and Model-based Clustering, Technical Report No. 504, Department of Statistics, University of Washington, September 2006.\\ \end{document}