%\VignetteIndexEntry{Main vignette:Fletcher2013b} %\VignetteKeywords{Fletcher2013b} %\VignettePackage{Fletcher2013b} \documentclass[11pt]{article} %\usepackage{amsmath} %\usepackage[pdftex]{graphicx} %\usepackage{layouts} %\usepackage{bm} \usepackage{Sweave,fullpage} \usepackage{float} \SweaveOpts{keep.source=TRUE,eps=FALSE,width=4,height=4.5, include=FALSE} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rpackage}[1]{\textit{#1}} \newcommand{\Rfunction}[1]{\textit{#1}} \usepackage{subfig} \usepackage{color} \usepackage{hyperref} \definecolor{linkcolor}{rgb}{0.0,0.0,0.75} \hypersetup{colorlinks=true, linkcolor=linkcolor, urlcolor=cyan} \setlength{\skip\footins}{15mm} \bibliographystyle{unsrt} \title{ Vignette for \emph{Fletcher2013b}: master regulators of FGFR2 signalling and breast cancer risk. } \author{ Mauro AA Castro\footnote{joint first authors}, Michael NC Fletcher\footnotemark[1], Xin Wang, Ines de Santiago, \\ Martin O'Reilly, Suet-Feung Chin, Oscar M Rueda, Carlos Caldas, \\ Bruce AJ Ponder, Florian Markowetz and Kerstin B Meyer \thanks{Cancer Research UK - Cambridge Research Institute, Robinson Way Cambridge, CB2 0RE, UK.} \\ \texttt{\small florian.markowetz@cancer.org.uk} \\ \texttt{\small kerstin.meyer@cancer.org.uk} \\ } \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \tableofcontents <>= options(width=70) @ %---------------------------- %---------------------------- \newpage %---------------------------- %---------------------------- \section{Description} The package \Rpackage{Fletcher2013b} contains a set of transcriptions networks and related datasets that can be used to reproduce the results in Fletcher et al. \cite{Fletcher2013}. The first part of this study is available in the package \Rpackage{Fletcher2013a}, which contains the time-course gene expression data and has been separated for better organization on the data distribution. Here we provide the R scripts to reproduce the bioinformatics analysis. Please refer to Fletcher et al. \cite{Fletcher2013} for more details about the biological background and experimental design of the study. \section{Data sources for regulatory network inference} The METABRIC breast cancer gene expression dataset \cite{Curtis2012} was used in two cohorts, a discovery set (n = 997) and a validation set (n = 995). The METABRIC normal breast expression dataset (n = 144) was used as a non-cancer, tissue control and a T-cell acute lymphoblastic leukaemia gene expression dataset (n = 57) was included as a non-related tissue, cancer control \cite{Vlierberghe2011}. These data sets are publicly available at: \begin{itemize} \item METABRIC discovery set \href{https://www.ebi.ac.uk/ega/studies/EGAS00000000083}{EGAD00010000210} \item METABRIC validation set \href{https://www.ebi.ac.uk/ega/studies/EGAS00000000083}{EGAD00010000211} \item METABRIC normals \href{https://www.ebi.ac.uk/ega/studies/EGAS00000000083}{EGAD00010000212} \item T-cell ALL \href{http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE33469}{GSE33469} \end{itemize} \section{Reconstruction of the breast cancer transcription networks} Due to the large-scale datasets and the parallel processing required to compute the transcription networks, this package provides 4 pre-processed networks named: \Robject{rtni1st} (METABRIC discovery set), \Robject{rtni2nd} (METABRIC validation set), \Robject{rtniNormals} (METABRIC normals) and \Robject{rtniTALL} (T-cell ALL). These R objects will be required to reproduce the analyses along the vignette: \begin{small} <>= library(Fletcher2013b) data(rtni1st) data(rtni2nd) data(rtniNormals) data(rtniTALL) @ \end{small} Next we describe the main methods used to compute the transcription networks, and in the R package \Rpackage{RTN} we provide a short tutorial demostrating the inference pipeline. \subsection{Transcription network inference pipeline} In order to make all methods used in this study available for different users, we implemented the R package called \Rpackage{RTN: reconstruction of transcriptional networks and analysis of master regulators}, which is designed for the reconstruction of transcriptional networks using mutual information \cite{Margolin2006a}. It is implemented by S4 classes in \emph{R} \cite{Rcore} and extends several methods previously validated for assessing transcriptional regulatory units, or regulons (\textit{e.g.} MRA \cite{Carro2010} and GSEA \cite{Subramanian2005}. The main advantage of using \Rpackage{RTN} lies in the provision of a statistical pipeline that runs the network inference in a stepwise process together with a parallel computing algorithm that demands high performance. The \Rpackage{RTN} package should be installed prior to running this vignette. Additionally, in \Rpackage{RTN} we provide a tutorial showing how to compute a transcriptional network using a toy example, which is generated with default options and pValueCutoff=0.05. Here, the pre-processed breast cancer transcription networks were generated by a more stringent threshold, with pValueCutoff=1e-6. To reproduce these large networks we suggest as minimum computational resources a cluster >= 8 nodes and RAM >= 8 GB per node (specific routines should be tuned for the available resources). The inference pipeline is executed in four steps: (\textbf{\textit{i}}) check the consistency of the input data and remove non-informative probes, (\textbf{\textit{ii}}) compute the mutual information and remove the non-significant associations by permutation analysis, (\textbf{\textit{iii}}) remove unstable interactions by bootstrap and (\textbf{\textit{iv}}) apply the data processing inequality filter. These steps are described next. \subsection{Pre-processing of gene expression data} Non-informative microarray probes with low dynamic range of expression were removed from the gene expression matrices. This procedure aims to filter out probes that exhibit low coefficient of variation (CV), below the CV median value. For breast cancer samples, this CV threshold yields a good overlap (>90\%) with the corresponding differential expression analysis of cancer vs. normal cohort samples. The differential expression analysis therefore was used for quality control purposes. The advantage of using the CV here is that the same procedure could be applied across all samples, guaranteeing statistical independence between cancer and normal cohorts. In an alternative approach, for a given gene with multiple probes the \Rpackage{RTN} package selects the probe exibiting the maximum CV, which yields higher gene representativity. We have carried out both approaches and the overall results converged to the same scenario as described in \cite{Fletcher2013}. \subsection{Mutual information (MI) computation} The MI algorithm used in the \Rpackage{RTN} package extends the methods available in \Rpackage{minet} \cite{Meyer2008}. The structure of the regulatory network was derived by mapping all significant interactions between TF and target probes. The TF list was derived from that used in a previous ARACNe/MRA publication \cite{Carro2010} by converting Affymetrix probe IDs into the equivalent probes on the Illumina Human-HT12 Expression BeadChip. Non-significant interactions were removed by permutation analysis. Unstable interactions were additionally removed by bootstrap analysis in order to create a consensus bootstrap network (referred to as the transcriptional network (TN)). \subsection{Application of data processing inequality (DPI)} DPI was applied to the RN with tolerance = 0.0 to remove interactions likely to be mediated by another TF \cite{Margolin2006b}. As DPI removes the weakest edge of each network triplet, the vast majority of indirect interactions are likely to be removed. We also tested DPI tolerance ranging from 0.1 to 0.5 in order to assess the stability of the regulatory units identified in the transcriptional networks. Both the TN and the post-DPI network (filtered transcriptional network) were used in the MRA analysis. \section{Master Regulator Analysis (MRA)} The application of MRA has been described in detail in a previous publication \cite{Carro2010}. MRA computes the overlap between two lists: the TFs and their candidate regulated genes (referred to as regulons) and the gene expression signatures from other sources. In this case, the MRA analytical pipeline estimates the statistical significance of the overlap between all the regulons in each TN using a hypergeometric test. The stability of MRA results was tested by comparing the MRA results between the filtered and unfiltered TN networks, removing master regulators inconsistent with the previous analysis (\textit{i.e.} selected regulons must be significant in both TN networks). Next we retrieve one of the FGFR2 signatures (\textit{i.e.} differentially expressed genes from \textit{Exp1}) and run the MRA analysis on METABRIC discovery set: \begin{small} <>= sigt <- Fletcher2013pipeline.deg(what="Exp1",idtype="entrez") MRA1 <- Fletcher2013pipeline.mra1st(hits=sigt$E2FGF10, verbose=FALSE) @ \end{small} We provide the following functions to run the MRA analysis on the other 3 TN networks: \begin{small} <>= MRA2 <- Fletcher2013pipeline.mra2nd(hits=sigt$E2FGF10) MRA3 <- Fletcher2013pipeline.mraNormals(hits=sigt$E2FGF10) MRA4 <- Fletcher2013pipeline.mraTALL(hits=sigt$E2FGF10) @ \end{small} Each of these MRA pipelines constitutes a wrapper function that uses the pre-processed transcriptional networks together with the MRA algorithm implemented in the \Rpackage{RTN} package. Therefore, different signatures can also be interrogated on METABRIC datasets using these functions (for detailed description and default settings, please see the package's documentation). \section{Transcriptional network of consensus master regulators} Next, the pipeline function plots a graph representing all regulons identified in the consensus MRA analysis. The network is generated by the R package \Rpackage{RedeR} \cite{Castro2012} and should require some user input in order to tune the layout in the software's interface (Figure \ref{fig1}). \begin{small} <>= Fletcher2013pipeline.consensusnet() @ \end{small} \textit{As a suggestion, set 'anchor' to the master regulators at the end of the 'relax' algorithm for a better layout control! right-click the square nodes and then assign 'transform' and 'anchor'!!!} \section{Enrichment maps} In addition to the clustering analysis, the regulons were also represented in an association map showing the degree of similarity among them, the number of common targets. Likewise, the similarity is assessed by the Jaccard coefficient, which is plotted in the association map by the R package \Rpackage{RedeR} \cite{Castro2012}. In the next pipeline, a graph representation is generated for regulons exhibiting $JC \geq 0.4$ (Figure \ref{fig2}). \begin{small} <>= Fletcher2013pipeline.enrichmap() @ \end{small} \textit{Suggestion: zoom in/out with a scroll wheel, and adjust the graph settings interactively!} %---------------------------- %---------------------------- %---------------------------- %---------------------------- %%%%%% %Fig1% %%%%%% \begin{figure}[h!] \begin{center} \includegraphics[width=0.8\textwidth]{fig1.pdf} \caption{\label{fig1}% \textbf{Breast cancer transcriptional network (TN) enriched for the FGFR2 responsive genes.} The network shows the 5 MRs, each one comprising one TF (square nodes) and all inferred targets (round nodes) applying a DPI threshold of 0.01.} \end{center} \end{figure} \clearpage %%%%%% %Fig2% %%%%%% \begin{figure}[h!] \begin{center} \includegraphics[width=0.9\textwidth]{fig2.pdf} \caption{\label{fig2}% \textbf{Enrichment map derived from the relevance network in breast cancer.} Edge width depics the overlap of regulons, and shades of orange indicate degree of enrichment of a regulon in at least one of the three FGFR2 gene signatures.} \end{center} \end{figure} %---------------------------- %---------------------------- \clearpage %---------------------------- %---------------------------- \section{Session information} \begin{scriptsize} <>= print(sessionInfo(), locale=FALSE) @ \end{scriptsize} \newpage \bibliography{bib} \end{document}