%\VignetteIndexEntry{Protein quantification in LC-MS, SRM, DIA experiments} %\VignetteKeywords{Mass Spectrometry, Proteomics} %\VignettePackage{MSstats} \def\todo#1{{\color{red}[TODO: #1]}} \def\meena#1{{\color{green}[MEENA: #1]}} \def\m{$\tt{MSstats}$~} \documentclass[11pt]{article} \usepackage{Sweave} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \usepackage{color} \definecolor{darkblue}{rgb}{0.0,0.0,0.75} \usepackage[% baseurl={http://www.bioconductor.org},% pdftitle={Protein quantification in LC-MS, SRM, DIA experiments with label-free or labeled synthetic peptides },% pdfauthor={Meena Choi},% pdfkeywords={Bioconductor},% pagebackref,bookmarks,colorlinks,linkcolor=darkblue,citecolor=darkblue,% pagecolor=darkblue,raiselinks,plainpages,pdftex]{hyperref} \def\code#1{ \vspace{-7mm} \begin{flushleft} \colorbox{ColorLightGray}{ \begin{minipage}{6.6in} \begin{flushleft} \small \tt #1 \end{flushleft} \end{minipage} } \end{flushleft} } \def\figref#1{Figure~\ref{fig:#1}} \def\secref#1{Section~\ref{sec:#1}} \def\tabref#1{Table~\ref{tab:#1}} \usepackage{amsmath,amsfonts,amsthm} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} % Sweave options for inside << >>= : eval = FALSE (don't run), echo = FALSE (don't show in tex file), results = hide (show no output) \begin{document} \SweaveOpts{concordance=TRUE} \title{Protein significance analysis for mass spectrometry-based proteomics} \author {Meena Choi ({\tt choi67@purdue.edu}) \\ \\http://msstats.org} \maketitle \tableofcontents %============================================ \newpage %\section{Task} %\begin{enumerate} %\item Finding differentially abundant proteins across conditions. %\item Quantification for group or sample %\item Planning future experimental design (e.g. sample size calculation). %\end{enumerate} %============================================ %============================================ \section{Statistical relative protein quantification for LC-MS, SRM and DIA} MSstats is an open-source R-based package for statistical relative quantification of peptides and proteins in mass spectrometry-based proteomics experiments. \subsection*{Applicability} \m 2.0 is applicable to multiple types of sample preparation, including label-free workflows, workflows that use stable isotope labeled reference proteins and peptides, and workflows that use fractionation. It is applicable to global Liquid Chromatography coupled with Mass Spectrometry (LC-MS), targeted Selected Reaction Monitoring (SRM) and Data-Independent Acquisition (DIA or SWATH-MS). It is applicable to experiments that make arbitrary complex comparisons of experimental conditions or times. \m 2.0 is currently not applicable to experiments that compare multiple metabolically labeled endogenous samples within a same run. It is not applicable to experiments with iTRAQ labeling. These experiments will be supported in the future. \subsection*{Statistical functionalities} \m 2.0 performs three analysis steps. The first step, {\it data processing and visualization}, transforms and normalizes the intensities of the peaks, and generates workflow-specific and customizable numeric summaries for data visualization and quality control. The second step, {\it statistical modeling and inference}, automatically detects the experimental design (e.g. group comparison or time course, presence of labeled reference peptides or proteins) from the data. It then reflects the experimental design, the type of spectral acquisition strategy, and the scope of conclusions (e.g. restricted to the subjects, or expanded to the underlying populations), and fits an appropriate linear mixed model by means of $\tt{lm}$ and $\tt{lmer}$ functionalities in R. The model is used to detect differentially abundant proteins or peptides, or to summarize the protein or peptide abundance in a single biological replicate or condition (that can be used, e.g. as input to clustering or classification). The third step, {\it statistical experimental design}, views the dataset being analyzed as a pilot study, utilizes its variance components, and calculates the minimal number of replicates necessary to achieve a pre-specified statistical power. \subsection*{Interoperability with existing computational tools} \m takes as input data in a tabular CVS format, which can be produced by any spectral processing tool such as SuperHirn, MaxQuant, Progenesis, MultiQuant, OpenMS or OpenSWATH. For statistics experts, \m 2.0 satisfies the interoperability requirements of Bioconductor, and takes as input data in the $\tt{MSnSet}$ format~\citep{Gatto:2012tj}. The command line-based workflow is partitioned into a series of independent steps, that facilitate the development and testing of alternative statistical approaches. \m 2.0 complies with the maintenance and documentation requirements of Bioconductor. Finally, \m 2.0 is available as an external tool within Skyline ~\citep{MacLean:2010hd}. The external tool support within Skyline manages \m installation, point-and-click execution, parameter collection in Windows forms and output display. Skyline manages the annotations of the experimental design, and the processing of raw data. It outputs a custom report, that is fed as a single stream input into {\tt MSstats}. This design buffers proteomics users from the details of the R implementation, while enabling rigorous statistical modeling. \subsection*{Availability} \m 2.0 is available under the Artistic-2.0 license at \url{msstats.org}. \m as an external tool is available at \url{http://proteome.gs.washington.edu/software/Skyline/tools.html}. \m 2.0 is currently also under evaluation by Bioconductor (\url{http://www.bioconductor.org}). \subsection*{Changes from MSstats 1.0 and SRMstats} For special cases of some experimental workflows, the underlying statistical methodology was previously described, and implemented in R-based packages MSstats 1.0 and SRMstats ~\citep{Chang:2012uv,Clough:2012bc,Surinova:2013jk}. \m 2.0 supersedes MSstats 1.0 and SRMstats, in that it implements all the analysis steps that are available in these packages. In addition, it extends the methodology and the implementation, as follows. \begin{itemize} \item Unlike MSstats 1.0 (limited to label-free shotgun LC-MS experiments) and SRMstats (limited to SRM experiments), \m 2.0 integrates the statistical analysis steps across two sample preparation workflows (label-free, and using labeled reference proteins or peptides), and three spectral acquisition strategies (global LC-MS, targeted SRM and data-Independent DIA or SWATH-MS). The integration enables a greater flexibility of statistical modeling for each dataset. \item \m includes new statistical capabilities: \begin{itemize} \item {\it Data processing:} quantification of between-run interferences, custom imputation of missing values by low-intensity signals, custom removal of spectral features. \item {\it Data visualization:} more flexible plots of the protein profiles using $\tt{ggplot2}$ functionalities in R, in particular displaying pre-specified proteins, customizing axis range, label and angle. \item {\it Fitting linear mixed effects models:} fit appropriate linear models for specialized circumstances (e.g. proteins with a single replicate in a condition, proteins with a single feature, proteins with various patterns of missing intensities in groups or runs). For label-free experiments, fit appropriate models for experiments with and without technical replicates. Model the unequal variance between features using iterative weight least squares. \item {\it Diagnostics of the quality of model fit:} residual plots and normal quantile-quantile plots across features of a protein, and separately for each feature of a protein to detect deviations from the model assumptions such as unequal variance. \item {\it Calculation of the sample size:} support of multiple modeling options in the analysis of the future experiment, such as expanded or restricted scope of biological replicates, and experiments with or without systematic interferences. \item {\it Summarization of protein abundance in a subject or in a condition on a relative scale:} support of label-free experiments and experiments with labeled reference proteins or peptides. Support of multiple output formats (long format and data matrix). \end{itemize} \item MSstats 2.0 facilitates the interoperability with existing computational tools. In addition to taking as input a table in a CSV format, it can now be used as an external tool with Skyline by researchers who are unfamiliar with R. It also supports input in the $\tt{MSnSet}$ format, and partitions the analysis into a series of separate well-defined steps for interoperability with Bioconductor. \end{itemize} \vspace{-0.5cm} @ %++++++++++++++ % FIGURE: overview \begin{figure}[!h] \centering \small \includegraphics[scale=.9]{MSstatsOverview.pdf} \label{fig:overview} \caption{Overview of workflow in \m} \end{figure} To get started with this package in R, first load the package and then visit the help section of MSstats-package first by the following code. \begin{small} <>= library(MSstats) @ \end{small} \begin{small} <>= ?MSstats @ \end{small} %============================================ \section{Allowable data formats} \subsection{SRM with stable isotope labeled reference peptides} \begin{enumerate} \item[(1)] {\bf 10-column format} The preferred structure of data for use in \m is ``long" format with 10 columns such as the report from other software for identifying peaks. The purpose of MSstats is for statistical analysis after peak identification and quantitation. Therefore, input for MSstats would be output of other software(such as Skyline, MaxQuant, mProphet, openSWATH and so on) for reading raw files such as mzXML, wiff files, or other types of files and identifying peaks. That software generates the report as .csv file. The raw data is required to contain variable of ProteinName, PeptideSequence, PrecursorCharge, FragmentIon, ProductCharge, IsotopeLabelType, Condition, BioReplicate, Run, Intensity. The variable names should be fixed but are not case-sensitive. Using MSstats report format in Skyline, this required input data will be automatically generated. %% each variable meaning \begin{enumerate} \item {\tt ProteinName}: This column needs information about Protein ID. Statistical analysis for each unique labeling in this column will be done. If you want peptide-level analysis, you can use peptide ID in this column. \item {\tt PeptideSequence, PrecursorCharge, FragmentIon, ProductCharge}: The combination of these 4 columns will be used as feature information. If the information of one or more columns is not available for the original raw data, please retain the column variables and type in fixed value (especially LC-MS data, See below LC-MS example input.). For example, the original raw data does not contain the information of {\tt ProductCharge}, we retain the column {\tt ProductCharge} and type in 0 for all transitions in {\tt RawData}. \item {\tt IsotopeLabelType}: This column need to say whether this measurement is based on the endogenous peptides (use 'L') or reference(labeled synthetic) peptides (use 'H'). \item {\tt Condition}: For case-control experiments, this should include case(disease) or control information. For time-course experiments, this should have time points (T1,T2,...) for each measurement. If you have combination of case-control and time-course experiments, this should also combination of these two, such as Disease\_T1, Disease\_T2, Control\_T1,and Control\_T2. \item {\tt BioReplicate}: This should label with unique patient ID (i.e., same patients should label with the same ID). %%% case-control and time-course \m does not require that technical replicates be specified in the data. They should be labeled the same as the corresponding biological replicate. \m detects the presence of technical replicates and accounts for them in the model-based analysis. \item {\tt Run}: This means MS experiment measuremnts. If the same individual(patient) repetitively measured in the same Condition, {\tt BioReplicate} values are the same, but {\tt Run} values are different based on repetition. If all the transitions from one sample is not measured in a single MS run, but split into multiple MS runs due to the limited transitions per run, please also specify this information. \item {\tt Intensity}: This is required to be original signal without any log transformation and can be specified as the peak of height or the peak of area under curve. Any other quantitative representative of abundance can be used instead. \end{enumerate} %++++++++++++++ % FIGURE: example data An example of a dataset in SRM is shown in~\figref{inputSRM}. \begin{figure}[!h] \centering \small \includegraphics[totalheight=1.5in, width=5.5in]{requiredInput.jpg} \caption{Example dataset from SRM experiment (as a .CSV file) in ``long" format. Each row corresponds to a single intensity. (See section 3 in details.) \label{fig:inputSRM}} \end{figure} %++++++++++++++ % table: example design of experiment \item[(2)] {\bf How to assign {\tt Condition, BioReplicate, Run} columns from design of experiment} {\tt Condition, BioReplicate, Run} columns are based on your design of experiments. {\tt Condition} means Disease groups, different time points or the combination of groups and time points. {\tt BioReplicate} needs sample or patient ID. {\tt Run} means MS measurement run. \begin{enumerate} \item {\bf Case-control experiment}\\ Let's assume that there are two groups, disease and control, and each has 2 individuals (biological replicates), which are total 4 individual. Also we repeat to measure each individual twice as technical replicates. Then the possible values in these 3 columns are in ~\tabref{designCase}. \begin{table}[h!] \small \begin{center} \begin{tabular}{|c|c|c|} \hline Condition & BioReplicate& Run \\ \hline \hline Disease & sample1 & 1\\ \hline Disease & sample1 & 2\\ \hline Disease & sample2 & 3\\ \hline Disease & sample2 & 4\\ \hline Control & sample3 & 5\\ \hline Control & sample3 & 6\\ \hline Control & sample4 & 7\\ \hline Control & sample4 & 8\\ \hline \end{tabular} \caption{Possible assignment for design of experiment information in case-control \label{tab:designCase}} \end{center} \end{table} \vspace{-0.5cm} If there is no technical replicate, {\tt BioReplicate} and {\tt Run} will be the same. In case-control, {\tt BioReplicate} is nested in {\tt Condition} and {\tt Run} is the combination of biological replicates and technical replicates. \vspace{0.3cm} \item {\bf Time-course experiment}\\ In time-cour experiment, the same individual is repetitively measured across conditions (time points). Let's assume that there are four individuals (biological replicates). We measure each individual at two time points, time1 and time2, repetively. Then the values in these 3 columns are one of rows in ~\tabref{designTime}. \begin{table}[h!] \small \begin{center} \begin{tabular}{|c|c|c|} \hline Condition & BioReplicate& Run \\ \hline \hline Time1 & sample1 & 1\\ \hline Time2 & sample1 & 2\\ \hline Time1 & sample2 & 3\\ \hline Time2 & sample2 & 4\\ \hline Time1 & sample3 & 5\\ \hline Time2 & sample3 & 6\\ \hline Time1 & sample4 & 7\\ \hline Time2 & sample4 & 8\\ \hline \end{tabular} \caption{Possible assignment for design of experiment information in time-course \label{tab:designTime}} \end{center} \end{table} \vspace{-0.5cm} If there is no technical replicate, {\tt Run} column is the combination of {\tt Condition} and {\tt BioReplicate}. Then if there are technical replicates, {\tt Run} column is the combination of {\tt Condition}, {\tt BioReplicate}, and technical replicates. \end{enumerate} \item[(3)] {\bf {\it MSnSet} format } \m also allows data to be in the format of {\tt MSnSet}, suggested general format on the proteomics in \m package. A MSnSet contains several components, of which the most commonly accessed are the assayData, phenoData, and the featureData components. The assayData is a matrix of intensities, where each row corresponds to the unit of analysis, a peptide feature; the columns correspond to sample ids. The phenoData contains columns that describe the biological samples, conditions in the experiment. The featureData contains columns describing the peptide features, such as the name or id of the underlying protein and information of features. For data stored as an expression set, group labels information are required. If more than one variable is listed in group argument, a concatenated variable is created based on the variables. The remaining information (peptide feature ids, biological replicate ids, and abundance) can be extracted from the rows and columns of featureData and phenoData or the users can assign them based on their design of experiments. \end{enumerate} %====================== \subsection{Label-free LC-MS} \begin{enumerate} \item[(1)] {\bf 10-column format} For label-free LC-MS, the required input is 10-column format, which is the same as for SRM in section 2.1 (1). Only difference between expeirments in input format of MSstats is how to assign feature information. Feature information will consist of 4 columns, {\tt PeptideSequence, PrecursorCharge, FragmentIon, ProductCharge}. If any value of 4 columns is different, it will make different feature ID. The datasets from LC-MS experiment do not have multiple fragments per peptide. Therefore {\tt PrecursorCharge, FragmentIon, ProductCharge} columns might not be available. You can retain the column variables with any fixed value (such as NA or 0). In example data set, peptide feature ID is used in {\tt FragmentIon} column to distinguish feature.(~\figref{inputLCMS}) \begin{figure}[!h] \centering \small \includegraphics[totalheight=1.5in, width=5.5in]{required_LCMS.png} \caption{Example dataset from LC-MS shortgun experiment.(See section 4 in details.) \label{fig:inputLCMS}} \end{figure} %++++++++++++++ % table: example design of experiment \item[(2)] {\bf How to assign {\tt Condition, BioReplicate, Run} columns from design of experiment} For label-free LC-MS, it is the same as for SRM in section 2.1 (2). \item[(3)] {\bf {\it MSnSet} format } If there are all required information in {\it MSnSet}, it is the same as for SRM in section 2.1 (3). \end{enumerate} %====================== \subsection{Label-free DIA} \begin{enumerate} \item[(1)] {\bf 10-column format} For label-free DIA, the required input is 10-column format, which is the same as for SRM in section 2.1 (1). The values of required columns for DIA can be extracted from output of other software after identifing and quantifying peaks, such as openSWATH. Each row in required input for \m corresponds to single intensity value. Only difference between expeirments in input format of MSstats is how to assign feature information. For example, openSWATH provides the information about 4 columns, {\tt PeptideSequence, PrecursorCharge, FragmentIon, ProductCharge}. In this case, the format of input is the same as SRM experiment. If some variables are not available or you want to use feature ID itself, use feature ID values in {\tt FragmentIon} or {\tt ProductCharge}. In example data set, feature ID (or transition ID) is used in {\tt ProductCharge} column to distinguish feature.(~\figref{inputDIA}) \begin{figure}[!h] \centering \small \includegraphics[totalheight=1.5in, width=5.5in]{required_DIA.png} \caption{Example dataset from DIA experiment.(See section 5 in details.) \label{fig:inputDIA}} \end{figure} %++++++++++++++ % table: example design of experiment \item[(2)] {\bf How to assign {\tt Condition, BioReplicate, Run} columns from design of experiment} For label-free DIA, it is the same as for SRM in section 2.1 (2). \item[(3)] {\bf {\it MSnSet} format } If there are all required information in {\it MSnSet}, it is the same as for SRM in section 2.1 (3). \end{enumerate} \vspace{0.2in} %============================================ \section{Example workflow with label-based SRM: time-course investigation of {\it S. Cerevisiae}} %We demonstrate the use of the software in a case study: a labeled reference peptide SRM experiments of yeast.%%(\cite{thaminy11}) \subsection{Experimental design} The example dataset is from a label-based SRM experiment of a time course yeast study. This is a partial data set obtained from a published study ~\citep{Picotti09}. The experiment targeted 45 proteins in the glycolysis/gluconeogenesis/TCA cycle/glyoxylate cycle network, which spans the range of protein abundance from less than 128 to 10E6 copies per cell. Three biological replicates were analyzed at ten time points (T1-T10), while yeasts transited through exponential growth in a glucose-rich medium (T1-T4), diauxic shift (T5-T6), post-diauxic phase (T7-T9), and stationary phase (T10). Prior to trypsinization, the samples were mixed with an equal amount of proteins from the same N15-labeled yeast sample, which was used as a reference. Each sample was profiled in a single mass spectrometry run, where each protein was represented by up to two peptides and each peptide by up to three transitions. The goal of this study is to detect significantly change in protein abundance across time points. Transcriptional activity under the same experimental conditions has been previously investigated by (DeRisi et. al., 1997). Genes coding for 29 of the proteins are differentially expressed between conditions similar to those represented by T7 and T1 and could be treated as external sources to validate the proteomics analysis. In this exampled data set, two of the targeted proteins are selected and validated with gene expression study: Protein IDHC (gene name IDP2) is differentially expressed in time point 1 and time point 7, whereas, Protein PMG2 (gene name GPM2) is not. The protein names are based on Swiss Prot Name. This dataset is stored in the package in a data structure {\tt RawData}. To know about example dataset, {\tt RawData}, visit the help section of RawData first by the following code. \begin{small} <>= ?RawData @ \end{small} %============================================ \subsection{Reading the data} Labeled reference peptide SRM data from two proteins identified in yeast study is stored in "long" format as a data.frame labeled {\tt RawData} in \m; the data can be accessed once the package has been installed and loaded in {\tt R}. \begin{footnotesize} <<>>= head(RawData) @ \end{footnotesize} %============================================ \subsection{Pre-processing data and quality control of MS runs} \begin{enumerate} \item[(1)] {\bf Data processing steps and options} All data need to pre-processing prior to statistical analysis. Possible steps may include: \begin{itemize} \item logarithm transformation with base 2 (default) or 10 of the intensities \item normalization to remove systematic bias between MS run after logarithm transformation. For label-based experiments, constant normalization is performed based on reference signals across runs among all proteins. For label-free experiments, constant normalization based on endogenous signals across runs among all proteins is performed. Therefore users need to be careful for label-free experiments. In this case, normalization with standard protein or any other normalization has to be done prior to use of the package. If normalization=FALSE, no normalization is performed. \item calculation of between-run-interference score (based on the correlation between mean of peptide by run and intensity). It detects interference for further statistical model. \end{itemize} After pre-processing data, the summary of experimental design will be showed. When a transition is missing completely in a condition or a MS run, a warning message is sent to the console notifying the user of the missing transitions. The quantitative data after data pre-processing and quality control of MS runs not only contain the same variable from the raw data, but with addition variables for statistical model fitting and group comparison. For examples, Variable ABUNDANCE represents the final measurement, which could be normalized or not depending on the options you specified in {\tt dataProcess}. Default option in {\tt dataProcess} is with log2 transformation and normalization. \\ To get started with this function, visit the help section of dataProcess first by the following code. \begin{small} <>= ?dataProcess @ \end{small} \begin{small} <<>>= QuantData<-dataProcess(RawData) @ \end{small} \begin{footnotesize} <<>>= head(QuantData) @ \end{footnotesize} %============================================ \item[(2)]{\bf Visualization for explanatory data analysis} To illustrate the quantitative data after data-preprocessing and quality control of MS runs, dataProcessPlots takes the quantitative data from function ({\tt dataProcess}) as input and can generates three types of figures. \begin{itemize} \item Profile Plot (~\figref{Profile}): identify the potential sources of variation of each protein. X-axis is run. Y-axis is log-intensities of transitions. Reference/endogenous signals are in the left/right panel. Line colors indicate peptides and line types indicate transitions. \item QC Plot (~\figref{QC}) : illustrate the systematic bias between MS runs. After normalization, the reference signals for all proteins should be stable across MS runs. X-axis is run. Y-axis is log-intensities of transition. Reference/endogenous signals are in the left/right panel. The pdf file contains (1) QC plot for all proteins and (2) QC plots for each protein separately. \item Condition Plot (~\figref{Condition}): illustrate the systematic difference between conditions. X-axis is condition. Y-axis is log ratio of endogenous over reference. For label-free, Y-axis is log intensity of endogenous. If scale is TRUE, the levels of conditions is scaled according to its actual values at x-axis. Red points indicate the mean of log ratio for each condition. If interval is "SE", blue error bars indicate the confidence interval with 0.95 significant level for each condition. If interval is "SD", blue error bars indicate the standard deviation for each condition.The interval is not related with model-based analysis. \end{itemize} There are several options such as size of label, axis, file and so on. For example, {\tt which.Protein} can be assigned as protein list to draw plots. and {\tt address} is the name of folder that will store pdf file for plots. If {\tt address}=FALSE, plot will be just showed in window.\\ To get started with this function and information about detailed options, visit the help section of dataProcessPlots first by the following code. \begin{small} <>= ?dataProcessPlots @ \end{small} \begin{enumerate} \item{Quality control plot } \begin{small} <>= dataProcessPlots(data=QuantData,type="QCPlot") @ \end{small} \begin{figure}[ht!] \centering \includegraphics[width=2.5in]{QCPlot_all_before.pdf} \includegraphics[width=2.5in]{QCPlot_all.pdf} \caption{\small All protein together in QCplots before(left panel) and after(right panel) normalization} \label{fig:QC} \end{figure} \clearpage \item{Profile plot} \begin{small} <>= dataProcessPlots(data=QuantData,type="ProfilePlot") @ \end{small} \vspace{-0.3cm} \begin{figure}[ht!] \centering \includegraphics[width=2.5in]{ProfilePlot1.pdf} \includegraphics[width=2.5in]{ProfilePlot2.pdf} \vspace{-0.3cm} \caption{\small Profile plots for Protein IDHC and PMG2 after normalization. Quantitative profiles of reference and endogenous transitions in label-based SRM experiments. X-axis: run. Y-axis: log-intensities of transitions. Line types indicate transitions, and colors indicate peptides.} \label{fig:Profile} \end{figure} \item{Condition plot} \begin{small} <>= dataProcessPlots(data=QuantData,type="ConditionPlot") @ \end{small} \vspace{-0.3cm} \begin{figure}[ht!] \centering \includegraphics[width=2.5in]{Condition1.pdf} \includegraphics[width=2.5in]{Condition2.pdf} \vspace{-0.3cm} \caption{\small Condition plots for Protein IDHC and PMG2. Illustrate the systematic difference between conditions. X-axis is condition. Y-axis is log ratio of endogenous over reference. Red points indicate the mean of log ratio for each condition. Blue error bars indicate the confidence interval with 0.95 significant level for each condition.} \label{fig:Condition} \end{figure} \end{enumerate} \end{enumerate} %============================================ \subsection{Model-based inference} \subsubsection{Setting up a linear mixed effects model} A statistical model has the capability to formally characterize the sources of variation of all measurements that pertain to a protein, and to distinguish the systematic patterns of differential abundance from noise. A statistical model describes the relationship between a {\it response} variable and a set of variables that have been observed along with the response. In proteomic experiments, categorical variables, whose values are denoted by labels, may include peptide features, conditions under which replicates are observed, e.g., disease state, treatment, or time, and biological replicates. \m tests for significant changes in protein abundance across conditions based on a family of linear mixed-effects models in LC-MS, SRM, DIA experiment. Experimental design of case-control study (patients are not repeatedly measured) or time course study (patients are repeatedly measured) is automatically determined based on proper statistical model. Other choices of model specification include : \begin{enumerate} \item labeling technique : TRUE(default) represent the labeled reference peptide study. FALSE represent label-free study. \item scope of inference for biological replicates(variable SUBJECT) or technical replicates(variable RUN): restricted scope, which limited conclusion from the model to observed biological unites or technical units, or expanded scope, which expands conclusion from the model to the population of biological units or technical units. The underlying model fitting functions are {\tt lm} and {\tt lmer} for fixed-effects models and mixed-effects models, respectively. \item interference: TRUE(default) means data contain interference transitions and need additional model interaction to address the interference. FALSE means data contain no interference transitions and no need additional model interaction to address the interference. \item unequal variance between features: If the unequal variation of error for different peptide features is detected, then a possible solution is to account for the unequal error variation by means of a procedure called iteratively re-weighted least squares. featureVar=TRUE performs an iterative fitting procedure, in which features are weighted inversely proportionally to the variation in their intensities, so that feature with large variation are given less importance in the estimation of parameters in the model. \item In general, we recommend to keep interaction({\tt interference}=TRUE) and to use constant variance among features({\tt featureVar}=FALSE) for SRM experiment. \end{enumerate} To get started with this function, visit the help section of groupComparison first by the following code. \begin{small} <>= ?groupComparison @ \end{small} Also comparison of interest can be assigned as {\tt contrast.matrix}. Based on the levels of conditions, specify 1 or -1 to the conditions of interests and 0 otherwise. The levels of conditions are sorted alphabetically. Command {\tt levels(QuantData\$GROUP\_ORIGINAL)} can illustrate the actual order of the levels of conditions. When a feature is missing completely in a condition or a MS run, a warning message is sent to the console notifying the user of the missing feature. Additional filtering or imputing process is required before model fitting. For example, we attempt to compare differential abundance between time 1 and 7 in a set of targeted proteins. In statistical terms, we test $H_0: L = \mu_\text{\ T7} - \mu_\text{\ T1} = 0$ against the alternative $H_a: L = \mu_\text{\ T7} - \mu_\text{\ T1} \neq 0$, where $\mu_\text{\ T1}$ is the mean abundance of the protein in the time 1. In this label-based SRM experiment, we recommend the fitted model with expanded scope of technical replication (i.e. {\tt labeled}=TRUE, {\tt scopeOfTechReplication}="expanded" as default). Finally, the {\tt groupComparison} function is used to simultaneously estimate the log fold change $L$ and the corresponding standard error of the estimate, and to test $H_0$. \begin{small} <<>>== levels(QuantData$GROUP_ORIGINAL) @ <<>>== comparison<-matrix(c(-1,0,0,0,0,0,1,0,0,0),nrow=1) row.names(comparison)<-"T7-T1" @ <<>>== testResultOneComparison<-groupComparison(contrast.matrix=comparison, data=QuantData) testResultOneComparison$ComparisonResult @ \end{small} The testing result contains variable of Protein, Comparison(Label), log2 fold change(logFC), standard error (SE), T values (Tvalue), degree of freedom (DF), raw p-values (pvalue), adjusted p-values based on Benjamini and Hochberg method to collect multiple testing issue and further control false discovery rate (adj.pvalue). The positive logFC values indicates up-regulated and the negative log FC values indicates down-regulated in time 7. The small adjusted p-value means that the regulation is statistically significant. Also multiple companions between groups are possible. This is a testing result of multiple comparisons of {\tt QuantData} based on the intensity-based linear model with expanded scope. The comparison is time 1 vs time 3 (T3-T1), time 1 vs time 7 (T7-T1), time 1 vs time 9 (T9-T1). This result dataset is stored in the package in a data structure {\tt testResultMultiComparisons}. \begin{small} <<>>== comparison1<-matrix(c(-1,0,1,0,0,0,0,0,0,0),nrow=1) comparison2<-matrix(c(-1,0,0,0,0,0,1,0,0,0),nrow=1) comparison3<-matrix(c(-1,0,0,0,0,0,0,0,1,0),nrow=1) comparison<-rbind(comparison1,comparison2, comparison3) row.names(comparison)<-c("T3-T1","T7-T1","T9-T1") testResultMultiComparisons<-groupComparison(contrast.matrix=comparison,data=QuantData) testResultMultiComparisons$ComparisonResult @ \end{small} %============================================ \subsubsection{Verifying the assumption of the model} Results based on statistical models are accurate as long as the assumptions of the model are met. The model assumes that the measurement errors are normally distributed with mean 0 and constant variance $\sigma^2_{\text{Error}}$. The assumption of a constant variance can be checked by examining the residuals from the model, i.e., the deviations of the predicted intensities from the actual intensities. If a plot of residuals against predicted values shows a random scatter, then the assumption is appropriate. Residual plots can be produced for all proteins in the dataset by using the {\tt modelBasedQCPlots} function with {\tt type="ResidualPlots"} in \m. Option, {\tt which.Protein} can be assigned as protein list to draw plots and {\tt address} is the name of folder that will store pdf file for plots. If {\tt address}=FALSE, plot will be just showed in window. To check whether the errors are well approximated by a normal distribution, the {\tt modelBasedQCPlots} function with {\tt type="QQPlots"} in \m can be used. This function produces a normal quantile-quantile plot for each protein. If points fall approximately along a straight line, then the assumption is appropriate for that protein. Only large deviations from the line are problematic. The input of this function is "ModelQC" in the testing results from function ({\tt groupComparison}). The example result is {\tt testResultMultiComparisons}\\ To get started with this function, visit the help section of modelBasedQCPlots first by the following code. \begin{small} <>= ?modelBasedQCPlots @ <>== modelBasedQCPlots(data=testResultMultiComparisons$ModelQC, type="ResidualPlots", which.Protein="PMG2", address=FALSE) @ <>== modelBasedQCPlots(data=testResultMultiComparisons$ModelQC,type="QQPlots", which.Protein="PMG2", address=FALSE) @ \end{small} \begin{figure}[h!] \begin{center} \includegraphics[width=2.5in]{ResidualPlot2.pdf} \includegraphics[width=2.5in]{QQPlot2.pdf} \includegraphics[width=3in]{SRM_QQPlot_perFeature_light.pdf} \includegraphics[width=3in]{SRM_QQPlot_perFeature_heavy.pdf} \caption{\small (a) Residual plot (upper left panel) and (b) normal quantile-quantile plot (upper right panel) for Protein PMG2. The unequal variance between feature is detected for Protein PMG2. The pattern in the qqplot is due to deviations from the assumption of constant variance, and not necessarily from the assumption of Normality. (c) and (d) The normal quantile-quantile plot per feature (bottom panel) illustrates that. It shows that (1) since the slope of the Q-Q plot is related to the variance of the distribution of the errors, the different slopes of the lines offer another indication of unequal variance, (2) the assumptions of Normality of each separate feature is more plausible. The features with lower intensities have a larger variance, and are more likely to deviate from the Normality assumption. } \end{center} \end{figure} In practice, it is common for the variation of error to be unequal for different peptide features. If this is diagnosed from the residual plots, then a possible solution is to account for the unequal error variation by means of a procedure called {\it iteratively re-weighted least squares}. This procedure is implemented in \m by means of the {\tt featureVar} argument in {\tt groupComparison}. {\tt featureVar} = TRUE performs an iterative fitting procedure, in which features are weighted inversely proportional to the variation in their intensities, so that features with large variation are given less importance in the estimation of parameters in the model. %============================================ \subsubsection{Testing for protein-level differential abundance between conditions} To summarize the results of fold changes and adjusted p-values for differentially abundant proteins, {\tt groupComparisonPlots} takes testing results from function ({\tt groupComparison}) as input : (1) volcano plot (specify "VolcanoPlot" in option type) for each comparison separately; (2) heatmap (specify "Heatmap" in option type) for multiple comparisons ; (3) comparison plot (specify "ComparisonPlot" in option type) for multiple comparisons per protein. To get started with this function, visit the help section of groupComparisonPlots first by the following code. \begin{small} <>= ?groupComparisonPlots @ \end{small} \begin{itemize} \item Volcano plot : illustrate actual log-fold changes and adjusted p-values for each comparison separately with all proteins. The x-axis is the log fold change. The base of logarithm transformation is the same as specified in "logTrans" from {\tt dataProcess}. The y-axis is the negative log2 adjusted p-values. The horizontal dashed line represents the FDR cutoff. The points below the FDR cutoff line are non-significantly abundant proteins (colored in black). The points above the FDR cutoff line are significantly abundant proteins (colored in red/blue for up-/down-regulated). If fold change cutoff is specified (FCcutoff = specific value), the points above the FDR cutoff line but within the FC cutoff line are non-significantly abundant proteins (colored in black). \item Heatmap : illustrate up-/down-regulated proteins for multiple comparisons with all proteins. Each column represents each comparison of interest. Each row represents each protein. Color red/blue represents proteins in that specific comparison are significantly up-regulated/down-regulated proteins with FDR cutoff and/or FC cutoff. The color scheme shows the evidences of significance. The darker color it is, the stronger evidence of significance it has. Color gold represents proteins are not significantly different in abundance. \item Comparison plot : illustrate log-fold change and its variation of multiple comparisons for single protein. X-axis is comparison of interest. Y-axis is the log fold change. The red points are the estimated log fold change from the model. The blue error bars are the confidence interval with 0.95 significant level for log fold change. This interval is only based on the standard error, which is estimated from the model. \end{itemize} The input of this function is "ComparisonResult" in the testing results from function ({\tt groupComparison}). The example result is {\tt testResultMultiComparisons} based on the testing results of comparison of time1 and time 7. \begin{enumerate} \item[(1)]{Volcano plot} \begin{footnotesize} <<>>= head(testResultMultiComparisons$ComparisonResult) @ \end{footnotesize} {\bf 1. Default : sig=0.05 / No FCcutoff / No ylimUp / with Protein name} \begin{small} <>= groupComparisonPlots(data=testResultMultiComparisons$ComparisonResult, type="VolcanoPlot",which.Comparison=c("T7-T1"), address=FALSE) @ \end{small} \begin{figure}[h!] \begin{center} \includegraphics[width=2.5in]{VolcanoYeast1.pdf} \includegraphics[width=2.5in]{VolcanoYeast2.pdf} \vspace{-0.5cm} \caption{\small Volcano plot of a single comparison. The x-axis is the log fold change of two conditions. The y-axis is the negative log2 adjusted p-values. The dashed line represents the FDR cutoff of 0.05(default setup {\tt sig}=0.05). The points below the cutoff line are the non-significantly abundant proteins (colored in black). Left volcano plot is the result for default option. Right volcano plot is the result for setting up {\tt FCcutoff}=70 and upper limit of y-axis is 100 without protein names. The setup of FCcutoff=70 is for demonstrate purpose onlythat protein IDHC is not significant after applying fold change cutoff.} \end{center} \end{figure} {\bf 2. FCcutoff=70 / ylimUp=100 / without Protein name}\\ Note : FCcutoff=70 is due to demonstration purpose. \begin{small} <>= groupComparisonPlots(data=testResultMultiComparisons$ComparisonResult, type="VolcanoPlot",FCcutoff=70, ylimUp=100, ProteinName=FALSE, which.Comparison=c("T7-T1"), address=FALSE) @ \end{small} \vspace{-1cm} \begin{figure}[h!] \begin{center} \includegraphics[width=2.5in]{Volcano45Yeast1.pdf} \includegraphics[width=2.5in]{Volcano45Yeast2.pdf} \vspace{-0.5cm} \caption{\small Volcano plots of a single comparison for 45 proteins from the example time courses yeast study with all 45 proteins. Left volcano plot is the result for default option. Right volcano plot is the result for setting up {\tt FCcutoff}=1.5 and upper limit of y-axis is 60 without protein names. \label{fig:volcano}} \end{center} \end{figure} \clearpage \item[(2)]{Heatmap} \vspace{0.3cm} {\bf 1. Default : sig=0.05 / No FCcutoff} \begin{small} <>= groupComparisonPlots(data=testResultMultiComparisons$ComparisonResult, type="Heatmap",address=FALSE) @ \end{small} {\bf 2. FCcutoff=70} \begin{small} <>= groupComparisonPlots(data=testResultMultiComparisons$ComparisonResult, type="Heatmap",FCcutoff=70,address=FALSE) @ \end{small} \begin{figure}[h!] \begin{center} \includegraphics[width=1.5in]{Heatmap_colorkey.png} \\ \vspace{0.3cm} \includegraphics[width=2.5in]{HeatmapYeast1.png} \includegraphics[width=2.5in]{HeatmapYeast2.png} \vspace{-0.2in} \caption{\small Heatmap of multiple comparisons. The column represents different comparisons. The row represents each protein. Color red/blue represents proteins in that specific comparison are significantly up-regulated/down-regulated proteins based on adjusted p-values with the FDR cutoff (specified in the option {\tt sig}) and/or with the fold change cutoff (specified in the option {\tt FCcutoff}). Color gold represents proteins are not significantly different in abundance. The color scheme shows the evidences of significance. The darker color it is, the stronger evidence of significance it has. The color key shows the cutoff of adjusted p-value for color scheme. Left Heatmap is applied with FDR cutoff is 0.05 and no fold change cutoff. Right Heatmap is applied with both FDR cutoff is 0.05 and FC cutoff is 70. Note that specifying the FC cutoff = 70 is for demonstration purpose, so that protein IDHC becomes insignificant after both FDR and fold change cutoff.} \end{center} \end{figure} \newpage \begin{figure}[h!] \begin{center} %\includegraphics[width=1.5in]{Heatmap_colorkey.png} \\ \vspace{-0.3in} \includegraphics[width=2.5in]{Heatmap45Yeast1.pdf} \includegraphics[width=2.5in]{Heatmap45Yeast2.pdf} \vspace{-0.3in} \caption{\small Heatmap of multiple comparisons for the 45 proteins from the example time courses yeast study with all 45 proteins, which is the same result of ~\figref{volcano}. Left heatmp is the result for default option using FDR cutoff=0.05 only. Right heatmap is the result under both FDR cutoff=0.05 and FCcutoff=1.5} \end{center} \end{figure} \vspace{-0.3cm} \item[(3)]{Comparison plot} \vspace{-0.3cm} \begin{small} <>= groupComparisonPlots(data=testResultMultiComparisons$ComparisonResult, type="ComparisonPlot") @ \end{small} \vspace{-0.2in} \begin{figure}[ht!] \begin{center} \includegraphics[width=2.5in]{Comparison1.pdf} \includegraphics[width=2.5in]{Comparison2.pdf} \vspace{-0.1in} \caption{\small Comparison plots for multiple comparisons. The x-axis is the comparisons which the user designed. The y-axis is the log fold change for each comparison. The red points mean the estimated log fold change. The blue error bars mean the confidence interval with 0.95 significant level for log fold change. The horizontal line at log fold change=0 (the log fold change of one comparison is located around this line) means that there is no significant difference for this comparison.} \end{center} \end{figure} \end{enumerate} \vspace{-0.2in} \clearpage %============================================ \subsection{Sample size calculation for a future experiment} To calculate sample size for future experiments, the function fits the model and uses variance components. The underlying model fitting with intensity-based linear model with technical MS run replication. Input in the example is {\tt QuantData}. Estimated sample size is rounded to 0 decimal. Four options of the calculation: \begin{enumerate} \item number of biological replicates per condition \item number of peptides per protein \item number of transitions per peptide \item power is a pre-specified statistical power which defined as the probability of detecting a true fold change. Users should input the average of power that is expected. Otherwise, the function calculate the power for this category, \end{enumerate} It can only obtain either one of the categories of the sample size calculation (numSample, numPep, numTran) or power calculation at the same time. Other three values are need to be inputed.\\ To get started with this function, visit the help section of designSampleSize first by the following code. \begin{small} <>= ?designSampleSize @ \end{small} {\tt desiredFC}, the range of a desired fold change which includes the lower and upper values of the desired fold change, and FDR, a pre-specified false discovery ratio (FDR) to control the overall false positive, are required. Two options for fitting model are possible, choice of scope of biological replication and choice of interference data. For scope of biological replication, "restricted"(default) represents restricted scope of biological replication to the selected individuals. "expanded" represents expanded scope of biological replication to the whole population. For interference, TRUE(default) means data contain interference transitions and need additional model interaction to address the interference. FALSE means data contain no interference transitions and no need additional model interaction to address the interference. \subsubsection{Minimal number of biological replicates per condition} \begin{footnotesize} <<>>== designSampleSize(data=QuantData,numSample=TRUE,numPep=3,numTran=4,power=0.8, desiredFC=c(1.25,1.75),FDR=0.05) @ \end{footnotesize} \subsubsection{Power calculation} \begin{footnotesize} <<>>== designSampleSize(data=QuantData,numSample=2,numPep=3,numTran=4,power=TRUE, desiredFC=c(1.25,1.75),FDR=0.05) @ \end{footnotesize} %============================================ \subsubsection{Visualization for sample size calculation} To illustrate the relationship of desired fold change and the calculated minimal number sample size which are (1) number of biological replicates per condition, (2) number of peptides per protein, (3) number of transitions per peptide, and (4) power. The input is the result from function ({\tt designSampleSize}). \\ To get started with this function, visit the help section of designSampleSizePlots first by the following code. \begin{small} <>= ?designSampleSizePlots @ <>== # Minimal number of biological replicates per condition result.sample<-designSampleSize(data=QuantData,numSample=TRUE,numPep=3,numTran=4,power=0.8, desiredFC=c(1.25,1.75),FDR=0.05) designSampleSizePlots(data=result.sample) @ \end{small} \begin{center} \includegraphics[width=2.5in]{SampleSizeReplicateYeast.pdf}\\ \end{center} \begin{small} <>== # Power result.power<-designSampleSize(data=QuantData,numSample=2,numPep=3,numTran=4,power=TRUE, desiredFC=c(1.25,1.75),FDR=0.05) designSampleSizePlots(data=result.power) @ \end{small} \begin{center} \includegraphics[width=2.5in]{power.pdf}\\ \end{center} %============================================ \subsection{Quantification of protein abundance in individual samples or in conditions} Downstream analysis, e.g., clustering or classification of individual samples based on protein profiles, sometimes requires a single value of a protein in each biological replicate. That is, it is sometimes of interest to have a table where columns are biological replicates and rows are proteins. A summarization step is needed to express the measurements from multiple features into a single value per protein per biological replicate. As in testing for differential abundance, this can be done by using model-based quantities. In \m, the {\tt quantification} function can be used to produce the estimates of protein abundance for all biological replicates in the study or for all groups. For sample quantification, the label of each biological sample is a combination of the corresponding group and the sample ID. The same model with groupComparison will be used. However, if there is only one transition in a certain protein, the estimate of variation is NA. Therefore, the result may be unreliable. The quantification for light is the endogenous transition quantification.The quantification for reference (heavy) is the average among all reference intensities. The quantification of ratio between endogenous and reference intensity would be the quantification of light minus the reference quantification. Two formats of output are supported. "long" for long format which has the columns named Protein, Condition, LonIntensities (and BioReplicate if sample quantification). "matrix" for data matrix format which has the rows for Protein and the columns, which are Groups(or Conditions) for group quantification or the combinations of BioReplicate and Condition (labeled by "BioReplicate"\_"Condition") for sample quantification. Default is "matrix". The input of this function is the quantitative data from function ({\tt dataProcess}). The example data is {\tt QuantData}.\\ To get started with this function, visit the help section of quantification first by the following code. \begin{small} <>= ?quantification @ \end{small} Consider quantitative data (i.e. {\tt QuantData}) from a yeast study with ten time points of interests, three biological replicates, and no technical replicates which is a time-course experiment. Sample quantification shows model-based estimation of protein abundance in each biological replicate within each time point. Group quantification shows model-based estimation of protein abundance in each time point. \begin{small} <<>>== # (1): Sample quantification subQuant<-quantification(QuantData) head(subQuant) @ <<>>== # (2): Group quantification groupQuant<-quantification(QuantData, type="Group") head(groupQuant) @ \end{small} \clearpage %============================================ \section{Example workflow with label-free LC-MS: controlled spike-in experiment} The workflow is the same as SRM experiment except some options for analysis and inference. The summary of result is shown below. See details from {\it user tutorial} in \url{msstats.org} \subsection{Experimental design} ~\cite{Mueller:2007fo}, has previously described the experiment with the sample with 6 standard Proteins, (horse myoglobin, bovine carbonic anhydrase, horse Cytochrome c, chicken lysozyme, yeast alcohol dehydrogenase, rabbit aldolase A) Each protein was represented by 7-21 peptides and each peptide was represented by 1-5 transition. Using latin square design of experiment frame, each sample has different dilution for each protein. Each of the LC-MS runs is analyzed in 3 technical replicate. There are total 18 injection runs. The data set is available in website (\url{http://prottools.ethz.ch/muellelu/web/Latin\_Square\_Data.php}) and in \url{msstats.org}.\\ \subsection{Pre-processing data and quality control of MS runs} The default normalization method for label-free MS experiment in MSstats is the constant normalization with all endogenous intensities along all proteins. However, if there are small number of features in each run, compared with SRM and DIA experiments, we can not keep the assumption that the distribution of intensities across run is the same. The alternative could be normalization using internal standard peptide. See details in section 3.3. \begin{figure}[h!] \begin{center} \includegraphics[width=3in]{LCMS_ProfilePlot.pdf} \caption{\small Profile plot for one spiked protein, Alcohol dehydrogenase-Yeast. The profiles show some non-parallel patterns (i.e. interferences). Features with low signals have missing values. \label{fig:profileLCMS}} \end{center} \end{figure} \subsection{Model-based inference} LC-MS experiments are often subject to random interferences, and contain missing values. To express the fact that the interferences are non-systematic, we recommend to remove interaction ({\tt interference}=FALSE). To express the fact that the variances are unequal between the features, we recommend to use feature-specific variance ({\tt featureVar}=TRUE). Also there are three options to handle missing values, (1) remove interaction which means to assume features demonstrate no interference across runs, (2) impute with average minimum intensity across runs, (3) remove features that completely miss in some conditions. \subsubsection{Setting up a linear mixed effects model} First, we assign the comparison of interest. Next,we fit a model that specifies systematic interferences and constant variance. Since this is a label-free experiment, use {\tt labeled}=FALSE. See section 3.4.1 for details regarding specifying scope of conclusions. \subsubsection{Verifying the assumption of the model} Also, in this case the protein has many missing peaks (complete missing in some conditions).The groupComparison function automatically detects this pattern of missing values, and uses the model without interaction. See in section 3.4.2 for other possible options. %% use defaul \begin{figure}[h!] \begin{center} \includegraphics[width=3in]{LCMS_constant_ResidualPlot.pdf} \caption{\small Residual plot for one spiked protein, Alcohol dehydrogenase-Yeast. The profiles shows a larger variance for features with a lower intensity. \label{fig:outputLCMS}} \end{center} \end{figure} \subsection{Testing for protein-level differential abundance between conditions} Heatmap summarizes all results of comparison, that \m detects difference between groups. See details about other types of plot in section 3.5. %% use defaul \begin{figure}[h!] \begin{center} \includegraphics[width=3in]{LCMS_Heatmap.pdf} \caption{\small Heatmap of significance testing results of 4 spiked proteins across 6 conditions. \label{fig:heatmapLCMS}} \end{center} \end{figure} \subsection{Sample size calculation for a future experiment} Same as in SRM experiments. See details in section 3.5. \subsection{Quantification of protein abundance in individual samples or in conditions} Same as in SRM experiments. See details in section 3.6. \clearpage %============================================ \section{Example workflow with label-free DIA: a group comparison study of {\it S.Pyogenes}} The workflow is the same as SRM experiment except some options for analysis and inference. The summary of result is shown below. See details from {\it user tutorial} in \url{msstats.org} \subsection{Experimental design} The example dataset is from a label-free DIA experiment of S.pyogenes study. For demonstration purposes, this partial data set consists of two proteins only. Two biological conditions are Strep0 and Strep10 (S. Pyogenes with 0\% and 10\% of human plasma added). Two biological replicates from each condition were profile with a SWATH-MS-enabled AB SCIEX TripleTOF 5600 System. The identification and quantification of spectral peaks was assisted by a spectral library, and was performed using OpenSWATH software \url {http://proteomics.ethz.ch/openswath.html}. \subsection{Pre-processing data and quality control of MS runs} The default normalization method for label-free MS experiment in MSstats is the constant normalization with all endogenous intensities along all proteins. In case of DIA, there are enough number of intensities in each run. Therefore, quantile normalization is also suitable because the assumption for same distribution of intensities for endogenous transition across all MS runs is reasonable. See details in section 4.3. \begin{figure}[h!] \begin{center} \includegraphics[width=2.5in]{DIA_ProfilePlot_269.pdf} \includegraphics[width=2.5in]{DIA_ProfilePlot_377.pdf} \caption{\small Profile plots for two example proteins, Probable RNA helicase exp9 and FabG in DIA experiment. The proteins have a large number of features. The profiles show some non-parallel patterns (i.e. interferences). \label{fig:profileDIA}} \end{center} \end{figure} %% \todo{Hannes : could you confirm the name of protein numbered 515084? the first few letters(here, XX) are broked.} \subsection{Model-based inference} DIA experiments are often subject to random interferences, and unequal variance between features. To express the fact that the interferences are non-systematic, we recommend to remove interaction ({\tt interference}=FALSE). To express the fact that the variances are unequal between the features, we recommend to use feature-specific variance ({\tt featureVar}=TRUE). \subsubsection{Setting up a linear mixed effects model} First, we assign the comparison of interest. Next,we fit a model that specifies systematic interferences and constant variance. Since this is a label-free experiment, use {\tt labeled}=FALSE. See section 3.4.1 for details regarding specifying scope of conclusions. \subsubsection{Verifying the assumption of the model} The unequal variance between features is detected from Residual plot and normal quantile-quantile plots.(~\figref{outputDIA}) Therefore to consider the unequal variance is recommended as below. See in section 3.4.2 for other possible options. %% use defaul \begin{figure}[h!] \begin{center} \includegraphics[width=2.8in]{DIA_constant_ResidualPlot4.pdf} \includegraphics[width=2.8in]{DIA_constant_QQPlot4.pdf} \caption{\small (a) Residual plot (left panel) for FabG protein using the model under the assumption constant variance across features. (b) normal quantile-quantile plot (right panel) shows the pattern in the qqplot that is due to deviations from the assumption of constant variance, and not necessarily from the assumption of Normality. \label{fig:outputDIA}} \end{center} \end{figure} \subsubsection{Testing for protein-level differential abundance between conditions} Based on detection about interference in ~\figref{profileDIA} and unequal variance in ~\figref{outputDIA}, we consider them using option {\tt featureVar}=TRUE,{\tt interference}=FALSE as below. See in section 3.4.3 for other options and visualizations. \subsection{Sample size calculation for a future experiment} Same as in SRM experiments. See details in section 3.5. \subsection{Quantification of protein abundance in individual samples or in conditions} Same as in SRM experiments. See details in section 3.6. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% { \bibliographystyle{natdin} \large \bibliography{bibliography} } \end{document}