%\VignetteIndexEntry{primer - protViz: Visualizing and Analyzing Mass Spectrometry Related Data in Proteomics} \documentclass[nojss]{jss} \usepackage{thumbpdf,lmodern} \title{\pkg{protViz}: Visualizing and Analyzing Mass Spectrometry Related Data in Proteomics} \author{Christian Panse\\ Functional Genomics Center Zurich \And Jonas Grossmann\\ Functional Genomics Center Zurich} \title{protViz: Visualizing and Analyzing Mass Spectrometry Related Data in Proteomics} \Plainauthor{Christian Panse, Jonas Grossmann} \Plaintitle{protViz: Visualizing and Analyzing Mass Spectrometry Related Data in Proteomics} \Shorttitle{protViz} \Keywords{proteomics, mass spectrometry, fragment-ion} \Plainkeywords{proteomics, mass spectrometry, fragment-ion} \Abstract{ \pkg{protViz} is an R package to do quality checks, visualizations and analysis of mass spectrometry data, coming from proteomics experiments. The package is developed, tested and used at the Functional Genomics Center Zurich. We use this package mainly for prototyping, teaching, and having {\em fun} with proteomics data. But it can also be used to do data analysis for small scale data sets. Nevertheless, if one is patient, it also handles large data sets.} \Address{ Jonas Grossmann and Christian Panse\\ Functional Genomics Center Zurich, UZH\texttt{|}ETHZ\\ Winterthurerstr. 190\\ CH-8057, Z{\"u}rich, Switzerland\\ Telephone: +41-44-63-53912\\ E-mail: \email{jg@fgcz.ethz.ch}, \email{cp@fgcz.ethz.ch}\\ URL: \url{https://fgcz.ch} } \begin{document} \tableofcontents \newpage <>= options(prompt = "R> ", continue = "+ ", width = 70, useFancyQuotes = FALSE) @ \SweaveOpts{concordance=TRUE} \section{Related Work} {\em The method of choice in proteomics is mass spectrometry.} There are already packages in R which deal with mass spec related data. Some of them are listed here: \begin{itemize} \item \pkg{OrgMassSpec}: Organic Mass Spectrometry \item \pkg{MSnbase} package (basic functions for mass spec data including quant aspect with iTRAQ data)\\ \url{http://bioconductor.org/packages/MSnbase/} \item \pkg{plgem} -- spectral counting quantification, applicable to MudPIT experiments\\ \url{http://www.bioconductor.org/packages/plgem/} \item \pkg{synapter} -- MSe (Hi3 = Top3 Quantification) for Waters Q-tof data aquired in MSe mode\\ \url{http://bioconductor.org/packages/synapter/} \item \pkg{mzR} \\ \url{http://bioconductor.org/packages/mzR/} \item isobar iTRAQ/TMT quantification package\\ \url{http://bioconductor.org/packages/isobar/} \item \pkg{readMzXmlData}\\ \url{https://CRAN.R-project.org/package=readMzXmlData} %\item \pkg{msQC}\\ %\url{http://bioconductor.org/packages/3.0/bioc/html/msQC.html} \item \pkg{rawDiag} - an R package supporting rational LC-MS method optimization for bottom-up proteomics on multiple OS platforms \citep{rawDiag} \end{itemize} \section{Get Data In -- Preprocessing} The most time consuming and challenging part of data analysis and visualization is shaping the data the way that they can easily further process. In this package, we intentionally left this part away because it is very infrastructure dependent. Moreover, we use also commercial tools to analyze data and export the data into R accessible formats. We provide a different kind of importers if these formats are available, but with little effort, one can bring other exports in a similar format which will make it easy to use our package for a variety of tools. \subsection{Identification - In-silico from Proteins to Peptides} For demonstration, we use a sequence of peptides derived from a tryptic digest using the Swiss-Prot \code{FETUA\_BOVIN Alpha-2-HS-glycoprotein} protein (P12763). \code{fcat} and \code{tryptic-digest} are commandline programs which are included in the package. \code{fcat} removes the lines starting with \code{>} and all 'new line' character within the protein sequence while \code{tryptic-digest} is doing the triptic digest of a protein sequence applying the rule: cleave after arginine (R) and lysine (K) except followed by proline(P). Both programs can be used through the \code{Fasta} Rcpp module. <<>>= library(protViz) fname <- system.file("extdata", name='P12763.fasta', package = "protViz") F <- Fasta$new(fname) @ print the first 60 characters of P12763. <<>>= substr(F$getSequences(), 1, 60) @ <<>>= (fetuin <- F$getTrypticPeptides()) @ \section{Peptide Identification} {\em The currency in proteomics are the peptides.} In proteomics, proteins are digested to so-called peptides since peptides are much easier to handle biochemically than proteins. Proteins are very different in nature some are very sticky while others are soluble in aqueous solutions while again are only sitting in membranes. Therefore, proteins are chopped up into peptides because it is fair to assume, that for each protein, there will be many peptides behaving well so that they can be measured with the mass spectrometer. This step introduces another problem, the so-called protein inference problem. In this package here, we do not touch at all upon the protein inference. \subsection{Computing Mass and Hydrophobicity of a Peptide Sequence} \code{parentIonMass} computes the mass of an amino acid sequence. <<>>= mass <- protViz::parentIonMass(fetuin) @ The \code{ssrc} function derives a measure for the hydrophobicity based on the method described in \citep{pmid15238601}. <<>>= hydrophobicity <- protViz::ssrc(fetuin) @ The content of \code{mass} and \code{hydrophobicity} can be seen in the Table \ref{Table:hydrophobicity}. <>= library(xtable) print(xtable(data.frame(peptide = names(hydrophobicity), mass = parentIonMass(names(hydrophobicity)), hydrophobicity=hydrophobicity), caption="parent ion mass and hydrophobicity values of the tryptic digested protein \texttt{P12763}.", label="Table:hydrophobicity"), include.rownames=FALSE, scalebox="0.75") @ A figure below shows a scatter plot graphing the parent ion mass versus the hydrophobicity value of each in-silico tryptic digested peptide of the \texttt{FETUA BOVIN} (P12763) protein. <>= op <- par(mfrow = c(1, 1)) plot(hydrophobicity ~ mass, log = 'xy', pch = 16, col = '#88888888', cex = 2, main = "sp|P12763|FETUA_BOVIN Alpha-2-HS-glycoprotein", sub = 'tryptic peptides') text(mass, hydrophobicity, fetuin, pos=2, cex=0.5, col = '#CCCCCC88') @ \subsection{In-silico Peptide Fragmentation} The fragment ions computation of a peptide follows the rules proposed in \citep{pmid6525415}. Beside the \code{b} and \code{y} ions the \code{FUN} argument of \code{fragmentIon} defines which ions are computed. the default ions beeing computed are defined in the function \code{defaultIon}. The are no limits for defining other forms of fragment ions for ETD (c and z ions) CID (b and y ions). <>= defaultIon @ <>= ## plot in-silico fragment ions of peptides <- c('HTLNQIDSVK', 'ALGGEDVR', 'TPIVGQPSIPGGPVR') pims <- peptides |> protViz::parentIonMass() fis <- peptides |> protViz::fragmentIon() par(mfrow = c(3, 1)); rv <- mapply(FUN = function(fi, pim, peptide){ plot(0,0, xlab='m/Z', ylab='', xlim = range(c(fi$b, fi$y)), ylim = c(0,1), type = 'n', axes = FALSE, sub=paste(pim, "Da")); axis(1, fi$b,round(fi$b, 2)) pepSeq <- strsplit(peptide, "") axis(3, fi$b, pepSeq[[1]]) abline(v = fi$b, col='red', lwd=2) abline(v = fi$y, col='blue',lwd=2) abline(v = fi$c, col='orange') abline(v = fi$z, col='cyan') }, fis, pims, peptides) @ The next lines compute the singly and doubly charged fragment ions of the \code{HTLNQIDSVK} peptide. Which are usually the ones that can be used to make an identification. <>= Hydrogen<-1.007825 (fi.HTLNQIDSVK.1 <- fragmentIon('HTLNQIDSVK'))[[1]] (fi.HTLNQIDSVK.2 <-(fi.HTLNQIDSVK.1[[1]] + Hydrogen) / 2) @ \subsection{Peptide Sequence -- Fragment Ion Matching} Given a peptide sequence and a tandem mass spectrum. For the assignment of a candidate peptide an in-silico fragment ion spectra \code{fi} is computed. The function \code{findNN} determines for each fragment ion the closed peak in the MS2. If the difference between the in-silico mass and the measured mass is inside the 'accuracy' mass window of the mass spec device the in-silico fragment ion is considered as a potential hit. <>= peptideSequence <- 'HTLNQIDSVK' spec <- list(scans=1138, title="178: (rt=22.3807) [20080816_23_fetuin_160.RAW]", rtinseconds=1342.8402, charge=2, mZ=c(195.139940, 221.211970, 239.251780, 290.221750, 316.300770, 333.300050, 352.258420, 448.384360, 466.348830, 496.207570, 509.565910, 538.458310, 547.253380, 556.173940, 560.358050, 569.122080, 594.435500, 689.536940, 707.624790, 803.509240, 804.528220, 822.528020, 891.631250, 909.544400, 916.631600, 973.702160, 990.594520, 999.430580, 1008.583600, 1017.692500, 1027.605900), intensity=c(931.8, 322.5, 5045, 733.9, 588.8, 9186, 604.6, 1593, 531.8, 520.4, 976.4, 410.5, 2756, 2279, 5819, 2.679e+05, 1267, 1542, 979.2, 9577, 3283, 9441, 1520, 1310, 1.8e+04, 587.5, 2685, 671.7, 3734, 8266, 3309)) fi <- protViz::fragmentIon(peptideSequence) n <- nchar(peptideSequence) by.mZ <- c(fi[[1]]$b, fi[[1]]$y) by.label <- c(paste("b",1:n,sep=''), paste("y",n:1,sep='')) # should be a R-core function as findInterval! idx <- protViz::findNN(by.mZ, spec$mZ) mZ.error <- abs(spec$mZ[idx]-by.mZ) plot(mZ.error[mZ.error.idx<-order(mZ.error)], main="Error Plot", pch='o', cex=0.5, sub='The error cut-off is 0.6Da (grey line).', log='y') abline(h=0.6,col='grey') text(1:length(by.label), mZ.error[mZ.error.idx], by.label[mZ.error.idx], cex=0.75,pos=3) @ The graphic above is showing the mass error of the assignment between the MS2 \code{spec} and the singly charged fragment ions of \code{HTLNQIDSVK}. The function \code{psm} is doing the peptide sequence matching. Of course, the more theoretical ions match (up to a small error tolerance, given by the system) the measured ion chain, the more likely it is, that the measured spectrum indeed is from the inferred peptide (and therefore the protein is identified) \subsection{Modifications} <>= library(protViz) ptm.0 <- cbind(AA="-", mono=0.0, avg=0.0, desc="unmodified", unimodAccID=NA) ptm.616 <- cbind(AA='S', mono=-27.010899, avg=NA, desc="Substituition", unimodAccID=616) ptm.651 <- cbind(AA='N', mono=27.010899, avg=NA, desc="Substituition", unimodAccID=651) m <- as.data.frame(rbind(ptm.0, ptm.616, ptm.651)) genMod(c('TAFDEAIAELDTLNEESYK','TAFDEAIAELDTLSEESYK'), m$AA) fi <- protViz::fragmentIon(c('TAFDEAIAELDTLSEESYK', 'TAFDEAIAELDTLNEESYK', 'TAFDEAIAELDTLSEESYK', 'TAFDEAIAELDTLNEESYK'), modified=c('0000000000000200000', '0000000000000100000', '0000000000000000000', '0000000000000000000'), modification=m$mono) @ \subsection{Labeling Peaklists} The \code{peakplot} \citet{peakplot} function performs the labeling of the tandem mass spectra. %\SweaveOpts{width=9.5, height=6} %\setkeys{Gin}{width=0.95\columnwidth} <>= data(msms) op <- par(mfrow = c(2, 1)) protViz::peakplot("TAFDEAIAELDTLNEESYK", msms[[1]]) protViz::peakplot("TAFDEAIAELDTLSEESYK", msms[[2]]) par(op) @ The following code snippet combines all the functions to implement a simple peptide search engine. As default arguments, the mass spectrum \code{x}, mZ and intensity arrays list, and a character vector of peptide sequences are given. <>= .peptideFragmentIonSpectrumMatch <- function (x, peptideSet, framentIonMassToleranceDa = 0.1) { ## Here we ignore the peptide mass # peptideMassTolerancePPM = 5 # query.mass <- ((x$pepmass[1] * x$charge)) - (1.007825 * (x$charge - 1)) # eps <- query.mass * peptideMassTolerancePPM * 1e-06 # pimIdx <- protViz::parentIonMass(peptideSequence) # lower <- protViz::findNN(query.mass - eps, pimIdx) # upper <- protViz::findNN(query.mass + eps, pimIdx) rv <- lapply(peptideSet, FUN = protViz::psm, spec = x, plot = FALSE) |> lapply(FUN = function(p) { ## determine peaks considered as hits idx <- abs(p$mZ.Da.error) < framentIonMassToleranceDa intensityRatio <- sum(x$intensity[idx]) / sum(x$intensity) ## derive objectives for a good match data.frame(nHits=sum(idx), intensityRatio = intensityRatio) }) |> Reduce(f=rbind) idx.tophit <- which(rv$nHits == max(rv$nHits))[1] data.frame(peptideMatch = peptideSet[idx.tophit], nHits = max(rv$nHits), nPeaks = length(x$mZ), intensityRatio = rv$intensityRatio[idx.tophit] ) } @ define a set of peptide sequences <<>>= peptideSet <- c("ELIVSK", 'TAFDEAIAELDTLNEESYK','TAFDEAIAELDTLSEESYK') @ generate a in-silico tandem mass spectrum <<>>= mZ <- protViz::fragmentIon("TAFDEAIAELDTLNEESYK")[[1]] |> unlist() |> sort() intensity <- mZ |> length() |> sample() msms.insilico <- list(mZ = mZ, intensity = intensity) @ generate reverse peptide sequences <<>>= peptideSet.rev <- peptideSet |> sapply(FUN = function(x){ strsplit(x, "")[[1]] |> rev() |> paste0(collapse = "") }) @ The output is an assignment of the best matching peptide. <<>>= lapply(list(msms[[1]], msms[[2]], msms.insilico), FUN = .peptideFragmentIonSpectrumMatch, peptideSet = c(peptideSet, peptideSet.rev), framentIonMassToleranceDa = 0.05) |> Reduce(f=rbind) @ \section{Quantification} For an overview on Quantitative Proteomics read \citet{pmid22772140, pmid22821268}. The authors are aware that meaningful statistics usually require a much higher number of biological replicates. In almost all cases there are not more than three to six repetitions. For the moment there are limited options due to the availability of machine time and the limits of the technologies. \subsection{Label-free methods on protein level} The data set \code{fetuinLFQ} contains a subset of our results descriped in \cite{pmid20576481}. The example below shows a visualization using trellis plots. It graphs the abundance of four protein independency from the fetuin concentration spiked into the sample. <>= library(lattice) data(fetuinLFQ) cv<-1-1:7/10 t<-trellis.par.get("strip.background") t$col<-(rgb(cv,cv,cv)) trellis.par.set("strip.background",t) print(xyplot(abundance~conc|prot*method, groups=prot, xlab="Fetuin concentration spiked into experiment [fmol]", ylab="Abundance", aspect=1, data=fetuinLFQ$t3pq[fetuinLFQ$t3pq$prot %in% c('Fetuin', 'P15891', 'P32324', 'P34730'),], panel = function(x, y, subscripts, groups) { if (groups[subscripts][1] == "Fetuin") { panel.fill(col="#ffcccc") } panel.grid(h=-1,v=-1) panel.xyplot(x, y) panel.loess(x,y, span=1) if (groups[subscripts][1] == "Fetuin") { panel.text(min(fetuinLFQ$t3pq$conc), max(fetuinLFQ$t3pq$abundance), paste("R-squared:", round(summary(lm(x~y))$r.squared,2)), cex=0.75, pos=4) } } )) @ The plot shows the estimated concentration of the four proteins using the top three most intense peptides. The Fetuin peptides are spiked in with increasing concentration while the three other yeast proteins are kept stable in the background. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{pgLFQ -- LCMS based label-free quantification} LC-MS based label-free quantification (LFQ) is a very popular method to extract relative quantitative information from mass spectrometry experiments. At the FGCZ we use the software ProgenesisLCMS for this workflow \url{http://www.nonlinear.com/products/progenesis/lc-ms/overview/}. Progenesis is a graphical software which does the aligning between several LCMS experiments, extracts signal intensities from LCMS maps and annotates the master map with peptide and protein labels. <>= data(pgLFQfeature) data(pgLFQprot) featureDensityPlot<-function(data, n=ncol(data), nbins=30){ my.col<-rainbow(n); mids<-numeric() density<-numeric() for (i in 1:n) { h<-hist(data[,i],nbins, plot=FALSE) mids<-c(mids, h$mids) density<-c(density, h$density) } plot(mids,density, type='n') for (i in 1:n) { h<-hist(data[,i],nbins, plot=FALSE) lines(h$mids,h$density, col=my.col[i]) } legend("topleft", names(data), cex=0.5, text.col=my.col ) } par(mfrow=c(1,1)); featureDensityPlot(asinh(pgLFQfeature$"Normalized abundance"), nbins=25) @ The \code{featureDensityPlot} shows the normalized signal intensity distribution (asinh transformed) over 24 LCMS runs which are aligned in this experiment. <>= op<-par(mfrow=c(1,1),mar=c(18,18,4,1),cex=0.5) samples<-names(pgLFQfeature$"Normalized abundance") image(cor(asinh(pgLFQfeature$"Normalized abundance")), col=gray(seq(0,1,length=20)), main='pgLFQfeature correlation', axes=FALSE) axis(1,at=seq(from=0, to=1, length.out=length(samples)), labels=samples, las=2) axis(2,at=seq(from=0, to=1, length.out=length(samples)), labels=samples, las=2) par(op) @ This image plot shows the correlation between runs on feature level (values are \code{asinh} transformed). White is perfect correlation while black indicates a poor correlation. <>= op<-par(mfrow=c(1,1),mar=c(18,18,4,1),cex=0.5) image(cor(asinh(pgLFQprot$"Normalized abundance")), main='pgLFQprot correlation', axes=FALSE, col=gray(seq(0,1,length=20))) axis(1,at=seq(from=0, to=1, length.out=length(samples)), labels=samples, las=2) axis(2,at=seq(from=0, to=1, length.out=length(samples)), labels=samples, las=2) par(op) @ This figure shows the correlation between runs on protein level (values are \code{asinh} transformed). White is perfect correlation while black indicates a poor correlation. Striking is the fact that the six biological replicates for each condition cluster very well. <>= par(mfrow=c(2,2),mar=c(6,3,4,1)) ANOVA<-pgLFQaov(pgLFQprot$"Normalized abundance", groups=as.factor(pgLFQprot$grouping), names=pgLFQprot$output$Accession, idx=c(15,16,196,107), plot=TRUE) @ This figure shows the result for four proteins which either differ significantly in expression across conditions (green boxplots) using an analysis of variance test, or non-differing protein expression (red boxplot). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{iTRAQ -- Two Group Analysis} The data for the next section is an iTRAQ-8-plex experiment where two conditions are compared (each condition has four biological replicates) \subsubsection{Sanity Check} <>= data(iTRAQ) x<-rnorm(100) par(mfrow=c(3,3),mar=c(6,4,3,0.5)); for (i in 3:10){ qqnorm(asinh(iTRAQ[,i]), main=names(iTRAQ)[i]) qqline(asinh(iTRAQ[,i]), col='grey') } b<-boxplot(asinh(iTRAQ[,c(3:10)]), main='boxplot iTRAQ') @ A first quality check to see if all reporter ion channels are having the same distributions. Shown in the figure are Q-Q plots of the individual reporter channels against a normal distribution. The last is a boxplot for all individual channels. \subsubsection{On Protein Level} <>= data(iTRAQ) group1Protein<-numeric() group2Protein<-numeric() for (i in c(3,4,5,6)) group1Protein<-cbind(group1Protein, asinh(tapply(iTRAQ[,i], paste(iTRAQ$prot), sum, na.rm=TRUE))) for (i in 7:10) group2Protein<-cbind(group2Protein, asinh(tapply(iTRAQ[,i], paste(iTRAQ$prot), sum, na.rm=TRUE))) par(mfrow=c(2,3),mar=c(6,3,4,1)) for (i in 1:nrow(group1Protein)){ boxplot.color="#ffcccc" tt.p_value <- t.test(as.numeric(group1Protein[i,]), as.numeric(group2Protein[i,]))$p.value if (tt.p_value < 0.05) boxplot.color='lightgreen' b <- boxplot(as.numeric(group1Protein[i,]), as.numeric(group2Protein[i,]), main=row.names(group1Protein)[i], sub=paste("t.test: p-value =", round(tt.p_value,2)), col=boxplot.color, axes=FALSE) axis(1, 1:2, c('group_1','group_2')); axis(2); box() points(rep(1,b$n[1]), as.numeric(group1Protein[i,]), col='blue') points(rep(2,b$n[2]), as.numeric(group2Protein[i,]), col='blue') } @ This figure shows five proteins which are tested if they differ across conditions using the four biological replicates with a $t$~statistic. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{On Peptide Level} The same can be done on peptide level using the \code{protViz} function \code{iTRAQ2GroupAnalysis}. <>= data(iTRAQ) q <- iTRAQ2GroupAnalysis(data=iTRAQ, group1=c(3,4,5,6), group2=7:10, INDEX=paste(iTRAQ$prot,iTRAQ$peptide), plot=FALSE) q[1:10,] @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Pressure Profiles QC} A common problem with mass spec setup is the pure reliability of the high-pressure pump. The following graphics provide visualizations for quality control. An overview of the pressure profile data can be seen by using the \code{ppp} function. <>= data(pressureProfile) ppp(pressureProfile) @ The lines plots the pressure profiles data on a scatter plot ``Pc'' versus ``time'' grouped by time range (no figure because of too many data items). The Trellis \code{xyplot} shows the Pc development over each instrument run to a specified relative runtime $(25, 30, \ldots)$. <>= pp.data<-pps(pressureProfile, time=seq(25,40,by=5)) print(xyplot(Pc ~ as.factor(file) | paste("time =", as.character(time), "minutes"), panel = function(x, y){ m<-sum(y)/length(y) m5<-(max(y)-min(y))*0.05 panel.abline(h=c(m-m5,m,m+m5), col=rep("#ffcccc",3),lwd=c(1,2,1)) panel.grid(h=-1, v=0) panel.xyplot(x, y) }, ylab='Pc [psi]', layout=c(1,4), sub='The three red lines indicate the average plus min 5%.', scales = list(x = list(rot = 45)), data=pp.data)) @ While each panel in the \code{xyplot} above shows the data to a given point in time, we try to use the levelplot to get an overview of the whole pressure profile data. <>= pp.data<-pps(pressureProfile, time=seq(0,140,length=128)) print(levelplot(Pc ~ time * as.factor(file), main='Pc(psi)', data=pp.data, col.regions=rainbow(100)[1:80])) @ \label{lab:datapreparation} The \pkg{protViz} package has also been used in \citealp{pmid20576481,pPTM,specL,msqc1,pmid28035797,Egloff2018,Gehrig2020,Kockmann2021}. \bibliography{protViz} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \appendix \section{Session information}\label{sec:sessionInfo} An overview of the package versions used to produce this document are shown below. <>= toLatex(sessionInfo()) @ \end{document}