--- title: "Peptide Annotation and Protein Inference" author: "Witold Wolski" date: "March 16, 2017" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Peptide Annotation and Protein Inference} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE) ``` # Introduction This Vignette describes how the `prozor::greedy` function can be used to infer proteins form peptide identifications using the Occam's razor principle. The method tries to find a minimal set of porteins which can explain all the peptides identified. ```{r loadData} library(prozor) rm(list = ls()) file = system.file("extdata/IDResults.txt.gz" , package = "prozor") specMeta <- readr::read_tsv(file) head(specMeta) ``` Annotate peptide sequences with protein sequences from two fasta files on with reviewed entries only (sp) and the other with reviewed and Trembl entries (sp/tr). ```{r fig.cap = "Number of proteins in the All and Canonical database."} length(unique(specMeta$peptideSeq)) upeptide <- unique(specMeta$peptideSeq) resAll <- prozor::readPeptideFasta( system.file("p1000_db1_example/Annotation_allSeq.fasta.gz" , package = "prozor")) resCan <- prozor::readPeptideFasta( system.file("p1000_db1_example/Annotation_canSeq.fasta.gz" , package = "prozor")) annotAll <- prozor::annotatePeptides(upeptide, resAll) head(subset(annotAll,select = -proteinSequence)) annotCan <- prozor::annotatePeptides(upeptide, resCan) barplot(c(All = length(resAll),Canonical = length(resCan))) ``` ```{r fig.cap="Number of unique peptide protein pairs for the All and Canonical database."} barplot(c(All = nrow(annotAll), Canonical = nrow(annotCan)), ylab = "# peptide protein matches.") ``` We can see that using the larger fasta database reduces the proportion of proteotypic peptides. ```{r} PCProteotypic_all <- sum(table(annotAll$peptideSeq) == 1) / length(table(annotAll$peptideSeq)) * 100 PCProteotypic_canonical <- sum(table(annotCan$peptideSeq) == 1) / length(table(annotCan$peptideSeq)) * 100 barplot( c(All = PCProteotypic_all, Canonical = PCProteotypic_canonical), las = 2, ylab = "% proteotypic" ) ``` # Protein Inference We can now identify a minmal set of proteins explaining all the peptides observed for both databases ```{r} library(Matrix) precursors <- unique(subset(specMeta, select = c( peptideModSeq, precursorCharge, peptideSeq ))) ``` ## Protein Inference for database with Trembl identifiers ```{r greedyTrembl, fig.cap="Peptide protein machtes for the All database. Rows - peptides, Columns - proteins, black - peptide protein match."} library(Matrix) annotatedPrecursors <- merge(precursors , subset(annotAll, select = c(peptideSeq, proteinID)), by.x = "peptideSeq", by.y = "peptideSeq") xx <- prepareMatrix(annotatedPrecursors, proteinID = "proteinID", peptideID = "peptideSeq") image(xx) xxAll <- greedy(xx) ``` ## Protein Inference for Reviewed/Canonical database ```{r greedyCanonical, fig.cap="Peptide protein machtes for the Canonical database. Rows - peptides, Columns - proteins, black - peptide protein match."} annotatedPrecursors <- merge(precursors , subset(annotCan, select = c(peptideSeq, proteinID)), by.x = "peptideSeq", by.y = "peptideSeq") xx <- prepareMatrix(annotatedPrecursors , proteinID = "proteinID", peptideID = "peptideSeq") image(xx) xxCAN <- greedy(xx) ``` # Conclusion We see that the number of proteins needed to explain all the peptides is practically identical for both databases. Also in practice using a database with more entries does not lead to more identified proteins. On the contrary, it might even reduce the number of porteins identified. ```{r fig.cap="Number of proteins before and after protein inference."} barplot(c(All_before = length(unique(annotAll$proteinID)), All_after = length(unique(unlist( xxAll ))) , Canonical_before = length(unique(annotCan$proteinID)), Canonical_after = length(unique(unlist( xxCAN ))))) ``` # TODO [Protein Grouping and Clustering in Scaffold](https://www.dropbox.com/s/a4kqyc4gxln8sfj/scaffold_protein_grouping_clustering.pdf?dl=1) # Session Info ```{r} sessionInfo() ```