--- title: "Using JohnsonKinaseData to predict kinase-substrate relationships" author: "Florian Geier" date: "`r BiocStyle::doc_date()`" package: "`r BiocStyle::pkg_ver('JohnsonKinaseData')`" output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{JohnsonKinaseData} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console bibliography: JohnsonKinaseData.bib --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction Johnson et al. [@Johnson2023] published for 303 human serine/threonine specific kinases substrate affinities in the form of position-specific weight matrices (PWMs). The `r Biocpkg("JohnsonKinaseData")` package provides access to these PWMs including basic functionality to match user-provided phosphosites against all kinase PWMs. The aim is to give the user a simple way of predicting kinase-substrate relationships based on PWM-phosphosite matching. These predictions can serve to infer kinase activity from differential phospho-proteomic data. # Installation The `r Biocpkg("JohnsonKinaseData")` package can be install using the following code: ```{r installation, eval=FALSE} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("ExperimentHub") BiocManager::install("JohnsonKinaseData") ``` # Loading kinase PWMs The kinase PWMs can be accessed with the `getKinasePWM()` function. It returns a list with 303 human serine/threonine specific PWMs. ```{r, load-pwm} library(JohnsonKinaseData) pwms <- getKinasePWM() head(names(pwms)) ``` Each PWM is a numeric matrix with amino acids as rows and positions as columns. Matrix elements are log2-odd scores measuring differential affinity relative to a random frequency of amino acids [@Johnson2023]. ```{r pwm-example} pwms[["PLK2"]] ``` Beside the 20 standard amino acids, also phosphorylated serine, threonine and tyrosine residues are included. These phosphorylated residues are distinct from the central phospho-acceptor (serine/threonine at position `0`) and can have a strong impact on the affinity of a given kinase-substrate pair (phospho-priming). The central phospho-acceptor site is located at position `0` and only measures the favorability of serine over threonine. The user can exclude this favorability measure by setting the parameter `includeSTfavorability` to `FALSE`, in which case the central position doesn't contribute to the PWM score. ```{r, pwm-st} pwms2 <- getKinasePWM(includeSTfavorability=FALSE) ``` # Processing user-provided phosphosites Phosphorylated peptides are often represented in two different formats: (1) the phosphorylated residues are indicated by an asterix as in `SAGLLS*DEDC`. Alternatively, phosphorylated residues are given by lower case letters as in `SAGLLsDEDC`. In order to unify the phosophosite representation for PWM matching, `r Biocpkg("JohnsonKinaseData")` provides the function `processPhosphopeptides()`. It takes a character vector with phospho-peptides, aligns them to the central phospho-acceptor position and pads and/or truncates the surrounding residues, such that the processed site consists of 5 upstream residues, a central acceptor and 4 downstream residues. The central phospho-acceptor position is defined as the left closest position to the midpoint of the peptide given by `floor(nchar(sites)/2)+1`. This midpoint definition is also the default alignment position if no phosphorylated residue was recognized. ```{r} ppeps <- c("SAGLLS*DEDC", "GDtND", "EKGDSN__", "HKRNyGsDER", "PEKS*GyNV") sites <- processPhosphopeptides(ppeps) sites ``` If a peptide contains several phosphorylated residues, option `onlyCentralAcceptor` controls how to select the acceptor position. Setting `onlyCentralAcceptor=FALSE` will return all possible aligned phosphosites for a given input peptide. Note that in this case the output is not parallel to the input. ```{r} sites <- processPhosphopeptides(ppeps, onlyCentralAcceptor=FALSE) sites ``` # Scoring of user-provided phosphosites Once peptides are processed to sites, the function `scorePhosphosites()` can be used to create a matrix of kinase-substrate match scores. ```{r} selected <- sites |> dplyr::filter(acceptor %in% c('S','T')) |> dplyr::pull(processed) scores <- scorePhosphosites(pwms, selected) dim(scores) scores[,1:5] ``` The PWM scoring can be parallelized by supplying a `BiocParallelParam` object to `BPPARAM=`. ```{r} scores <- scorePhosphosites(pwms, selected, BPPARAM=BiocParallel::SerialParam()) ``` By default, the resulting score is the log2-odds score of the PWM. Alternatively, by setting `scoreType="percentile"`, a percentile rank of the log2-odds score is calculated, using for each PWM a background score distribution which is derived by matching each PWM to the 85'603 unique phosphosites published in Johnson et al. 2023. ```{r} scores <- scorePhosphosites(pwms, selected, scoreType="percentile") scores[,1:5] ``` Quantifying PWM matches by percentile rank was first described in Yaffe et al. 2001 [@Yaffe2001]. It is also the matching score underlying the kinase activity predictions published in Johnson et al. 2023 [@Johnson2023]. Note that these percentile ranks do not account for phospho-priming, as non-central phosphorylated residues were missing in the background sites published in Johnson et al. I.e. the score distributions derived from the background sites do not reflect the impact of phospho-priming. # Session info ```{r session-info} sessionInfo() ``` # References