The AWAggregator package implements an
attribute-weighted aggregation algorithm which leverages
peptide-spectrum match (PSM) attributes to provide a more accurate
estimate of protein abundance compared to conventional aggregation
methods. This algorithm employs pre-trained random forest models to
predict the quantitative inaccuracy of PSMs based on their attributes.
PSMs are then aggregated to the protein level using a weighted average,
taking the predicted inaccuracy into account. Additionally, the package
allows users to construct their own training sets that are more relevant
to their specific experimental conditions if desired.
Since ExperimentHub can only retrieve data from the
AWAggregatorData package with Bioconductor version 3.21 or
later, please use the legacy version of the AWAggregator
package if you are using an earlier Bioconductor version: https://github.com/Tan-Jiahua/AWAggregator-compat
Functions available in the AWAggregator package:
getDistMetric(): Calculates the distance metric for
PSMs. Distance metric reflects on whether the quantified ratio of each
pair of samples of a PSM diverges from other PSMs in the same
redundant/unique group. Redundant group, unique group and distance
metric were originally defined in the iPQF method. Please refer to
“iPQF: a new peptide-to-protein summarization method using peptide
spectra characteristics to improve protein quantification” for more
details.
getPSMAttributes(): Retrieves attributes required
for training or test sets.
getAvgScaledErrorOfLog2FC(): Calculates the Average
Scaled Error of log2FC values required for training sets.
mergeTrainingSets(): Extracts a similar number of
PSMs from each input dataset and merges them into a single training
set.
fitQuantInaccuracyModel(): Trains a random forest
model to predict the level of quantitative inaccuracy of PSMs.
aggregateByAttributes(): Aggregates PSMs using a
random forest model.
convertPDFormat(): Converts output from Proteome
Discoverer into the input format required by
AWAggregator.
Function available in the associated AWAggregatorData
package:
loadQuantInaccuracyModel(): Loads a pre-trained random
forest model for predicting the level of quantitative inaccuracy of
PSMs.Data available in the AWAggregator package:
sample.PSM.FP: represents sample PSMs mapped to the
proteins A0AV96, A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, obtained from the
psm.tsv output file generated by FragPipe. Columns
unnecessary for the AWAggregator have been removed from the
sample data.
sample.prot.PD: represents sample proteins A0AV96,
A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, obtained from the TXT export of the
proteins page in the Proteome Discoverer search results. Columns
unnecessary for the AWAggregator have been removed from the
sample data.
sample.PSM.PD: represents sample PSMs mapped to the
proteins A0AV96, A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, obtained from the
TXT export of the PSMs page in the Proteome Discoverer search results.
Columns unnecessary for the AWAggregator have been removed
from the sample data.
Data available in the associated AWAggregatorData
package:
regr: represent the pre-trained random forest model
that incorporates the average coefficient of variation (CV) as a
feature.
regr.no.CV: represent the pre-trained random forest
model that does not include the average CV as a feature.
benchmark.set.1, benchmark.set.2,
benchmark.set.3: represents PSMs in Benchmark Set 1 ~ 3
derived from the psm.tsv output files generated by
FragPipe, which are used to train the random forest model. Columns
unnecessary for the AWAggregator have been removed from the
sample data.
The AWAggregator package and the associated
AWAggregatorData package can be installed from
Bioconductor.
Load the AWAggregator package and the
AWAggregatorData package.
## Loading required package: ExperimentHub
## Loading required package: BiocGenerics
## Loading required package: generics
##
## Attaching package: 'generics'
## The following objects are masked from 'package:base':
##
## as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
## setequal, union
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
## mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
## unsplit, which.max, which.min
## Loading required package: AnnotationHub
## Loading required package: BiocFileCache
## Loading required package: dbplyr
In this example, we aggregate the reporter ion intensities of PSMs to
the protein level. We use the sample dataset sample.PSM.FP,
included in the AWAggregator package and derived from the
psm.tsv output file generated by FragPipe. This dataset
includes reporter ion intensities from nine samples, labeled from
Sample 1 to Sample 9, without replicates. The
PSMs are mapped to the following proteins: A0AV96, A0AVF1, A0AVT1,
A0FGR8, and A0M8Q6, with unnecessary columns removed for clarity.
This example demonstrates the basic functionality of the
AWAggregator package using the default pre-trained
model.
# Load the pre-trained random forest model that does not include the average CV
# as a feature, which indicates the average CV in percentage for processed PSM
# reporter ion intensities across different replicate groups. It is recommended
# to load the pre-trained model with average CV when replicates are available;
# otherwise, use the model without the average CV
data(sample.PSM.FP)
regr <- loadQuantInaccuracyModel(useAvgCV=FALSE)## see ?AWAggregatorData and browseVignettes('AWAggregatorData') for documentation
## downloading 1 resources
## retrieving 1 resource
## loading from cache
# Load sample names (Sample 1 ~ Sample 9)
samples <- colnames(sample.PSM.FP)[grep('Sample', colnames(sample.PSM.FP))]
groups <- samples
df <- getPSMAttributes(
PSM=sample.PSM.FP,
# TMT tag (229.1629) and carbamidomethylation (57.0214) are applied as
# fixed post-translational modifications (PTMs)
fixedPTMs=c('229.1629', '57.0214'),
colOfReporterIonInt=samples,
groups=groups,
setProgressBar=TRUE
)## These groups are automatically removed when average CV is calculated because of lack of replicates:
## Sample 1, Sample 2, Sample 3, Sample 4, Sample 5, Sample 6, Sample 7, Sample 8, Sample 9
## There are no replicates so average CV will not be generated as an attribute.
aggregated_results <- aggregateByAttributes(
PSM=df,
colOfReporterIonInt=samples,
ranger=regr,
ratioCalc=FALSE
)The output dataframe will provide estimates of protein abundance.
Protein Sample 1 Sample 2 Sample 3 Sample 4 ...
sp|A0AV96|RBM47_HUMAN 0.9292177 1.0111264 0.7933874 0.9606382 ...
sp|A0AVF1|IFT56_HUMAN 0.6646691 0.6600642 0.6696656 0.7984397 ...
sp|A0AVT1|UBA6_HUMAN 1.1883116 1.1752203 1.0482381 1.0910095 ...
sp|A0FGR8|ESYT2_HUMAN 0.9304190 0.8504465 1.0550898 0.7952998 ...
sp|A0M8Q6|IGLC7_HUMAN 0.4205675 0.6393757 0.7475482 0.6968704 ...
In this example, we convert the search result from Proteome
Discoverer to the format required by AWAggregator and
aggregate the reporter ion intensities of PSMs to the protein level. We
use the sample dataset sample.PSM.PD, alongside its
corresponding protein table sample.prot.PD, both included
in the AWAggregator package. These files are derived from
the TXT exports of the proteins and PSMs pages in the search results
from Proteome Discoverer. This dataset includes reporter ion intensities
from nine samples, labeled from Sample 1 to
Sample 9, without replicates. The PSM and protein tables
contains following proteins: A0AV96, A0AVF1, A0AVT1, A0FGR8, and A0M8Q6,
with unnecessary columns removed for clarity.
# Load the pre-trained random forest model that does not include the average CV
# as a feature, which indicates the average CV in percentage for processed PSM
# reporter ion intensities across different replicate groups. It is recommended
# to load the pre-trained model with average CV when replicates are available;
# otherwise, use the model without the average CV
data(sample.PSM.PD)
data(sample.prot.PD)
regr <- loadQuantInaccuracyModel(useAvgCV=FALSE)## see ?AWAggregatorData and browseVignettes('AWAggregatorData') for documentation
## loading from cache
# Load sample names (Sample 1 ~ Sample 9)
samples <- colnames(sample.PSM.PD)[grep('Sample', colnames(sample.PSM.PD))]
groups <- samples
df <- convertPDFormat(
PSM=sample.PSM.PD,
protein=sample.prot.PD,
colOfReporterIonInt=samples
)
df <- getPSMAttributes(
PSM=df,
# TMT tag and carbamidomethylation are applied as static PTMs
fixedPTMs=c('TMT6plex', 'Carbamidomethyl'),
colOfReporterIonInt=samples,
groups=groups,
setProgressBar=TRUE
)## These groups are automatically removed when average CV is calculated because of lack of replicates:
## Sample 1, Sample 2, Sample 3, Sample 4, Sample 5, Sample 6, Sample 7, Sample 8, Sample 9
## There are no replicates so average CV will not be generated as an attribute.
aggregated_results <- aggregateByAttributes(
PSM=df,
colOfReporterIonInt=samples,
ranger=regr,
ratioCalc=FALSE
)The output dataframe will provide estimates of protein abundance.
Protein Sample 1 Sample 2 Sample 3 Sample 4 ...
A0AV96_Homo sapiens 0.9392033 0.9514846 0.7096284 0.9393484 ...
A0AVF1_Homo sapiens 0.6591366 0.6534372 0.7121089 0.7741971 ...
A0AVT1_Homo sapiens 1.2035820 1.1647425 1.0494833 1.1121796 ...
A0FGR8_Homo sapiens 0.9664924 0.8391658 1.0946545 0.7832414 ...
A0M8Q6_Homo sapiens 0.3516833 0.4695273 0.7225070 0.6042526 ...
Retraining the AWA model using additional spike-in datasets can improve the number of quantified PSMs in the merged training set, and hence the robustness of the correlation. In addition, retraining using experiment-specific in-house spike-in datasets could also provide potential benefits for the machine learning model by better representing the employed hardware and acquisition modes.
In this example, we create a training set by merging three benchmark
spike-in datasets (benchmark.set.1,
benchmark.set.2, and benchmark.set.3), all
included in the AWAggregator package and derived from the
psm.tsv output files generated by FragPipe. This combined
training set is then used to train a random forest model.
We load the spike-in datasets using ExperimentHub
package. These datasets correspond to the sets described in the
AWAggregator publication. You may substitute your own
spike-in datasets if desired.
## see ?AWAggregatorData and browseVignettes('AWAggregatorData') for documentation
## downloading 1 resources
## retrieving 1 resource
## loading from cache
## see ?AWAggregatorData and browseVignettes('AWAggregatorData') for documentation
## downloading 1 resources
## retrieving 1 resource
## loading from cache
## see ?AWAggregatorData and browseVignettes('AWAggregatorData') for documentation
## downloading 1 resources
## retrieving 1 resource
## loading from cache
Firstly, we calculate the attributes and the values of Average Scaled
Error of log2FC in benchmark.set.1.
library(stringr)
# Load sample names (Sample 'H1+E1_1' ~ Sample 'H1+E6_3')
samples <- colnames(benchmarkSet1)[
grep('H1[+]E[0-9]+_[1-4]', colnames(benchmarkSet1))
]
groups <- str_match(samples, 'H1[+]E[0-9]+')[, 1]
PSM1 <- getPSMAttributes(
PSM=benchmarkSet1,
# TMT tag (229.1629) and carbamidomethylation (57.0214) are applied as
# fixed PTMs
fixedPTM=c('229.1629', '57.0214'),
colOfReporterIonInt=samples,
groups=groups
)
PSM1 <- getAvgScaledErrorOfLog2FC(
PSM=PSM1,
colOfReporterIonInt=samples,
groups=groups,
# The actual protein fold change may be deviated from the intended values
# after TMT labelling as the original work indicates when H1+Y6 is
# involved, and therefore, H1+Y6 is not used in the calculation of Average
# of Scaled Error of log2FC
expectedRelativeAbundance=list(`H1+E1`=1, `H1+E2`=2, `H1+E6`=NA),
speciesAtConstLevel='HUMAN'
)Secondly, we calculate the attributes and the values of Average
Scaled Error of log2FC in benchmark.set.2.
benchmark.set.2 consists of three separate mass
spectrometry runs, indicated by the Replicate column. Each
run is processed individually because of potential run-specific
differences using lapply function, and merged together by
bind_rows function.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:dbplyr':
##
## ident, sql
## The following objects are masked from 'package:BiocGenerics':
##
## combine, intersect, setdiff, setequal, union
## The following object is masked from 'package:generics':
##
## explain
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Load sample names (Sample 'H1+Y1_1' ~ Sample 'H1+Y10_3')
samples <- colnames(benchmarkSet2)[
grep('H1[+]Y[0-9]+_[1-3]', colnames(benchmarkSet2))
]
groups <- str_match(samples, 'H1[+]Y[0-9]+')[, 1]
# Process each replicate separately using lapply()
# lapply() loops over all unique replicate IDs in benchmarkSet2.
# 'X' is the current replicate ID.
tmp <- lapply(unique(benchmarkSet2$Replicate), FUN=function(X){
# Select PSMs from the current replicate X
df <- benchmarkSet2[benchmarkSet2$Replicate == X, ]
df <- getPSMAttributes(
PSM=df,
fixedPTM=c('229.1629', '57.0214'),
colOfReporterIonInt=samples,
groups=groups,
setProgressBar=FALSE
)
df <- getAvgScaledErrorOfLog2FC(
PSM=df,
colOfReporterIonInt=samples,
groups=groups,
expectedRelativeAbundance=list(`H1+Y1`=1, `H1+Y4`=4, `H1+Y10`=10),
speciesAtConstLevel='HUMAN'
)
# Return the processed PSMs from the current replicate
return(df)
})
# Combine results from all replicates into one dataframe
PSM2 <- bind_rows(tmp)Thirdly, we calculate the attributes and the values of Average Scaled
Error of log2FC in benchmark.set.3.
# Load sample names (Sample 'H1+Y1_1' ~ Sample 'H1+Y10_2')
samples <- colnames(benchmarkSet3)[
grep('H1[+]Y[0-9]+_[1-2]', colnames(benchmarkSet3))
]
groups <- str_match(samples, 'H1[+]Y[0-9]+')[, 1]
PSM3 <- getPSMAttributes(
PSM=benchmarkSet3,
fixedPTM=c('304.2071', '125.0476'),
colOfReporterIonInt=samples,
groups=groups,
# The signals for yeast PSMs in group H1+Y0 is completely from noise, so
# they are not used for calculating Average CV
groupsExcludedFromCV='H1+Y0'
)## These groups are removed when average CV is calculated because of the setting of groupsExcludedFromCV:
## H1+Y0
Next, we merge a new training set from these three datasets. The minimum number of PSMs to extract from each dataset is determined by the number of PSMs in the smallest set. Complete sets of PSMs mapped to the selected proteins are extracted, resulting in final PSM counts from each set that are equal to or slightly larger than the preset values.
Train a new random forest model using Average CV as an attribute.
## Growing trees.. Progress: 40%. Estimated remaining time: 46 seconds.
## Growing trees.. Progress: 81%. Estimated remaining time: 14 seconds.
## Model training time = 1.3340398311615 minutes
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_1.1.4 stringr_1.5.2 AWAggregatorData_0.99.4
## [4] ExperimentHub_2.99.6 AnnotationHub_3.99.6 BiocFileCache_2.99.6
## [7] dbplyr_2.5.1 BiocGenerics_0.55.4 generics_0.1.4
## [10] AWAggregator_1.1.0 BiocStyle_2.37.1
##
## loaded via a namespace (and not attached):
## [1] KEGGREST_1.49.2 toOrdinal_1.3-0.0 xfun_0.54
## [4] bslib_0.9.0 httr2_1.2.1 Biobase_2.69.1
## [7] lattice_0.22-7 vctrs_0.6.5 tools_4.5.1
## [10] stats4_4.5.1 curl_7.0.0 tibble_3.3.0
## [13] AnnotationDbi_1.71.2 RSQLite_2.4.3 blob_1.2.4
## [16] pkgconfig_2.0.3 Matrix_1.7-4 S4Vectors_0.49.0
## [19] lifecycle_1.0.4 compiler_4.5.1 Biostrings_2.79.1
## [22] brio_1.1.5 progress_1.2.3 Seqinfo_1.1.0
## [25] htmltools_0.5.8.1 sys_3.4.3 buildtools_1.0.0
## [28] sass_0.4.10 yaml_2.3.10 pillar_1.11.1
## [31] crayon_1.5.3 jquerylib_0.1.4 tidyr_1.3.1
## [34] cachem_1.1.0 tidyselect_1.2.1 digest_0.6.37
## [37] stringi_1.8.7 purrr_1.1.0 BiocVersion_3.23.1
## [40] maketools_1.3.2 fastmap_1.2.0 grid_4.5.1
## [43] cli_3.6.5 magrittr_2.0.4 withr_3.0.2
## [46] prettyunits_1.2.0 filelock_1.0.3 rappdirs_0.3.3
## [49] bit64_4.6.0-1 XVector_0.51.0 httr_1.4.7
## [52] rmarkdown_2.30 Peptides_2.4.6 bit_4.6.0
## [55] ranger_0.17.0 png_0.1-8 hms_1.1.4
## [58] memoise_2.0.1 evaluate_1.0.5 knitr_1.50
## [61] IRanges_2.45.0 testthat_3.2.3 rlang_1.1.6
## [64] Rcpp_1.1.0 glue_1.8.0 DBI_1.2.3
## [67] BiocManager_1.30.26 jsonlite_2.0.0 R6_2.6.1