Package: MsBackendSql
Authors: Johannes Rainer [aut, cre] (ORCID:
https://orcid.org/0000-0002-6977-7147),
Chong Tang [ctb],
Laurent Gatto [ctb] (ORCID: https://orcid.org/0000-0002-1520-2268)
Compiled: Tue Apr 15 19:24:34 2025
The Spectra Bioconductor package provides a flexible and
expandable infrastructure for Mass Spectrometry (MS) data. The package supports
interchangeable use of different backends that provide additional file support
or different ways to store and represent MS data. The
MsBackendSql package provides backends to store data from whole
MS experiments in SQL databases. The data in such databases can be easily (and
efficiently) accessed using Spectra objects that use the MsBackendSql class
as an interface to the data in the database. Such Spectra objects have a
minimal memory footprint and hence allow analysis of very large data sets even
on computers with limited hardware capabilities. For certain operations, the
performance of this data representation is superior to that of other low-memory
(on-disk) data representations such as Spectra’s MsBackendMzR backend.
Finally, the MsBackendSql supports also remote data access to e.g. a central
database server hosting several large MS data sets.
The package can be installed with the BiocManager package. To install
BiocManager use install.packages("BiocManager") and, after that,
BiocManager::install("MsBackendSql") to install this package.
MsBackendSql SQL databasesMsBackendSql SQL databases can be created either by importing (raw) MS data
from MS data files using the createMsBackendSqlDatabase() or using the
backendInitialize() function by providing in addition to the database
connection also the full MS data to import as a DataFrame. In the first
example we use the createMsBackendSqlDatabase() function to import the full MS
data from the provided MS data files into an (empty) database. Below we first
create an empty SQLite database (in a temporary file) and use the
createMsBackendSqlDatabase() function to create all necessary tables in that
database and import the MS data from two mzML files (from the r Biocpkg("msdata") package).
library(RSQLite)
dbfile <- tempfile()
con <- dbConnect(SQLite(), dbfile)
library(Spectra)
library(MsBackendSql)
fls <- dir(system.file("sciex", package = "msdata"), full.names = TRUE)
createMsBackendSqlDatabase(con, fls)
dbDisconnect(con)By default (with parameters blob = TRUE and peaksStorageMode = "blob2") the
peaks data matrix of each spectrum is stored as a BLOB data type into the
database (one entry per spectrum). This has advantages on the performance to
extract the peaks data from the database, but does not allow to filter
individual peaks by their m/z or intensity values directly in the database. As
an alternative (using blob = FALSE) it is also possible to store the
individual m/z and intensity values in separate columns of the database
table. This long table format results however in considerably larger databases
(with potentially poorer performance). Note also that the code and backend is
optimized for MySQL/MariaDB databases by taking advantage of table partitioning
and specialized table storage options. Any other SQL database server is however
also supported (also portable, self-contained SQLite databases). In fact,
performance for MsBackendSql databases with peaks data stored as BLOB data
type is similar for SQLite and MySQL/MariaDB databases.
The MsBackendSql package provides two backends to interact with such
databases: the MsBackendSql class and the MsBackendOfflineSql class, that
inherits all properties and functions from the former, but does not store the
connection to the database within the object. The MsBackendOfflineSql object
thus supports parallel processing and allows to save/load the object (e.g. using
save and saveRDS). The MsBackendOfflineSql might therefore be used as the
preferred backend to SQL databases for most applications.
To access the data in the database we create below a Spectra object providing
the database connection information in the constructor call and specifying to
use the MsBackendOfflineSql as backend (parameter source). We stored the
data to a SQLite database, thus we provide the database name (SQLite database
file name) and the SQLite DBI driver with parameters dbname and drv. Which
parameters are required to connect to the database depends on the SQL database
and the used driver. For a MySQL/MariaDB database we would use the MariaDB()
driver and would have to provide the database name, user name, password as well
as the host name and port through which the database is accessible.
sps <- Spectra(dbname = dbfile, source = MsBackendOfflineSql(), drv = SQLite())
sps## MSn data (Spectra) with 1862 spectra in a MsBackendOfflineSql backend:
##        msLevel precursorMz  polarity
##      <integer>   <numeric> <integer>
## 1            1          NA         1
## 2            1          NA         1
## 3            1          NA         1
## 4            1          NA         1
## 5            1          NA         1
## ...        ...         ...       ...
## 1858         1          NA         1
## 1859         1          NA         1
## 1860         1          NA         1
## 1861         1          NA         1
## 1862         1          NA         1
##  ... 35 more variables/columns.
##  Use  'spectraVariables' to list all of them.
## Database: /tmp/RtmprpBYzG/file21750b75de4e07Spectra objects allow also to change the backend to any other backend
(extending MsBackend) using the setBackend() function. Below we use this
function to first load all data into memory by changing from the
MsBackendOfflineSql to a MsBackendMemory.
sps_mem <- setBackend(sps, MsBackendMemory())
sps_mem## MSn data (Spectra) with 1862 spectra in a MsBackendMemory backend:
##        msLevel     rtime scanIndex
##      <integer> <numeric> <integer>
## 1            1     0.280         1
## 2            1     0.559         2
## 3            1     0.838         3
## 4            1     1.117         4
## 5            1     1.396         5
## ...        ...       ...       ...
## 1858         1   258.636       927
## 1859         1   258.915       928
## 1860         1   259.194       929
## 1861         1   259.473       930
## 1862         1   259.752       931
##  ... 35 more variables/columns.
## Processing:
##  Switch backend from MsBackendOfflineSql to MsBackendMemory [Tue Apr 15 19:24:42 2025]With this function it is also possible to change from any backend to a
MsBackendOfflineSql (or MsBackendSql) in which case a new database is
created and all data from the originating backend is stored in this database. To
change the backend to an MsBackendOfflineSql we need to provide the connection
information to the SQL database as additional parameters. These parameters are
the same that need to be passed to a dbConnect() call to establish the
connection to the database. These parameters include the database driver
(parameter drv), the database name and eventually the user name, host etc (see
?dbConnect for more information). In the simple example below we store the
data into a SQLite database and thus only need to provide the database name,
which corresponds SQLite database file. In our example we store the data into a
temporary file. Optionally, setBackend() supports also the parameters blob
and peaksDataStorage described above for the createMsBackendSqlDatabase()
function.
sps2 <- setBackend(sps_mem, MsBackendOfflineSql(), drv = SQLite(),
                   dbname = tempfile())
sps2## MSn data (Spectra) with 1862 spectra in a MsBackendOfflineSql backend:
##        msLevel precursorMz  polarity
##      <integer>   <numeric> <integer>
## 1            1          NA         1
## 2            1          NA         1
## 3            1          NA         1
## 4            1          NA         1
## 5            1          NA         1
## ...        ...         ...       ...
## 1858         1          NA         1
## 1859         1          NA         1
## 1860         1          NA         1
## 1861         1          NA         1
## 1862         1          NA         1
##  ... 35 more variables/columns.
##  Use  'spectraVariables' to list all of them.
## Database: /tmp/RtmprpBYzG/file21750b5a315fe6
## Processing:
##  Switch backend from MsBackendOfflineSql to MsBackendMemory [Tue Apr 15 19:24:42 2025]
##  Switch backend from MsBackendMemory to MsBackendOfflineSql [Tue Apr 15 19:24:43 2025]Similar to any other Spectra object we can retrieve the available spectra
variables using the spectraVariables() function.
spectraVariables(sps)##  [1] "msLevel"                  "rtime"                   
##  [3] "acquisitionNum"           "scanIndex"               
##  [5] "dataStorage"              "dataOrigin"              
##  [7] "centroided"               "smoothed"                
##  [9] "polarity"                 "precScanNum"             
## [11] "precursorMz"              "precursorIntensity"      
## [13] "precursorCharge"          "collisionEnergy"         
## [15] "isolationWindowLowerMz"   "isolationWindowTargetMz" 
## [17] "isolationWindowUpperMz"   "peaksCount"              
## [19] "totIonCurrent"            "basePeakMZ"              
## [21] "basePeakIntensity"        "ionisationEnergy"        
## [23] "lowMZ"                    "highMZ"                  
## [25] "mergedScan"               "mergedResultScanNum"     
## [27] "mergedResultStartScanNum" "mergedResultEndScanNum"  
## [29] "injectionTime"            "filterString"            
## [31] "spectrumId"               "ionMobilityDriftTime"    
## [33] "scanWindowLowerLimit"     "scanWindowUpperLimit"    
## [35] "electronBeamEnergy"       "spectrum_id_"The MS peak data can be accessed using either the mz(), intensity() or
peaksData() functions. Below we extract the peaks matrix of the 5th spectrum
and display the first 6 rows.
peaksData(sps)[[5]] |>
head()##            mz intensity
## [1,] 105.0347         0
## [2,] 105.0362       164
## [3,] 105.0376         0
## [4,] 105.0391         0
## [5,] 105.0405       328
## [6,] 105.0420         0All data (peaks data or spectra variables) are always retrieved on-the-fly
from the database resulting thus in a minimal memory footprint for the Spectra
object.
print(object.size(sps), units = "KB")## 114.6 KbThe backend supports also adding additional spectra variables or changing their values. Below we add 10 seconds to the retention time of each spectrum.
sps$rtime <- sps$rtime + 10Such operations do however not change the data in the database (which is always considered read-only) but are cached locally within the backend object (in memory). The size in memory of the object is thus higher after changing that spectra variable.
print(object.size(sps), units = "KB")## 129.2 KbSuch $<- operations can also be used to cache spectra variables
(temporarily) in memory which can eventually improve performance. Below we test
the time it takes to extract the MS level from each spectrum from the database,
then cache the MS levels in memory using $msLevel <- and test the timing to
extract these cached variable.
system.time(msLevel(sps))##    user  system elapsed 
##   0.015   0.000   0.015sps$msLevel <- msLevel(sps)
system.time(msLevel(sps))##    user  system elapsed 
##   0.007   0.000   0.007We can also use the reset() function to reset the data to its original state
(this will cause any local spectra variables to be deleted and the backend to be
initialized with the original data in the database).
sps <- reset(sps)The need to retrieve any spectra data on-the-fly from the database has an impact
on the performance of data access functions of Spectra objects using
MsBackendSql/MsBackendOfflineSql backends. To evaluate this we compare below
the performance of the MsBackendSql to other Spectra backends, specifically,
the MsBackendMzR which is the default backend to read and represent raw MS
data, and the MsBackendMemory backend that keeps all MS data in memory (and is
thus not suggested for larger MS experiments). Similar to the MsBackendMzR,
also the MsBackendSql keeps only a limited amount of data in memory. These
on-disk backends need thus to retrieve spectra and MS peaks data on-the-fly
from either the original raw data files (in the case of the MsBackendMzR) or
from the SQL database (in the case of the MsBackendSql). The in-memory backend
MsBackendMemory is supposed to provide the fastest data access since all data
is kept in memory.
Below we thus create Spectra objects from the same data but using the
different backends.
con <- dbConnect(SQLite(), dbfile)
sps <- Spectra(con, source = MsBackendSql())
sps_mzr <- Spectra(fls, source = MsBackendMzR())
sps_im <- setBackend(sps_mzr, backend = MsBackendMemory())At first we compare the memory footprint of the 3 backends.
print(object.size(sps), units = "KB")## 112.9 Kbprint(object.size(sps_mzr), units = "KB")## 401.4 Kbprint(object.size(sps_im), units = "KB")## 54509.1 KbThe MsBackendSql has the lowest memory footprint of all 3 backends because it
does not keep any data in memory. The MsBackendMzR keeps all spectra
variables, except the MS peaks data, in memory and has thus a larger size. The
MsBackendMemory keeps all data (including the MS peaks data) in memory and has
thus the largest size in memory.
Next we compare the performance to extract the MS level for each spectrum from
the 4 different Spectra objects.
library(microbenchmark)
microbenchmark(msLevel(sps),
               msLevel(sps_mzr),
               msLevel(sps_im))## Unit: microseconds
##              expr      min        lq        mean    median         uq       max
##      msLevel(sps) 9075.515 9492.4720 10174.27107 9806.2070 10968.7150 12954.522
##  msLevel(sps_mzr)  626.023  663.1530   892.68709  709.9900   784.5570  8982.692
##   msLevel(sps_im)   14.818   22.6545    39.44521   41.3245    51.4205    73.248
##  neval cld
##    100 a  
##    100  b 
##    100   cExtracting MS levels is thus slowest for the MsBackendSql, which is not
surprising because both other backends keep this data in memory while the
MsBackendSql needs to retrieve it from the database.
We next compare the performance to access the full peaks data from each
Spectra object.
microbenchmark(peaksData(sps, BPPARAM = SerialParam()),
               peaksData(sps_mzr, BPPARAM = SerialParam()),
               peaksData(sps_im, BPPARAM = SerialParam()),
               times = 10)## Unit: milliseconds
##                                         expr        min         lq        mean
##      peaksData(sps, BPPARAM = SerialParam())  63.534895  89.317356  199.122897
##  peaksData(sps_mzr, BPPARAM = SerialParam()) 695.335248 735.872955 1063.021581
##   peaksData(sps_im, BPPARAM = SerialParam())   1.175141   1.282965    4.648899
##      median          uq        max neval cld
##  124.245089  147.784205  667.32555    10  a 
##  786.181355 1357.766999 1834.46520    10   b
##    1.655021    2.076847   31.48762    10  aAs expected, the MsBackendMemory has the fasted access to the full peaks
data. The MsBackendSql outperforms however the MsBackendMzR providing faster
access to the m/z and intensity values.
Performance can be improved for the MsBackendMzR using parallel
processing. Note that the MsBackendSql does not support parallel
processing and thus parallel processing is (silently) disabled in functions such
as peaksData().
m2 <- MulticoreParam(2)
microbenchmark(peaksData(sps, BPPARAM = m2),
               peaksData(sps_mzr, BPPARAM = m2),
               peaksData(sps_im, BPPARAM = m2),
               times = 10)## Unit: microseconds
##                              expr        min         lq        mean      median
##      peaksData(sps, BPPARAM = m2)  57089.097  78698.172  149405.337   97000.369
##  peaksData(sps_mzr, BPPARAM = m2) 707535.467 762801.732 1144929.600 1018310.785
##   peaksData(sps_im, BPPARAM = m2)    599.237    956.972    1531.655    1397.655
##           uq         max neval cld
##   102026.638  679687.762    10  a 
##  1481558.941 1973334.180    10   b
##     2038.081    3002.343    10  aWe next compare the performance of subsetting operations.
microbenchmark(filterRt(sps, rt = c(50, 100)),
               filterRt(sps_mzr, rt = c(50, 100)),
               filterRt(sps_im, rt = c(50, 100)))## Unit: microseconds
##                                expr      min       lq      mean   median
##      filterRt(sps, rt = c(50, 100)) 3084.820 3364.604 3885.9162 3583.426
##  filterRt(sps_mzr, rt = c(50, 100)) 1987.844 2234.152 2612.4719 2434.856
##   filterRt(sps_im, rt = c(50, 100))  635.672  726.579  799.8071  768.355
##        uq       max neval cld
##  3923.035 23471.827   100 a  
##  2563.178 14831.049   100  b 
##   831.446  2828.657   100   cThe two on-disk backends MsBackendSql and MsBackendMzR show a comparable
performance for this operation. This filtering does involves access to a spectra
variables (the retention time in this case) which, for the MsBackendSql needs
first to be retrieved from the backend. The MsBackendSql backend allows
however also to cache spectra variables (i.e. they are stored within the
MsBackendSql object). Any access to such cached spectra variables can
eventually be faster because no dedicated SQL query is needed.
To evaluate the performance of a pure subsetting operation we first define the
indices of 10 random spectra and subset the Spectra objects to these.
idx <- sample(seq_along(sps), 10)
microbenchmark(sps[idx],
               sps_mzr[idx],
               sps_im[idx])## Unit: microseconds
##          expr      min        lq      mean    median        uq      max neval
##      sps[idx]  203.498  221.6815  243.7263  250.3315  260.0285  347.724   100
##  sps_mzr[idx] 1028.232 1036.8395 1057.6455 1043.0985 1055.4180 1973.849   100
##   sps_im[idx]  307.361  327.2190  357.7035  343.9675  361.0475 1631.650   100
##  cld
##  a  
##   b 
##    cHere the MsBackendSql outperforms the other backends because it does not keep
any data in memory and hence does not need to subset these. The two other
backends need to subset the data they keep in memory which is in both cases a
data frame with either a reduced set of spectra variables or the full MS data.
At last we compare also the extraction of the peaks data from the such subset
Spectra objects.
sps_10 <- sps[idx]
sps_mzr_10 <- sps_mzr[idx]
sps_im_10 <- sps_im[idx]
microbenchmark(peaksData(sps_10),
               peaksData(sps_mzr_10),
               peaksData(sps_im_10),
               times = 10)## Unit: microseconds
##                   expr        min         lq        mean      median         uq
##      peaksData(sps_10)   1347.151   2216.415   2884.6083   2817.9625   3425.109
##  peaksData(sps_mzr_10) 131915.979 134734.611 139034.6948 139896.5860 142934.359
##   peaksData(sps_im_10)    315.723    343.110    572.9997    521.1625    748.791
##         max neval cld
##    4926.594    10  a 
##  146907.296    10   b
##     997.407    10  aThe MsBackendSql outperforms the MsBackendMzR while, not unexpectedly, the
MsBackendMemory provides fasted access.
The backends from the MsBackendSql package use standard SQL calls to retrieve MS data from the database and hence any SQL database system (for which an R package is available) is supported. SQLite-based databases would represent the easiest and most user friendly solution since no database server administration and user management is required. Indeed, performance of SQLite is very high, even for very large data sets. Server-based databases on the other hand have the advantage to enable a centralized storage and control of MS data (inclusive user management etc). Also, such server systems would also allow data set or server-specific configurations to improve performance.
A comparison between a SQLite-based with a MariaDB-based MsBackendSql database for a large data set comprising over 8,000 samples and over 15,000,000 spectra is available here. In brief, performance to extract data was comparable and for individual spectra variables even faster for the SQLite database. Only when more complex SQL queries were involved (combining several primary keys or data fields) the more advanced MariaDB database outperformed SQLite.
MsBackendSqlThe MsBackendSql backend does not support parallel processing since the
database connection can not be shared across the different (parallel)
processes. Thus, all methods on Spectra objects that use a MsBackendSql will
automatically (and silently) disable parallel processing even if a dedicated
parallel processing setup was passed along with the BPPARAM method.
Some functions on Spectra objects require to load the MS peak data (i.e., m/z
and intensity values) into memory. For very large data sets (or computers with
limited hardware resources) such function calls can cause out-of-memory
errors. One example is the lengths() function that determines the number of
peaks per spectrum by loading the peak matrix first into memory. Such functions
should ideally be called using the peaksapply() function with parameter
chunkSize (e.g., peaksapply(sps, lengths, chunkSize = 5000L)). Instead of
processing the full data set, the data will be first split into chunks of size
chunkSize that are stepwise processed. Hence, only data from chunkSize
spectra is loaded into memory in one iteration.
The MsBackendSql provides an MS data representations and storage mode with a
minimal memory footprint (in R) that is still comparably efficient for standard
processing and subsetting operations. This backend is specifically useful for
very large MS data sets, that could even be hosted on remote (MySQL/MariaDB)
servers. A potential use case for this backend could thus be to set up a central
storage place for MS experiments with data analysts connecting remotely to this
server to perform initial data exploration and filtering. After subsetting to a
smaller data set of interest, users could then retrieve/download this data by
changing the backend to e.g. a MsBackendMemory, which would result in a
download of the full data to the user computer’s memory.
sessionInfo()## R version 4.5.0 RC (2025-04-04 r88126)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] microbenchmark_1.5.0 RSQLite_2.3.9        MsBackendSql_1.8.0  
## [4] Spectra_1.18.0       BiocParallel_1.42.0  S4Vectors_0.46.0    
## [7] BiocGenerics_0.54.0  generics_0.1.3       BiocStyle_2.36.0    
## 
## loaded via a namespace (and not attached):
##  [1] sandwich_3.1-1         sass_0.4.10            MsCoreUtils_1.20.0    
##  [4] lattice_0.22-7         stringi_1.8.7          hms_1.1.3             
##  [7] digest_0.6.37          grid_4.5.0             evaluate_1.0.3        
## [10] bookdown_0.43          mvtnorm_1.3-3          fastmap_1.2.0         
## [13] blob_1.2.4             Matrix_1.7-3           jsonlite_2.0.0        
## [16] ProtGenerics_1.40.0    progress_1.2.3         mzR_2.42.0            
## [19] DBI_1.2.3              survival_3.8-3         multcomp_1.4-28       
## [22] BiocManager_1.30.25    TH.data_1.1-3          codetools_0.2-20      
## [25] jquerylib_0.1.4        cli_3.6.4              rlang_1.1.6           
## [28] crayon_1.5.3           Biobase_2.68.0         splines_4.5.0         
## [31] bit64_4.6.0-1          cachem_1.1.0           yaml_2.3.10           
## [34] tools_4.5.0            parallel_4.5.0         memoise_2.0.1         
## [37] ncdf4_1.24             fastmatch_1.1-6        vctrs_0.6.5           
## [40] R6_2.6.1               zoo_1.8-14             lifecycle_1.0.4       
## [43] fs_1.6.6               IRanges_2.42.0         bit_4.6.0             
## [46] clue_0.3-66            MASS_7.3-65            cluster_2.1.8.1       
## [49] pkgconfig_2.0.3        bslib_0.9.0            data.table_1.17.0     
## [52] Rcpp_1.0.14            xfun_0.52              knitr_1.50            
## [55] htmltools_0.5.8.1      rmarkdown_2.29         compiler_4.5.0        
## [58] prettyunits_1.2.0      MetaboCoreUtils_1.16.0