--- title: "Using and understanding a Chromatograms object" output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{Using and understanding a Chromatograms object} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignettePackage{Chromatograms} %\VignetteDepends{Chromatograms,BiocStyle,mzR,msdata,Spectra,MsBackendMetaboLights} --- ```{r style, echo = FALSE, results = 'asis', message=FALSE} BiocStyle::markdown() ``` **Package**: `r BiocStyle::pkg_ver('Chromatograms')`
**Compiled**: `r date()` ```{r, echo = FALSE, message = FALSE} library(Chromatograms) library(BiocStyle) register(SerialParam()) ``` # Introduction The *Chromatograms* package provides a scalable and flexible infrastructure to represent, retrieve, and handle chromatographic data. The `Chromatograms` object offers a standardized interface to access and manipulate chromatographic data while supporting various ways to store and retrieve this data through the concept of exchangeable *backends*. This vignette provides general examples and descriptions for the *Chromatograms* package. Contributions to this vignette (content or correction of typos) or requests for additional details and information are highly welcome, ideally *via* pull requests or *issues* on the package's [github repository](https://github.com/RforMassSpectrometry/Chromatograms). This vignette describe the structure of a Chromatograms object and the methods that need to be implemented. In order to see the structure of the object, the `@` accessor is used to access the different slots. This is possible because the `Chromatograms` class is defined as an S4 class. However users should not use the `@` accessor to access the data stored in a `Chromatograms` object, but instead use the methods defined by the `Chromatograms` class. # Installation The package can be installed with the *BiocManager* package. To install *BiocManager*, use `install.packages("BiocManager")`, and after that, use `BiocManager::install("Chromatograms")` to install *Chromatograms*. # The Chromatograms object The `Chromatograms` object is a container for chromatographic data, which includes peaks data (*retention time* and related intensity values, also\ referred to as *peaks data variables* in the context of `Chromatograms`) and metadata of individual chromatograms (so-called *chromatogram variables*). While a core set of chromatogram variables (the `coreChromatogramsVariables()`) and peaks data variables (the `corePeaksVariables()`) are guaranteed to be provided by a `Chromatograms`, it is possible to add arbitrary variables to a `Chromatograms` object. The `Chromatograms` object is designed to contain chromatographic data for a (large) set of chromatograms. The data is organized *linearly* and can be thought of as a list of chromatograms, where each element in the `Chromatograms` is one chromatogram. ## Available backends Backends allow to use different *backends* to store chromatographic data while providing *via* the `Chromatograms` class a unified interface to use that data. The `Chromatograms` package defines a set of example backends but any object extending the base `ChromBackend` class could be used instead. The default backends are: - `ChromBackendMemory`: the *default* backend to store data in memory. Due to its design the `ChromBackendMemory` provides fast access to the peaks data and metadata. Since all data is kept in memory, this backend has a relatively large memory footprint (depending on the data) and is thus not suggested for very large experiments. - `ChromBackendMzR`: this backend keeps only the chromatographic metadata variables in memory and relies on the `r Biocpkg("mzR")` package to read chromatographic peaks (retention time and intensity values) from the original mzML files on-demand. - `ChromBackendSpectra`: this backend generates chromatographic data from a `Spectra` object. It can be used to create Total Ion Chromatograms (TIC), Base Peak Chromatograms (BPC), or Extracted Ion Chromatograms (EICs). It supports both in-memory and file-backed `Spectra` objects. The backend uses **factorization** to group spectra into chromatograms based on variables like MS level and data origin (see details below). All backends provide a consistent interface through the `Chromatograms` object, regardless of where or how the data is stored. The `ChromBackendSpectra` has a special feature: it uses an internal sort index (`spectraSortIndex`) to maintain retention time order without physically reordering the underlying `Spectra` object. This is particularly important for disk-backed `Spectra` objects, as it avoids loading all data into memory. The sort index is automatically maintained during subsetting and factorization operations. ## Chromatographic peaks data The *peaks data variables* information in the `Chromatograms` object can be accessed using the `peaksData()` function. `peaksData` can be accessed, replaced, and also filtered/subsetted. The *core peaks data variables* all have their own accessors and are as follows: - `rtime`: A `numeric` vector containing retention time values. - `intensity`: A `numeric` vector containing intensity values. ## Chromatograms metadata The metadata of individual chromatograms (so called *chromatograms variables*), can be accessed using the `chromData()` function. The `chromData` can be accessed, replaced, and filtered. The *core chromatogram variables* all have their own accessor methods, and it is guaranteed that a value is returned by them (or `NA` if the information is not available). The core variables and their data types are (alphabetically ordered): - `chromIndex`: an `integer` with the index of the chromatogram in the original source file (e.g., *mzML* file). - `collisionEnergy`: for SRM data, `numeric` with the collision energy of the precursor. - `dataOrigin`: optional `character` with the origin of a chromatogram. - `storageLocation`: `character` defining where the data is (currently) stored. - `msLevel`: `integer` defining the MS level of the data. - `mz`: optional `numeric` with the (target) m/z value for the chromatographic data. - `mzMin`: optional `numeric` with the lower m/z value of the m/z range in case the data (e.g., an extracted ion chromatogram EIC) was extracted from a `Chromtagorams` object. - `mzMax`: optional `numeric` with the upper m/z value of the m/z range. - `precursorMz`: for SRM data, `numeric` with the target m/z of the precursor (parent). - `precursorMzMin`: for SRM data, optional `numeric` with the lower m/z of the precursor's isolation window. - `precursorMzMax`: for SRM data, optional `numeric` with the upper m/z of the precursor's isolation window. - `productMz`: for SRM data, `numeric` with the target m/z of the product ion. - `productMzMin`: for SRM data, optional `numeric` with the lower m/z of the product's isolation window. - `productMzMax`: for SRM data, optional `numeric` with the upper m/z of the product's isolation window. For details on the individual variables and their getter/setter functions, see the help for `Chromatograms` (`?Chromatograms`). Also, note that these variables are suggested but not required to characterize a chromatogram. ## Creating `Chromatograms` objects The simplest way to create a `Chromatograms` object is by defining a backend ofchoice, which mainly depends on what type of data you have, and passing that to the `Chromatograms` constructor function. Below we create such an object for a set of 2 chromatograms, providing their metadata through a data.frame with the MS level, m/z, and chromatogram index columns, and peaks data. The metadata includes the MS level, m/z, and chromatogram index, while the peaks data includes the retention time and intensity in a list of data.frames. ```{r} # A data.frame with chromatogram variables. cdata <- data.frame( msLevel = c(1L, 1L), mz = c(112.2, 123.3), chromIndex = c(1L, 2L) ) # Retention time and intensity values for each chromatogram. pdata <- list( data.frame( rtime = c(11, 12.4, 12.8, 13.2, 14.6, 15.1, 16.5), intensity = c(50.5, 123.3, 153.6, 2354.3, 243.4, 123.4, 83.2) ), data.frame( rtime = c(45.1, 46.2, 53, 54.2, 55.3, 56.4, 57.5), intensity = c(100, 180.1, 300.45, 1400, 1200.3, 300.2, 150.1) ) ) # Create and initialize the backend be <- backendInitialize(ChromBackendMemory(), chromData = cdata, peaksData = pdata ) # Create Chromatograms object chr <- Chromatograms(be) chr ``` Alternatively, it is possible to import chromatograhic data from mass spectrometry raw files in mzML/mzXML or CDF format. Below, we create a `Chromatograms` object from an mzML file and define to use a `ChromBackendMzR` backend to *store* the data (note that this requires the `r Biocpkg("mzR")` package to be installed). This backend, specifically designed for raw LC-MS data, keeps only a subset of chromatogram variables in memory while reading the retention time and intensity values from the original data files only on demand. See section [Backends](#backends) for more details on backends and their properties. ```{r} MRM_file <- system.file("proteomics", "MRM-standmix-5.mzML.gz", package = "msdata" ) be <- backendInitialize(ChromBackendMzR(), files = MRM_file, BPPARAM = SerialParam() ) chr_mzr <- Chromatograms(be) ``` The `Chromatograms` object `chr_mzr` now contains the chromatograms from the mzML file `MRM_file`. The chromatograms can be accessed and manipulated using the `Chromatograms` object's methods and functions. It is also possible to create a `Chromatograms` object directly from a `Spectra` object. This is particularly useful when you want to generate total ion chromatograms (TIC), base peak chromatograms (BPC), or extracted ion chromatograms (EIC) from spectral data. A worked example is provided in the [plotting](#plotting-chromatograms-from-a-spectra-object) section below. Basic information about the `Chromatograms` object can be accessed using functions such as `length()`, which tell us how many chromatograms are contained in the object: ```{r} length(chr) length(chr_mzr) ``` # Access data from a Chromatograms object The `Chromatograms` object provides a set of methods to access and manipulate the chromatographic data. The following sections describe how to do such thingson the peaks data and related metadata. ## peaksData The main function to access the full or a part of the peaks data is `peaksData()` (imaginative right), This function returns a list of data.frames, where each data.frame contains the retention time and intensity values for one chromatogram. It is used such as below: ```{r} peaksData(chr) ``` Specific peaks variables can be accessed by either precising the `"columns"` parameter in `peaksData()` or using `$`. ```{r} peaksData(chr, columns = c("rtime"), drop = TRUE) chr$rtime ``` The methods above also allows to replace the peaks data. It can either be the full peaks data: ```{r} peaksData(chr) <- list( data.frame( rtime = c(1, 2, 3, 4, 5, 6, 7), intensity = c(1, 2, 3, 4, 5, 6, 7) ), data.frame( rtime = c(1, 2, 3, 4, 5, 6, 7), intensity = c(1, 2, 3, 4, 5, 6, 7) ) ) ``` Or for specific variables: ```{r} chr$rtime <- list( c(8, 9, 10, 11, 12, 13, 14), c(8, 9, 10, 11, 12, 13, 14) ) ``` The peak data can be therefore accessed, replaced but also filtered/subsetted. The filtering can be done using the `filterPeaksData()` function. This function filters numerical peaks data variables based on the specified numerical ranges parameter. This function does not reduce the number of chromatograms in the object, but it removes the specified peaks data (e.g., "rtime" and "intensity" pairs) from the peaksData. ```{r} chr_filt <- filterPeaksData(chr, variables = "rtime", ranges = c(12, 15)) length(chr_filt) length(rtime(chr_filt)) ``` As you can see the number of chromatograms in the `Chromatograms` object is not reduced, but the peaks data is filtered/reduced. ## chromData The main function to access the full chromatographic metadata is `chromData()`.This function returns the metadata of the chromatograms stored in the `Chromatograms` object. It can be used as follows: ```{r} chromData(chr) ``` Specific chromatogram variables can be accessed by either precising the `"columns"` parameter in `chromData()` or using `$`. ```{r} chromData(chr, columns = c("msLevel")) chr$chromIndex ``` The metadata can be replaced using the same methods as for the peaks data. ```{r} chr$msLevel <- c(2L, 2L) chromData(chr) ``` extra columns can also be added by the user using the `$` operator. ```{r} chr$extra <- c("extra1", "extra2") chromData(chr) ``` As for the peaks data, the filtering can be done using the `filterChromData()` function. This function filters the chromatogram variables based on the specified ranges parameter. However, contrarily to the peaks data, the filtering *does* reduces the number of chromatograms in the object. ```{r} chr_filt <- filterChromData(chr, variables = "chromIndex", ranges = c(1, 2), keep = TRUE ) length(chr_filt) length(chr) ``` The number of chromatograms in the `Chromatograms` object is reduced. Note that for `ChromBackendSpectra`, when you subset the `Chromatograms` object, the underlying `Spectra` object and its sort index are also properly subset and updated. This ensures that peak data extraction remains efficient even after subsetting operations. # Lazy Processing and Parallelization The `Chromatograms` object is designed to be scalable and flexible. It is therefore possible to perform processing in a lazy manner, i.e., only when the data is needed, and in a parallelized way. ## Processing queue Some functions, such as those that require reading large amounts of data from source files, are deferred and executed only when the data is needed. For example, when `filterPeaksData()` is applied, it initially returns the same `Chromatograms` object as the input, but the filtering step is stored in the processing queue of the object. Later, when `peaksData` is accessed, all stacked operations are performed, and the updated data is returned. This approach is particularly important for backends that do not store data in memory, such as `ChromBackendMzR`. It ensures that data is read from the source file only when required, reducing memory usage. However, loading and processing data in smaller chunks can further minimize memory demands, allowing efficient handling of large datasets. It is possible to add also custom functions to the processing queue of the object. Such a function can be applicable to both the peaks data and the chromatogram metadata. Below we demonstrate how to add a custom function to the processing queue of a `Chromatograms` object. Below we define a function that divides the intensities of each peak by a value which can be passed with argument `y`. ```{r define-function} ## Define a function that takes the backend as an input, divides the intensity ## by parameter y and returns it. Note that ... is required in ## the function's definition. divide_intensities <- function(x, y, ...) { intensity(x) <- lapply(intensity(x), `/`, y) x } ## Add the function to the procesing queue chr_2 <- addProcessing(chr, divide_intensities, y = 2) chr_2 ``` Object `chr_2` has now 2 processing steps in its lazy evaluation queue. Calling `intensity()` on this object will now return intensities that are half of the intensities of the original objects `chr`. ```{r custom-processing} intensity(chr_2) intensity(chr) ``` Finally, for `Chromatograms` that use a *writeable* backend, such as the `ChromBackendMemory` it is possible to apply the processing queue to the peak data and write that back to the data storage with the `applyProcessing()` function. Below we use this to make all data manipulations on peak data of the `sps_rep` object persistent. ```{r applyProcessing} length(chr_2@processingQueue) chr_2 <- applyProcessing(chr_2) length(chr_2@processingQueue) chr_2 ``` Before `applyProcessing()` the lazy evaluation queue contained 2 processing steps, which were then applied to the peak data and *written* to the data storage. Note that calling `reset()` **after** `applyProcessing()` can no longer *restore* the data. ## Parallelization The functions are designed to run in multiple chunks (i.e., pieces) of the object simultaneously, enabling parallelization. This is achieved using the `BiocParallel` package. For `ChromBackendMzR`, data is automatically split and processed by files. For other backends, chunk-wise processing can be enabled by setting the `processingChunkSize` of a `Chromatograms` object, which defines the number of chromatograms for which peak data should be loaded and processed in each iteration. The `processingChunkFactor()` function can be used to evaluate how the data will be split. Below, we use this function to assess how chunk-wise processing would be performed with two `Chromatograms` objects: ```{r} processingChunkFactor(chr) ``` For the `Chromatograms` with the in-memory backend an empty `factor()` is returned, thus, no chunk-wise processing will be performed. We next evaluate whether the `Chromatograms` with the `ChromBackendMzR` on-disk backend would use chunk-wise processing. ```{r} processingChunkFactor(chr_mzr) |> head() ``` Here the factor would on yl be of length 1, meaning that all chromatograms will be processed in one go. however the length would be higher if more than one file is used. As this data is quite big (`r length(chr_mzr)` chromatograms), we can set the `processingChunkSize` to 10 to process the data in chunks of 10 chromatograms. ```{r} processingChunkSize(chr_mzr) <- 10 processingChunkFactor(chr_mzr) |> table() ``` The `Chromatograms` with the `ChromBackendMzR` backend would now split the data in about equally sized arbitrary chunks and no longer by original data file. `processingChunkSize` thus overrides any splitting suggested by the backend. While chunk-wise processing reduces the memory demand of operations, the splitting and merging of the data and results can negatively impact performance. Thus, small data sets or `Chromatograms` with in-memory backends willgenerally not benefit from this type of processing. For computationally intense operation on the other hand, chunk-wise processing has the advantage, that chunks can (and will) be processed in parallel (depending on the parallel processing setup). # Changing backend type In the previous sections we learned already that a `Chromatograms` object can use different backends for the actual data handling. It is also possible to change the backend of a `Chromatograms` to a different one with the `setBackend()` function. As of now it is only possible to change the `ChrombackendMzR` to an in-memory backend such as `ChromBackendMemory`. ```{r} print(object.size(chr_mzr), units = "Mb") chr_mzr <- setBackend(chr_mzr, ChromBackendMemory(), BPPARAM = SerialParam()) chr_mzr chr_mzr@backend@peaksData[[1]] |> head() # data is now in memory ``` With the call the full peak data was imported from the original mzML files into the object. This has obviously an impact on the object's size, which is now much larger than before. ```{r} print(object.size(chr_mzr), units = "Mb") ``` ## Choosing the right backend Different backends are suited for different use cases: - **`ChromBackendMemory`**: Best for small to medium datasets where fast access is needed. All data is kept in memory, providing the fastest access but higher memory consumption. - **`ChromBackendMzR`**: Ideal for large datasets stored in mzML/mzXML/CDF files. Only metadata is kept in memory, while peak data is read on-demand, significantly reducing memory footprint at the cost of slower data access. - **`ChromBackendSpectra`**: Perfect for generating chromatograms from spectral data, especially when creating TICs, BPCs, or EICs from existing `Spectra` objects. The backend intelligently handles both in-memory and disk-backed `Spectra` objects through its internal sorting mechanism, avoiding unnecessary memory consumption while maintaining good performance. # Plotting chromatograms from a `Spectra` object For this purpose let's create a lightweight in-memory `Spectra` object and derive a `Chromatograms` from it. This avoids any external downloads while still illustrating the `ChromBackendSpectra` workflow. ```{r, message=FALSE} library(Spectra) library(IRanges) sp <- Spectra( DataFrame( rtime = c(100, 110, 120, 130, 140), msLevel = c(1L, 1L, 1L, 1L, 1L), dataOrigin = rep("example", 5L), mz = NumericList( c(100, 101), c(100, 101), c(100, 101), c(100, 101), c(100, 101), compress = FALSE ), intensity = NumericList( c(10, 20), c(15, 25), c(30, 5), c(12, 18), c(40, 2), compress = FALSE ) ), source = MsBackendDataFrame() ) chr_s <- Chromatograms(sp) ``` We now have a `Chromatograms` object `chr_s` with a `ChromBackendSpectra` backend. one chromatogram was generated per file. ```{r} chr_s ``` The `ChromBackendSpectra` backend provides flexibility in how chromatograms are generated from spectral data through a process called **factorization**. ## Understanding Factorization Factorization is the process of grouping individual spectra into chromatograms based on one or more variables. Think of it as creating separate "bins" where each bin becomes one chromatogram. By default, the `factorize.by` parameter is set to `c("msLevel", "dataOrigin")`, which means: - All MS1 spectra from file "A" → Chromatogram 1 - All MS2 spectra from file "A" → Chromatogram 2 - All MS1 spectra from file "B" → Chromatogram 3 - All MS2 spectra from file "B" → Chromatogram 4 Each unique combination of the factorization variables creates a separate chromatogram. This allows you to organize your spectral data into meaningful chromatographic traces that can be visualized and analyzed together. You can customize the factorization behavior by changing the `factorize.by` parameter. For example, using only `factorize.by = "dataOrigin"` would create one chromatogram per file (combining all MS levels), while adding more variables would create more granular groupings. Additionally, you can provide custom chromatogram metadata to define specific m/z and retention time ranges: ```{r} ## Create custom metadata for EIC extraction custom_cd <- data.frame( msLevel = c(1L, 1L), dataOrigin = rep(dataOrigin(sp)[1], 2), mzMin = c(100, 200), mzMax = c(100.5, 200.5) ) chr_custom <- Chromatograms(sp, chromData = custom_cd) chr_custom ``` This approach allows you to pre-define the chromatographic regions you want to extract, which is useful for targeted analysis workflows. ## Re-factorizing after metadata changes If you modify the chromatogram metadata (particularly the factorization columns like `msLevel` or `dataOrigin`), you may need to re-factorize the data to update the groupings. This can be done using the `factorize()` function: ```{r} ## Work on a copy so downstream examples are not affected chr_s_tmp <- chr_s ## Modify metadata chr_s_tmp$msLevel <- rep(2L, length(chr_s_tmp)) ## Re-factorize to update the groupings chr_s_tmp <- factorize(chr_s_tmp) chromData(chr_s_tmp) ``` This recalculates which spectra belong to which chromatograms based on the updated metadata. Now, let's say we want to plot specific area of the chromatograms. ```{r} chromData(chr_s)$rtmin <- 125 chromData(chr_s)$rtmax <- 180 chromData(chr_s)$mzmin <- 100 chromData(chr_s)$mzmax <- 100.5 ``` The `Chromatograms` object provides a set of functions to plot the chromatograms and their peaks data. The `plotChromatograms()` function can be used to plot each single chromatograms into its own plot. ```{r} library(RColorBrewer) col3 <- brewer.pal(3, "Dark2") plotChromatograms(chr_s, col = col3) ``` On the overhand if the users wants to easily compare the chromatograms, the `plotChromatogramsOverlay()` function can be used to overlay all chromatograms into one plot. ```{r} plotChromatogramsOverlay(chr_s, col = col3) ``` # Extracting chromatographic regions of interest The `chromExtract()` function allows you to extract specific regions of interest from a `Chromatograms` object based on a peak table. This is particularly useful when you want to focus on specific retention time windows or m/z ranges that correspond to detected peaks or features of interest. ## Basic extraction by retention time For backends like `ChromBackendMemory` and `ChromBackendMzR`, you can extract regions based on retention time ranges: ```{r} ## Define peaks of interest with retention time windows peak_table <- data.frame( rtMin = c(8, 11), rtMax = c(10, 13), msLevel = c(2L, 2L), chromIndex = c(1L, 2L) ) ## Extract those regions chr_extracted <- chromExtract(chr, peak_table, by = c("msLevel", "chromIndex")) chr_extracted ``` The resulting `Chromatograms` object contains only the data within the specified retention time windows. Note that extra columns in `peak_table` are added to the chromatogram metadata: ```{r} chromData(chr_extracted) ``` ## Extraction with m/z filtering (ChromBackendSpectra only) When using `ChromBackendSpectra`, you can also filter by m/z ranges, which is useful for extracting ion chromatograms (EICs) for specific mass windows: ```{r} ## Define peak table with both retention time and m/z windows peak_table_mz <- data.frame( rtMin = c(125, 125), rtMax = c(180, 180), mzMin = c(100, 140), mzMax = c(100.5, 140.5), msLevel = c(1L, 1L), dataOrigin = rep(dataOrigin(chr_s)[1], 2), featureID = c("feature_1", "feature_2") ) ## Extract EICs for these features chr_eics <- chromExtract(chr_s, peak_table_mz, by = c("msLevel", "dataOrigin")) chr_eics ``` Notice that the custom column `featureID` from the peak table is now part of the chromatogram metadata: ```{r} chromData(chr_eics) ``` This is particularly useful for linking extracted chromatograms back to feature tables or peak detection results. # Imputing missing values in chromatograms Real chromatographic data often has gaps or missing intensity values at certain retention times, which can occur due to instrumental limitations, data processing artifacts, or sparse sampling. The `imputePeaksData()` function provides several methods to interpolate these missing values, which can improve downstream analysis and visualization. ## Available imputation methods The package provides four imputation methods: - **"linear"**: Linear interpolation between known values. Fast and simple, good for data with regular gaps. - **"spline"**: Cubic spline interpolation. Provides smooth curves but may introduce artifacts. - **"gaussian"**: Gaussian kernel smoothing. Uses a Gaussian kernel to estimate values based on neighboring points. - **"loess"**: Locally weighted scatter plot smoothing. Provides robust smoothing with local polynomial regression. ## Extrapolation vs. Interpolation By default, `imputePeaksData()` performs only interpolation (fills gaps between observed values). You can control this behavior with the `extrapolate` parameter: - `extrapolate = FALSE` (default): Only interpolation is performed. Leading and trailing `NA` values (outside the range of observed data) remain as `NA`. - `extrapolate = TRUE`: Both interpolation and extrapolation are performed. All `NA` values are filled. ## Example: Imputing an extracted ion chromatogram (EIC) To demonstrate imputation we first build a small `Spectra` object that already contains a few `NA` intensity values — mimicking a real-world EIC with gaps — and then extract a chromatogram from it. ```{r} ## A small Spectra with some missing intensities at m/z 100 sp_gaps <- Spectra( DataFrame( rtime = c(100, 110, 120, 130, 140, 150, 160, 170, 180), msLevel = rep(1L, 9), dataOrigin = rep("impute_demo", 9), mz = NumericList(100, 100, 100, 100, 100, 100, 100, 100, 100, compress = FALSE), intensity = NumericList(50, NA, 120, 200, NA, NA, 80, 30, 10, compress = FALSE) ), source = MsBackendDataFrame() ) ## Derive a Chromatograms and extract the EIC for m/z ≈ 100 chr_gaps <- Chromatograms(sp_gaps) eic_table <- data.frame( rtMin = 100, rtMax = 180, mzMin = 99.5, mzMax = 100.5, msLevel = 1L, dataOrigin = "impute_demo" ) chr_eic <- chromExtract(chr_gaps, eic_table, by = c("msLevel", "dataOrigin")) chr_eic ``` Now let's examine the raw data and apply different imputation methods: ```{r, fig.height=10, fig.width=8} ## Create copies for comparison chr_linear <- imputePeaksData(chr_eic, method = "linear") chr_spline <- imputePeaksData(chr_eic, method = "spline") chr_gaussian <- imputePeaksData(chr_eic, method = "gaussian", window = 2, sd = 1) chr_loess <- imputePeaksData(chr_eic, method = "loess", span = 0.75) ## Plot all methods for comparison par(mfrow = c(3, 2), mar = c(4, 4, 2, 1)) ## Original data plotChromatograms(chr_eic, main = "Original EIC") ## Linear interpolation plotChromatograms(chr_linear, main = "Linear Imputation") ## Spline interpolation plotChromatograms(chr_spline, main = "Spline Imputation") ## Gaussian smoothing plotChromatograms(chr_gaussian, main = "Gaussian Smoothing (window=2, sd=1)") ## LOESS smoothing plotChromatograms(chr_loess, main = "LOESS Smoothing (span=0.75)") ``` ## Selecting the right imputation method The choice of imputation method depends on your data characteristics and analysis goals: - Use **"linear"** for quick interpolation of small gaps in regularly sampled data. - Use **"spline"** for smooth curves when data is fairly regular, but be aware it can overshoot. - Use **"gaussian"** for local smoothing that preserves peak shapes while filling gaps. - Use **"loess"** when you want robust smoothing that adapts to local data density. ## Imputation in lazy evaluation pipelines For on-disk backends like `ChromBackendMzR`, imputation is particularly useful when combined with the lazy evaluation queue. The imputation function is added to the processing queue and is only applied when peak data is actually accessed: ```{r} ## For on-disk backends, add imputation to the lazy queue chr_mzr_imputed <- imputePeaksData( chr_mzr, method = "gaussian", window = 5, sd = 2 ) chr_mzr_imputed ``` The imputation is **not** performed immediately. Instead, it's stored in the processing queue. When you call `peaksData()` on the object, the raw data is read from the file and then imputation is applied on-the-fly: ```{r} ## This reads from disk and applies imputation in one step peak_data <- peaksData(chr_mzr_imputed[1]) ``` This approach is highly efficient for large datasets because: 1. Data is only read from disk when needed 2. Imputation is applied on-the-fly during data access 3. No temporary files are created 4. Memory usage remains minimal You can verify the processing queue contains your imputation step: ```{r} length(chr_mzr_imputed@processingQueue) ``` And if you want to make the imputation permanent (for in-memory backends), use `applyProcessing()`: ```{r} ## For in-memory backends, you can persist the imputation chr_in_memory <- setBackend(chr_mzr_imputed, ChromBackendMemory()) chr_in_memory <- applyProcessing(chr_in_memory) # Now imputation is permanently applied length(chr_in_memory@processingQueue) ``` # Comparing chromatograms The `compareChromatograms()` function computes pairwise similarity between chromatograms in two steps. First, a mapping function (`MAPFUN`, default `matchRtime()`) aligns two chromatograms onto a common retention-time grid by linear interpolation, returning aligned intensity vectors for each. Then, a scoring function (`FUN`, default `cor()`, i.e. Pearson correlation) computes a similarity value from those vectors. Additional arguments such as `method = "spearman"` can be passed via `...` to the scoring function, and a fully custom `FUN` or `MAPFUN` can be supplied. The result is always a 3-dimensional numeric array of dimensions *n × m × 2*. Layer `[, , "score"]` contains pairwise similarity scores; layer `[, , "n_peaks"]` contains the number of overlapping retention-time points used for each comparison. The `minPeaks` parameter (default `4`) sets the minimum number of overlapping points required to compute a score — pairs below this threshold return `NA` in the score layer while the actual overlap count is still recorded in `n_peaks`. This avoids unreliable scores from very sparse overlaps and skips the `FUN` computation entirely for those pairs. Use `minPeaks = 2L` to compute a score whenever two or more points overlap. ## Comparing chromatograms within a single object When called with a single `Chromatograms` object, `compareChromatograms()` computes all pairwise similarities and returns a symmetric *n × n × 2* array. Let's pick a subset of the MRM chromatograms we loaded earlier: ```{r compare-chromatograms-mrm} ## Pick 8 MRM chromatograms, skipping the first (a TIC with no m/z info) chr_sub <- chr_mzr[2:9] cor_arr <- compareChromatograms(chr_sub) cor_arr[, , "score"] ## similarity scores cor_arr[, , "n_peaks"] ## number of overlapping RT points per pair ## Use a chromData column as row/column labels cor_arr_labeled <- compareChromatograms(chr_sub, labelsColumn = "chromIndex") cor_arr_labeled[, , "score"] ``` We can visualise the score layer as a heatmap, labelling rows and columns with the MRM precursor → product m/z transitions stored in `chromData()`: ```{r compare-chromatograms-heatmap, fig.width=6, fig.height=5} library(pheatmap) ## Label rows/columns with precursor → product m/z transitions mz_labels <- paste0(round(chromData(chr_sub)$precursorMz, 1), " → ", round(chromData(chr_sub)$productMz, 1)) score_mat <- cor_arr[, , "score"] rownames(score_mat) <- colnames(score_mat) <- mz_labels pheatmap(score_mat, main = "Pairwise Pearson correlation", color = hcl.colors(30, palette = "RdYlBu", rev = TRUE)) ``` A custom similarity function can be passed via `FUN`. For example, cosine similarity, useful for checking co-elution regardless of absolute intensity differences: ```{r compare-chromatograms-custom} cosine <- function(x, y) { sum(x * y) / (sqrt(sum(x^2)) * sqrt(sum(y^2))) } cos_arr <- compareChromatograms(chr_sub, FUN = cosine) cos_arr[, , "score"] ``` The `n_peaks` layer is useful even when `minPeaks` blocks a score: it shows how many RT points overlapped, letting you distinguish *no overlap at all* (`n_peaks = 0`) from *some overlap but below the threshold* (`n_peaks > 0` but `score = NA`): ```{r compare-chromatograms-minpeaks} ## Require at least 10 overlapping RT points to compute a score cor_strict <- compareChromatograms(chr_sub, minPeaks = 10L) cor_strict[, , "score"] ## NAs for pairs with < 10 common RT points cor_strict[, , "n_peaks"] ## actual overlap counts are always recorded ``` ## Comparing two Chromatograms objects When called with two `Chromatograms` objects, `compareChromatograms(x, y)` returns an *n × m × 2* array with similarities between each chromatogram in `x` (rows) and each in `y` (columns). ```{r compare-chromatograms-xy} ## Compare the first 4 chromatograms against the last 4 res <- compareChromatograms(chr_sub[1:4], chr_sub[5:8]) res[, , "score"] ``` ## Comparing groups of chromatograms To compare chromatograms within separate groups (e.g. per `dataOrigin`), split the object first and apply `compareChromatograms()` to each subset: ```{r compare-chromatograms-groups, eval=FALSE} grp_list <- split(chr_sub, chromData(chr_sub)$dataOrigin) lapply(grp_list, compareChromatograms) ``` # Session information ```{r si} sessionInfo() ```